May 22, 2018 Text Analytics & Unsupervised NLP - Part II

In the first article of the Text Analytics & Unsupervised NLP series, I walked through importing your data, stop words, cleaning, and tokenizing and finished off with visualizing n-grams: uni-, bi- and tri-grams.

Read More

May 18, 2018 Space Efficiency with Pandas DataFrames

Lately I’ve been working with enormous datasets (10s of millions of rows) with anywhere from tens to hundreds or even thousands of features. What I have found to be absolutely critical is cutting down on as much space as possible. Assigning dtypes to your dataframe is the best thing you can do for performance enhancement.

Read More

August 17, 2017 Python & Jupyter Notebook Tricks

Update - newly added tricks are added to the top, and will be added as I discover them. A few things that I’ve come across from extensive googling have been game changers for me in Python programming. Check them out below:

Read More

May 28, 2017 Run Jupyter Notebook from AWS EC2

This is a tutorial adapted from Chris Albon’s found here, and either directly copies or follows it very closely. There were a couple of things that I needed clarification for, so I wanted to make sure I wrote everything down in the event I need to go through this process again. The thing I learned is that this will only work if you do literally everything exactly as shown on this page.

Read More

May 14, 2017 Merge DataFrames: Unique Rows from Two DataFrames

I am constantly trying to remember how to add a row into a dataframe only if it doesn’t already exist. My indices will never match up and are irrelevant, so I struggle to figure out how to ignore the indexes on the dataframes.

Read More

May 7, 2017 Bayes and Monty Hall

When learning about Bayesian Statistics in class, we discussed the Monty Hall problem: the gameshow has 3 doors and behind one of them is a car; the other two have goats. You can pick one door that you think the car is behind, and then Monty Hall will open one of the remaining two.

Read More

April 30, 2017 Calculating Distance Between Coordinate Points

For our fourth project, we worked in groups on the already closed Kaggle competition for Predicting West Nile Virus. The competition provided data on all of the traps located throughout the neighborhoods of Chicago, along with the exact geographic coordinates, the date the trap was tested, the number of mosquitos in the sample, and the test results of whether or not the mosquitos have West Nile Virus.

Read More

April 23, 2017 Webscraping

One of my favorite new skills to date is webscraping. In my last job, I had to use HTML and CSS frequently on our product to develop features and fix bugs. I transitioned to a database heavy position, and really missed working on the website. Webscraping served as a nice reminder as to how much I enjoyed my web development experience. With that said, please don’t judge the appearance of my blog. I wish I had the time to spruce it up.

Read More

April 16, 2017 Iowa Liquor Sales

For our second project, we were to take liquor sales from Iowa and determine ideal locations for opening a new liquor store. The data can be found on the government of Iowa’s website, here.

Read More

April 9, 2017 Logistic Regression

In this post, I’ll talk about some of the basics of logistic regression, and why we would use this type of modeling over linear regression. To help illustrate these concepts, I’ll use the Titanic dataset which can be found here.

Read More

April 2, 2017 Convert String to Number in Python

While working on a project, I came across this issue where I had a string which was a number with commas. As it’s stored as a string, it cannot be modified using int(string) or float(string) or .astype(float). I’ve tried them all. I’ve also tried removing the commas and then converting. No such luck.

Read More

March 27, 2017 SQL for updating Pandas DataFrame

I’m fairly new to Python and even more so to Pandas, but I’m pretty experienced in SQL. As I encounter ever more issues in manipulating data in Pandas DataFrames, I find myself angrily thinking of how I could have solved each problem simply and quickly somehow using SQL on the df.

Read More

March 19, 2017 SAT Scores

For a project, I was given SAT score data from eligible students–those who were in high school and took the SAT in 2001. We were told simply to see what insights we could draw. After my initial scan of the data, what immediately stood out to me was the number of low participation scores–percentage of high school students who took the SAT–specifically in the Midwest.

Read More