If you went through the previous post about Python with Visual Studio and installed the data analysis tools there, you already have Jupyter since it comes installed with Anaconda. To run a notebook locally, type jupyter notebook in a terminal. Anaconda also installed all the required packages that we will be using in this post.
Asking Questions About our Data
Thanks to Super Data Science we can look at a training data set with some sales data and gain some insights from it. Let’s take a little look at the data as it looks in Excel.
Upon initial inspection of the data, we can start thinking of some questions about it that we would want to answer.
- What rep sold the most?
- What item sold the most?
- What region sold the most?
Loading the Data
import pandas as pd import numpy as np
We can use pandas to read in the CSV file with the
df = pd.read_csv("./OfficeSupplies.csv")
If there are no errors when executing then the file loaded with no errors. But now what do we do with it? Pandas takes the data and creates a DataFrame data structure with it. The DataFrame allows us to do quite a bit of analysis on the data. We can look at the number of rows and columns to get a quick idea of how big our data is.
There are 43 rows and six columns in our data set. It’s not that big of a data set, but even small data sets can yield some good insights.
General Analysis of the Data
To get a quick idea of what the data looks like, we can call the
head function on the data frame. By default, this returns the top five rows, but it can take in a parameter of how many rows to return.
This looks a lot like an Excel spreadsheet, doesn’t it? Under the hood, the data frame is a two-dimensional data structure and each column can have different types. To show that, we can call
dtypes attribute on the data frame to see what each column types are.
Even though the first four columns are objects, we can see from the data that it’s text. Due to pandas using Numpy behind the scenes, it interprets strings as objects.
This shows some descriptive statistics on the data set. Notice, it only shows the statistics on the numerical columns. From here you can see the following statistics:
- Row count, which aligns to what the
shapeattribute showed us.
- The mean, or average.
- The standard deviation, or how spread out the data is.
- The minimum and maximum value of each column
- The number of items that fall within the first, second, and third percentiles.
We can tell a decent amount of our data from this, such as the average number of units sold is 49 and the average price of each unit is around $20.
Answering our Questions
Now we can do further analysis on our data to answer our questions. Before that, we should see if there are any missing values in our data set. In our current data set of 43 rows we can scan the data to tell there aren’t any missing values, but with much bigger data sets looking at the data by hand to find missing values will be too tedious. This is easy to do in pandas. To check if there are any missing values in the entire data set we use the
isnull function, then see if there are any values.
We’re lucky we have such a nice data set and with no missing values. While we won’t focus on it in this post, a data scientist will spend their time cleaning (or wrangling ) the data. Since we don’t have any missing data, we can start doing further analysis on our data.
Who sold the most?
The first question we had was what rep had sold the most. Pandas has a lot of utility functions for querying the data frame to help us out. To answer this we can group by the “Rep” column and sum up the values in the columns. Notice in the result that pandas only does a sum on the numerical columns. Then we do a descending sort on the values based on the “Units” column.
Richard sold the most units, but did he sell the most in dollar amount? Does selling the most units mean the rep earned the most money? How can we determine that if there’s no column available to find that out? Pandas has you covered there, too. We can add columns to our data frame as we need (we can drop them, too, if they add too much noise to our data set). Pandas uses Numpy behind the scenes in the DataFrame object so it has the ability to do mathematical operations on columns, and it can do them quite fast.
df["Total Price"] = df["Units"] * df["Unit Price"] df.head()
Here we create a new column on our data frame called “Total Price” and multiply the “Units” and “Unit Price” columns to calculate it. We call the
head function again to make sure it got added.
Now we can find who sold the most in total dollar amount.
df.groupby("Rep").sum().sort_values("Total Price", ascending=False).head()
We do the same grouping of the “Rep” column and sum up the values, but now we do a descending sort on the “Total Price” column values and print out the results.
Richard may have sold the most units, but Matthew sold the most in terms of total price.
What region sold the most?
We now know who sold the most, but we also have a “Region” column. With this, we can see who sold the most within each region.
group = df.groupby(["Region","Rep"]).sum() total_price = group["Total Price"].groupby(level=0, group_keys=False) total_price.nlargest(5)
Now we group by two columns, “Region” and “Rep”, and sum those values up. With that result, we save it off into a new data frame. Then take the “Total Price” column and group the values. In our
groupby function, we specify what level to group by since we already did the first grouping, which creates a hierarchical index on the data. Then we take the largest five of those to print out. From the results, we can tell that Matthew sold the most in the Central region, Susan in the East, and James in the West.
Another thing to note from this result: the Central region has the same number of reps than East and West combined. That could be something to bring up to the sales team when presenting the data.
That was a good bit of analysis already. In a few commands with pandas we were able to answer some initial questions about our data. However, some of these can be a bit hard to interpret, especially if we have a much bigger data set. In our next post, we’ll answer these same questions with visualizations.