Still using Excel? Supercharge your productivity in 15mins with Python.

Genevieve
6 min readMay 14, 2020

I have successfully brainwashed my Excel fanatic colleague on the virtues of Python. She became an eager student after seeing how fast and easily my Python scripts gave the desired results half an hour instead of 2 painful weeks of scrubbing Excel. Python is an easy to learn, high level programming language that is freely distributed. It is also one of the most popular and in demand programming language for data science applications. While Excel is incredibly powerful, there are some things that it doesn’t do that well, such as joining large amounts of data. In this article I will outline everything the typical analyst needs to know to get started on Python.

Step 1 Installing Python

I recommend downloading the free Anaconda distribution which comes with Python and some helpful libraries: https://www.anaconda.com/products/individual

(Optional. In order to access Python from command line) After installation, you can add:

C:\Users\<your-username>\Anaconda3\Scripts; C:\Users\<your-username>\Anaconda3; C:\Users\<your-username>\Anaconda3\Library\bin

to your system path by searching “Advance system settings” in your start sidebar, select “View advance system settings”. On the pop up, select “Environmental Variables…” under the ‘Advanced’ tab. On the new pop up, Under user variable (the first white box), click on ‘Path’, then click on ‘Edit…’, add the 3 above paths to your path.

Now you can type ‘where python’ in the command prompt to verify the paths have been added correctly. You can also quickly run python in command prompt just by typing ‘python’.

Step 2 Installing libraries

Python is so powerful because it gives you access to thousands of libaries written by programmers all over the world. These libraries allow you to do things like read databases, create webpages, and also read excel files. In order to read Excel files we need to install the ‘xlrd’ package. Just find the ‘Anaconda prompt’ application and start it. Now type ‘conda install xlrd’ and hit enter when prompted.

Step 2 Start writing

If you installed Anaconda, it comes with an editor called Jupyter notebook that allows you to easily write and display Python code and results. Just find the ‘Anaconda prompt’ application and start it. Now type ‘jupyter notebook’ and hit enter. Jupyter will open up in your browser:

Click on new Notebook to start coding

Feel free to navigate to your preferred folder. You can also open up jupyter in your folder of choice by running: cd <folder path> in Anaconda prompt before you run ‘jupyter notebook’.

This is the simplest editor for most beginners, but if you prefer you can use other editors. I personally like PyCharm.

Click on New Notebook: Python 3 to start coding.

Step 3 Importing Libraries

import pandas

We will be using the Pandas library extensively. So let’s go ahead and import it. Type: import pandas as pd
This import the ‘pandas’ library and allow us to refer to this library by calling pd in the rest of the script. Pandas is a data analysis library that is very good for manipulating tables of data, like what we might do in Excel. These tables are called dataframes in Pandas.

Step 4 Read Excel data

I generated some random data for an imaginary grocer. You can download the same excel file here. To read the file into your notebook just type:

data = pd.read_excel("C:\\Users\\yourusername\\path_to_data\\MOCK_DATA.xlsx")

Note the use of double backslash (\\), this is because ‘\’ is an escape character in code, example ‘\n’ stands for the ‘Enter’ character and ‘\t’ stands for the ‘tab’ character. So always double slash. Then hit ctrl+enter to run this code block. To see the data, just type data.

You will see your data appearing here.

Step 5 How to perform simple calculations

Let’s say we want to calculate the revenue, and add that as a column in data. That is really easy to do in Python:

We can create a new column just by running

data['column name'] = whatever_we_want_to_be_in_column

Step 6 Grouping data

Let’s say we want to know how much revenue we made on each day. We will need to sum up all the REVENUE in each DATA_DATE. That is usually a job for pivot tables in Excel. In Python, we do:

data.groupby('DATA_DATE')['REVENUE'].sum()

We can replace ‘DATA_DATE’ with multiple columns as well, such as:

Note that when we refer to multiple columns, we need to add square brackets. The revenue column can be replaced with other column names, you can also change sum to count, prod, cumprod (cumulative product), cumsum (cumlative sum), among others. To find out more refer to the Pandas documentation.

Let’s save our revenue result by writing it to a variable. Run:

revenue_result = data.groupby('DATA_DATE')['REVENUE'].sum()
revenue_result = revenue_result.reset_index()

The 2nd line converts the revenue result into a dataframe table again, if we do not run it, revenue_result still refers to a column object. This will cause problems when we try to join it with another table.

Step 6 Joining multiple tables

This is a huge advantage of Python as compared to Excel. While Excel has vlookup, it isn’t exactly fast or easy, especially when you are working with many sheets.

Let’s suppose our grocer also records the customers that come everyday. Let’s start by importing the customer data:

customer_data = pd.read_excel("C:\\Users\\username\\Downloads\\mock_customer_data.xlsx")

We now want to count the number of visitors everyday, and join that with our revenue data to see whether there is a trend.

Let’s count the number of customers first:

no_customers = customer_data.groupby('date_visited')['id'].count().reset_index()

We can rename the ‘id’ column to a better name:

no_customers = no_customers.rename(columns={'id':'no_cust'})

Now to merge the no_customers table with revenue_result table:

no_customers.merge(revenue_result, left_on='date_visited', right_on='DATA_DATE', how='inner')

The parameter left_on, refers to which column on the left table (left side of ‘.merge’: no_customers) you want to merge on, while the right_on refers to the column on the right table. Merge will join these tables if the values in the left column matches that in the right column. the ‘how’ parameter controls how merge handles the values that is only on the left table or only on the right table, ‘inner’ means inner join, which drops all values that are not on both left and right table, ‘left’/ ‘right’ keeps all values on the left/ right, while ‘outer’ keeps all values whether they come from the left or right.

This is the result:

Step 7 Let’s export the results to excel

First, save the result to a variable.

result = no_customers.merge(revenue_result, left_on='date_visited', right_on='DATA_DATE', how='inner')

Next, let’s export:

result.to_excel('folder_path\\results.xlsx')

And we are done!

You can also export to clipboard by running:

result.to_clipboard()

Then paste this into excel, something my colleague loves to do but will lead to messy code if your scripts get very long. I don’t recommend it.

In this article, I covered how to get started with Python and Pandas, and how to do 5 simple things:

  1. read Excel
  2. create a new column and do simple column operations
  3. group by columns
  4. merge 2 tables
  5. export your results into Excel

These are simple operations but we can stack many of them to rapidly generate analysis for large datasets that are challenging for Excel to handle. The steps I covered takes less than 5 minutes to write and run on Python.

I hope this inspires you to get started learning Python, these useful functions have saved us many hours of painful Excel computation. If you would like to learn more about the basics behind the Python language there are many free resources online, I think codecademy is a good start. Do watch out for my future articles covering more advanced topics like how to read from corporate databases, and how to run rolling window regressions.

--

--

Genevieve

I was an engineering student, a software developer at a wealth fund and now a graduate student studying computational biology.