Data Wrangling with Open Refine

Published in

Towards Data Science

10 min readNov 15, 2020

Most times, when we hear about data wrangling, we think about data cleaning and manipulation using Python or some other language. In most real-life cases, the data is so messy or so large that cleaning up with a language can feel very manual and painful.

Open Refine is a free open source tool by Google that helps with messy data by cleaning, transforming it from one format to another, and extending the data using other web services. Open Refine can also be used locally, so you don’t have to worry about losing your data to the public.

In this article, we are going to walk through the process of wrangling data that belongs to a fictitious Sunshine Pharmacy stating their sales of different medications. We want to use Open Refine to clean the current data to have headers: Product ID, Product Name, Packing Type, Qty. Transfer, and Price.

Sunshine Pharmacy Sales Data.xlsx

drive.google.com

From the sheet above, we see that the data is pretty messy. Some of the challenges with the data includes:

Improper header names
Irregular data — (1,200) in the supposed Tranf. Quantity column.
Unwanted data embedded within the sheet, for example, we have pharmacy branches and total sales for each branch in the sheet which we are not interested in.

Open Refine Installation

Open Refine works on Windows, Linux, and Mac. To get installation instructions, please check the installation page for all the details.

Once you have installed it and run it, you should get a page like this on your browser:

Open Refine Start Page (Image by Author)

From this start page, you can create a project by importing your messy data. Note that the data can be in forms that are already familiar to us — csv, json, xls, xlsx, etc. Also, you can your data from multiple sources, your computer, URLs, databases, etc. Let's import our Sunshine Pharmacy data from our file browser on our computer and click Next.

Formatting before project creation (Image by Author)

Before the project is created, Open Refine shows us a preview of the project for us to make a few changes before we start the project. If we pay attention, we would notice that the column headers are not right. We will start cleaning up our data from here. The current headers don't seem useful so if we delete them, we are not losing any important data. Also, on line 2, we can get something closer to better column headers, so we would let Open Refine use the data on line 2 as our column headers. We can do that by doing the following:

Ignoring the first two lines in the project (Image by Author)

By updating the ignore first option to 2 ignores the first 2 lines and the Parse next option box uses the data on line 2 as the new column headers.

Updated column headers (Image by Author)

The columns are updated, though it's not correct, it has quite improved the look of our data. We will modify the column headers as we move along. Also, notice that the brackets around the numbers have been removed automatically and figures have been converted to numbers (in green). Now we can create the project by clicking on the Create Project button.

We have a new project now. Note that it shows the total number of rows at the top. Also, each page can only show a maximum of 50 rows and a minimum of 5 rows. This is where all the magic happens. Let's jump in!

Extract the dates to its own column

We will start by extracting the dates in the first column into a column of its own — the dates column. In Open Refine we can do this by creating a new column based on our first column.

Adding column based on another column (Image by Author)

By clicking on the caret on the column header, we get the option to Edit column and then select Add column based on this column. This is where Open Refine becomes really interesting, you can write code to filter this very messy column to get dates and put them in another column.

Using GREL to filter the column for dates (Image by Author)

From the image above, we have given our new column the name Date and then moved on to use General Refine Expression Language to filter the column. As you can see, it's very easy and clean. Since the sheet we are working on is for March, we are requesting that if any row in the first column contains the word ‘Mar’, the data in that column should be copied to the new Date column. Click OK and see what we have done.

Rows that meet requirements are copied to a new column (Image by Author)

Our filter worked, rows with dates are moved to the Date column. We still have some cleaning to do on the Date column since we still have a lot of blanks and unwanted rows with Totals. Also, note that I had to move like 4 pages to see this first date on row 190.

Fill Down on the Dates Column

Next, we are going to do is do a fill down on the new Date column. A fill down fills the blank rows with any data that is just right above it. This move will help fill our blank rows.

Doing a fill down on the Date column (Image by Author)

Click on the caret on the Date column header to select the Edit cells option and click on the Fill down option. Initially, you may not see any changes if you are on the first page of the sheet. This is because there is no data in the Date column until row 190, so let's check that out in the next few pages.

Fill down on the Date column (Image by Author)

We can safely assume that the header in the first column will be used for the first date on the Date column. Also, since we have filled down the column earlier, the remaining blank cells in the column will be meant for the first date. So let's fill the blank spaces with Monday, 02 Mar 2020. We will do this by using Text Facet. The Text Facet gets every unique value in the column and shows their frequency. Click on the caret on the Date header and select Facet -> Text Facet.

Doing a text facet on the Date column (Image by Author)

Text Facet on the Date column (Image by Author)

You can see the various unique entries in the Date column and their frequencies. However, we are interested in the blank cells so we can fill in the first date. Let's scroll down the text facet to see the blank cells.

blank cells with a frequency of 188 (Image by Author)

Click on the (blank) link to filter the records to only those that have blank cells on the Date column. You would see that all the records are those before Tuesday, 03 Mar 2020. Now, let’s edit blanks and fill them with the right date. Hover your mouse over the (blank) link to see and click on the edit link.

edit link to edit blank cells (Image by Author)

This brings up a prompt for you to fill the value you want. We will type in Monday, 02 Mar 2020 and click Apply. You see that we have 0 matching rows after the update since we no longer have any blank cells in that column. Close the Date Facet to see the whole datasheet again.

Cleaning up Columns and Column Headers

Let's rename our headers to give us a clearer idea of our data. In the beginning, I mentioned that we wanted this data to have columns: Post Date, Product ID, Product Name, Packing Type, Qty. Transf. and Price. We will start by renaming the first column. Let’s start with the first column. Click on the caret on the first header and select Edit Column -> Rename this column. This shows a prompt to rename the column.

Renaming the first column (Image by Author)

We will do the same for the Date column also. We will change the name of the column to Post Date. Product Code looks like it holds data of the names of the drugs, so we will call it Product Name. The column Product/Name holds the price of the drugs so we would rename it to Price. Curr. Cost Price column holds the quantity sold, so we rename that to Qty Transf.. We don't need Transf. Quantity, Transf. Value and Remaining Qty. We will delete Transf. Quantity and Transf. Value first by clicking the caret on the headers and selecting Edit Column -> Remove this column. We can rename the final column — Remaining Qty to Packing Type .

We have renamed the columns but it's still not in the arrangement in the requirements. We will start with the Product ID . Click on the caret and select Edit Columns -> Move column right. This moves the column after Post Date.

Moving Product ID to the right (Image by Author)

We would move the Price column to the end of the sheet by clicking on the caret on the header, select Edit Column -> Move column to the end. Finally, we move Qty Transf. to the right as we did earlier for Product ID .

Headers after clean up and rearrangement (Image by Author)

Remove unwanted data

We have cleaned up the headers, now, we clean up the data. From the first row, we can see that row does not fit as there is no Product Name, Quantity, or Price data in that row. There are many such rows in our 6360 records and we have to clean them up. We can do this by finding the blank cells in the columns. We will start with the Product Name. We use the Text Facet to get the blank cells just like we did for the Post Date column. And then we click on the (blank) value to filter to only blank cells in the column.

Blank cells for Product Name (Image by Author)

If we look closely at the filtered data, we see that these rows were serving as headers for different branches of Sunshine Pharmacy. We don’t need that data and so we will delete the rows that match. We do this by clicking the caret on the All column and selecting Edit rows -> Remove all matching rows.

Removing rows that match blank Product IDs (Image by Author)

This removes 103 unwanted records, that's loads of dirt out of our way. Let’s check other columns to see if we can see more blanks. We do the same for Qty Transfer. and Price . For Qty Transf. , we do the Text Facet and we notice we have blanks also and some text in a column that should have numbers. After inspecting, we see that the text are also headers within the data which we don't need and the blanks represent rows that were meant to show daily total sales which we don't want here either. We would delete them just as we did for the Product Name column.

28 rows matching blank Qty Transf. cells (Image by Author)

26 rows matching unwanted data in Qty Transf. column (Image by Author)

For the Price column, all looks well except for an interestingly large number which represents the grand total sales for the month. We don't need that, so we would delete it just like previously.

removing the row that represents the grand total (Image by Author)

All looks clean and dandy now that we have deleted all unwanted rows and columns.

We are done. This looks awesome, you have come a long way. There are few adjustments you may want to do like changing the negative numbers to positive and changing the dates to a typical mm/dd/yyyy format using GREL. I’ll leave that to you to play around with.

Open Refine is very easy to use to clean up very large data sets especially if you are not a programmer or don't want to go through the horrors of cleaning with a programming language.

I hope this was helpful. Thanks for reading.