Transforming Data
Overview
Teaching: 30 min
Exercises: 10 minQuestions
How can we transform our data to correct errors?
Objectives
Learn about clustering and how it is applied to group and edit typos
Split values from one column into multiple columns
Manipulate data using previous cleaning steps with undo/redo
Remove leading and trailing white spaces from cells
Learn to use GREL (General Refine Expression Language) for advanced data transformations
So far we have learned to use various facets to inspect and explore our data. Text facet also allowed us to directly edit a subset of data in bulk. We have also seen how we can transform the data type from text to numeric. OpenRefine offers a number of other functionalities to transform and restructure the data.
Data clustering
Clustering allows you to find groups of entries that are are not identical but are
sufficiently similar that they may be alternative representations of the same thing (term or data value).
For example, the two strings New York
and new york
are very likely to refer to the same concept and just have a
capitalisation differences. Likewise, Björk
and Bjork
probably refer to the same person. These kinds of variations
occur a lot in scientific data. Clustering gives us a tool to resolve them.
OpenRefine provides different clustering algorithms. The best way to understand how they work is to experiment with them.
- If you removed it, reinstate the
scientificName
text facet (you can also remove all the other facets to gain some space). In thescientificName
text facet box - click theCluster
button. - In the resulting pop-up window, you can change the
Method
and theKeying Function
. Try different combinations to see what different mergers of values are suggested. -
If you select the
key collision
method and themetaphone3
keying function. It should identify three clusters. - Tick the
Merge?
checkbox beside each group, then clickMerge Selected and Recluster
to apply the corrections to the dataset. Note that theNew Cell Value
column displays the new name that will replace the value in all the cells in the group. You can change this (but please don’t do so now) if you wish to choose a different value than the suggested one. - Try selecting different
Methods
andKeying Functions
again, to see what new merges are suggested. You may find there are still improvements that can be made, but do notMerge
again; justClose
when you are done. We will now see other operations that will help us detect and correct the remaining problems, and that have other, more general uses.
Important: If you Merge
using a different method or keying function, or more times than described in the instructions above,
your solutions for later exercises will not be the same as shown in those exercise solutions.
OpenRefine Wiki: Clustering
Full documentation on clustering can be found at OpenRefine Wiki: Clustering
Data splitting
It is easy to split data from one column into multiple columns if the parts are separated by a common separator (say a comma, or a space).
- Let us suppose we want to split the
scientificName
column into separate columns, one for genus and one for species. - Click the down arrow next to the
scientificName
column. ChooseEdit Column
>Split into several columns...
- In the pop-up, in the
Separator
box, replace the comma with a space (the box will look empty when you’re done). - Important! Uncheck the box that says
Remove this column
. - Click
OK
. You should get some new columns calledscientificName 1
,scientificName 2
,scientificName 3
, andscientificName 4
. - Notice that in some cases these newly created columns are empty (you can check by text faceting the column). Why? What do you think we can do to fix it?
The entries that have data in scientificName 3
and scientificName 4
but not the first two scientificName
columns
had an extra space at the beginning of the entry. Leading and trailing white spaces are very difficult to notice when cleaning data
manually. This is another advantage of using OpenRefine to clean your data - this process can be automated.
In newer versions of OpenRefine (from version 3.4.1) there is now an option to
clean leading and trailing white spaces from all data when importing the data initially and creating the project.
Because we didn’t clean white space at the time of importing the data, we will look at how to
fix leading and trailing white spaces manually in a moment - first we need to undo the splitting step.
Undoing / Redoing actions
It is common while exploring and cleaning a dataset to make a mistake or decide to change the order of the process
you wish to conduct. OpenRefine provides Undo
and Redo
operations to make it easy to roll back your changes.
-
Click
Undo / Redo
in the left side of the screen. All the changes you have made will appear in the left-hand panel. The current stage in the data processing is highlighted in blue (i.e. step 4. in the screenshot below). As you click on the different stages in the process, the step identified in blue will change and, far more importantly, the data will revert to that stage in the processing. - We want to undo the splitting of the column
scientificName
. Select the stage just before the split occurred and the newscientificName
columns will disappear. - Notice that you can still click on the last stage and make the columns reappear, and toggle back and forth between these states. You can also select the state more than one steps back and revert to that state.
- Let’s leave the dataset in the state in which the
scientificNames
were clustered, by selecting the stage just before the split.
Important: If you skip this step, your solutions for later exercises will not be the same as shown in those exercise solutions.
Trimming Leading and Trailing Whitespace
Words with spaces at the beginning or end are particularly hard for humans to identify from strings without these
spaces (as we have seen with the scientificName
column). However, blank spaces can make a big difference to computers, so we usually want to remove them.
- In the header for the column
scientificName
, chooseEdit cells
>Common transforms
>Trim leading and trailing whitespace
. - Notice that the
Split
step has now disappeared from theUndo / Redo
pane on the left and is replaced with aText transform on 3 cells
- Perform the same
Split
operation onscientificName
that you undid earlier. This time you should now only get two new columns.
Removing the leading white spaces means that each entry in this column has exactly one space
(between the genus and species parts of the original scientificName
data).
Therefore, when you now split with space as the separator, you should get only two columns. Let’s do this as an exercise.
Exercise
Repeat the splitting of column
scientificName
exercise.Solution
On the
scientificName
column, click the down arrow next to thescientificName
column and chooseEdit Column
>Split into several columns...
from the drop down menu. Use a blank character as a separator, as before. You should now get only two columnsscientificName 1
andscientificName 2
.
Renaming columns
We now have the genus and species parts neatly separated into 2 columns - scientificName 1
and scientificName 2
.
We want to rename these as genus
and species
, respectively.
- Let’s first rename the
scientificName 1
column. On the column, click the down arrow and thenEdit column
>Rename this column
. - Type “genus” into the box that appears.
Exercise
Try to change the name of the
scientificName 2
column tospecies
. What problem do you encounter? How can you fix the problem?Solution
- On the
scientificName 2
column, click the down arrow and thenEdit column
>Rename this column
.- Type “species” into the box that appears.
- A pop-up will appear that says
Another column already named species
. This is because there is another column with the same name where we’ve recorded the species abbreviation.- You can choose another name like
speciesName
for this column or change the otherspecies
column name tospecies_abbreviation
and then come back and rename this column tospecies
.
Important: Undo the splitting and renaming steps and retain the white space trimming step before moving on (it may be several steps back). If you skip this step, your solutions for later exercises will not be the same as shown in exercise solutions.
Transforming Data Using GREL
OpenRefine provides a way to write special expressions to accomplish more complex data transformations (such as string manipulation or mathematical calculations) to improve the structure of the data. These functions are written in a special language called GREL (General Refine Expression Language). GREL can be used in several places:
- when transforming cells in a column using the transformation function
- when adding a column based on another column
- when creating a custom Text or Numeric facet
- when creating a new column by fetching data from a URL
We will have a look at the first two of these options; you can explore other yourself - the principle of using GREL will be the same and all GREL input windows in OpenRefine will have a very similar outlook.
Let’s have a look at the column geolocation
- it contains latitude and longitude coordinates of locations
where observations took place combined together like this: ('30.438056', '-84.247155')
. As can be noted, data contains
round braces “(“ and “)” and single quotes “’” around data making it less useful for any processing.
We want to get rid of all these characters and split the data in two columns to contain individual values
for latitude and
longitude.
- First we want to create a duplicate of the
geolocation
column where we will perform our operations and keep the originalgeolocation
intact. To do so, on thegeolocation
column click the down arrow and thenEdit column
>Add column based on this column...
. -
You will be presented with a window to enter a GREL expression telling OpenRefine how to transform the current data when creating a new column based off it.
GREL
Expression
field contains the expression “value” to begin with. This indicates to use the current “value” of the cell as is when transforming data. In the Preview panel below you can also see the current cell value and what the new value would be after applying the GREL expression to it (in this case - both values will be the same as we are simply duplicating the column). In theNew column name
field type the new new name for our duplicate column, e.g.geolocation_new
. When finished, clickOK
to apply the action. -
After OpenRefine creates the
geolocation_new
column, we want to do further transformations on it to extract longitude and latitude values. To do so, selectEdit cells
>Transform...
from the drop down menu on thegeolocation_new
column. You will once again be presented with a similar window to enter a GREL expression. This time, we want to chain a few functions in the GREL expression to achieve the desired effect of removing round braces and single quotes, like so:value.replace("(","").replace(")","").replace("'","")
. We are replacing any occurrence of “(“ in the cell data value with a blank character (effectively deleting it), and then repeating (chaining) similar functions on the output value from the previous function until we remove all unwanted characters. Try typing one function at a time to see what effect it has on the data - you can see the result of applying each expression in the Preview panel.When finished, click
OK
to apply the data transformation. -
We are now ready to split the
geolocation_new
column using theEdit Column
>Split into several columns...
, as we learned earlier in this episode. The separator we want to use in this case is “, “ - a comma followed by a blank character. If, in addition, you select theGuess cell type
checkbox in the split column popup window, OpenRefine will correctly identify that the values in new columns are numeric and transform the data type for us as well. - You should now have 2 new columns with numeric data named
geolocation_new 1
andgeolocation_new 2
representing the extracted longitude and latitude values respectively. - Rename your new columns to
longitude
andlatitude
accordingly. You can now make further use of the extracted data from other applications, e.g. plot geolocations on a map.
GREL offers rich syntax and a large number of functions for complex string manipulations (and handling different text formats - JSON, HTML, XML), working with numbers, dates and boolean (TRUE/FALSE) values, logical and mathematical operations. We strongly recommend learning more on GREL syntax and functionalities.
GREL documentation
Check the official GREL documentation for the full reference on GREL. Here is another useful GREL guide to check out.
Key Points
Clustering can identify outliers in data and help us fix errors in bulk
GREL (General Refine Expression Language) is a powerful tool for transforming data