Painless ETL with Python Pandas and Lambda Functions
The Python Pandas library, and within Pandas, Lambda Functions, are a painless way to accomplish Extract-Transform-Load (ETL) which is such an important process in business computing prior to analytics (akin to the all-important step of taping-off and tarping prep before house painting).
What are Lambda functions?
In Python and other languages, we use the lambda keyword to declare an anonymous function, which is why we refer to them as "lambda functions". An anonymous function refers to a function declared with no name. Although syntactically they look different, lambda functions behave in the same way as named functions that are declared using the def keyword.
Why are Lambda functions useful?
In conjunction with Pandas dataframes, we use Lambda functions to transform fields or synthesize new ones as a flexible, low-effort ETL. This is useful to avoid the limited SQL syntax that must be used when one is confined to only a database environment.? In addition, the solutions will be highly customized to the client (unlike 3rd party ETL packages which will need a lot of tuning) and reusable, as many patterns repeat between use-cases. In this article we will see some examples of common use-cases.
Lambdas and Notebooks
Typically lambda functions are part of an ETL pipeline that can be housed, e.g., in a Jupyter Notebook for reproducibility and shareability. A notebook is a mixture of code and comments to help third parties understand the overall flow.??
Motivation
AMZN, FB, and many other clients make extensive use of Python notebooks in conjunction with their database work (MySQL, Hive, Presto, DynamoDB, etc.) There is no shortage of ETL work. Lambdas make for efficient ETL.?
Lambda Origins
Lambda functions come from Alonzo Church’s research in the 1930s on lambda calculus. They pop up frequently in the LISP programming language. The function is defined without an identifier. If the function is used only once, it is syntactically lighter than using a named function.??
False Cognates
Now let's look at some code examples and explain common use-cases that motivate each.
Here is the starting point, some customer data with typical malformations and missing values.
This assumes we have set up a connection 'conn' with our database credentials and the appropriate driver (psycopg2 in the case of Postgres). The handy read_sql_query function copies the column names from the database table into the Pandas dataframe.
id name city industry_type someval ts
0 4 Samsonic Pleasant 1 NaN 2009-03-01
1 7 Samony Jackson 2 NaN 2014-04-02
2 9 Orange Jackson 2 NaN 2015-11-13
3 14 Frank\n None 4 NaN 2017-08-03
4 3 j motor city 4 32.0 2018-11-02
5 333 j mator city 6 NaN 2018-12-10
6 335 j mator city 6 31.0 2020-01-03
7 535 j mator city 6 11.0 2020-11-19
8 12 None Overton 3 NaN 2021-01-09
9 6 School Oaktown\n 1 NaN 2010-03-03
Let's take a look at this dataset. 'name' has some entries with an extraneous line feed ('\n'), and some entries that look too short ('j'), and an entry that is missing ("None" is the default value passed from the Postgres database when there is a missing value). 'city' has similar issues, 'someval' has quite a few missing values (NaN means 'not a number'), but 'ts' (Timestamp) looks complete and well-formed.
Finding the difference between timestamp entries
A common use-case is to take timestamps and calculate the delta between them. We accomplish the time delta as follows:
dat['day_intrvl'] = dat["ts"].diff()?
Now we have a time delta object in the day_intrvl field. Note the first row will not have a valid Time Delta value, and this is represented by "NaT" (not a Time Delta Object).
id name city industry_type someval ts day_intrvl
0 4 Samsonic Pleasant 1 NaN 2009-03-01 NaT
1 7 Samony Jackson 2 NaN 2014-04-02 1858 days
2 9 Orange Jackson 2 NaN 2015-11-13 590 days
3 14 Frank\n None 4 NaN 2017-08-03 629 days
4 3 j motor city 4 32.0 2018-11-02 456 days
5 333 j mator city 6 NaN 2018-12-10 38 days
6 335 j mator city 6 31.0 2020-01-03 389 days
7 535 j mator city 6 11.0 2020-11-19 321 days
8 12 None Overton 3 NaN 2021-01-09 51 days
9 6 School Oaktown\n 1 NaN 2010-03-03 -3965 days
However the Time Delta object, expressed as 'N days' where N is the number of days between each row, is not useful for calculations. We would rather just have the N and drop the 'days' part. We also want to make sure this N is an integer.
Breaking a string into parts, converting the first part to an integer, and assigning 0 to a non-numeric value.
To accomplish this, we perform a chained Lambda transformation:
dat['day_intrvl'] = dat["ts"].diff().astype(str).apply(lambda x: x.split()[0]).apply(lambda x: 0 if x == 'NaT' else int(x))?
print(dat.head(15))?
What is happening here? We calculate the Time Delta as before, but we convert the object to a string with astype(str). We then apply a Lambda split operator to grab only the first token (the N part of 'N days'). The first token is defined by having a [0] offset. Then we set that token to the value of zero if the object was NaT, otherwise we change the datatype of the string representation of an integer to an actual integer.
This improves the usability of the day_intrvl field because now we can do calculations with it, plot it, and bin it.
id name city industry_type someval ts day_intrvl
0 4 Samsonic Pleasant 1 NaN 2009-03-01 0
1 7 Samony Jackson 2 NaN 2014-04-02 1858
2 9 Orange Jackson 2 NaN 2015-11-13 590
3 14 Frank\n None 4 NaN 2017-08-03 629
4 3 j motor city 4 32.0 2018-11-02 456
5 333 j mator city 6 NaN 2018-12-10 38
6 335 j mator city 6 31.0 2020-01-03 389
7 535 j mator city 6 11.0 2020-11-19 321
8 12 None Overton 3 NaN 2021-01-09 51
9 6 School Oaktown\n 1 NaN 2010-03-03 -3965
Dividing the Data into 5 Bins and Finding the 95th Percentile
What's next? Let's divide the day intervals into bins (5 bins in our case), and also find the 95th percentile value.
id name city industry_type someval ts day_intrvl \
0 4 Samsonic Pleasant 1 NaN 2009-03-01 0
1 7 Samony Jackson 2 NaN 2014-04-02 1858
2 9 Orange Jackson 2 NaN 2015-11-13 590
3 14 Frank\n None 4 NaN 2017-08-03 629
4 3 j motor city 4 32.0 2018-11-02 456
5 333 j mator city 6 NaN 2018-12-10 38
6 335 j mator city 6 31.0 2020-01-03 389
7 535 j mator city 6 11.0 2020-11-19 321
8 12 None Overton 3 NaN 2021-01-09 51
9 6 School Oaktown\n 1 NaN 2010-03-03 -3965
intrvl_bins intrvl95
0 0 1304.95
1 4 1304.95
2 3 1304.95
3 4 1304.95
4 3 1304.95
5 1 1304.95
6 2 1304.95
7 2 1304.95
8 1 1304.95
9 0 1304.95
Binning is useful to capture intensities; for example a 5-point or a 7-point Likert scale to group consumer propensities to take some action, such as booking a hotel room. Now we have synthesized two new fields, intrvl_bins (the bin number) and the 95th percentile value, intrvl95. Finding a certain percentile is a common use case when metering a service; a percentile might be the goal we fall short of, or exceed.
Now let's take a look at the city field and the name field. We have business logic that tells that 'mator' is a common misspelling of 'motor' and we have additional logic that tells us that both 'mator city' and 'motor city' can be normalized to the value 'Detroit'. Let's write a small function to do a regular expression check on the variations and then normalize. We also want to substitute 'Unknown' for the fields 'name' and 'city' if there is no value ('None' coming from the database) and we want to strip off extraneous '\n' line feeds if those exist.
Normalizing variations to a common value, Dealing with missing values,
Stripping off line feeds
Note that stripping off unprintable characters such as hex codes is a common use-case when we transport IBM mainframe data (i.e. EBCDIC) to Unix-style ASCII format. EBCDIC will contain some hex codes we will want to replace similar to the above code snippet. In this case, we clean up 'city' and 'name' by normalizing missing values and taking off the line feed, and then we run the user-defined function 'regexf' on 'city'. The procedure results in the following.
id name city industry_type someval ts day_intrvl \
0 4 Samsonic Pleasant 1 NaN 2009-03-01 0
1 7 Samony Jackson 2 NaN 2014-04-02 1858
2 9 Orange Jackson 2 NaN 2015-11-13 590
3 14 Frank Unknown 4 NaN 2017-08-03 629
4 3 j Detroit 4 32.0 2018-11-02 456
5 333 j Detroit 6 NaN 2018-12-10 38
6 335 j Detroit 6 31.0 2020-01-03 389
7 535 j Detroit 6 11.0 2020-11-19 321
8 12 Unknown Overton 3 NaN 2021-01-09 51
9 6 School Oaktown 1 NaN 2010-03-03 -3965
intrvl_bins intrvl95
0 0 1304.95
1 4 1304.95
2 3 1304.95
3 4 1304.95
4 3 1304.95
5 1 1304.95
6 2 1304.95
7 2 1304.95
8 1 1304.95
9 0 1304.95
Success! The matching variations have been normalized to 'Detroit'. The name field with a missing value has been converted to 'Unknown', and all name and city entries with an extraneous line feed have that extra '\n' snipped off.
Normalizing a value that is too short
It is very common for data to be invalid due to being too short. Here is how to fix it, in our case the 'name' field has some entries of 'j' which by business logic we will call invalid.
dat['name'] = dat['name'].apply(lambda x: x if len(x) > 1 else 'invalid')
print(dat.head(15))
id name city industry_type someval ts day_intrvl \
0 4 Samsonic Pleasant 1 NaN 2009-03-01 0
1 7 Samony Jackson 2 NaN 2014-04-02 1858
2 9 Orange Jackson 2 NaN 2015-11-13 590
3 14 Frank Unknown 4 NaN 2017-08-03 629
4 3 invalid Detroit 4 32.0 2018-11-02 456
5 333 invalid Detroit 6 NaN 2018-12-10 38
6 335 invalid Detroit 6 31.0 2020-01-03 389
7 535 invalid Detroit 6 11.0 2020-11-19 321
8 12 Unknown Overton 3 NaN 2021-01-09 51
9 6 School Oaktown 1 NaN 2010-03-03 -3965
intrvl_bins intrvl95
0 0 1304.95
1 4 1304.95
2 3 1304.95
3 4 1304.95
4 3 1304.95
5 1 1304.95
6 2 1304.95
7 2 1304.95
8 1 1304.95
9 0 1304.95
Success! The 'name' field has been corrected to a value of 'invalid' to normalize the error in the case of a name value that is too short.
Dealing with non-numeric values
If we take a look at the field 'someval', we notice many rows have 'NaN' which means Not a Number. We want to replace those with a default, in our case, 10.
A simple lambda expression is used to apply the isnan function (isnan makes use of the math library we must import) to check if the row is a NaN. If it is, we put in a value of 10 as a baseline value.
领英推荐
import math
dat['someval'] = dat['someval'].apply(lambda x: 10 if math.isnan(x) else x)
print(dat.head(15))
We get rid of the NaN in 'someval' and replace those with 10.
id name city industry_type someval ts day_intrvl \
0 4 Samsonic Pleasant 1 10.0 2009-03-01 0
1 7 Samony Jackson 2 10.0 2014-04-02 1858
2 9 Orange Jackson 2 10.0 2015-11-13 590
3 14 Frank Unknown 4 10.0 2017-08-03 629
4 3 invalid Detroit 4 32.0 2018-11-02 456
5 333 invalid Detroit 6 10.0 2018-12-10 38
6 335 invalid Detroit 6 31.0 2020-01-03 389
7 535 invalid Detroit 6 11.0 2020-11-19 321
8 12 Unknown Overton 3 10.0 2021-01-09 51
9 6 School Oaktown 1 10.0 2010-03-03 -3965
intrvl_bins intrvl95
0 0 1304.95
1 4 1304.95
2 3 1304.95
3 4 1304.95
4 3 1304.95
5 1 1304.95
6 2 1304.95
7 2 1304.95
8 1 1304.95
9 0 1304.95
It worked! Now 'someval' has the default value of 10 inserted whereas previously it had the inconvenient 'NaN' label. We often want to normalize the datatype and set a default in order to perform subsequent calculations on all the rows.
Synthesizing a new field based on the contents of two existing fields
It is quite common to derive a new field based on multiple existing fields. For example, in Google Cloud, the web traffic data includes a visitor's latitude and longitude. We could derive the Metro Region from this data by making a function that draws a bounding box around major metros such as New York City, Chicago, etc. Why do this? Because the CMO will often want to know web traffic divided into major metro regions for the purpose of ad-spend planning. In our learning case we will just build up a new field 'net' that calculates a tax on the 'industry_type' and 'someval' fields.
We run the user-defined function 'tax' and get:
id name city industry_type someval ts day_intrvl \
0 4 Samsonic Pleasant 1 10.0 2009-03-01 0
1 7 Samony Jackson 2 10.0 2014-04-02 1858
2 9 Orange Jackson 2 10.0 2015-11-13 590
3 14 Frank Unknown 4 10.0 2017-08-03 629
4 3 invalid Detroit 4 32.0 2018-11-02 456
5 333 invalid Detroit 6 10.0 2018-12-10 38
6 335 invalid Detroit 6 31.0 2020-01-03 389
7 535 invalid Detroit 6 11.0 2020-11-19 321
8 12 Unknown Overton 3 10.0 2021-01-09 51
9 6 School Oaktown 1 10.0 2010-03-03 -3965
intrvl_bins intrvl95 net
0 0 1304.95 8.00
1 4 1304.95 7.50
2 3 1304.95 7.50
3 4 1304.95 7.50
4 3 1304.95 24.00
5 1 1304.95 6.50
6 2 1304.95 20.15
7 2 1304.95 7.15
8 1 1304.95 7.50
9 0 1304.95 8.00
We notice a new field 'net' has appeared and its value is derived from 'industry_type' and 'someval' fields as per the 'tax' function we invoke.
Creating a Goal: The 95th Percentile
It is often the case that we want to define a goal on a metric. In this case, we want to define a goal of the 95th percentile of 'net'. We use the built-in Pandas function quantile to define a new field 'n95'.
dat['n95'] = dat['net'].quantile(0.95)
print(dat.head(15)) ?
The output now contains the 95th percentile field 'n95'.
id name city industry_type someval ts day_intrvl \
0 4 Samsonic Pleasant 1 10.0 2009-03-01 0
1 7 Samony Jackson 2 10.0 2014-04-02 1858
2 9 Orange Jackson 2 10.0 2015-11-13 590
3 14 Frank Unknown 4 10.0 2017-08-03 629
4 3 invalid Detroit 4 32.0 2018-11-02 456
5 333 invalid Detroit 6 10.0 2018-12-10 38
6 335 invalid Detroit 6 31.0 2020-01-03 389
7 535 invalid Detroit 6 11.0 2020-11-19 321
8 12 Unknown Overton 3 10.0 2021-01-09 51
9 6 School Oaktown 1 10.0 2010-03-03 -3965
intrvl_bins intrvl95 net n95
0 0 1304.95 8.00 22.2675
1 4 1304.95 7.50 22.2675
2 3 1304.95 7.50 22.2675
3 4 1304.95 7.50 22.2675
4 3 1304.95 24.00 22.2675
5 1 1304.95 6.50 22.2675
6 2 1304.95 20.15 22.2675
7 2 1304.95 7.15 22.2675
8 1 1304.95 7.50 22.2675
9 0 1304.95 8.00 22.2675
Success! The new field 'n95' contains the goal which is in this case 22.2675. This is populated for every row. Another common use case is to calculate a higher and lower bound (for example, the 30th and the 70th percentile) to capture a certain chunk out of normally distributed data by subsetting it according to these bounds. Here, we will just make use of the 95th percentile as a goal line we plot against at the end of our discussion.
Chained recalculating after a value change
Quite often a field or fields will change and this requires recalculation. We just chain the recalculations together to ensure all the data moves in lockstep.
Notice how all the related fields update:
id name city industry_type someval ts day_intrvl \
0 4 Samsonic Pleasant 1 11.0 2009-03-01 0
1 7 Samony Jackson 2 11.0 2014-04-02 1858
2 9 Orange Jackson 2 11.0 2015-11-13 590
3 14 Frank Unknown 4 11.0 2017-08-03 629
4 3 invalid Detroit 4 35.2 2018-11-02 456
5 333 invalid Detroit 6 11.0 2018-12-10 38
6 335 invalid Detroit 6 34.1 2020-01-03 389
7 535 invalid Detroit 6 12.1 2020-11-19 321
8 12 Unknown Overton 3 11.0 2021-01-09 51
9 6 School Oaktown 1 11.0 2010-03-03 -3965
intrvl_bins intrvl95 net n95
0 0 1304.95 8.800 24.49425
1 4 1304.95 8.250 24.49425
2 3 1304.95 8.250 24.49425
3 4 1304.95 8.250 24.49425
4 3 1304.95 26.400 24.49425
5 1 1304.95 7.150 24.49425
6 2 1304.95 22.165 24.49425
7 2 1304.95 7.865 24.49425
8 1 1304.95 8.250 24.49425
9 0 1304.95 8.800 24.49425
We see that after someval was adjusted upward by 10%, we also adjust other fields that depend on someval accordingly.
Calculating a Column Total
Sometimes we want to know the total of a field. We use the "reduce" operator from the functools library to calculate the total of the 'net' field. It adds all the elements pairwise until the answer is reduced to a single value.
import functools
dat['totalnet'] = functools.reduce(lambda a,b: a+b,dat['net'])
print(dat)
This gives us the new 'totalnet' field.
id name city industry_type someval ts day_intrvl \
0 4 Samsonic Pleasant 1 11.0 2009-03-01 0
1 7 Samony Jackson 2 11.0 2014-04-02 1858
2 9 Orange Jackson 2 11.0 2015-11-13 590
3 14 Frank Unknown 4 11.0 2017-08-03 629
4 3 invalid Detroit 4 35.2 2018-11-02 456
5 333 invalid Detroit 6 11.0 2018-12-10 38
6 335 invalid Detroit 6 34.1 2020-01-03 389
7 535 invalid Detroit 6 12.1 2020-11-19 321
8 12 Unknown Overton 3 11.0 2021-01-09 51
9 6 School Oaktown 1 11.0 2010-03-03 -3965
intrvl_bins intrvl95 net n95 totalnet
0 0 1304.95 8.800 24.49425 114.18
1 4 1304.95 8.250 24.49425 114.18
2 3 1304.95 8.250 24.49425 114.18
3 4 1304.95 8.250 24.49425 114.18
4 3 1304.95 26.400 24.49425 114.18
5 1 1304.95 7.150 24.49425 114.18
6 2 1304.95 22.165 24.49425 114.18
7 2 1304.95 7.865 24.49425 114.18
8 1 1304.95 8.250 24.49425 114.18
9 0 1304.95 8.800 24.49425 114.18
Success! We have calculated the total of the 'net' field in a new field, 'totalnet'.
Checking entries versus a dictionary
Many times we will have a key-value dictionary defined and we will want to check if an entry in our dataset is contained in the dictionary. In this example, we define a statedict with the key being the capitol and the value being the name of the state. We then create a new field 'state' if the capitol defined in the 'city' field matches a key in the statedict dictionary. If there is no match, we define 'state' as 'unknown'.
statedict = {'Detroit': 'Michigan'}
dat['state'] = dat['city'].apply(lambda x: statedict[x] if x in statedict else 'unknown')?
print(dat.head(15))
Take a look at the contents of the new 'state' field.
id name city industry_type someval ts day_intrvl \
0 4 Samsonic Pleasant 1 11.0 2009-03-01 0
1 7 Samony Jackson 2 11.0 2014-04-02 1858
2 9 Orange Jackson 2 11.0 2015-11-13 590
3 14 Frank Unknown 4 11.0 2017-08-03 629
4 3 invalid Detroit 4 35.2 2018-11-02 456
5 333 invalid Detroit 6 11.0 2018-12-10 38
6 335 invalid Detroit 6 34.1 2020-01-03 389
7 535 invalid Detroit 6 12.1 2020-11-19 321
8 12 Unknown Overton 3 11.0 2021-01-09 51
9 6 School Oaktown 1 11.0 2010-03-03 -3965
intrvl_bins intrvl95 net n95 totalnet state
0 0 1304.95 8.800 24.49425 114.18 unknown
1 4 1304.95 8.250 24.49425 114.18 unknown
2 3 1304.95 8.250 24.49425 114.18 unknown
3 4 1304.95 8.250 24.49425 114.18 unknown
4 3 1304.95 26.400 24.49425 114.18 Michigan
5 1 1304.95 7.150 24.49425 114.18 Michigan
6 2 1304.95 22.165 24.49425 114.18 Michigan
7 2 1304.95 7.865 24.49425 114.18 Michigan
8 1 1304.95 8.250 24.49425 114.18 unknown
9 0 1304.95 8.800 24.49425 114.18 unknown
The state is defined as Michigan for all 'Detroit' entries because the city key matched the dictionary in those cases.
String Manipulation and Classification
The next example shows a contrived situation where we want to classify the language origin of a name. We know that Dutch contains a lot of double letters, so we say for the purpose of example any double letter in the 'name' field means the name originated from Dutch and we populate a new field 'lang' as "Dutch". If the name does not have any double letters, we just call the new field 'lang' "Not Dutch".
Running the user-defined function 'lang' generates values for a new field 'lang':
id name city industry_type someval ts day_intrvl \
0 4 Samsonic Pleasant 1 11.0 2009-03-01 0
1 7 Samony Jackson 2 11.0 2014-04-02 1858
2 9 Orange Jackson 2 11.0 2015-11-13 590
3 14 Frank Unknown 4 11.0 2017-08-03 629
4 3 invalid Detroit 4 35.2 2018-11-02 456
5 333 invalid Detroit 6 11.0 2018-12-10 38
6 335 invalid Detroit 6 34.1 2020-01-03 389
7 535 invalid Detroit 6 12.1 2020-11-19 321
8 12 Unknown Overton 3 11.0 2021-01-09 51
9 6 School Oaktown 1 11.0 2010-03-03 -3965
intrvl_bins intrvl95 net n95 totalnet state lang
0 0 1304.95 8.800 24.49425 114.18 unknown Not Dutch
1 4 1304.95 8.250 24.49425 114.18 unknown Not Dutch
2 3 1304.95 8.250 24.49425 114.18 unknown Not Dutch
3 4 1304.95 8.250 24.49425 114.18 unknown Not Dutch
4 3 1304.95 26.400 24.49425 114.18 Michigan Not Dutch
5 1 1304.95 7.150 24.49425 114.18 Michigan Not Dutch
6 2 1304.95 22.165 24.49425 114.18 Michigan Not Dutch
7 2 1304.95 7.865 24.49425 114.18 Michigan Not Dutch
8 1 1304.95 8.250 24.49425 114.18 unknown Not Dutch
9 0 1304.95 8.800 24.49425 114.18 unknown Dutch
As can be seen, the "School" name is classified as "Dutch" in the new 'lang' field. The others are "Not Dutch".
Plotting versus a Goal
In our last exercise, we contruct a plot using matplotlib. We will plot the 'net' field versus the goal of 'n95', which is the 95th percentile of 'net'.
Running the matplotlib code ('inline' tells it to display the plot inline in the notebook) displays the following:
This plot shows for some time interval defined on the x-axis, our shortfall towards the N95 goal or our exceeding of that goal. Useful in status report meetings!
This concludes our quick overview of Pandas Lambda ETL transformations. Give these handy-dandy Lambda ETL patterns a try!
---
About the author:
Dr. Mark Ginsburg has taught information systems topics such as networks and software development at NYU Stern, U of Arizona, and U California/Berkeley SIMS School. He has numerous peer-reviewed published articles in IEEE Computer, CACM, J. Electronic Commerce, and elsewhere. He was the lead developer on the NSF EDGAR-on-the-Internet project, a flagship effort to put mass corporate disclosure data online for the small investor. ?He is in the Data Engineering and Cloud Services practice at?Unify Consulting. The techniques presented here come in very handy at large tech clients.
Selected publications; https://scholar.google.com/citations?user=vmQ29zkAAAAJ&hl=en