登录查看更多内容

Painless ETL with Python Pandas and Lambda Functions

Mark Ginsburg, Ph.D.

Senior Data Engineer

发布日期: 2021年10月9日

The Python Pandas library, and within Pandas, Lambda Functions, are a painless way to accomplish Extract-Transform-Load (ETL) which is such an important process in business computing prior to analytics (akin to the all-important step of taping-off and tarping prep before house painting).

What are Lambda functions?

In Python and other languages, we use the lambda keyword to declare an anonymous function, which is why we refer to them as "lambda functions". An anonymous function refers to a function declared with no name. Although syntactically they look different, lambda functions behave in the same way as named functions that are declared using the def keyword.

Why are Lambda functions useful?

In conjunction with Pandas dataframes, we use Lambda functions to transform fields or synthesize new ones as a flexible, low-effort ETL. This is useful to avoid the limited SQL syntax that must be used when one is confined to only a database environment.? In addition, the solutions will be highly customized to the client (unlike 3rd party ETL packages which will need a lot of tuning) and reusable, as many patterns repeat between use-cases. In this article we will see some examples of common use-cases.

Lambdas and Notebooks

Typically lambda functions are part of an ETL pipeline that can be housed, e.g., in a Jupyter Notebook for reproducibility and shareability. A notebook is a mixture of code and comments to help third parties understand the overall flow.??

Motivation

AMZN, FB, and many other clients make extensive use of Python notebooks in conjunction with their database work (MySQL, Hive, Presto, DynamoDB, etc.) There is no shortage of ETL work. Lambdas make for efficient ETL.?

Lambda Origins

Lambda functions come from Alonzo Church’s research in the 1930s on lambda calculus. They pop up frequently in the LISP programming language. The function is defined without an identifier. If the function is used only once, it is syntactically lighter than using a named function.??

False Cognates

In modern computing, a lambda architecture refers to a combination of a streaming and a batch layer. This is unrelated to lambda functions.
Another false cognate in the AWS context is AWS Lambda. This is an event-driven, serverless computing platform in AWS that runs code in response to events and automatically manages the computing resources for that code. To link it to our λ, one could say it is “anonymous” in the sense there is no named server but that is a stretch.?

Now let's look at some code examples and explain common use-cases that motivate each.

Here is the starting point, some customer data with typical malformations and missing values.

This assumes we have set up a connection 'conn' with our database credentials and the appropriate driver (psycopg2 in the case of Postgres). The handy read_sql_query function copies the column names from the database table into the Pandas dataframe.

    id      name        city  industry_type  someval         ts
0    4  Samsonic    Pleasant              1      NaN 2009-03-01
1    7    Samony     Jackson              2      NaN 2014-04-02
2    9    Orange     Jackson              2      NaN 2015-11-13
3   14   Frank\n        None              4      NaN 2017-08-03
4    3         j  motor city              4     32.0 2018-11-02
5  333         j  mator city              6      NaN 2018-12-10
6  335         j  mator city              6     31.0 2020-01-03
7  535         j  mator city              6     11.0 2020-11-19
8   12      None     Overton              3      NaN 2021-01-09
9    6    School   Oaktown\n              1      NaN 2010-03-03

Let's take a look at this dataset. 'name' has some entries with an extraneous line feed ('\n'), and some entries that look too short ('j'), and an entry that is missing ("None" is the default value passed from the Postgres database when there is a missing value). 'city' has similar issues, 'someval' has quite a few missing values (NaN means 'not a number'), but 'ts' (Timestamp) looks complete and well-formed.

Finding the difference between timestamp entries

A common use-case is to take timestamps and calculate the delta between them. We accomplish the time delta as follows:

dat['day_intrvl'] = dat["ts"].diff()?

Now we have a time delta object in the day_intrvl field. Note the first row will not have a valid Time Delta value, and this is represented by "NaT" (not a Time Delta Object).

    id      name        city  industry_type  someval         ts day_intrvl
0    4  Samsonic    Pleasant              1      NaN 2009-03-01        NaT
1    7    Samony     Jackson              2      NaN 2014-04-02  1858 days
2    9    Orange     Jackson              2      NaN 2015-11-13   590 days
3   14   Frank\n        None              4      NaN 2017-08-03   629 days
4    3         j  motor city              4     32.0 2018-11-02   456 days
5  333         j  mator city              6      NaN 2018-12-10    38 days
6  335         j  mator city              6     31.0 2020-01-03   389 days
7  535         j  mator city              6     11.0 2020-11-19   321 days
8   12      None     Overton              3      NaN 2021-01-09    51 days
9    6    School   Oaktown\n              1      NaN 2010-03-03 -3965 days

However the Time Delta object, expressed as 'N days' where N is the number of days between each row, is not useful for calculations. We would rather just have the N and drop the 'days' part. We also want to make sure this N is an integer.

Breaking a string into parts, converting the first part to an integer, and assigning 0 to a non-numeric value.

To accomplish this, we perform a chained Lambda transformation:

dat['day_intrvl'] = dat["ts"].diff().astype(str).apply(lambda x: x.split()[0]).apply(lambda x: 0 if x == 'NaT' else int(x))?
print(dat.head(15))?

What is happening here? We calculate the Time Delta as before, but we convert the object to a string with astype(str). We then apply a Lambda split operator to grab only the first token (the N part of 'N days'). The first token is defined by having a [0] offset. Then we set that token to the value of zero if the object was NaT, otherwise we change the datatype of the string representation of an integer to an actual integer.

This improves the usability of the day_intrvl field because now we can do calculations with it, plot it, and bin it.

    id      name        city  industry_type  someval         ts  day_intrvl
0    4  Samsonic    Pleasant              1      NaN 2009-03-01           0
1    7    Samony     Jackson              2      NaN 2014-04-02        1858
2    9    Orange     Jackson              2      NaN 2015-11-13         590
3   14   Frank\n        None              4      NaN 2017-08-03         629
4    3         j  motor city              4     32.0 2018-11-02         456
5  333         j  mator city              6      NaN 2018-12-10          38
6  335         j  mator city              6     31.0 2020-01-03         389
7  535         j  mator city              6     11.0 2020-11-19         321
8   12      None     Overton              3      NaN 2021-01-09          51
9    6    School   Oaktown\n              1      NaN 2010-03-03       -3965

Dividing the Data into 5 Bins and Finding the 95th Percentile

What's next? Let's divide the day intervals into bins (5 bins in our case), and also find the 95th percentile value.

    id      name        city  industry_type  someval         ts  day_intrvl  \
0    4  Samsonic    Pleasant              1      NaN 2009-03-01           0   
1    7    Samony     Jackson              2      NaN 2014-04-02        1858   
2    9    Orange     Jackson              2      NaN 2015-11-13         590   
3   14   Frank\n        None              4      NaN 2017-08-03         629   
4    3         j  motor city              4     32.0 2018-11-02         456   
5  333         j  mator city              6      NaN 2018-12-10          38   
6  335         j  mator city              6     31.0 2020-01-03         389   
7  535         j  mator city              6     11.0 2020-11-19         321   
8   12      None     Overton              3      NaN 2021-01-09          51   
9    6    School   Oaktown\n              1      NaN 2010-03-03       -3965   

   intrvl_bins  intrvl95  
0            0   1304.95  
1            4   1304.95  
2            3   1304.95  
3            4   1304.95  
4            3   1304.95  
5            1   1304.95  
6            2   1304.95  
7            2   1304.95  
8            1   1304.95  
9            0   1304.95

Binning is useful to capture intensities; for example a 5-point or a 7-point Likert scale to group consumer propensities to take some action, such as booking a hotel room. Now we have synthesized two new fields, intrvl_bins (the bin number) and the 95th percentile value, intrvl95. Finding a certain percentile is a common use case when metering a service; a percentile might be the goal we fall short of, or exceed.

Now let's take a look at the city field and the name field. We have business logic that tells that 'mator' is a common misspelling of 'motor' and we have additional logic that tells us that both 'mator city' and 'motor city' can be normalized to the value 'Detroit'. Let's write a small function to do a regular expression check on the variations and then normalize. We also want to substitute 'Unknown' for the fields 'name' and 'city' if there is no value ('None' coming from the database) and we want to strip off extraneous '\n' line feeds if those exist.

Normalizing variations to a common value, Dealing with missing values,

Stripping off line feeds

Note that stripping off unprintable characters such as hex codes is a common use-case when we transport IBM mainframe data (i.e. EBCDIC) to Unix-style ASCII format. EBCDIC will contain some hex codes we will want to replace similar to the above code snippet. In this case, we clean up 'city' and 'name' by normalizing missing values and taking off the line feed, and then we run the user-defined function 'regexf' on 'city'. The procedure results in the following.

  id      name      city  industry_type  someval         ts  day_intrvl  \
0    4  Samsonic  Pleasant              1      NaN 2009-03-01           0   
1    7    Samony   Jackson              2      NaN 2014-04-02        1858   
2    9    Orange   Jackson              2      NaN 2015-11-13         590   
3   14     Frank   Unknown              4      NaN 2017-08-03         629   
4    3         j   Detroit              4     32.0 2018-11-02         456   
5  333         j   Detroit              6      NaN 2018-12-10          38   
6  335         j   Detroit              6     31.0 2020-01-03         389   
7  535         j   Detroit              6     11.0 2020-11-19         321   
8   12   Unknown   Overton              3      NaN 2021-01-09          51   
9    6    School   Oaktown              1      NaN 2010-03-03       -3965   

   intrvl_bins  intrvl95  
0            0   1304.95  
1            4   1304.95  
2            3   1304.95  
3            4   1304.95  
4            3   1304.95  
5            1   1304.95  
6            2   1304.95  
7            2   1304.95  
8            1   1304.95  
9            0   1304.95

Success! The matching variations have been normalized to 'Detroit'. The name field with a missing value has been converted to 'Unknown', and all name and city entries with an extraneous line feed have that extra '\n' snipped off.

Normalizing a value that is too short

It is very common for data to be invalid due to being too short. Here is how to fix it, in our case the 'name' field has some entries of 'j' which by business logic we will call invalid.

dat['name'] = dat['name'].apply(lambda x: x if len(x) > 1 else 'invalid')
print(dat.head(15))

    id      name      city  industry_type  someval         ts  day_intrvl  \
0    4  Samsonic  Pleasant              1      NaN 2009-03-01           0   
1    7    Samony   Jackson              2      NaN 2014-04-02        1858   
2    9    Orange   Jackson              2      NaN 2015-11-13         590   
3   14     Frank   Unknown              4      NaN 2017-08-03         629   
4    3   invalid   Detroit              4     32.0 2018-11-02         456   
5  333   invalid   Detroit              6      NaN 2018-12-10          38   
6  335   invalid   Detroit              6     31.0 2020-01-03         389   
7  535   invalid   Detroit              6     11.0 2020-11-19         321   
8   12   Unknown   Overton              3      NaN 2021-01-09          51   
9    6    School   Oaktown              1      NaN 2010-03-03       -3965   

   intrvl_bins  intrvl95  
0            0   1304.95  
1            4   1304.95  
2            3   1304.95  
3            4   1304.95  
4            3   1304.95  
5            1   1304.95  
6            2   1304.95  
7            2   1304.95  
8            1   1304.95  
9            0   1304.95

Success! The 'name' field has been corrected to a value of 'invalid' to normalize the error in the case of a name value that is too short.

Dealing with non-numeric values

If we take a look at the field 'someval', we notice many rows have 'NaN' which means Not a Number. We want to replace those with a default, in our case, 10.

A simple lambda expression is used to apply the isnan function (isnan makes use of the math library we must import) to check if the row is a NaN. If it is, we put in a value of 10 as a baseline value.

领英推荐

SQL and Python - Combining the 2 Forces for Advanced…

Muhammad Ishtiaq Khan 8 个月前

Automating Flight Data Processing with Apache Airflow,…

Fidel .V 3 个月前

Handling Big Data with Python

Ime Eti-mfon 3 个月前

import math
dat['someval'] = dat['someval'].apply(lambda x: 10 if math.isnan(x) else x)
print(dat.head(15))

We get rid of the NaN in 'someval' and replace those with 10.

    id      name      city  industry_type  someval         ts  day_intrvl  \
0    4  Samsonic  Pleasant              1     10.0 2009-03-01           0   
1    7    Samony   Jackson              2     10.0 2014-04-02        1858   
2    9    Orange   Jackson              2     10.0 2015-11-13         590   
3   14     Frank   Unknown              4     10.0 2017-08-03         629   
4    3   invalid   Detroit              4     32.0 2018-11-02         456   
5  333   invalid   Detroit              6     10.0 2018-12-10          38   
6  335   invalid   Detroit              6     31.0 2020-01-03         389   
7  535   invalid   Detroit              6     11.0 2020-11-19         321   
8   12   Unknown   Overton              3     10.0 2021-01-09          51   
9    6    School   Oaktown              1     10.0 2010-03-03       -3965   

   intrvl_bins  intrvl95  
0            0   1304.95  
1            4   1304.95  
2            3   1304.95  
3            4   1304.95  
4            3   1304.95  
5            1   1304.95  
6            2   1304.95  
7            2   1304.95  
8            1   1304.95  
9            0   1304.95

It worked! Now 'someval' has the default value of 10 inserted whereas previously it had the inconvenient 'NaN' label. We often want to normalize the datatype and set a default in order to perform subsequent calculations on all the rows.

Synthesizing a new field based on the contents of two existing fields

It is quite common to derive a new field based on multiple existing fields. For example, in Google Cloud, the web traffic data includes a visitor's latitude and longitude. We could derive the Metro Region from this data by making a function that draws a bounding box around major metros such as New York City, Chicago, etc. Why do this? Because the CMO will often want to know web traffic divided into major metro regions for the purpose of ad-spend planning. In our learning case we will just build up a new field 'net' that calculates a tax on the 'industry_type' and 'someval' fields.

We run the user-defined function 'tax' and get:

    id      name      city  industry_type  someval         ts  day_intrvl  \
0    4  Samsonic  Pleasant              1     10.0 2009-03-01           0   
1    7    Samony   Jackson              2     10.0 2014-04-02        1858   
2    9    Orange   Jackson              2     10.0 2015-11-13         590   
3   14     Frank   Unknown              4     10.0 2017-08-03         629   
4    3   invalid   Detroit              4     32.0 2018-11-02         456   
5  333   invalid   Detroit              6     10.0 2018-12-10          38   
6  335   invalid   Detroit              6     31.0 2020-01-03         389   
7  535   invalid   Detroit              6     11.0 2020-11-19         321   
8   12   Unknown   Overton              3     10.0 2021-01-09          51   
9    6    School   Oaktown              1     10.0 2010-03-03       -3965   

   intrvl_bins  intrvl95    net  
0            0   1304.95   8.00  
1            4   1304.95   7.50  
2            3   1304.95   7.50  
3            4   1304.95   7.50  
4            3   1304.95  24.00  
5            1   1304.95   6.50  
6            2   1304.95  20.15  
7            2   1304.95   7.15  
8            1   1304.95   7.50  
9            0   1304.95   8.00

We notice a new field 'net' has appeared and its value is derived from 'industry_type' and 'someval' fields as per the 'tax' function we invoke.

Creating a Goal: The 95th Percentile

It is often the case that we want to define a goal on a metric. In this case, we want to define a goal of the 95th percentile of 'net'. We use the built-in Pandas function quantile to define a new field 'n95'.

dat['n95'] = dat['net'].quantile(0.95) 
print(dat.head(15)) ?

The output now contains the 95th percentile field 'n95'.

    id      name      city  industry_type  someval         ts  day_intrvl  \
0    4  Samsonic  Pleasant              1     10.0 2009-03-01           0   
1    7    Samony   Jackson              2     10.0 2014-04-02        1858   
2    9    Orange   Jackson              2     10.0 2015-11-13         590   
3   14     Frank   Unknown              4     10.0 2017-08-03         629   
4    3   invalid   Detroit              4     32.0 2018-11-02         456   
5  333   invalid   Detroit              6     10.0 2018-12-10          38   
6  335   invalid   Detroit              6     31.0 2020-01-03         389   
7  535   invalid   Detroit              6     11.0 2020-11-19         321   
8   12   Unknown   Overton              3     10.0 2021-01-09          51   
9    6    School   Oaktown              1     10.0 2010-03-03       -3965   

   intrvl_bins  intrvl95    net      n95  
0            0   1304.95   8.00  22.2675  
1            4   1304.95   7.50  22.2675  
2            3   1304.95   7.50  22.2675  
3            4   1304.95   7.50  22.2675  
4            3   1304.95  24.00  22.2675  
5            1   1304.95   6.50  22.2675  
6            2   1304.95  20.15  22.2675  
7            2   1304.95   7.15  22.2675  
8            1   1304.95   7.50  22.2675  
9            0   1304.95   8.00  22.2675

Success! The new field 'n95' contains the goal which is in this case 22.2675. This is populated for every row. Another common use case is to calculate a higher and lower bound (for example, the 30th and the 70th percentile) to capture a certain chunk out of normally distributed data by subsetting it according to these bounds. Here, we will just make use of the 95th percentile as a goal line we plot against at the end of our discussion.

Chained recalculating after a value change

Quite often a field or fields will change and this requires recalculation. We just chain the recalculations together to ensure all the data moves in lockstep.

Notice how all the related fields update:

    id      name      city  industry_type  someval         ts  day_intrvl  \
0    4  Samsonic  Pleasant              1     11.0 2009-03-01           0   
1    7    Samony   Jackson              2     11.0 2014-04-02        1858   
2    9    Orange   Jackson              2     11.0 2015-11-13         590   
3   14     Frank   Unknown              4     11.0 2017-08-03         629   
4    3   invalid   Detroit              4     35.2 2018-11-02         456   
5  333   invalid   Detroit              6     11.0 2018-12-10          38   
6  335   invalid   Detroit              6     34.1 2020-01-03         389   
7  535   invalid   Detroit              6     12.1 2020-11-19         321   
8   12   Unknown   Overton              3     11.0 2021-01-09          51   
9    6    School   Oaktown              1     11.0 2010-03-03       -3965   

   intrvl_bins  intrvl95     net       n95  
0            0   1304.95   8.800  24.49425  
1            4   1304.95   8.250  24.49425  
2            3   1304.95   8.250  24.49425  
3            4   1304.95   8.250  24.49425  
4            3   1304.95  26.400  24.49425  
5            1   1304.95   7.150  24.49425  
6            2   1304.95  22.165  24.49425  
7            2   1304.95   7.865  24.49425  
8            1   1304.95   8.250  24.49425  
9            0   1304.95   8.800  24.49425

We see that after someval was adjusted upward by 10%, we also adjust other fields that depend on someval accordingly.

Calculating a Column Total

Sometimes we want to know the total of a field. We use the "reduce" operator from the functools library to calculate the total of the 'net' field. It adds all the elements pairwise until the answer is reduced to a single value.

import functools
dat['totalnet'] = functools.reduce(lambda a,b: a+b,dat['net'])
print(dat)

This gives us the new 'totalnet' field.

    id      name      city  industry_type  someval         ts  day_intrvl  \
0    4  Samsonic  Pleasant              1     11.0 2009-03-01           0   
1    7    Samony   Jackson              2     11.0 2014-04-02        1858   
2    9    Orange   Jackson              2     11.0 2015-11-13         590   
3   14     Frank   Unknown              4     11.0 2017-08-03         629   
4    3   invalid   Detroit              4     35.2 2018-11-02         456   
5  333   invalid   Detroit              6     11.0 2018-12-10          38   
6  335   invalid   Detroit              6     34.1 2020-01-03         389   
7  535   invalid   Detroit              6     12.1 2020-11-19         321   
8   12   Unknown   Overton              3     11.0 2021-01-09          51   
9    6    School   Oaktown              1     11.0 2010-03-03       -3965   

   intrvl_bins  intrvl95     net       n95  totalnet  
0            0   1304.95   8.800  24.49425    114.18  
1            4   1304.95   8.250  24.49425    114.18  
2            3   1304.95   8.250  24.49425    114.18  
3            4   1304.95   8.250  24.49425    114.18  
4            3   1304.95  26.400  24.49425    114.18  
5            1   1304.95   7.150  24.49425    114.18  
6            2   1304.95  22.165  24.49425    114.18  
7            2   1304.95   7.865  24.49425    114.18  
8            1   1304.95   8.250  24.49425    114.18  
9            0   1304.95   8.800  24.49425    114.18

Success! We have calculated the total of the 'net' field in a new field, 'totalnet'.

Checking entries versus a dictionary

Many times we will have a key-value dictionary defined and we will want to check if an entry in our dataset is contained in the dictionary. In this example, we define a statedict with the key being the capitol and the value being the name of the state. We then create a new field 'state' if the capitol defined in the 'city' field matches a key in the statedict dictionary. If there is no match, we define 'state' as 'unknown'.

statedict = {'Detroit': 'Michigan'}

dat['state'] = dat['city'].apply(lambda x: statedict[x] if x in statedict else 'unknown')?
print(dat.head(15))

Take a look at the contents of the new 'state' field.


   id      name      city  industry_type  someval         ts  day_intrvl  \
0    4  Samsonic  Pleasant              1     11.0 2009-03-01           0   
1    7    Samony   Jackson              2     11.0 2014-04-02        1858   
2    9    Orange   Jackson              2     11.0 2015-11-13         590   
3   14     Frank   Unknown              4     11.0 2017-08-03         629   
4    3   invalid   Detroit              4     35.2 2018-11-02         456   
5  333   invalid   Detroit              6     11.0 2018-12-10          38   
6  335   invalid   Detroit              6     34.1 2020-01-03         389   
7  535   invalid   Detroit              6     12.1 2020-11-19         321   
8   12   Unknown   Overton              3     11.0 2021-01-09          51   
9    6    School   Oaktown              1     11.0 2010-03-03       -3965   

   intrvl_bins  intrvl95     net       n95  totalnet     state  
0            0   1304.95   8.800  24.49425    114.18   unknown  
1            4   1304.95   8.250  24.49425    114.18   unknown  
2            3   1304.95   8.250  24.49425    114.18   unknown  
3            4   1304.95   8.250  24.49425    114.18   unknown  
4            3   1304.95  26.400  24.49425    114.18  Michigan  
5            1   1304.95   7.150  24.49425    114.18  Michigan  
6            2   1304.95  22.165  24.49425    114.18  Michigan  
7            2   1304.95   7.865  24.49425    114.18  Michigan  
8            1   1304.95   8.250  24.49425    114.18   unknown  
9            0   1304.95   8.800  24.49425    114.18   unknown

The state is defined as Michigan for all 'Detroit' entries because the city key matched the dictionary in those cases.

String Manipulation and Classification

The next example shows a contrived situation where we want to classify the language origin of a name. We know that Dutch contains a lot of double letters, so we say for the purpose of example any double letter in the 'name' field means the name originated from Dutch and we populate a new field 'lang' as "Dutch". If the name does not have any double letters, we just call the new field 'lang' "Not Dutch".

Running the user-defined function 'lang' generates values for a new field 'lang':

    id      name      city  industry_type  someval         ts  day_intrvl  \
0    4  Samsonic  Pleasant              1     11.0 2009-03-01           0   
1    7    Samony   Jackson              2     11.0 2014-04-02        1858   
2    9    Orange   Jackson              2     11.0 2015-11-13         590   
3   14     Frank   Unknown              4     11.0 2017-08-03         629   
4    3   invalid   Detroit              4     35.2 2018-11-02         456   
5  333   invalid   Detroit              6     11.0 2018-12-10          38   
6  335   invalid   Detroit              6     34.1 2020-01-03         389   
7  535   invalid   Detroit              6     12.1 2020-11-19         321   
8   12   Unknown   Overton              3     11.0 2021-01-09          51   
9    6    School   Oaktown              1     11.0 2010-03-03       -3965   

   intrvl_bins  intrvl95     net       n95  totalnet     state       lang  
0            0   1304.95   8.800  24.49425    114.18   unknown  Not Dutch  
1            4   1304.95   8.250  24.49425    114.18   unknown  Not Dutch  
2            3   1304.95   8.250  24.49425    114.18   unknown  Not Dutch  
3            4   1304.95   8.250  24.49425    114.18   unknown  Not Dutch  
4            3   1304.95  26.400  24.49425    114.18  Michigan  Not Dutch  
5            1   1304.95   7.150  24.49425    114.18  Michigan  Not Dutch  
6            2   1304.95  22.165  24.49425    114.18  Michigan  Not Dutch  
7            2   1304.95   7.865  24.49425    114.18  Michigan  Not Dutch  
8            1   1304.95   8.250  24.49425    114.18   unknown  Not Dutch  
9            0   1304.95   8.800  24.49425    114.18   unknown      Dutch

As can be seen, the "School" name is classified as "Dutch" in the new 'lang' field. The others are "Not Dutch".

Plotting versus a Goal

In our last exercise, we contruct a plot using matplotlib. We will plot the 'net' field versus the goal of 'n95', which is the 95th percentile of 'net'.

Running the matplotlib code ('inline' tells it to display the plot inline in the notebook) displays the following:

This plot shows for some time interval defined on the x-axis, our shortfall towards the N95 goal or our exceeding of that goal. Useful in status report meetings!

This concludes our quick overview of Pandas Lambda ETL transformations. Give these handy-dandy Lambda ETL patterns a try!

---

About the author:

Dr. Mark Ginsburg has taught information systems topics such as networks and software development at NYU Stern, U of Arizona, and U California/Berkeley SIMS School. He has numerous peer-reviewed published articles in IEEE Computer, CACM, J. Electronic Commerce, and elsewhere. He was the lead developer on the NSF EDGAR-on-the-Internet project, a flagship effort to put mass corporate disclosure data online for the small investor. ?He is in the Data Engineering and Cloud Services practice at?Unify Consulting. The techniques presented here come in very handy at large tech clients.

Selected publications; https://scholar.google.com/citations?user=vmQ29zkAAAAJ&hl=en

要查看或添加评论，请登录

Mark Ginsburg, Ph.D.的更多文章

Google Cloud and E-Commerce: A Happy Union

2022年3月18日

Google Cloud and E-Commerce: A Happy Union

Let’s take a look at three typical digital marketing/e-Commerce issues and see how to solve them using the convenient…
One Reactive and One Proactive Approach to Solve the HR Holy Grail Problem of Effective Onboarding

2021年12月22日

One Reactive and One Proactive Approach to Solve the HR Holy Grail Problem of Effective Onboarding

Consulting companies of all shapes and sizes have been tasked to deal with it. Every large company's HR department…
Refining Textual Sentiment Analysis with Word2Vec

2017年10月25日

Refining Textual Sentiment Analysis with Word2Vec

A common problem in business computing machine learning is textual sentiment analysis. For example, a collection of…

1 条评论
27 Years Ago in Technology

2017年8月8日

27 Years Ago in Technology

In the year 1990..
World Senior Chess 50+ and 65+

2016年11月22日

World Senior Chess 50+ and 65+

Ongoing, Marianske Lazne, Czech Republic, follow along here: https://www.wscc2016.
All my articles collected in one place!

2016年9月29日

All my articles collected in one place!

Google made this process very easy! https://scholar.google.

See all articles

Painless ETL with Python Pandas and Lambda Functions

Mark Ginsburg, Ph.D.

Senior Data Engineer

What are Lambda functions?

Why are Lambda functions useful?

Lambdas and Notebooks

Motivation

Lambda Origins

False Cognates

Finding the difference between timestamp entries

Breaking a string into parts, converting the first part to an integer, and assigning 0 to a non-numeric value.

Dividing the Data into 5 Bins and Finding the 95th Percentile

Normalizing variations to a common value, Dealing with missing values,

Stripping off line feeds

Normalizing a value that is too short

Dealing with non-numeric values

领英推荐

Synthesizing a new field based on the contents of two existing fields

Creating a Goal: The 95th Percentile

Chained recalculating after a value change

Calculating a Column Total

Checking entries versus a dictionary

String Manipulation and Classification

Plotting versus a Goal

Mark Ginsburg, Ph.D.的更多文章

社区洞察

其他会员也浏览了

Mastering Python for Data Engineering: Tools, Techniques, and Real-World Use Cases

What are the benefits of using PySpark for Data Analysis?

MI - ETLx: Incremental Extract and Load Module for Python

I created an ETL pipeline using Python, BigQuery, and Apache Airflow

Accessing Columns in PySpark: A Comprehensive Guide

Python in Data Engineering: Powering Databricks, Snowflake, dbt, and Airflow for Big Data Pipelines

Understanding DataFrames in Python and PySpark

Unlocking the Power of Snowflake Stored Procedures: SQL, JavaScript, and Python

Python vs SQL

Discover 5 cutting-edge data science tools that are essential for your Python toolkit

What are Lambda functions?

Why are Lambda functions useful?

Lambdas and Notebooks

Motivation

Lambda Origins

False Cognates

Finding the difference between timestamp entries

Breaking a string into parts, converting the first part to an integer, and assigning 0 to a non-numeric value.

Dividing the Data into 5 Bins and Finding the 95th Percentile

Normalizing variations to a common value, Dealing with missing values,

Stripping off line feeds

Normalizing a value that is too short

Dealing with non-numeric values

领英推荐

Synthesizing a new field based on the contents of two existing fields

Creating a Goal: The 95th Percentile

Chained recalculating after a value change

Calculating a Column Total

Checking entries versus a dictionary

String Manipulation and Classification

Plotting versus a Goal

Mark Ginsburg, Ph.D.的更多文章

Google Cloud and E-Commerce: A Happy Union

One Reactive and One Proactive Approach to Solve the HR Holy Grail Problem of Effective Onboarding

Refining Textual Sentiment Analysis with Word2Vec

27 Years Ago in Technology

World Senior Chess 50+ and 65+

All my articles collected in one place!

社区洞察

其他会员也浏览了

Mastering Python for Data Engineering: Tools, Techniques, and Real-World Use Cases

What are the benefits of using PySpark for Data Analysis?

MI - ETLx: Incremental Extract and Load Module for Python

I created an ETL pipeline using Python, BigQuery, and Apache Airflow

Accessing Columns in PySpark: A Comprehensive Guide

Python in Data Engineering: Powering Databricks, Snowflake, dbt, and Airflow for Big Data Pipelines

Understanding DataFrames in Python and PySpark

Unlocking the Power of Snowflake Stored Procedures: SQL, JavaScript, and Python

Python vs SQL

Discover 5 cutting-edge data science tools that are essential for your Python toolkit