PyData London

PyData London

I recently attended the PyData Conference in London, it was a really interesting collection of talks, demos and workshops on Python packages. It was an in-person event at the Leonardo Royal Hotel at London Tower Bridge. It had a variety of diverse speakers and was really interesting to see the latest trends and tooling.

No alt text provided for this image
A low-code Python library to build full web applications

Taipy is a new low-code Python package that allows you to create data science applications for graphical visualization, managing algorithms, pipelines, and scenarios on web applications quickly. It is composed of two main independent components Taipy Core and Taipy GUI.


Taipy GUI can create an interactive and powerful user interface with a few lines of code. It uses markdown notation and formats to build simple web pages, see example below

from taipy import Gui

Gui(page="# Getting started with *Taipy*").run()        

This deploys a local web server like so

[Taipy][INFO]  * Server starting on https://127.0.0.1:5000        

The http link will open a page in your browser that will look like

No alt text provided for this image
Webpage created using markdown syntax

Taipy Core creates scenarios, uses models, retrieve metrics easily and applies version control to application configuration. Overall Taipy provides a really useful UI with easy coding and high completeness in comparison to Plotly another web app which is a little more complex in its coding.


No alt text provided for this image
Polars data frame library written using Rust programming lanuage

The Python Polars is a data frame library written in Rust and can be much faster than Pandas data frames when doing data processing on large datasets. It uses a standard composable pattern and allows runs in parallel. It has similar syntax to the Pandas library for creating a data frame using the dictionary syntax.

import polars as pl

df = pl.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "fruits": ["banana", "banana", "apple", "apple", "banana"],
        "B": [5, 4, 3, 2, 1],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
    }
)

print(df)        

When printing out a data frame it also outputs the size automatically displayed by shape in a tuple format.

No alt text provided for this image
Polar data frame print out

The Polar data frame allows conditions to be applied to slice the data using filters.

print(df.filter(pl.col("B") > 2))        

The above code filters out rows where values are above 2 on column B.

No alt text provided for this image
Polar data frame on car values greater than 2

It also has group by functionality like Pandas and can be used with multiple expressions like filter.

print(df.filter(pl.col("B") > 2).groupby("cars").agg(pl.sum("A")))        

The above code filters out values greater than 2 on column B, it then groups by cars and returns the sum of column A.

No alt text provided for this image
Polars group by aggregate output


No alt text provided for this image
The new time series package

Sktime?is a Python library its?a unified framework for machine learning with time series. It has various time series algorithms and modular tools which is compatible with the Sklearn library. It can do time series classification, regression, clustering with custom distances, kernels and feature extraction steps.

from sktime.datasets import load_osuleaf

X_train, y_train = load_osuleaf(split="train", return_type="numpy3D")
X_new, _ = load_osuleaf(split="test", return_type="numpy3D")        

Sktime can import datasets similar to Sklearn datsets and it can also build models in very similar syntax. Here is an example of a 3-nearest neighbour classifier with simple dynamic time warping distance.

from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier

clf = KNeighborsTimeSeriesClassifier(n_neighbors=3)        

Here is another example a 3-nearest neighbour classifier with mean (over time points) pairwise Euclidean distance.

from sktime.dists_kernels.compose_tab_to_panel import AggrDist
from sktime.dists_kernels import ScipyDist

mean_eucl_dist = AggrDist(ScipyDist())
clf = KNeighborsTimeSeriesClassifier(n_neighbors=3, distance=mean_eucl_dist)        

The classifiers fits and predicts like Sklearn.

# fit/train the classifier
clf.fit(X_train, y_train)

# predict labels on new data
y_pred = clf.predict(X_new)        

Sktime also supports and recongises mutliple data formats including dask, xarray and abstract data type scitype.

No alt text provided for this image
Best practives for data science

It was helpful understanding how to manage virtual environments and dependencies using pyproject.toml, venv, and pip-tools.

No alt text provided for this image
helper packages in a project.toml file

Its always recommended running pre-commit hooks on Python files using installed helper packages.


These helper packages can be listed in a project.toml file with their configurations.


black - reformats code in a consistent format and increase readability and is PEP8 compliant.

ruff - an extremely fast?Python?linter, written in Rust.

isort - sorts imports alphabetically and automatically separates into sections and by type.

pydocstyle - is a static analysis tool for checking compliance with Python docstring conventions.

sqlfluff - is a dialect-flexible and configurable SQL linter.

Also something to beware of is package squatting which is essentially typo-squatting that occurs when a malicious package is uploaded with a name?similar to a common package, take for example misspelling pandas and forgetting the s in the install.

pip install panda         

Rather than installing the safe way with the correct spelling

pip install pandas        
No alt text provided for this image
PyWhy is a python library used from causal machine learning

PyWhy is a Python package that provides various statistical methods for causal analysis to prove and establish the cause and effect relationship. This example used a kidney dataset and builds a causal model to explain machine learning.

from dowhy import CausalModel
import dowhy.datasets

kidney_df = dowhy.datasets.linear_dataset()

model = CausalModel(
    data=kidney_df,
    graph=kidney_df['gml_graph'],
    treament='treatment',
    outcome='success'
)

identified_estimand = model.identify_effect()        

PyWhy looks at interpretability of its output in the analysis, that allows you to inspect the untested assumptions, identified estimands (if any) and the estimate?(if any).

identified_estimand = model.identify_effect()

estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.linear_regression",
    target_units='ate'
)        

This is a sample output of the linear regression estimator code above

No alt text provided for this image
causality meets machine learning
No alt text provided for this image
New Python data frame library

Dask is a parallel computing library developed closely with Pandas, has added new features in the last year, like memory-stable shuffling, task queueing, and with recent experiments in query optimization. A Dask data frame is a large parallel data frame composed of many smaller Pandas data frames, split along the index. It is partitioned?row-wise, grouping rows by index value for efficiency.

The Dask implementation of Pandas API.

import dask.dataframe as dd
import pandas as pd

d = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}
df = dd.from_pandas(pd.DataFrame(data=d), npartitions=2)        

Dask implementation of Numpy API.

import dask.array as da

x = da.random.random(size=(10000, 10000),
                     chunks=(1000, 1000))        

Dask-ML implementation of Sklearn API.

from dask_ml.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(train, test)        

Dask is imporving performance of tabular data on Python. Pandas is the community favorite is also innovating structural improvements like Arrow data types, copy on write, and more but Dask brings the world's most popular dataframe library into significantly better performance and memory use.


No alt text provided for this image
scriptable annotation tool from the creators of spacy

The Python library for natural language progressing, Spacy has over 140 million downloads. The creators of Spacy have created a modern scriptable annotation tool for machine learning?developers, Prodigy. It currently has 8,000 users and 700+ companies using Prodigy for large language models and can be coded as follows.

import prodigy

@prodigy.recipe("my-custom-recipe")
def custom_recipe(dataset, view_id):
    stream = load_my_custom_stream()

    def update(examples):
        print(f"Received {len(examples)} answers!")

    return {
        "dataset": dataset,
        "view_id": view_id,
        "stream": stream,
        "update": update
    }        


Luca Montalto

Senior Data Science Consultant, ML/AI | Chartered Engineer | MSc MEng CEng

1 年

Wow! Such a useful summary! Thanks for sharing.

I am a fervent follower of their YouTube channel. They never miss.

Alistair R.

Head of Data Science & AI | Accenture UK&I Secure Transformation Services | Chartered Statistician | Defence & Public Safety

1 年

Fantastic stuff Michelle!!

要查看或添加评论,请登录

Michelle Conway的更多文章

  • Mob Programming

    Mob Programming

    I wanted to explain what mob programming is and give some insight into how helpful I have found it. Mobbing is a great…

    1 条评论

社区洞察

其他会员也浏览了