PyData London
Michelle Conway
Lead Data Scientist at Lloyds Banking Group ?? Data Science Top Voice ?? Tech Awards 2024 Judge ???? 20 in Data & Tech 2023 ?? MSc Data Science at Birkbeck University ??BSc Maths Science at University College Dublin ??
I recently attended the PyData Conference in London, it was a really interesting collection of talks, demos and workshops on Python packages. It was an in-person event at the Leonardo Royal Hotel at London Tower Bridge. It had a variety of diverse speakers and was really interesting to see the latest trends and tooling.
Taipy is a new low-code Python package that allows you to create data science applications for graphical visualization, managing algorithms, pipelines, and scenarios on web applications quickly. It is composed of two main independent components Taipy Core and Taipy GUI.
Taipy GUI can create an interactive and powerful user interface with a few lines of code. It uses markdown notation and formats to build simple web pages, see example below
from taipy import Gui
Gui(page="# Getting started with *Taipy*").run()
This deploys a local web server like so
[Taipy][INFO] * Server starting on https://127.0.0.1:5000
The http link will open a page in your browser that will look like
Taipy Core creates scenarios, uses models, retrieve metrics easily and applies version control to application configuration. Overall Taipy provides a really useful UI with easy coding and high completeness in comparison to Plotly another web app which is a little more complex in its coding.
The Python Polars is a data frame library written in Rust and can be much faster than Pandas data frames when doing data processing on large datasets. It uses a standard composable pattern and allows runs in parallel. It has similar syntax to the Pandas library for creating a data frame using the dictionary syntax.
import polars as pl
df = pl.DataFrame(
{
"A": [1, 2, 3, 4, 5],
"fruits": ["banana", "banana", "apple", "apple", "banana"],
"B": [5, 4, 3, 2, 1],
"cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
}
)
print(df)
When printing out a data frame it also outputs the size automatically displayed by shape in a tuple format.
The Polar data frame allows conditions to be applied to slice the data using filters.
print(df.filter(pl.col("B") > 2))
The above code filters out rows where values are above 2 on column B.
It also has group by functionality like Pandas and can be used with multiple expressions like filter.
print(df.filter(pl.col("B") > 2).groupby("cars").agg(pl.sum("A")))
The above code filters out values greater than 2 on column B, it then groups by cars and returns the sum of column A.
Sktime?is a Python library its?a unified framework for machine learning with time series. It has various time series algorithms and modular tools which is compatible with the Sklearn library. It can do time series classification, regression, clustering with custom distances, kernels and feature extraction steps.
from sktime.datasets import load_osuleaf
X_train, y_train = load_osuleaf(split="train", return_type="numpy3D")
X_new, _ = load_osuleaf(split="test", return_type="numpy3D")
Sktime can import datasets similar to Sklearn datsets and it can also build models in very similar syntax. Here is an example of a 3-nearest neighbour classifier with simple dynamic time warping distance.
from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier
clf = KNeighborsTimeSeriesClassifier(n_neighbors=3)
Here is another example a 3-nearest neighbour classifier with mean (over time points) pairwise Euclidean distance.
from sktime.dists_kernels.compose_tab_to_panel import AggrDist
from sktime.dists_kernels import ScipyDist
mean_eucl_dist = AggrDist(ScipyDist())
clf = KNeighborsTimeSeriesClassifier(n_neighbors=3, distance=mean_eucl_dist)
The classifiers fits and predicts like Sklearn.
# fit/train the classifier
clf.fit(X_train, y_train)
# predict labels on new data
y_pred = clf.predict(X_new)
Sktime also supports and recongises mutliple data formats including dask, xarray and abstract data type scitype.
领英推荐
It was helpful understanding how to manage virtual environments and dependencies using pyproject.toml, venv, and pip-tools.
Its always recommended running pre-commit hooks on Python files using installed helper packages.
These helper packages can be listed in a project.toml file with their configurations.
black - reformats code in a consistent format and increase readability and is PEP8 compliant.
ruff - an extremely fast?Python?linter, written in Rust.
isort - sorts imports alphabetically and automatically separates into sections and by type.
pydocstyle - is a static analysis tool for checking compliance with Python docstring conventions.
sqlfluff - is a dialect-flexible and configurable SQL linter.
Also something to beware of is package squatting which is essentially typo-squatting that occurs when a malicious package is uploaded with a name?similar to a common package, take for example misspelling pandas and forgetting the s in the install.
pip install panda
Rather than installing the safe way with the correct spelling
pip install pandas
PyWhy is a Python package that provides various statistical methods for causal analysis to prove and establish the cause and effect relationship. This example used a kidney dataset and builds a causal model to explain machine learning.
from dowhy import CausalModel
import dowhy.datasets
kidney_df = dowhy.datasets.linear_dataset()
model = CausalModel(
data=kidney_df,
graph=kidney_df['gml_graph'],
treament='treatment',
outcome='success'
)
identified_estimand = model.identify_effect()
PyWhy looks at interpretability of its output in the analysis, that allows you to inspect the untested assumptions, identified estimands (if any) and the estimate?(if any).
identified_estimand = model.identify_effect()
estimate = model.estimate_effect(
identified_estimand,
method_name="backdoor.linear_regression",
target_units='ate'
)
This is a sample output of the linear regression estimator code above
Dask is a parallel computing library developed closely with Pandas, has added new features in the last year, like memory-stable shuffling, task queueing, and with recent experiments in query optimization. A Dask data frame is a large parallel data frame composed of many smaller Pandas data frames, split along the index. It is partitioned?row-wise, grouping rows by index value for efficiency.
The Dask implementation of Pandas API.
import dask.dataframe as dd
import pandas as pd
d = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}
df = dd.from_pandas(pd.DataFrame(data=d), npartitions=2)
Dask implementation of Numpy API.
import dask.array as da
x = da.random.random(size=(10000, 10000),
chunks=(1000, 1000))
Dask-ML implementation of Sklearn API.
from dask_ml.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(train, test)
Dask is imporving performance of tabular data on Python. Pandas is the community favorite is also innovating structural improvements like Arrow data types, copy on write, and more but Dask brings the world's most popular dataframe library into significantly better performance and memory use.
The Python library for natural language progressing, Spacy has over 140 million downloads. The creators of Spacy have created a modern scriptable annotation tool for machine learning?developers, Prodigy. It currently has 8,000 users and 700+ companies using Prodigy for large language models and can be coded as follows.
import prodigy
@prodigy.recipe("my-custom-recipe")
def custom_recipe(dataset, view_id):
stream = load_my_custom_stream()
def update(examples):
print(f"Received {len(examples)} answers!")
return {
"dataset": dataset,
"view_id": view_id,
"stream": stream,
"update": update
}
Glad you enjoyed it Michelle Conway ????
Senior Data Science Consultant, ML/AI | Chartered Engineer | MSc MEng CEng
1 年Wow! Such a useful summary! Thanks for sharing.
Data| AI/ML | Causality
1 年I am a fervent follower of their YouTube channel. They never miss.
Thanks for Sharing! ?? Michelle Conway
Head of Data Science & AI | Accenture UK&I Secure Transformation Services | Chartered Statistician | Defence & Public Safety
1 年Fantastic stuff Michelle!!