登录查看更多内容

PyData London

Michelle Conway

Lead Data Scientist at Lloyds Banking Group ?? Data Science Top Voice ?? Tech Awards 2024 Judge ???? 20 in Data & Tech 2023 ?? MSc Data Science at Birkbeck University ??BSc Maths Science at University College Dublin ??

发布日期: 2023年6月16日

I recently attended the PyData Conference in London, it was a really interesting collection of talks, demos and workshops on Python packages. It was an in-person event at the Leonardo Royal Hotel at London Tower Bridge. It had a variety of diverse speakers and was really interesting to see the latest trends and tooling.

No alt text provided for this image — A low-code Python library to build full web applications

Taipy is a new low-code Python package that allows you to create data science applications for graphical visualization, managing algorithms, pipelines, and scenarios on web applications quickly. It is composed of two main independent components Taipy Core and Taipy GUI.

Taipy GUI can create an interactive and powerful user interface with a few lines of code. It uses markdown notation and formats to build simple web pages, see example below

from taipy import Gui

Gui(page="# Getting started with *Taipy*").run()

This deploys a local web server like so

[Taipy][INFO]  * Server starting on https://127.0.0.1:5000

The http link will open a page in your browser that will look like

Taipy Core creates scenarios, uses models, retrieve metrics easily and applies version control to application configuration. Overall Taipy provides a really useful UI with easy coding and high completeness in comparison to Plotly another web app which is a little more complex in its coding.

The Python Polars is a data frame library written in Rust and can be much faster than Pandas data frames when doing data processing on large datasets. It uses a standard composable pattern and allows runs in parallel. It has similar syntax to the Pandas library for creating a data frame using the dictionary syntax.

import polars as pl

df = pl.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "fruits": ["banana", "banana", "apple", "apple", "banana"],
        "B": [5, 4, 3, 2, 1],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
    }
)

print(df)

When printing out a data frame it also outputs the size automatically displayed by shape in a tuple format.

The Polar data frame allows conditions to be applied to slice the data using filters.

print(df.filter(pl.col("B") > 2))

The above code filters out rows where values are above 2 on column B.

It also has group by functionality like Pandas and can be used with multiple expressions like filter.

print(df.filter(pl.col("B") > 2).groupby("cars").agg(pl.sum("A")))

The above code filters out values greater than 2 on column B, it then groups by cars and returns the sum of column A.

Sktime?is a Python library its?a unified framework for machine learning with time series. It has various time series algorithms and modular tools which is compatible with the Sklearn library. It can do time series classification, regression, clustering with custom distances, kernels and feature extraction steps.

from sktime.datasets import load_osuleaf

X_train, y_train = load_osuleaf(split="train", return_type="numpy3D")
X_new, _ = load_osuleaf(split="test", return_type="numpy3D")

Sktime can import datasets similar to Sklearn datsets and it can also build models in very similar syntax. Here is an example of a 3-nearest neighbour classifier with simple dynamic time warping distance.

from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier

clf = KNeighborsTimeSeriesClassifier(n_neighbors=3)

Here is another example a 3-nearest neighbour classifier with mean (over time points) pairwise Euclidean distance.

from sktime.dists_kernels.compose_tab_to_panel import AggrDist
from sktime.dists_kernels import ScipyDist

mean_eucl_dist = AggrDist(ScipyDist())
clf = KNeighborsTimeSeriesClassifier(n_neighbors=3, distance=mean_eucl_dist)

The classifiers fits and predicts like Sklearn.

# fit/train the classifier
clf.fit(X_train, y_train)

# predict labels on new data
y_pred = clf.predict(X_new)

Sktime also supports and recongises mutliple data formats including dask, xarray and abstract data type scitype.

领英推荐

Things You Probably Didn’t Know About the ORDER BY…

Benjamin Bennett Alexander 1 个月前

Data Analysis with Python: Concatenating Datasets with…

Benjamin Bennett Alexander 7 个月前

D-TALE

360DigiTMG 1 年前

It was helpful understanding how to manage virtual environments and dependencies using pyproject.toml, venv, and pip-tools.

Its always recommended running pre-commit hooks on Python files using installed helper packages.

These helper packages can be listed in a project.toml file with their configurations.

black - reformats code in a consistent format and increase readability and is PEP8 compliant.

ruff - an extremely fast?Python?linter, written in Rust.

isort - sorts imports alphabetically and automatically separates into sections and by type.

pydocstyle - is a static analysis tool for checking compliance with Python docstring conventions.

sqlfluff - is a dialect-flexible and configurable SQL linter.

Also something to beware of is package squatting which is essentially typo-squatting that occurs when a malicious package is uploaded with a name?similar to a common package, take for example misspelling pandas and forgetting the s in the install.

pip install panda

Rather than installing the safe way with the correct spelling

pip install pandas

PyWhy is a Python package that provides various statistical methods for causal analysis to prove and establish the cause and effect relationship. This example used a kidney dataset and builds a causal model to explain machine learning.

from dowhy import CausalModel
import dowhy.datasets

kidney_df = dowhy.datasets.linear_dataset()

model = CausalModel(
    data=kidney_df,
    graph=kidney_df['gml_graph'],
    treament='treatment',
    outcome='success'
)

identified_estimand = model.identify_effect()

PyWhy looks at interpretability of its output in the analysis, that allows you to inspect the untested assumptions, identified estimands (if any) and the estimate?(if any).

identified_estimand = model.identify_effect()

estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.linear_regression",
    target_units='ate'
)

This is a sample output of the linear regression estimator code above

Dask is a parallel computing library developed closely with Pandas, has added new features in the last year, like memory-stable shuffling, task queueing, and with recent experiments in query optimization. A Dask data frame is a large parallel data frame composed of many smaller Pandas data frames, split along the index. It is partitioned?row-wise, grouping rows by index value for efficiency.

The Dask implementation of Pandas API.

import dask.dataframe as dd
import pandas as pd

d = {'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}
df = dd.from_pandas(pd.DataFrame(data=d), npartitions=2)

Dask implementation of Numpy API.

import dask.array as da

x = da.random.random(size=(10000, 10000),
                     chunks=(1000, 1000))

Dask-ML implementation of Sklearn API.

from dask_ml.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(train, test)

Dask is imporving performance of tabular data on Python. Pandas is the community favorite is also innovating structural improvements like Arrow data types, copy on write, and more but Dask brings the world's most popular dataframe library into significantly better performance and memory use.

The Python library for natural language progressing, Spacy has over 140 million downloads. The creators of Spacy have created a modern scriptable annotation tool for machine learning?developers, Prodigy. It currently has 8,000 users and 700+ companies using Prodigy for large language models and can be coded as follows.

import prodigy

@prodigy.recipe("my-custom-recipe")
def custom_recipe(dataset, view_id):
    stream = load_my_custom_stream()

    def update(examples):
        print(f"Received {len(examples)} answers!")

    return {
        "dataset": dataset,
        "view_id": view_id,
        "stream": stream,
        "update": update
    }

PyData London

1 年

Glad you enjoyed it Michelle Conway ????

1 次回应

Luca Montalto

Senior Data Science Consultant, ML/AI | Chartered Engineer | MSc MEng CEng

1 年

Wow! Such a useful summary! Thanks for sharing.

1 次回应

Ifeanyichukwu Nwobodo, PMP?

Data| AI/ML | Causality

1 年

I am a fervent follower of their YouTube channel. They never miss.

1 次回应

Brand

1 年

Thanks for Sharing! ?? Michelle Conway

1 次回应

Alistair R.

Head of Data Science & AI | Accenture UK&I Secure Transformation Services | Chartered Statistician | Defence & Public Safety

1 年

Fantastic stuff Michelle!!

2 次回应

查看更多评论

要查看或添加评论，请登录

Michelle Conway的更多文章

Mob Programming

2023年4月5日

Mob Programming

I wanted to explain what mob programming is and give some insight into how helpful I have found it. Mobbing is a great…

1 条评论

PyData London

Michelle Conway

Lead Data Scientist at Lloyds Banking Group ?? Data Science Top Voice ?? Tech Awards 2024 Judge ???? 20 in Data & Tech 2023 ?? MSc Data Science at Birkbeck University ??BSc Maths Science at University College Dublin ??

领英推荐

Michelle Conway的更多文章

社区洞察

其他会员也浏览了

The Snake Installation

Sweetviz

Top 7 Python Libraries for Data Automation

Mastering Data Visualization: Essential Plots in Python using Matplotlib

Unlocking the Power of Python through Libraries

?? Big Data in Construction. Part 1-1: Choosing python IDE. Anaconda. Install Python.

40 intresting Python packages; Not necessarily the most popular one

Use Python In Power Query to retrieve H3 indices

Python’s Collections Module: Unlocking Powerful Data Structures

Introduction to Polar: A Modern DataFrame Library for Python

领英推荐

Michelle Conway的更多文章

Mob Programming

社区洞察

其他会员也浏览了

The Snake Installation

Sweetviz

Top 7 Python Libraries for Data Automation

Mastering Data Visualization: Essential Plots in Python using Matplotlib

Unlocking the Power of Python through Libraries

?? Big Data in Construction. Part 1-1: Choosing python IDE. Anaconda. Install Python.

40 intresting Python packages; Not necessarily the most popular one

Use Python In Power Query to retrieve H3 indices

Python’s Collections Module: Unlocking Powerful Data Structures

Introduction to Polar: A Modern DataFrame Library for Python