登录查看更多内容

The dlt Project, Data Science at the Command Line

Rami Krispin

Senior Manager - Data Science and Engineering at Apple | Docker Captain | LinkedIn Learning Instructor

发布日期: 2024年11月12日

+ 关注

This week's agenda:

Open Source of the Week - the dlt project
Learning resources - Hugging Face code generation with LLM, building RAG application with JS, insurance premium prediction
Book of the week - Data Science at the Command Line by Jeroen Janssens

I am also on ???? Blue Sky ??, Telegram, Instagram, and WhatsApp

Open Source of the Week

The dlt (data load tool) Python library is a relatively new project for data engineering applications from dltHub . The library provides a framework for data ingestion from one destination to another, for example, data ingestion from an API to a Postgres database on AWS. The library's main goal is to simplify the ELT (extract, load, transform) process.

Here is a simple example from the project README that illustrates the dlt workflow:

import dlt
from dlt.sources.helpers import requests

# Create a dlt pipeline that will load
# chess player data to the DuckDB destination
pipeline = dlt.pipeline(
    pipeline_name='chess_pipeline',
    destination='duckdb',
    dataset_name='player_data'
)

# Grab some player data from Chess.com API
data = []
for player in ['magnuscarlsen', 'rpragchess']:
    response = requests.get(f'https://api.chess.com/pub/player/{player}')
    response.raise_for_status()
    data.append(response.json())

# Extract, normalize, and load the data
pipeline.run(data, table_name='player')

Where this workflow includes the following three steps:

Defining the pipeline
Sending a GET request to an API to pull data
Process the data and load it to a database, in this case DuckDB

Here are some of the key features of the library:

Define data schema and inspection
Set a data normalization process and data verification
Deployed anywhere Python runs (e.g., Airflow and serverless functions)

The project is under Apache 2.0 license. More details are available in the project documentation:

The below workshop is a good place to start with dlt:

New Learning Resources

Here are some new learning resources that I came across this week.

Code Generation & Synthetic Data With Loubna Ben Allal

A great conversation between Neil Leiser and Loubna Ben Allal - a Machine Learning engineer at Hugging Face about building LLM for code generation at the AI Stories podcast:

Build and Deploy a RAG Chatbot

This two-hour tutorial by Ania Kubow and freeCodeCamp focuses on the steps of building and deploying a RAG chatbot using tools such as JavaScript, LangChain.js, Next.js, Vercel, and OpenAI.

Insurance Premium Prediction

This short tutorial by NeuralNine demonstrates a machine learning application to create insurance premium predictions using Kagale's US Health Insurance Dataset.

Book of the Week

One of the most important data science tools that most academic institutes do not bother to teach is the command line. Like many other data scientists, I hit the command line "wall" when I started my first data science role. Luckily, I found the?Data Science at the Command Line?book by Jeroen Janssens , which does a great job of simplifying and explaining how the command line works and its data science use cases. Last week, I finally had the pleasure to meet Jeroen Janssens in person at the PyData NYC 2024 conference.

The book covers the following topics:

Core CLI commands
Data processing
Working with bash and shell scripts
Serial processing with multiple cores
Data modeling

Data Science at the Command Line; Image credit: book website

The book is available online, and a hard copy is available to purchase on Amazon:

Have any questions? Please comment below!

See you next Tuesday!

Thanks,

Rami

??Join my Data Science Channel for daily updates??

Neil Leiser

Senior Data Scientist at Artefact ? Podcast Host - AI Stories ? Imperial and UCL graduate

2 周

Great newsletter Rami Krispin! Thanks a lot for sharing my conversation with Loubna Ben Allal on the AI Stories podcast ??

1 次回应

Giulia Solinas, Ph.D.

Data Scientist | Working on ML and LLMs | Mum

Hi Rami, thanks for sharing this weekly newsletter. It seems I cannot find the link to the synthetic data generation podcast. Do you mind sharing it (again), please?

2 次回应

查看更多评论

The dlt Project, Data Science at the Command Line

Rami Krispin

Senior Manager - Data Science and Engineering at Apple | Docker Captain | LinkedIn Learning Instructor

Open Source of the Week

New Learning Resources

Code Generation & Synthetic Data With Loubna Ben Allal

Build and Deploy a RAG Chatbot

Insurance Premium Prediction

Book of the Week

Rami's Data Newsletter

24,446 位关注者

更多精彩文章

Open Source of the Week

New Learning Resources

Code Generation & Synthetic Data With Loubna Ben Allal

Build and Deploy a RAG Chatbot

Insurance Premium Prediction

Book of the Week

Rami's Data Newsletter

24,446 位关注者

Feature Engineering with Python, Data and ML Pipelines with GitHub Actions, Multi-Arch Builds

2024年11月27日

New Open Source Projects, NGINX Tutorial, Running Ollama on Kubernetes, Deep Learning Book

2024年11月19日

Bluesky Data Starter Packs

2024年11月16日

Llama Recipes, Reinforcement Learning and Probabilistic Methods in Combinatorics Courses, Quarto Dashboard Tutorial

2024年11月9日

My Learning from Publishing for 10 Weeks Weekly Newsletter on LinkedIn

2024年11月2日

The Elmer Project, New Shiny Release for Python, Mastering NLP from Foundations to LLMs

2024年11月1日

The CopilotKit Project, Data Engineering, MLOps, and Data Science Resources, Fine Tuning LLM

2024年10月24日

Probability and Statistics Courses, the NumPyro Project, Tidy Finance with R & Python

2024年10月15日

The Algorithms Project, LLM Course from IIT Delhi, Parallel Computing Course from Stanford University

2024年10月8日

Data Visualizations with ggplot2, Top DataViz People on LinkedIn, Book of the Week

2024年10月3日