The dlt Project, Data Science at the Command Line
Rami Krispin
Senior Manager - Data Science and Engineering at Apple | Docker Captain | LinkedIn Learning Instructor
This week's agenda:
Open Source of the Week
The dlt (data load tool) Python library is a relatively new project for data engineering applications from dltHub . The library provides a framework for data ingestion from one destination to another, for example, data ingestion from an API to a Postgres database on AWS. The library's main goal is to simplify the ELT (extract, load, transform) process.
Here is a simple example from the project README that illustrates the dlt workflow:
import dlt
from dlt.sources.helpers import requests
# Create a dlt pipeline that will load
# chess player data to the DuckDB destination
pipeline = dlt.pipeline(
pipeline_name='chess_pipeline',
destination='duckdb',
dataset_name='player_data'
)
# Grab some player data from Chess.com API
data = []
for player in ['magnuscarlsen', 'rpragchess']:
response = requests.get(f'https://api.chess.com/pub/player/{player}')
response.raise_for_status()
data.append(response.json())
# Extract, normalize, and load the data
pipeline.run(data, table_name='player')
Where this workflow includes the following three steps:
Here are some of the key features of the library:
The project is under Apache 2.0 license. More details are available in the project documentation:
The below workshop is a good place to start with dlt:
New Learning Resources
Here are some new learning resources that I came across this week.
Code Generation & Synthetic Data With Loubna Ben Allal
A great conversation between Neil Leiser and Loubna Ben Allal - a Machine Learning engineer at Hugging Face about building LLM for code generation at the AI Stories podcast:
Build and Deploy a RAG Chatbot
This two-hour tutorial by Ania Kubow and freeCodeCamp focuses on the steps of building and deploying a RAG chatbot using tools such as JavaScript, LangChain.js, Next.js, Vercel, and OpenAI.
Insurance Premium Prediction
This short tutorial by NeuralNine demonstrates a machine learning application to create insurance premium predictions using Kagale's US Health Insurance Dataset.
Book of the Week
One of the most important data science tools that most academic institutes do not bother to teach is the command line. Like many other data scientists, I hit the command line "wall" when I started my first data science role. Luckily, I found the?Data Science at the Command Line?book by Jeroen Janssens , which does a great job of simplifying and explaining how the command line works and its data science use cases. Last week, I finally had the pleasure to meet Jeroen Janssens in person at the PyData NYC 2024 conference.
The book covers the following topics:
The book is available online, and a hard copy is available to purchase on Amazon:
Have any questions? Please comment below!
See you next Tuesday!
Thanks,
Rami
Senior Data Scientist at Artefact ? Podcast Host - AI Stories ? Imperial and UCL graduate
2 周Great newsletter Rami Krispin! Thanks a lot for sharing my conversation with Loubna Ben Allal on the AI Stories podcast ??
Data Scientist | Working on ML and LLMs | Mum
2 周Hi Rami, thanks for sharing this weekly newsletter. It seems I cannot find the link to the synthetic data generation podcast. Do you mind sharing it (again), please?