The dlt Project, Data Science at the Command Line

The dlt Project, Data Science at the Command Line

This week's agenda:

  • Open Source of the Week - the dlt project
  • Learning resources - Hugging Face code generation with LLM, building RAG application with JS, insurance premium prediction
  • Book of the week - Data Science at the Command Line by Jeroen Janssens

I am also on ???? Blue Sky ??, Telegram, Instagram, and WhatsApp

Open Source of the Week

The dlt (data load tool) Python library is a relatively new project for data engineering applications from dltHub . The library provides a framework for data ingestion from one destination to another, for example, data ingestion from an API to a Postgres database on AWS. The library's main goal is to simplify the ELT (extract, load, transform) process.

Here is a simple example from the project README that illustrates the dlt workflow:

import dlt
from dlt.sources.helpers import requests

# Create a dlt pipeline that will load
# chess player data to the DuckDB destination
pipeline = dlt.pipeline(
    pipeline_name='chess_pipeline',
    destination='duckdb',
    dataset_name='player_data'
)

# Grab some player data from Chess.com API
data = []
for player in ['magnuscarlsen', 'rpragchess']:
    response = requests.get(f'https://api.chess.com/pub/player/{player}')
    response.raise_for_status()
    data.append(response.json())

# Extract, normalize, and load the data
pipeline.run(data, table_name='player')        

Where this workflow includes the following three steps:

  • Defining the pipeline
  • Sending a GET request to an API to pull data
  • Process the data and load it to a database, in this case DuckDB

Here are some of the key features of the library:

  • Define data schema and inspection
  • Set a data normalization process and data verification
  • Deployed anywhere Python runs (e.g., Airflow and serverless functions)

The project is under Apache 2.0 license. More details are available in the project documentation:

The below workshop is a good place to start with dlt:


New Learning Resources

Here are some new learning resources that I came across this week.

Code Generation & Synthetic Data With Loubna Ben Allal

A great conversation between Neil Leiser and Loubna Ben Allal - a Machine Learning engineer at Hugging Face about building LLM for code generation at the AI Stories podcast:

Build and Deploy a RAG Chatbot

This two-hour tutorial by Ania Kubow and freeCodeCamp focuses on the steps of building and deploying a RAG chatbot using tools such as JavaScript, LangChain.js, Next.js, Vercel, and OpenAI.

Insurance Premium Prediction

This short tutorial by NeuralNine demonstrates a machine learning application to create insurance premium predictions using Kagale's US Health Insurance Dataset.



Book of the Week

One of the most important data science tools that most academic institutes do not bother to teach is the command line. Like many other data scientists, I hit the command line "wall" when I started my first data science role. Luckily, I found the?Data Science at the Command Line?book by Jeroen Janssens , which does a great job of simplifying and explaining how the command line works and its data science use cases. Last week, I finally had the pleasure to meet Jeroen Janssens in person at the PyData NYC 2024 conference.

The book covers the following topics:

  • Core CLI commands
  • Data processing
  • Working with bash and shell scripts
  • Serial processing with multiple cores
  • Data modeling

Data Science at the Command Line; Image credit: book website

The book is available online, and a hard copy is available to purchase on Amazon:


Have any questions? Please comment below!

See you next Tuesday!

Thanks,

Rami

??Join my Data Science Channel for daily updates??





Neil Leiser

Senior Data Scientist at Artefact ? Podcast Host - AI Stories ? Imperial and UCL graduate

2 周

Great newsletter Rami Krispin! Thanks a lot for sharing my conversation with Loubna Ben Allal on the AI Stories podcast ??

Giulia Solinas, Ph.D.

Data Scientist | Working on ML and LLMs | Mum

2 周

Hi Rami, thanks for sharing this weekly newsletter. It seems I cannot find the link to the synthetic data generation podcast. Do you mind sharing it (again), please?

要查看或添加评论,请登录