DuckDB Access Over HTTPS

DuckDB Access Over HTTPS

Lets do a Deeper dive with an example from hugging face ??

The Hugging Face Hub is dedicated to providing open access to datasets for everyone and giving users the tools to explore and understand them. That is right open access to the dataset.


Note the /parquet in the above URL. We can discuss this in a bit.

Load remote data over HTTPS

import duckdb

url = "https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/blog_authorship_corpus-train-00000-of-00002.parquet"

con = duckdb.connect()
con.execute("INSTALL httpfs;")
con.execute("LOAD httpfs;")
        

Above we setup a url from Hugging Face's Public URL instantiated duckdb . Setup the https connctivity

About the parquet url:


Parquet?is columnar making it easier to load, store and analyze.?This is important, especially when working with large datasets. We're seeing this more and more in the LLM age.?Datasets Server converts a dataset and publishes it on the Hub in Parquet format.?You can retrieve the URL of the Parquet file using the?/parquet?Endpoint.

e.g URL:

https://datasets-server.huggingface.co/parquet        

?You can locate some of the datasets used to training large language models (LLMs) like?Falcon,?Dolly,?MPT?on hugging face. Go on play and around.

Now lets run SQL Queries over it :

con.sql(f"SELECT horoscope
??? count(*),
??? AVG(LENGTH(text)) AS avg_blog_length
??? FROM '{url}'
??? GROUP BY horoscope
??? ORDER BY avg_blog_length
??? DESC LIMIT(5)"
)

        

Cool eh? We can now run a SQL over a URL! Never occurred to me this we will come to see this. Impressive !!

Credits shared with Steven Liu of Hugging Face fame!

DuckDB from AWS

Duck db with PyArrow

Polars

Apache Spark vs Polars


要查看或添加评论,请登录

Remesh Govind N. M的更多文章

  • Scala Vs Go

    Scala Vs Go

    What are Go and Scala? ?? Go, a programming language developed by Google in 2009, combines the syntax and run-time of C…

    1 条评论
  • Querying Parquet, CSV Using DuckDB and Python on Amazon S3

    Querying Parquet, CSV Using DuckDB and Python on Amazon S3

    Introduction: This article will show you how to access Parquet files and CSVs stored on Amazon S3 with DuckDB. DuckDB…

  • DuckDB A Server-less Analytics Option

    DuckDB A Server-less Analytics Option

    After Exploring some of the options earlier such as Apache spark and Polars DuckDB (#duckdb) is a lightweight…

    1 条评论
  • Accessing Polars from RUST

    Accessing Polars from RUST

    #Polars is a Rust-based data manipulation library that provides similar functionality as Pandas. It has support for…

  • Bard vs ChatGPT

    Bard vs ChatGPT

    #Bard and #ChatGPT are two large language models, but they have different strengths and weaknesses. Bard is better…

  • Polars the nextgen dataframe library.

    Polars the nextgen dataframe library.

    Polars (#polars) is a #DataFrame library written in Rust, which means it is fast and efficient. It supports…

    1 条评论
  • 5 Reasons to Choose Rust as Your Next Programming Language

    5 Reasons to Choose Rust as Your Next Programming Language

    Introduction In an era dominated by a plethora of programming languages, #Rust has emerged as a promising contender…

  • Polars vs Apache Spark from a Developer's Perspective

    Polars vs Apache Spark from a Developer's Perspective

    #Polars and #Spark 3 are both popular frameworks for processing large datasets. But which one is better for you? Let's…

  • Apache Spark 2 Vs Apache Spark 3

    Apache Spark 2 Vs Apache Spark 3

    Apache Spark is a popular open-source big data processing engine used by many organizations to analyze and process…

  • Upgrade to Catalina MacOS or Not?

    Upgrade to Catalina MacOS or Not?

    A lot of us like Mac OS for its stability and so, in the usual course of things, its a no brainier to update to the…

社区洞察

其他会员也浏览了