登录查看更多内容

Presto(PrestoDB) - What it Offers and Where and How it can be used

Bhargav Maddikera

Engineering Leader | Technology Leader | Enterprise Architect | Technical Architect | Big Data | Building Data Intensive & Large Scale Systems | EAI | SaaS | BaaS | SOA | Data Pipeline | Middleware | 0-1 B2B | 0-1 B2C

发布日期: 2023年1月14日

Given my constant inclination to learn, over the past few months, i have been searching to learn something new. In October 2022, in ECR Chennai, i bumped into my old colleague who is a Chief Architect now. being an Architect myself earlier, we both discussed on various topics. at some point during our discussion, he spoke about Presto and casually said - "You should try PrestoDB and you would like it". so i started reading and eventually did a POC myself with few connectors. i found it good. here i am few months later, sharing about Presto :)

What is Presto?

Presto (or PrestoDB) is a distributed, fast, reliable SQL Query Engine that fetches data from heterogeneous data sources querying large sets((TB, PB) of data and processes in memory. it originated in Facebook as a result of Hive taking a longer time to execute queries of TB, PB magnitude. the problem with Hive was that it would store intermediate results on disk which resulted in a significant I/O overhead on disk. In 2015, Netflix showed PrestoDB was, in fact, 10 times faster than Hive and even faster in some cases.?Presto is written in Java. Presto resembles a massively parallel processing (MPP) system that facilitates the separation between storage and compute and allows you to scale its computing power horizontally by adding more servers

What Presto is Not?

since Presto understands SQL, it is not a general purpose relational database. it is not a replacement for MySQL, Oracle etc though it provides features of a standard database. it was not designed to handle OLTP. its main benefit and value can be seen in Data Warehousing and Analytics , where large volume of data is collected from various sources to produce reports. They fit into the world of OLAP. Big Data encompasses?many things like capturing, storing, searching, visualizing, querying, analysis. Presto fits into the querying ecosystem and is a big data tool

Presto Architecture

Presto Concepts

Coordinator: This is the brain of presto. it receives the query from the client, parses, plans and manages the worker nodes. it keeps track of the activity of worker nodes and coordinates the execution of the query. it fetches the results from the worker nodes and returns the final result to the client.?Additionally, Presto uses a discovery service that is running on the coordinator, where each worker can register and periodically send its heartbeat. This runs on the same HTTP server — including the same port.

Worker: worker nodes are the nodes that execute the tasks and processes data. they fetch data from connectors and exchange intermediate data with each other. HTTP is the communication between Coordinators and Workers, Coordinators and Clients and between Workers

Connector: Presto uses connector to connect to various data sources. in the world of databases, this equates to DB drivers. Each connector need to implement 4 SPI(Service Provider Interface)

???Metadata SPI
???Data Location SPI
???Data Statistics SPI
???Data Source SPI

Catalog: Catalog contains schemas and references to a data source via a connector

Schema : Schema is a collection of tables. In RDBMS like PostgreSQL and MySQL, this translates to the concept of Schema or a Database.

Table: Collection of data in terms of rows, columns, and the associated data types.

Statement: Statements are defined in the ANSI SQL standard, consisting of clauses, expressions, and predicates.

Query:? The previous SQL statement is parsed into a query and creates a distributed query plan consisting of a series of interconnected stages that contain all of the below elements.

Stage: The execution is structured in a hierarchy of stages that resembles a tree. They model the distributed query plan but are not executed by the worker nodes.

Task: Each stage consists of a series of tasks that are distributed over the Presto worker nodes. Tasks contain one or more parallel drivers.

Split: Tasks operate on splits, which are sections of a larger data set.

Driver: Drivers work with the data and combine operators to produce output that is aggregated by a task and delivered to another task in another stage. Each driver has one input and one output.

Operator: An operator consumes, transforms, and produces data.

Exchange: Exchanges transfer data between Presto nodes for different stages in a query.

Presto Connectors

Overall there are 30+known connectors that Presto supports. The following are few well know connectors that Presto supports.

BigQuery Connector
Cassandra Connector
Elasticsearch Connector
Hive Connector
JMX Connector
Kafka Connector
MongoDB Connector
Oracle/MySQL/PostgreSQL Connector
Prometheus Connector
Redis Connector
Redshift Connector
Delta Lake Connector

Event Listener

One of the nice things about Presto is clean abstractions. one such clean abstraction is Event Listeners. Event Listener allows you to write custom functions that listens to events happening inside engine. Event listeners are invoked for following events:

???Query creation
???Query completion
???Split completion

To Create Custom Listeners we would need to do the following:

???Implement EventListener and EventListenerFactory interfaces.
???Register the plugins and deploy the plugin to Presto.

Query Optimization

PrestoDB uses two optimizers. The Rule-Based Optimizer (RBO) applies filters to remove irrelevant data and uses hash joins to avoid full cartesian joins. This includes strategies such as predicate pushdown, limit pushdown, column pruning, and decorrelation. It also uses Cost-Based Optimizer (CBO). Here it uses statistics of the table (e.g., number of distinct values, number of null values, distributions of column data) to optimize queries and reduce I/O and network overhead. The following are ways to see available statistics and see cost-based analysis of a query

SHOW STATS FOR table_name - Approximated statistics for the named table

SHOW STATS FOR ( SELECT query ) - Approximated statistics for the query result

EXPLAIN SELECT query - Execute statement and show the distributed execution plan with the cost of each operation.

EXPLAIN ANALYZE SELECT - Execute statement and show the distributed execution plan with the cost and duration of each operation.

SQL Language and SQL Statement Syntax

we can use DDL, DML, DQL, DCL, TCL that modern databases support . the following are supported in PrestoDB

DDL - Create, Alter, Drop, Truncate
DML - Insert, Delete, Call
TCL - Commit, Rollback, Start Transaction
DQL - Select
DCL - Grant, Revoke

It also supports the following data types

Boolean, TINYINT, SMALLINT, INTEGER, BIGINT, DOUBLE, DECIMAL, VARCHAR, CHAR, JSON, DATE, TIME, TIMESTAMP, ARRAY, MAP, IPADDRESS

Also, Presto supports various joins like Cartesian join, Inner join, Left Outer join, Right Outer join, Full Outer join, Semi Join, Anti Join

Presto supports various functions and operators - logical operators, comparison operators, conditional expressions, lambda expressions, string functions, date and time functions, binary functions, aggregate functions, regular expression functions, json functions, mathematical functions, geospatial functions, ip functions, bitwise functions

An Example with JMX Connector

Java Management Extensions (JMX) gives information about the Java Virtual Machine and software running inside JVM. with JMX connector we can query JMX information from all nodes in a Presto cluster. JMX is actually a connector that is figured so that chosen JMX information will be periodically dumped and stored in tables (in the “jmx” catalog)?which can be queried. JMX is useful for debugging and monitoring Presto metrics.?

To configure the JMX connector, create a catalog properties file?etc/catalog/jmx.properties?with the following

connector.name=jmx

领英推荐

Connect AWS Glue with MySQL for Data Testing via ETL…

NARAYANAN PALANI ?????? 2 周前

??SQL Database in Fabric: A Game-Changer for AI…

Abiola A. David, MSc, MVP 3 个月前

High-level architecture for a text-to-SQL solution…

Sanjay Kumar MBA,MS,PhD 12 个月前

JMX connector supports 2 schemas - current and history

To enable periodical dumps, define the following properties:

connector.name=jmx

jmx.dump-tables=java.lang:type=Runtime,com.facebook.presto.execution.scheduler:name=NodeScheduler

jmx.dump-period=10s

jmx.max-entries=86400

we will use jdbc driver to connect to presto - com.facebook.presto.jdbc.PrestoDriver

the following is used to extract the JVM version of every node.

String dbUrl= "jdbc:presto://localhost:9000/catalogName/schemaName";?

???Connection conn = null;

???Statement stmt = null;

???try {

?????Class.forName("com.facebook.presto.jdbc.PrestoDriver");

?????conn = DriverManager.getConnection(dbUrl, "username", "password");

?????stmt = conn.createStatement();

?????String sql = "SELECT node, vmname, vmversion?from jmx.current.java.lang:type=runtime";

?????ResultSet res = stmt.executeQuery(sql);???

?????while (res.next()) {

?????????String node= res.getString("node");

????????String vmname= res.getString("vmname");

?????????String vmversion= res.getString("vmversion");

?????}

??????res.close();

?????stmt.close();

?????conn.close();

???} catch (SQLException se) {

?????se.printStackTrace();

???} catch (Exception e) {

?????e.printStackTrace();

???} finally {????

?????try {

???????if (stmt != null) stmt.close();

?????} catch (SQLException sqlException) {

???????sqlException.printStackTrace();

?????}

?????try {

???????if (conn != null) conn.close();

?????} catch (Exception e) {

???????e.printStackTrace();

?????}

???}

}

if we want to see?all of the available MBeans by running SHOW TABLES, we can use SHOW TABLES FROM jmx.current

if we want to see the open and maximum file descriptor counts for each node then the following is the query -?SELECT openfiledescriptorcount, maxfiledescriptorcount FROM jmx.current.java.lang:type=operatingsystem`

Where Presto can be used?

It can be used in Data Warehouse where data is fetched from multiple sources in TB and PB to query and process large datasets
It can used to run ad hoc queries from various sources through multiple connectors anytime we want and wherever the data resides
It can be used for generating reports and dashboards as data is collected from various sources that are in multiple formats for analytics and business intelligence
We can aggregate TB's of data from multiple data sources and run ETL queries against that data. instead of using legacy batch processing systems, we can use presto to run efficient and high throughput queries
We can query data on a data lake without the need for transformation. we can query any type of data in a data lake, including both structured and unstructured data as there are various connectors to pull from structured and unstructured sources

要查看或添加评论，请登录

Bhargav Maddikera的更多文章

Building Custom Multi Tenant and Multi Factor Authentication and Authorization Platform

2024年7月13日

Building Custom Multi Tenant and Multi Factor Authentication and Authorization Platform

Here is my approach on how to build a custom Authentication and Authorization platform before i go into the approach…

5 条评论
The 5 Well-beings of Life - Life is a long race

2024年3月17日

The 5 Well-beings of Life - Life is a long race

I have written articles in technology earlier for 2 reasons a) Passionate being a techie b) most importantly - having…

2 条评论
Elasticsearch - A guide to conceptually learn and implement

2021年5月18日

Elasticsearch - A guide to conceptually learn and implement

Having used Elasticsearch(ES) once in my journey so far, i would like to share about ES. Important things from learning…

9 条评论
Apache NiFi - What it Offers and Where and How it can be used

2021年5月13日

Apache NiFi - What it Offers and Where and How it can be used

I am sure many of you would have heard about NiFi and the people who have had the opportunity to work in NiFi, would…

1 条评论
Data Migration Techniques- One Time and Delta Feeds

2021年4月23日

Data Migration Techniques- One Time and Delta Feeds

We would have come across situations where there is a need to migrate data from one System to another. For example…

4 条评论
Features of In-Memory Data Grid

2021年4月11日

Features of In-Memory Data Grid

You might have often come across these 2 things : In-Memory Data Grid(IMDG) and In-Memory Database(IMDB). but what…
Kafka vs Queueing System

2021年4月11日

Kafka vs Queueing System

Having worked in both Kafka and Queueing Systems to a good level, i would like to share my thoughts why i would prefer…

3 条评论
How to Build Micro Services - Real World - Travel App

2021年3月20日

How to Build Micro Services - Real World - Travel App

Through this article, i would like to demonstrate a simple use case of how Micro Services Architecture can be…

2 条评论
How to Address Non Functional Requirements(NFR)in App Development

2021年3月6日

How to Address Non Functional Requirements(NFR)in App Development

This is an overview of Key Non Functional Requirements in Product/App Development and what are the things that should…
Key Architecture Patterns and Best Technologies, Frameworks, Tools and Libraries

2021年3月4日

Key Architecture Patterns and Best Technologies, Frameworks, Tools and Libraries

This is a summary only document of the Key Architecture Patterns and its corresponding Frameworks, Platforms that each…

See all articles

Presto(PrestoDB) - What it Offers and Where and How it can be used

Bhargav Maddikera

Engineering Leader | Technology Leader | Enterprise Architect | Technical Architect | Big Data | Building Data Intensive & Large Scale Systems | EAI | SaaS | BaaS | SOA | Data Pipeline | Middleware | 0-1 B2B | 0-1 B2C

领英推荐

Bhargav Maddikera的更多文章

社区洞察

其他会员也浏览了

Where is the database schema? #SQL #NoSQL

Azure Data Engineering Tips & Tricks: Synapse Analytics Collation and CSV files

Delta lake insights

DSE Cassandra Architecture Overview

To SQL, NoSQL or NewSQL, that’s the Query!

Mastering Spark Tables and Resource Optimization: A Beginner's Guide to Efficiency and Scalability

Traditional SQL Stored Procedure to Spark Conversion using AWS Glue

High Data Ingestion: LeanXcale Dual Interface SQL & NoSQL

Data Models

RDD vs DataFrame : The Deference!

领英推荐

Bhargav Maddikera的更多文章

Building Custom Multi Tenant and Multi Factor Authentication and Authorization Platform

The 5 Well-beings of Life - Life is a long race

Elasticsearch - A guide to conceptually learn and implement

Apache NiFi - What it Offers and Where and How it can be used

Data Migration Techniques- One Time and Delta Feeds

Features of In-Memory Data Grid

Kafka vs Queueing System

How to Build Micro Services - Real World - Travel App

How to Address Non Functional Requirements(NFR)in App Development

Key Architecture Patterns and Best Technologies, Frameworks, Tools and Libraries

社区洞察

其他会员也浏览了

Where is the database schema? #SQL #NoSQL

Azure Data Engineering Tips & Tricks: Synapse Analytics Collation and CSV files

Delta lake insights

DSE Cassandra Architecture Overview

To SQL, NoSQL or NewSQL, that’s the Query!

Mastering Spark Tables and Resource Optimization: A Beginner's Guide to Efficiency and Scalability

Traditional SQL Stored Procedure to Spark Conversion using AWS Glue

High Data Ingestion: LeanXcale Dual Interface SQL & NoSQL

Data Models

RDD vs DataFrame : The Deference!