Presto(PrestoDB) - What it Offers and Where and How it can be used

Given my constant inclination to learn, over the past few months, i have been searching to learn something new. In October 2022, in ECR Chennai, i bumped into my old colleague who is a Chief Architect now. being an Architect myself earlier, we both discussed on various topics. at some point during our discussion, he spoke about Presto and casually said - "You should try PrestoDB and you would like it". so i started reading and eventually did a POC myself with few connectors. i found it good. here i am few months later, sharing about Presto :)

What is Presto?

Presto (or PrestoDB) is a distributed, fast, reliable SQL Query Engine that fetches data from heterogeneous data sources querying large sets((TB, PB) of data and processes in memory. it originated in Facebook as a result of Hive taking a longer time to execute queries of TB, PB magnitude. the problem with Hive was that it would store intermediate results on disk which resulted in a significant I/O overhead on disk. In 2015, Netflix showed PrestoDB was, in fact, 10 times faster than Hive and even faster in some cases.?Presto is written in Java. Presto resembles a massively parallel processing (MPP) system that facilitates the separation between storage and compute and allows you to scale its computing power horizontally by adding more servers

What Presto is Not?

since Presto understands SQL, it is not a general purpose relational database. it is not a replacement for MySQL, Oracle etc though it provides features of a standard database. it was not designed to handle OLTP. its main benefit and value can be seen in Data Warehousing and Analytics , where large volume of data is collected from various sources to produce reports. They fit into the world of OLAP. Big Data encompasses?many things like capturing, storing, searching, visualizing, querying, analysis. Presto fits into the querying ecosystem and is a big data tool

Presto Architecture

No alt text provided for this image

Presto Concepts

Coordinator: This is the brain of presto. it receives the query from the client, parses, plans and manages the worker nodes. it keeps track of the activity of worker nodes and coordinates the execution of the query. it fetches the results from the worker nodes and returns the final result to the client.?Additionally, Presto uses a discovery service that is running on the coordinator, where each worker can register and periodically send its heartbeat. This runs on the same HTTP server — including the same port.

Worker: worker nodes are the nodes that execute the tasks and processes data. they fetch data from connectors and exchange intermediate data with each other. HTTP is the communication between Coordinators and Workers, Coordinators and Clients and between Workers

Connector: Presto uses connector to connect to various data sources. in the world of databases, this equates to DB drivers. Each connector need to implement 4 SPI(Service Provider Interface)

  1. ???Metadata SPI
  2. ???Data Location SPI
  3. ???Data Statistics SPI
  4. ???Data Source SPI

Catalog: Catalog contains schemas and references to a data source via a connector

Schema : Schema is a collection of tables. In RDBMS like PostgreSQL and MySQL, this translates to the concept of Schema or a Database.

Table: Collection of data in terms of rows, columns, and the associated data types.

Statement: Statements are defined in the ANSI SQL standard, consisting of clauses, expressions, and predicates.

Query:? The previous SQL statement is parsed into a query and creates a distributed query plan consisting of a series of interconnected stages that contain all of the below elements.

Stage: The execution is structured in a hierarchy of stages that resembles a tree. They model the distributed query plan but are not executed by the worker nodes.

Task: Each stage consists of a series of tasks that are distributed over the Presto worker nodes. Tasks contain one or more parallel drivers.

Split: Tasks operate on splits, which are sections of a larger data set.

Driver: Drivers work with the data and combine operators to produce output that is aggregated by a task and delivered to another task in another stage. Each driver has one input and one output.

Operator: An operator consumes, transforms, and produces data.

Exchange: Exchanges transfer data between Presto nodes for different stages in a query.

Presto Connectors

Overall there are 30+known connectors that Presto supports. The following are few well know connectors that Presto supports.

  1. BigQuery Connector
  2. Cassandra Connector
  3. Elasticsearch Connector
  4. Hive Connector
  5. JMX Connector
  6. Kafka Connector
  7. MongoDB Connector
  8. Oracle/MySQL/PostgreSQL Connector
  9. Prometheus Connector
  10. Redis Connector
  11. Redshift Connector
  12. Delta Lake Connector

Event Listener

One of the nice things about Presto is clean abstractions. one such clean abstraction is Event Listeners. Event Listener allows you to write custom functions that listens to events happening inside engine. Event listeners are invoked for following events:

  1. ???Query creation
  2. ???Query completion
  3. ???Split completion

To Create Custom Listeners we would need to do the following:

  1. ???Implement EventListener and EventListenerFactory interfaces.
  2. ???Register the plugins and deploy the plugin to Presto.

Query Optimization

PrestoDB uses two optimizers. The Rule-Based Optimizer (RBO) applies filters to remove irrelevant data and uses hash joins to avoid full cartesian joins. This includes strategies such as predicate pushdown, limit pushdown, column pruning, and decorrelation. It also uses Cost-Based Optimizer (CBO). Here it uses statistics of the table (e.g., number of distinct values, number of null values, distributions of column data) to optimize queries and reduce I/O and network overhead. The following are ways to see available statistics and see cost-based analysis of a query

SHOW STATS FOR table_name - Approximated statistics for the named table

SHOW STATS FOR ( SELECT query ) - Approximated statistics for the query result

EXPLAIN SELECT query - Execute statement and show the distributed execution plan with the cost of each operation.

EXPLAIN ANALYZE SELECT - Execute statement and show the distributed execution plan with the cost and duration of each operation.

SQL Language and SQL Statement Syntax

we can use DDL, DML, DQL, DCL, TCL that modern databases support . the following are supported in PrestoDB

  1. DDL - Create, Alter, Drop, Truncate
  2. DML - Insert, Delete, Call
  3. TCL - Commit, Rollback, Start Transaction
  4. DQL - Select
  5. DCL - Grant, Revoke

It also supports the following data types

Boolean, TINYINT, SMALLINT, INTEGER, BIGINT, DOUBLE, DECIMAL, VARCHAR, CHAR, JSON, DATE, TIME, TIMESTAMP, ARRAY, MAP, IPADDRESS

Also, Presto supports various joins like Cartesian join, Inner join, Left Outer join, Right Outer join, Full Outer join, Semi Join, Anti Join

Presto supports various functions and operators - logical operators, comparison operators, conditional expressions, lambda expressions, string functions, date and time functions, binary functions, aggregate functions, regular expression functions, json functions, mathematical functions, geospatial functions, ip functions, bitwise functions

An Example with JMX Connector

Java Management Extensions (JMX) gives information about the Java Virtual Machine and software running inside JVM. with JMX connector we can query JMX information from all nodes in a Presto cluster. JMX is actually a connector that is figured so that chosen JMX information will be periodically dumped and stored in tables (in the “jmx” catalog)?which can be queried. JMX is useful for debugging and monitoring Presto metrics.?

To configure the JMX connector, create a catalog properties file?etc/catalog/jmx.properties?with the following

connector.name=jmx

JMX connector supports 2 schemas - current and history

To enable periodical dumps, define the following properties:

connector.name=jmx

jmx.dump-tables=java.lang:type=Runtime,com.facebook.presto.execution.scheduler:name=NodeScheduler

jmx.dump-period=10s

jmx.max-entries=86400

we will use jdbc driver to connect to presto - com.facebook.presto.jdbc.PrestoDriver

the following is used to extract the JVM version of every node.

String dbUrl= "jdbc:presto://localhost:9000/catalogName/schemaName";?

???Connection conn = null;

???Statement stmt = null;

???try {

?????Class.forName("com.facebook.presto.jdbc.PrestoDriver");

?????conn = DriverManager.getConnection(dbUrl, "username", "password");

?????stmt = conn.createStatement();

?????String sql = "SELECT node, vmname, vmversion?from jmx.current.java.lang:type=runtime";

?????ResultSet res = stmt.executeQuery(sql);???

?????while (res.next()) {

?????????String node= res.getString("node");

????????String vmname= res.getString("vmname");

?????????String vmversion= res.getString("vmversion");

?????}

??????res.close();

?????stmt.close();

?????conn.close();

???} catch (SQLException se) {

?????se.printStackTrace();

???} catch (Exception e) {

?????e.printStackTrace();

???} finally {????

?????try {

???????if (stmt != null) stmt.close();

?????} catch (SQLException sqlException) {

???????sqlException.printStackTrace();

?????}

?????try {

???????if (conn != null) conn.close();

?????} catch (Exception e) {

???????e.printStackTrace();

?????}

???}

?}

}

if we want to see?all of the available MBeans by running SHOW TABLES, we can use SHOW TABLES FROM jmx.current

if we want to see the open and maximum file descriptor counts for each node then the following is the query -?SELECT openfiledescriptorcount, maxfiledescriptorcount FROM jmx.current.java.lang:type=operatingsystem`

Where Presto can be used?

  1. It can be used in Data Warehouse where data is fetched from multiple sources in TB and PB to query and process large datasets
  2. It can used to run ad hoc queries from various sources through multiple connectors anytime we want and wherever the data resides
  3. It can be used for generating reports and dashboards as data is collected from various sources that are in multiple formats for analytics and business intelligence
  4. We can aggregate TB's of data from multiple data sources and run ETL queries against that data. instead of using legacy batch processing systems, we can use presto to run efficient and high throughput queries
  5. We can query data on a data lake without the need for transformation. we can query any type of data in a data lake, including both structured and unstructured data as there are various connectors to pull from structured and unstructured sources

要查看或添加评论,请登录

Bhargav Maddikera的更多文章

社区洞察

其他会员也浏览了