Presto(PrestoDB) - What it Offers and Where and How it can be used
Bhargav Maddikera
Engineering Leader | Technology Leader | Enterprise Architect | Technical Architect | Big Data | Building Data Intensive & Large Scale Systems | EAI | SaaS | BaaS | SOA | Data Pipeline | Middleware | 0-1 B2B | 0-1 B2C
Given my constant inclination to learn, over the past few months, i have been searching to learn something new. In October 2022, in ECR Chennai, i bumped into my old colleague who is a Chief Architect now. being an Architect myself earlier, we both discussed on various topics. at some point during our discussion, he spoke about Presto and casually said - "You should try PrestoDB and you would like it". so i started reading and eventually did a POC myself with few connectors. i found it good. here i am few months later, sharing about Presto :)
What is Presto?
Presto (or PrestoDB) is a distributed, fast, reliable SQL Query Engine that fetches data from heterogeneous data sources querying large sets((TB, PB) of data and processes in memory. it originated in Facebook as a result of Hive taking a longer time to execute queries of TB, PB magnitude. the problem with Hive was that it would store intermediate results on disk which resulted in a significant I/O overhead on disk. In 2015, Netflix showed PrestoDB was, in fact, 10 times faster than Hive and even faster in some cases.?Presto is written in Java. Presto resembles a massively parallel processing (MPP) system that facilitates the separation between storage and compute and allows you to scale its computing power horizontally by adding more servers
What Presto is Not?
since Presto understands SQL, it is not a general purpose relational database. it is not a replacement for MySQL, Oracle etc though it provides features of a standard database. it was not designed to handle OLTP. its main benefit and value can be seen in Data Warehousing and Analytics , where large volume of data is collected from various sources to produce reports. They fit into the world of OLAP. Big Data encompasses?many things like capturing, storing, searching, visualizing, querying, analysis. Presto fits into the querying ecosystem and is a big data tool
Presto Architecture
Presto Concepts
Coordinator: This is the brain of presto. it receives the query from the client, parses, plans and manages the worker nodes. it keeps track of the activity of worker nodes and coordinates the execution of the query. it fetches the results from the worker nodes and returns the final result to the client.?Additionally, Presto uses a discovery service that is running on the coordinator, where each worker can register and periodically send its heartbeat. This runs on the same HTTP server — including the same port.
Worker: worker nodes are the nodes that execute the tasks and processes data. they fetch data from connectors and exchange intermediate data with each other. HTTP is the communication between Coordinators and Workers, Coordinators and Clients and between Workers
Connector: Presto uses connector to connect to various data sources. in the world of databases, this equates to DB drivers. Each connector need to implement 4 SPI(Service Provider Interface)
Catalog: Catalog contains schemas and references to a data source via a connector
Schema : Schema is a collection of tables. In RDBMS like PostgreSQL and MySQL, this translates to the concept of Schema or a Database.
Table: Collection of data in terms of rows, columns, and the associated data types.
Statement: Statements are defined in the ANSI SQL standard, consisting of clauses, expressions, and predicates.
Query:? The previous SQL statement is parsed into a query and creates a distributed query plan consisting of a series of interconnected stages that contain all of the below elements.
Stage: The execution is structured in a hierarchy of stages that resembles a tree. They model the distributed query plan but are not executed by the worker nodes.
Task: Each stage consists of a series of tasks that are distributed over the Presto worker nodes. Tasks contain one or more parallel drivers.
Split: Tasks operate on splits, which are sections of a larger data set.
Driver: Drivers work with the data and combine operators to produce output that is aggregated by a task and delivered to another task in another stage. Each driver has one input and one output.
Operator: An operator consumes, transforms, and produces data.
Exchange: Exchanges transfer data between Presto nodes for different stages in a query.
Presto Connectors
Overall there are 30+known connectors that Presto supports. The following are few well know connectors that Presto supports.
Event Listener
One of the nice things about Presto is clean abstractions. one such clean abstraction is Event Listeners. Event Listener allows you to write custom functions that listens to events happening inside engine. Event listeners are invoked for following events:
To Create Custom Listeners we would need to do the following:
Query Optimization
PrestoDB uses two optimizers. The Rule-Based Optimizer (RBO) applies filters to remove irrelevant data and uses hash joins to avoid full cartesian joins. This includes strategies such as predicate pushdown, limit pushdown, column pruning, and decorrelation. It also uses Cost-Based Optimizer (CBO). Here it uses statistics of the table (e.g., number of distinct values, number of null values, distributions of column data) to optimize queries and reduce I/O and network overhead. The following are ways to see available statistics and see cost-based analysis of a query
SHOW STATS FOR table_name - Approximated statistics for the named table
SHOW STATS FOR ( SELECT query ) - Approximated statistics for the query result
EXPLAIN SELECT query - Execute statement and show the distributed execution plan with the cost of each operation.
EXPLAIN ANALYZE SELECT - Execute statement and show the distributed execution plan with the cost and duration of each operation.
SQL Language and SQL Statement Syntax
we can use DDL, DML, DQL, DCL, TCL that modern databases support . the following are supported in PrestoDB
It also supports the following data types
Boolean, TINYINT, SMALLINT, INTEGER, BIGINT, DOUBLE, DECIMAL, VARCHAR, CHAR, JSON, DATE, TIME, TIMESTAMP, ARRAY, MAP, IPADDRESS
Also, Presto supports various joins like Cartesian join, Inner join, Left Outer join, Right Outer join, Full Outer join, Semi Join, Anti Join
Presto supports various functions and operators - logical operators, comparison operators, conditional expressions, lambda expressions, string functions, date and time functions, binary functions, aggregate functions, regular expression functions, json functions, mathematical functions, geospatial functions, ip functions, bitwise functions
An Example with JMX Connector
Java Management Extensions (JMX) gives information about the Java Virtual Machine and software running inside JVM. with JMX connector we can query JMX information from all nodes in a Presto cluster. JMX is actually a connector that is figured so that chosen JMX information will be periodically dumped and stored in tables (in the “jmx” catalog)?which can be queried. JMX is useful for debugging and monitoring Presto metrics.?
To configure the JMX connector, create a catalog properties file?etc/catalog/jmx.properties?with the following
connector.name=jmx
领英推荐
JMX connector supports 2 schemas - current and history
To enable periodical dumps, define the following properties:
connector.name=jmx
jmx.dump-tables=java.lang:type=Runtime,com.facebook.presto.execution.scheduler:name=NodeScheduler
jmx.dump-period=10s
jmx.max-entries=86400
we will use jdbc driver to connect to presto - com.facebook.presto.jdbc.PrestoDriver
the following is used to extract the JVM version of every node.
String dbUrl= "jdbc:presto://localhost:9000/catalogName/schemaName";?
???Connection conn = null;
???Statement stmt = null;
???try {
?????Class.forName("com.facebook.presto.jdbc.PrestoDriver");
?????conn = DriverManager.getConnection(dbUrl, "username", "password");
?????stmt = conn.createStatement();
?????String sql = "SELECT node, vmname, vmversion?from jmx.current.java.lang:type=runtime";
?????ResultSet res = stmt.executeQuery(sql);???
?????while (res.next()) {
?????????String node= res.getString("node");
????????String vmname= res.getString("vmname");
?????????String vmversion= res.getString("vmversion");
?????}
??????res.close();
?????stmt.close();
?????conn.close();
???} catch (SQLException se) {
?????se.printStackTrace();
???} catch (Exception e) {
?????e.printStackTrace();
???} finally {????
?????try {
???????if (stmt != null) stmt.close();
?????} catch (SQLException sqlException) {
???????sqlException.printStackTrace();
?????}
?????try {
???????if (conn != null) conn.close();
?????} catch (Exception e) {
???????e.printStackTrace();
?????}
???}
?}
}
if we want to see?all of the available MBeans by running SHOW TABLES, we can use SHOW TABLES FROM jmx.current
if we want to see the open and maximum file descriptor counts for each node then the following is the query -?SELECT openfiledescriptorcount, maxfiledescriptorcount FROM jmx.current.java.lang:type=operatingsystem`
Where Presto can be used?