登录查看更多内容

Data Ingestion with Spark Scala and SQL through JDBC

Steven Murhula

Machine Learning l Data Engineer l Scala l Python l Data Analysis l Big Data Development l SQL I AWS l DevOps l ETL GCP & Azure

发布日期: 2020年1月11日

Introduction :

One of data science's raison d'être is the difficulty of manipulating large datasets. Much of the data of interest to a company or research group cannot fit conveniently in a single computer's RAM. Storing the data in a way that is easy to query is therefore a complex problem.

Relational databases have been successful at solving the data storage problem. Originally proposed in 1970 (https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf), the overwhelming majority of databases in active use today are still relational. In that time, the price of RAM per megabyte has decreased by a factor of a hundred million. Similarly, hard drive capacity has increased from tens or hundreds of megabytes to terabytes. It is remarkable that, despite this exponential growth in data storage capacity, the relational model has remained dominant.

Virtually all relational databases are described and queried with variants of SQL (Structured Query Language). With the advent of distributed computing, the position of SQL databases as the de facto data storage standard is being challenged by other types of databases, commonly grouped under the umbrella term NoSQL. Many NoSQL databases are more partition-tolerant than SQL databases: they can be split into several parts residing on different computers. Even though I still expect that NoSQL databases will become increasingly popular, SQL databases are likely to remain prevalent as a data persistence mechanism; hence, a significant portion of this book is devoted to interacting with SQL from Scala.

Interacting with JDBC

JDBC is an API for connecting to SQL databases in Java. It remains the simplest way

of connecting to SQL databases from Scala. Furthermore, the majority of higher-level

abstractions for interacting with databases still use JDBC as a backend. JDBC is not a library in itself. Rather, it exposes a set of interfaces to interact with databases. Relational database vendors then provide specific implementations of these interfaces.

Let's start by creating a build.sbt file. We will declare a dependency on the MySQL JDBC connector:

scalaVersion := "2.11.7"

libraryDependencies += "mysql" % "mysql-connector-java" % "5.1.36"

First steps with JDBC

Let's start by connecting to JDBC from the command line. To follow with the examples, you will need access to a running MySQL server. If you added the MySQL connector to the list of dependencies, open a Scala console by typing the following command:

$ sbt console

Let's import JDBC:

scala> import java.sql._

import java.sql._

We then need to tell JDBC to use a specific connector. This is normally done using reflection, loading the driver at runtime:

scala> Class.forName("com.mysql.jdbc.Driver")

Class[_] = class com.mysql.jdbc.Driver

This loads the appropriate driver into the namespace at runtime. If this seems somewhat magical to you, it's probably not worth worrying about exactly how this works. This is the only example of reflection that we will consider in this book, and it is not particularly idiomatic Scala.

Connecting to a database server

Connecting to a database server Having specified the SQL connector, we can now connect to a database. Let's assume that we have a database called test on host 127.0.0.1, listening on port 3306. We create a connection as follows:

scala> val connection = DriverManager.getConnection(

"jdbc:mysql://127.0.0.1:3306/test",

"root", // username when connecting

"" // password

)

java.sql.Connection = com.mysql.jdbc.JDBC4Connection@12e78a69

The first argument to getConnection is a URL-like string with jdbc:mysql://host[:port]/database. The second and third arguments are the username and password. Pass in an empty string if you can connect without a password.

Creating tables

Now that we have a database connection, let's interact with the server. For these examples, you will find it useful to have a MySQL shell open (or a MySQL GUI such as MySQLWorkbench) as well as the Scala console. You can open a MySQL shell by typing the following command in a terminal:

$ mysql

As an example, we will create a small table to keep track of famous physicists. In a mysql shell, we would run the following command:

mysql> USE test;

mysql> CREATE TABLE physicists (

id INT(11) AUTO_INCREMENT PRIMARY KEY,

name VARCHAR(32) NOT NULL

);

To achieve the same with Scala, we send a JDBC statement to the connection:

scala> val statementString = """

CREATE TABLE physicists (

id INT(11) AUTO_INCREMENT PRIMARY KEY,

name VARCHAR(32) NOT NULL

)

"""

scala> val statement = connection.prepareStatement(statementString)

PreparedStatement = JDBC4PreparedStatement@c983201: CREATE TABLE ...

scala> statement.executeUpdate()

results: Int = 0

Inserting data

Now that we have created a table, let's insert some data into it. We can do this with a SQL INSERT statement:

scala> val statement = connection.prepareStatement("""

INSERT INTO physicists (name) VALUES ('Isaac Newton')

""")

scala> statement.executeUpdate()

Int = 1

In this case, executeUpdate returns 1. When inserting rows, it returns the number of rows that were inserted. Similarly, if we had used a SQL UPDATE statement, this would return the number of rows that were updated. For statements that do not manipulate rows directly (such as the CREATE TABLE statement in the previous section), executeUpdate just returns 0.

Let's just jump into a mysql shell to verify the insertion performed correctly:

mysql> select * from physicists ;

Data Ingestion with Spark Scala and SQL through JDBC

Steven Murhula

Machine Learning l Data Engineer l Scala l Python l Data Analysis l Big Data Development l SQL I AWS l DevOps l ETL GCP & Azure

Interacting with JDBC

First steps with JDBC

Connecting to a database server

Creating tables

Inserting data

更多精彩文章

社区洞察

其他会员也浏览了

Choosing between SQL and NoSQL

Top 20 Big Data Platforms: The Best Open Source Tools (updated April 2020)

SQL

Introduction to Apache Cassandra: A Distributed NoSQL Database for Big Data

To SQL or to NoSQL: The Trend of Differentiation

What are SQL and NoSQL Databases?

SQL turns 50: A look back at its history

PySpark vs Spark MySQL vs SQL ETL vs ELT Data Warehouse and Database Data mart vs Data Lake

Neo4j Graph Database: The non-relational database that improves on relations.

SQL vs. No SQL

Interacting with JDBC

First steps with JDBC

Connecting to a database server

Creating tables

Inserting data

HOW TO HANDLE HISTORICAL DATA IN DATA WAREHOUSE IMPLEMENTATION

2023年5月1日

when to chose graph database vs nosql database

2023年4月25日

best practices for data warehousing with Azure real world scenario

2023年4月25日

Apache Airflow as an open source orchestration solution for Data Engineering

2022年12月7日

K-means clustering - European countries protein consumption. R Project

2020年1月11日

Stock Price Prediction with Regression Algorithms

2020年1月6日

An eye on Machine Learning

2020年1月6日

SCALA FOR DATA SCIENCES PROFESSIONALS

2020年1月3日

Creating a Chatbot That Gets Smarter Over Time with a Business Case Study

2018年5月25日

Business Value in Sentiment Analysis and Natural Processing with Use Case

2018年4月16日

社区洞察

其他会员也浏览了

Choosing between SQL and NoSQL

Top 20 Big Data Platforms: The Best Open Source Tools (updated April 2020)

SQL

Introduction to Apache Cassandra: A Distributed NoSQL Database for Big Data

To SQL or to NoSQL: The Trend of Differentiation

What are SQL and NoSQL Databases?

SQL turns 50: A look back at its history

PySpark vs Spark MySQL vs SQL ETL vs ELT Data Warehouse and Database Data mart vs Data Lake

Neo4j Graph Database: The non-relational database that improves on relations.

SQL vs. No SQL