Data Ingestion with Spark Scala and SQL through JDBC

Data Ingestion with Spark Scala and SQL through JDBC

Introduction :

One of data science's raison d'être is the difficulty of manipulating large datasets. Much of the data of interest to a company or research group cannot fit conveniently in a single computer's RAM. Storing the data in a way that is easy to query is therefore a complex problem.

Relational databases have been successful at solving the data storage problem. Originally proposed in 1970 (https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf), the overwhelming majority of databases in active use today are still relational. In that time, the price of RAM per megabyte has decreased by a factor of a hundred million. Similarly, hard drive capacity has increased from tens or hundreds of megabytes to terabytes. It is remarkable that, despite this exponential growth in data storage capacity, the relational model has remained dominant.

Virtually all relational databases are described and queried with variants of SQL (Structured Query Language). With the advent of distributed computing, the position of SQL databases as the de facto data storage standard is being challenged by other types of databases, commonly grouped under the umbrella term NoSQL. Many NoSQL databases are more partition-tolerant than SQL databases: they can be split into several parts residing on different computers. Even though I still expect that NoSQL databases will become increasingly popular, SQL databases are likely to remain prevalent as a data persistence mechanism; hence, a significant portion of this book is devoted to interacting with SQL from Scala.

Interacting with JDBC

No alt text provided for this image

JDBC is an API for connecting to SQL databases in Java. It remains the simplest way

of connecting to SQL databases from Scala. Furthermore, the majority of higher-level

abstractions for interacting with databases still use JDBC as a backend. JDBC is not a library in itself. Rather, it exposes a set of interfaces to interact with databases. Relational database vendors then provide specific implementations of these interfaces.

Let's start by creating a build.sbt file. We will declare a dependency on the MySQL JDBC connector:

scalaVersion := "2.11.7"

libraryDependencies += "mysql" % "mysql-connector-java" % "5.1.36"

First steps with JDBC

No alt text provided for this image

Let's start by connecting to JDBC from the command line. To follow with the examples, you will need access to a running MySQL server. If you added the MySQL connector to the list of dependencies, open a Scala console by typing the following command:

$ sbt console

Let's import JDBC:

scala> import java.sql._

import java.sql._

We then need to tell JDBC to use a specific connector. This is normally done using reflection, loading the driver at runtime:

scala> Class.forName("com.mysql.jdbc.Driver")

Class[_] = class com.mysql.jdbc.Driver

This loads the appropriate driver into the namespace at runtime. If this seems somewhat magical to you, it's probably not worth worrying about exactly how this works. This is the only example of reflection that we will consider in this book, and it is not particularly idiomatic Scala.

Connecting to a database server

Connecting to a database server Having specified the SQL connector, we can now connect to a database. Let's assume that we have a database called test on host 127.0.0.1, listening on port 3306. We create a connection as follows:

scala> val connection = DriverManager.getConnection(

"jdbc:mysql://127.0.0.1:3306/test",

"root", // username when connecting

"" // password

)

java.sql.Connection = com.mysql.jdbc.JDBC4Connection@12e78a69

The first argument to getConnection is a URL-like string with jdbc:mysql://host[:port]/database. The second and third arguments are the username and password. Pass in an empty string if you can connect without a password.

Creating tables

No alt text provided for this image

Now that we have a database connection, let's interact with the server. For these examples, you will find it useful to have a MySQL shell open (or a MySQL GUI such as MySQLWorkbench) as well as the Scala console. You can open a MySQL shell by typing the following command in a terminal:

$ mysql

As an example, we will create a small table to keep track of famous physicists. In a mysql shell, we would run the following command:

mysql> USE test;

mysql> CREATE TABLE physicists (

id INT(11) AUTO_INCREMENT PRIMARY KEY,

name VARCHAR(32) NOT NULL

);

To achieve the same with Scala, we send a JDBC statement to the connection:

scala> val statementString = """

CREATE TABLE physicists (

id INT(11) AUTO_INCREMENT PRIMARY KEY,

name VARCHAR(32) NOT NULL

)

"""

scala> val statement = connection.prepareStatement(statementString)

PreparedStatement = JDBC4PreparedStatement@c983201: CREATE TABLE ...

scala> statement.executeUpdate()

results: Int = 0

Inserting data

Now that we have created a table, let's insert some data into it. We can do this with a SQL INSERT statement:

scala> val statement = connection.prepareStatement("""

INSERT INTO physicists (name) VALUES ('Isaac Newton')

""")

scala> statement.executeUpdate()

Int = 1

In this case, executeUpdate returns 1. When inserting rows, it returns the number of rows that were inserted. Similarly, if we had used a SQL UPDATE statement, this would return the number of rows that were updated. For statements that do not manipulate rows directly (such as the CREATE TABLE statement in the previous section), executeUpdate just returns 0.

Let's just jump into a mysql shell to verify the insertion performed correctly:

mysql> select * from physicists ;


No alt text provided for this image




要查看或添加评论,请登录

社区洞察

其他会员也浏览了