Oracle Cloud Infrastructure Data Flow
甲骨文 provides a fully managed Apache Spark service that makes running Spark applications easier than ever before. Oracle Cloud Infrastructure Data Flow simplifies and speeds up the delivery of Spark based Big Data and Machine Learning applications by managing all aspects of Apache Spark jobs, operations and infrastructure in the cloud.
Getting started is easy, just upload your Spark application to Oracle Cloud, configure the number of resources needed and run the application.
Before you start using it, checkout Getting Started with Data Flow for setting up required user groups, storage buckets and associated policies in IAM to create, manage and run applications in Data Flow.
This being my first try with OCI Data Flow, I started with a basic Python based application, to read data from a csv file present in Object Storage and upload it to a table in Oracle Autonomous Database. Sample csv datafile can be downloaded here. The following diagram shows the architecture -
Here's the python code -
from pyspark.sql import SparkSession
# Create a Spark Session
spark = SparkSession \
.builder \
.appName("Sample Data Flow App") \
.getOrCreate()
print("Reading data from object storage !")
src_df = (
spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.option("multiLine", "true")
.load(
"oci://<Bucket-Name>@<Namespace-Name>/<CSV-filename>" # Datafile location in OCI Object Storage
)
.cache()
) # cache the dataset to increase computing speed
# Write data to Oracle Autonomous Database
src_df.write \
.format("oracle") \
.option("adbId","<Autonomous Database OCID>") \
.option("connectionId","<Database Connection database_high/low/medium>") \
.option("dbtable", "<Target Tablename>") \
.option("user", "<DB User>") \
.option("password", "<DB User password>") \
.mode("Overwrite") \ # To overwrite an existing table
.save()
print("\nData written to autonomous database.")
Create Data Flow Application
Login to Oracle Cloud Console, open the navigation menu and click Data Flow under Analytics & AI. Click Create application and enter the details -
While creating a Data Flow application, select the Spark version, Driver and Executor shapes (number of OCPUs and amount of memory) and also the number of Executors.
Select Python from the Language options and specify the location of the python file (already uploaded to OCI Object Storage bucket).
Under advanced options, click Enable Spark Oracle data source property, to use Spark Oracle Datasource. Spark Oracle Datasource is an extension of the Spark JDBC datasource. It simplifies the connection to Oracle databases from Spark. In addition to all the options provided by Spark's JDBC datasource, Spark Oracle Datasource simplifies connecting Oracle databases from Spark by providing:
Specify an OCI Logging Log Group and Log (OCI Logging provides Spark diagnostic logs and application logs). Click Create to create the application.
领英推荐
Run Data Flow application
The Data Flow Run captures and securely stores the application's output, logs, and statistics. Runs also gives secure access to the Spark UI?for debugging and diagnostics. To run the application click Run on the Application details page. Data Flow does it all from Spark job provisioning to execution to cleanup.
Data Flow securely captures all job output, making it easy to gain access to analytics. Each run of an application generates a set of output logs:
Access the logs from Logs section on Run details page -
Clicking on the Application Log will display the logs generated in OCI Logging -
The Spark UI is included with Apache Spark and is an important tool for debugging and diagnosing Spark applications. Click on Spark UI on Job details page to launch the Spark UI -
Apache Spark provides a suite of web user interfaces (UIs) that can be used to monitor the status and resource consumption of your Spark cluster. It also helps in exploring logical and physical execution plans. For streaming it provides insights on processing progress, for example, input/output rates, offsets, durations, and statistical distribution.?Details from a job's event timeline are easily accessible, even drilling down to each jobs individual stages.
After successful completion of the job and verification of all the logs, logged into connected to Oracle Autonomous database and verified that the data was inserted into the table as specified in the python code -
Conclusion
Oracle Cloud Infrastructure (OCI) Data Flow is a fully managed Apache Spark service that performs processing tasks on extremely large datasets—without infrastructure to deploy or manage. Developers can also use Spark Streaming to perform cloud ETL on their continuously produced streaming data. This enables rapid application delivery because developers can focus on app development, not infrastructure management. OCI Data Flow handles infrastructure provisioning, network setup, and teardown when Spark jobs are complete. Storage and security are also managed, which means less work is required for creating and managing Spark applications for big data analysis.
Try OCI Data Flow - A fully managed cloud service that simplifies and streamlines Spark applications.