Oracle Cloud Infrastructure Data Flow
Oracle Cloud Infrastructure Data Flow

Oracle Cloud Infrastructure Data Flow

甲骨文 provides a fully managed Apache Spark service that makes running Spark applications easier than ever before. Oracle Cloud Infrastructure Data Flow simplifies and speeds up the delivery of Spark based Big Data and Machine Learning applications by managing all aspects of Apache Spark jobs, operations and infrastructure in the cloud.

Getting started is easy, just upload your Spark application to Oracle Cloud, configure the number of resources needed and run the application.

Before you start using it, checkout Getting Started with Data Flow for setting up required user groups, storage buckets and associated policies in IAM to create, manage and run applications in Data Flow.

This being my first try with OCI Data Flow, I started with a basic Python based application, to read data from a csv file present in Object Storage and upload it to a table in Oracle Autonomous Database. Sample csv datafile can be downloaded here. The following diagram shows the architecture -

No alt text provided for this image
Architecture Diagram

Here's the python code -

from pyspark.sql import SparkSession

# Create a Spark Session
spark = SparkSession \
    .builder \
    .appName("Sample Data Flow App") \
    .getOrCreate()
    
print("Reading data from object storage !")

src_df = (
    spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .option("multiLine", "true")
    .load(
        "oci://<Bucket-Name>@<Namespace-Name>/<CSV-filename>" # Datafile location in OCI Object Storage
    )
    .cache()
)  # cache the dataset to increase computing speed

# Write data to Oracle Autonomous Database
src_df.write \
    .format("oracle") \
    .option("adbId","<Autonomous Database OCID>") \
    .option("connectionId","<Database Connection database_high/low/medium>") \
    .option("dbtable", "<Target Tablename>") \
    .option("user", "<DB User>") \
    .option("password", "<DB User password>") \
    .mode("Overwrite") \ # To overwrite an existing table
    .save()
    
print("\nData written to autonomous database.")        

Create Data Flow Application

Login to Oracle Cloud Console, open the navigation menu and click Data Flow under Analytics & AI. Click Create application and enter the details -

No alt text provided for this image
Create Data Flow Application

While creating a Data Flow application, select the Spark version, Driver and Executor shapes (number of OCPUs and amount of memory) and also the number of Executors.

No alt text provided for this image
Data Flow Application configuration

Select Python from the Language options and specify the location of the python file (already uploaded to OCI Object Storage bucket).

No alt text provided for this image
Advanced Options

Under advanced options, click Enable Spark Oracle data source property, to use Spark Oracle Datasource. Spark Oracle Datasource is an extension of the Spark JDBC datasource. It simplifies the connection to Oracle databases from Spark. In addition to all the options provided by Spark's JDBC datasource, Spark Oracle Datasource simplifies connecting Oracle databases from Spark by providing:

  • An auto download wallet from the autonomous database, which means there is no need to download the wallet and keep it in Object Storage or Vault.
  • It automatically distributes the wallet bundle from Object Storage to the driver and executor without any customized code fom users.
  • It includes JDBC driver JAR files, and so eliminates the need to download them and include them in your archive.zip file. The JDBC driver is version 21.3.0.0.

Specify an OCI Logging Log Group and Log (OCI Logging provides Spark diagnostic logs and application logs). Click Create to create the application.

No alt text provided for this image
Data Flow Application

Run Data Flow application

The Data Flow Run captures and securely stores the application's output, logs, and statistics. Runs also gives secure access to the Spark UI?for debugging and diagnostics. To run the application click Run on the Application details page. Data Flow does it all from Spark job provisioning to execution to cleanup.

No alt text provided for this image
Data Flow Run details

Data Flow securely captures all job output, making it easy to gain access to analytics. Each run of an application generates a set of output logs:

  • Application logs, such as stdout and stderr. These logs are available immediately after the Run completes.
  • Diagnostics logs, such as Spark driver and executor logs. These logs are uploaded every 10 or 12 minutes into the Run execution. So, depending on the time the Run run takes, these logs aren't always uploaded before it finishes, and possibly more than one set of diagnostic logs is uploaded.

Access the logs from Logs section on Run details page -

No alt text provided for this image
Data Flow Run Logs

Clicking on the Application Log will display the logs generated in OCI Logging -

No alt text provided for this image

The Spark UI is included with Apache Spark and is an important tool for debugging and diagnosing Spark applications. Click on Spark UI on Job details page to launch the Spark UI -

No alt text provided for this image
Spark UI

Apache Spark provides a suite of web user interfaces (UIs) that can be used to monitor the status and resource consumption of your Spark cluster. It also helps in exploring logical and physical execution plans. For streaming it provides insights on processing progress, for example, input/output rates, offsets, durations, and statistical distribution.?Details from a job's event timeline are easily accessible, even drilling down to each jobs individual stages.

After successful completion of the job and verification of all the logs, logged into connected to Oracle Autonomous database and verified that the data was inserted into the table as specified in the python code -

No alt text provided for this image
Target Autonomous Database

Conclusion

Oracle Cloud Infrastructure (OCI) Data Flow is a fully managed Apache Spark service that performs processing tasks on extremely large datasets—without infrastructure to deploy or manage. Developers can also use Spark Streaming to perform cloud ETL on their continuously produced streaming data. This enables rapid application delivery because developers can focus on app development, not infrastructure management. OCI Data Flow handles infrastructure provisioning, network setup, and teardown when Spark jobs are complete. Storage and security are also managed, which means less work is required for creating and managing Spark applications for big data analysis.


Try OCI Data Flow - A fully managed cloud service that simplifies and streamlines Spark applications.

要查看或添加评论,请登录

Vivek V.的更多文章

社区洞察

其他会员也浏览了