登录查看更多内容

Oracle Cloud Infrastructure Data Flow

Vivek V.

Cloud Architect - Oracle Database@Azure/GCP/AWS | Tech Evangelist | Pre-Sales | ? Lover

发布日期: 2023年10月11日

甲骨文 provides a fully managed Apache Spark service that makes running Spark applications easier than ever before. Oracle Cloud Infrastructure Data Flow simplifies and speeds up the delivery of Spark based Big Data and Machine Learning applications by managing all aspects of Apache Spark jobs, operations and infrastructure in the cloud.

Getting started is easy, just upload your Spark application to Oracle Cloud, configure the number of resources needed and run the application.

Before you start using it, checkout Getting Started with Data Flow for setting up required user groups, storage buckets and associated policies in IAM to create, manage and run applications in Data Flow.

This being my first try with OCI Data Flow, I started with a basic Python based application, to read data from a csv file present in Object Storage and upload it to a table in Oracle Autonomous Database. Sample csv datafile can be downloaded here. The following diagram shows the architecture -

No alt text provided for this image — Architecture Diagram

Here's the python code -

from pyspark.sql import SparkSession

# Create a Spark Session
spark = SparkSession \
    .builder \
    .appName("Sample Data Flow App") \
    .getOrCreate()
    
print("Reading data from object storage !")

src_df = (
    spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .option("multiLine", "true")
    .load(
        "oci://<Bucket-Name>@<Namespace-Name>/<CSV-filename>" # Datafile location in OCI Object Storage
    )
    .cache()
)  # cache the dataset to increase computing speed

# Write data to Oracle Autonomous Database
src_df.write \
    .format("oracle") \
    .option("adbId","<Autonomous Database OCID>") \
    .option("connectionId","<Database Connection database_high/low/medium>") \
    .option("dbtable", "<Target Tablename>") \
    .option("user", "<DB User>") \
    .option("password", "<DB User password>") \
    .mode("Overwrite") \ # To overwrite an existing table
    .save()
    
print("\nData written to autonomous database.")

Create Data Flow Application

Login to Oracle Cloud Console, open the navigation menu and click Data Flow under Analytics & AI. Click Create application and enter the details -

While creating a Data Flow application, select the Spark version, Driver and Executor shapes (number of OCPUs and amount of memory) and also the number of Executors.

Select Python from the Language options and specify the location of the python file (already uploaded to OCI Object Storage bucket).

Under advanced options, click Enable Spark Oracle data source property, to use Spark Oracle Datasource. Spark Oracle Datasource is an extension of the Spark JDBC datasource. It simplifies the connection to Oracle databases from Spark. In addition to all the options provided by Spark's JDBC datasource, Spark Oracle Datasource simplifies connecting Oracle databases from Spark by providing:

An auto download wallet from the autonomous database, which means there is no need to download the wallet and keep it in Object Storage or Vault.
It automatically distributes the wallet bundle from Object Storage to the driver and executor without any customized code fom users.
It includes JDBC driver JAR files, and so eliminates the need to download them and include them in your archive.zip file. The JDBC driver is version 21.3.0.0.

Specify an OCI Logging Log Group and Log (OCI Logging provides Spark diagnostic logs and application logs). Click Create to create the application.

领英推荐

SELECT News From Yugabyte - Oct 22

Yugabyte 2 年前

Run MariaDB in Amazon Elastic Kubernetes Service…

AppsCode Inc. 2 年前

Run MongoDB in Amazon Elastic Kubernetes Service…

AppsCode Inc. 2 年前

Run Data Flow application

The Data Flow Run captures and securely stores the application's output, logs, and statistics. Runs also gives secure access to the Spark UI?for debugging and diagnostics. To run the application click Run on the Application details page. Data Flow does it all from Spark job provisioning to execution to cleanup.

Data Flow securely captures all job output, making it easy to gain access to analytics. Each run of an application generates a set of output logs:

Application logs, such as stdout and stderr. These logs are available immediately after the Run completes.
Diagnostics logs, such as Spark driver and executor logs. These logs are uploaded every 10 or 12 minutes into the Run execution. So, depending on the time the Run run takes, these logs aren't always uploaded before it finishes, and possibly more than one set of diagnostic logs is uploaded.

Access the logs from Logs section on Run details page -

Clicking on the Application Log will display the logs generated in OCI Logging -

The Spark UI is included with Apache Spark and is an important tool for debugging and diagnosing Spark applications. Click on Spark UI on Job details page to launch the Spark UI -

Apache Spark provides a suite of web user interfaces (UIs) that can be used to monitor the status and resource consumption of your Spark cluster. It also helps in exploring logical and physical execution plans. For streaming it provides insights on processing progress, for example, input/output rates, offsets, durations, and statistical distribution.?Details from a job's event timeline are easily accessible, even drilling down to each jobs individual stages.

After successful completion of the job and verification of all the logs, logged into connected to Oracle Autonomous database and verified that the data was inserted into the table as specified in the python code -

Conclusion

Oracle Cloud Infrastructure (OCI) Data Flow is a fully managed Apache Spark service that performs processing tasks on extremely large datasets—without infrastructure to deploy or manage. Developers can also use Spark Streaming to perform cloud ETL on their continuously produced streaming data. This enables rapid application delivery because developers can focus on app development, not infrastructure management. OCI Data Flow handles infrastructure provisioning, network setup, and teardown when Spark jobs are complete. Storage and security are also managed, which means less work is required for creating and managing Spark applications for big data analysis.

Try OCI Data Flow - A fully managed cloud service that simplifies and streamlines Spark applications.

要查看或添加评论，请登录

Vivek V.的更多文章

OCI GoldenGate to Replicate data from MongoDB to Autonomous JSON Database

2025年2月5日

OCI GoldenGate to Replicate data from MongoDB to Autonomous JSON Database

The OCI GoldenGate Big Data deployment type supports no down-time migrations from MongoDB to Autonomous JSON Database…

11 条评论
Oracle Database API for MongoDB

2025年1月28日

Oracle Database API for MongoDB

Oracle Database API for MongoDB enables the use of MongoDB drivers, frameworks, and tools to develop your JSON…

7 条评论
Configuring Oracle Interconnect for Google Cloud

2024年7月4日

Configuring Oracle Interconnect for Google Cloud

Introduction Using Oracle Interconnect for Google Cloud customers can connect their applications or services present in…

3 条评论
Oracle Zero Downtime Migration (ZDM) - Installation

2024年3月6日

Oracle Zero Downtime Migration (ZDM) - Installation

Oracle Zero Downtime Migration is a widely used automated solution for migrating Oracle Databases into any Oracle-owned…

4 条评论
Migrate MongoDB collections to Oracle Autonomous JSON Database

2024年2月29日

Migrate MongoDB collections to Oracle Autonomous JSON Database

Oracle Autonomous JSON Database is a cloud document database service that makes it simple to develop JSON-centric…
Oracle Autonomous JSON Database

2024年2月21日

Oracle Autonomous JSON Database

Oracle Autonomous JSON Database is a cloud document database service that makes it simple to develop JSON-centric…

4 条评论
Oracle GoldenGate on Oracle Cloud Marketplace

2024年2月12日

Oracle GoldenGate on Oracle Cloud Marketplace

Oracle GoldenGate 21c on Oracle Cloud Infrastructure Marketplace brings high-performance data replication as a customer…
Connect to a PostgreSQL database in OCI Database with PostgreSQL

2023年12月5日

Connect to a PostgreSQL database in OCI Database with PostgreSQL

Oracle recently announced the General Availability of Oracle Cloud Infrastructure (OCI) Database with PostgreSQL…

4 条评论
Autonomous Database : Use Select AI to Generate SQL from Natural Language Prompts using Cohere

2023年10月3日

Autonomous Database : Use Select AI to Generate SQL from Natural Language Prompts using Cohere

Autonomous Database now has a SELECT AI feature that generates SQL from natural language, enabling you to query your…

5 条评论

See all articles

Oracle Cloud Infrastructure Data Flow

Vivek V.

Cloud Architect - Oracle Database@Azure/GCP/AWS | Tech Evangelist | Pre-Sales | ? Lover

领英推荐

Vivek V.的更多文章

社区洞察

其他会员也浏览了

Run PostgreSQL in Azure Kubernetes Service (AKS) Using KubeDB

Week 19 (6 May - 12 May)

Deploy and Manage Percona XtraDB in Amazon Elastic Kubernetes Service (Amazon EKS) Using KubeDB

7 Reasons Azure Cosmos DB is a great NoSQL DBaaS Platform

Mydbops Newsletter October Edition

Oracle Cloud NoSQL Database

Understanding the Differences Between Serverless and Non-Serverless NoSQL Databases: Pros, Cons, and Use Cases

ACA Database Training Day 2 - Detailed Guide on PolarDB

Multicloud Oracle Database@Microsoft Azure - How to create & access Oracle Autonomous Database 23ai on Azure Cloud

My ‘Opinionated’ Cloud Native Journey: Database

领英推荐

Vivek V.的更多文章

OCI GoldenGate to Replicate data from MongoDB to Autonomous JSON Database

Oracle Database API for MongoDB

Configuring Oracle Interconnect for Google Cloud

Oracle Zero Downtime Migration (ZDM) - Installation

Migrate MongoDB collections to Oracle Autonomous JSON Database

Oracle Autonomous JSON Database

Oracle GoldenGate on Oracle Cloud Marketplace

Connect to a PostgreSQL database in OCI Database with PostgreSQL

Autonomous Database : Use Select AI to Generate SQL from Natural Language Prompts using Cohere

社区洞察

其他会员也浏览了

Run PostgreSQL in Azure Kubernetes Service (AKS) Using KubeDB

Week 19 (6 May - 12 May)

Deploy and Manage Percona XtraDB in Amazon Elastic Kubernetes Service (Amazon EKS) Using KubeDB

7 Reasons Azure Cosmos DB is a great NoSQL DBaaS Platform

Mydbops Newsletter October Edition

Oracle Cloud NoSQL Database

Understanding the Differences Between Serverless and Non-Serverless NoSQL Databases: Pros, Cons, and Use Cases

ACA Database Training Day 2 - Detailed Guide on PolarDB

Multicloud Oracle Database@Microsoft Azure - How to create & access Oracle Autonomous Database 23ai on Azure Cloud

My ‘Opinionated’ Cloud Native Journey: Database