Design Approach-On-premise Database processing using AWS Glue

The objective of the solution to find out a way to establish a connectivity in between on-premise database and AWS Glue over the HTTP/s only

No alt text provided for this image


Inbound Process

1.An application written to capture change data (CDC)

2. There is a built-in translator program in the application to read CDC and convert it to JSON/CSV

3. Upload JSON/CSV[Raw data] to S3 Bucket over http/Https

4.  An AWS Glue crawler creates a table for each stage of the data based on a job trigger or a predefined schedule. In this case we use AWS Lambda function to trigger the ETL process every time a new file is added to the Raw Data S3 bucket and transform it into Glue specific data format ( in case any transformation required for analysis)

5.Crawler also updates Glue catalog on schedule by reading data structure from S3

5. Store transformed data in Glue Catalog

6. A PySpark Job/ Java Native lib/ Native Py SQL can be written to read /write Glue schema for data processing

No alt text provided for this image

Outbound Process

1.PySpark Job can be written in Glue for any changes or processing of data in Glue

2. Glue transforms changes via catalog into output file(s) via jobs to S3.

3.There should be another scheduler in On-Premise application to read the changes in S3 and update the record in Local DB


**Watch this space for more updates on the development side of it**

sathish kumar

Solution Architect at HCL Technologies

2 年

I am learning AWS, Can you provide the code with sample data or any youtube link for On-premise Database processing using AWS Glue

回复
sathish kumar

Solution Architect at HCL Technologies

2 年

Hi Prosenjit

回复

要查看或添加评论,请登录

Prosenjit Das的更多文章