AWS Redshift DWH and Steps to store Data from S3 and DynamoDB

AWS Redshift DWH and Steps to store Data from S3 and DynamoDB

This article helps in configuring Redshift for AWS based Data Classification and DWH needs.

INTRODUCTION

Amazon Redshift is a AWS based DWH solution that is fully managed, petabyte-scale data warehouse service. Designed to handle large-scale data storage and complex queries, making it a popular choice for businesses looking to perform advanced analytics and gain insights from their data.

Copying data from Amazon S3 and DynamoDB to Amazon Redshift is essential for efficient data analysis and reporting. Here’s why it’s beneficial to move data from these sources into Redshift:

1. Optimized for Analytics and Complex Queries

- Data Warehouse vs. Operational Databases: DynamoDB is a NoSQL database optimized for high-transaction applications, while S3 is an object storage service for flexible data storage. They’re not built for complex analytical queries. Redshift, as a data warehouse, is designed for running complex queries on large datasets, so moving data there makes it easier to perform in-depth analysis.

- Columnar Storage and Parallel Processing: Redshift uses columnar storage and massively parallel processing (MPP), which significantly speeds up complex analytical queries compared to S3 or DynamoDB.

2. Data Consolidation for a Single Source of Truth

- Centralized Data for Analysis: By copying data from S3 (which might contain semi-structured data) and DynamoDB (typically used for real-time applications), Redshift allows you to consolidate disparate data sources. This centralized view supports more accurate reporting and insights, providing a single source of truth for business intelligence (BI).

- Easier Data Transformation and Cleaning: Once in Redshift, data can be standardized and joined across sources, making it ready for analysis. This step is often crucial for preparing data for BI tools like Tableau or Amazon QuickSight.

3. Scalability and Performance on Large Datasets

- Scaling for Big Data: Redshift can handle petabyte-scale data, making it suitable for big data analytics. S3 and DynamoDB can store massive amounts of data, but Redshift can process these large datasets much faster and more cost-effectively in analytic contexts.

- Efficient Data Loading and Querying: Redshift’s COPY command can load large datasets from S3 quickly and in parallel, minimizing the load time and allowing for rapid analysis.

4. Cost Efficiency for Analytical Workloads

- Separation of Storage and Compute: Since Redshift is optimized for analytics, it provides more cost-effective solutions for querying large volumes of data compared to querying directly in DynamoDB or S3, which could lead to high operational costs.

- Data Lifecycle Management: Frequently accessed analytical data can reside in Redshift, while colder data can remain in S3. This arrangement keeps storage costs down while maintaining fast access to active data.

5. Integration with AWS Ecosystem

- Seamless Integration with Other AWS Services: Redshift easily integrates with other AWS services like AWS Glue (for ETL), S3, and DynamoDB, enabling efficient data movement and transformation pipelines. This integration supports the end-to-end flow from raw data collection to analytical insights.

By copying data into Redshift, organizations gain a powerful platform for data analytics and BI, making it easier to derive actionable insights from their data across multiple sources.

Let us look at some easy steps to copy and load data from both S3 bucket and DynamoDB to Redshift in following sections.


Copy the file into AWS CLI:

curl -O -l https://raw.githubusercontent.com/linuxacademy/content-aws-database-specialty/master/S06_Additional%20Database%20Services/redshift-data.csv        

Copy Json file into AWS CLI:

curl -O -l https://raw.githubusercontent.com/linuxacademy/content-aws-database-specialty/master/S06_Additional%20Database%20Services/redshift-data.json        


Create Bucket:

aws s3 mb s3://redshift-import-786        

Copy file into bucket:

aws s3api put-object --bucket redshift-import-786 --key redshift-data.csv --body redshift-data.csv        

Similarly create Dynamo DB for storing Json file:

aws dynamodb create-table --table-name redshift-import --attribute-definitions AttributeName=ID,AttributeType=N --key-schema AttributeName=ID,KeyType=HASH --provisioned-throughput ReadCapacityUnits=5,WriteCapacityUnits=5        

Once created, copy Json file into DynamoDB:

aws dynamodb batch-write-item --request-items file://redshift-data.json        

Now we are good to go on scanning the table to source the data from this table to dynamodb:

aws dynamodb scan --table-name redshift-import        

After this scanning, create appropriate IAM Role and assign that to Redshift by following these steps:

IAM ROLE CREATION

Create a service based role with Ec2 access and get it edited later to Redshift



Add Read Only for S3 bucket:

Add Read Only for DynamoDB:


Edit the trust policy further to modify from EC2 to Redshift:


ASSOCIATE IAM ROLE TO REDSHIFT

Navigate to Redshift to click on 'Manage IAM Roles' and select the recently created role:

Click on Associate IAM Role to Link the role to Redshift for further data usage:

Once role associated to redshift, navigate to CLI to get end point:

aws redshift describe-clusters | head -25        

Now get this to environment variables:

export PGHOST=redshiftcluster-tin3jzeinmnn.cibxcgq1y6pn.us-east-1.redshift.amazonaws.com        

After this, try echo PGHOST to see the environment variable in CLI


Now, get ARN of redshift-import role created earlier by,

aws iam list-roles        


After this, navigate to CLI to get into Redshift-this is where the real work begins!

Create Table in Redshift

create table users_import (ID int, Name varchar, Department varchar, ExpenseCode int);          

Let us copy from S3 to Redshift by following this template:

copy users_import from 's3://<BUCKET_NAME>/redshift-data.csv' iam_role '<IAM_ROLE_ARN>' delimiter ',';        

Now use right ARN of IAM Role to copy the records from S3 to Redshift:

copy users_import from 's3://redshift-import-786/redshift-data.csv' iam_role 'arn:aws:iam::975049934529:role/redshift-import' delimiter ',';        

This will help us to load the data copy from S3 to Redshift-simple!

import-test=# select * from users_import;
 id | name  | department  | expensecode 
----+-------+-------------+-------------
  0 | David | Engineering |         200
  1 | Clay  | HR          |         100
  2 | Sue   | Sales       |         300
  3 | Betty | Marketing   |         400
(4 rows)        


After this, let us delete this table and redo steps for copying dynamoDB records to Redshift.

Now, deleting this table is possible through truncate command:

After this, let us copy the data from dynamoDB to Redshift:

copy users_import from 'dynamodb://redshift-import' iam_role '<IAM_ROLE_ARN>' readratio 50;        

Now, the data loaded from DynamoDB Table!


??Please feel free to share your views on Data ware house options such as Snowflake similar to Redshift and your experience in comments section.


?Follow me on LinkedIn: Link

Like this article? Subscribe to Engineering Leadership , Digital Accessibility, Digital Payments Hub and Motivation newsletters to enjoy reading useful articles. Press SHARE and REPOST button to help sharing the content with your network.

#LinkedInNewsUK #FinanceLeadership

要查看或添加评论,请登录