Best Practices for Data & Analytics Architecture on AWS
Igor Royzis
CTO | Software Engineering Leader in Cloud, Data & AI | Scaling Organizations, Driving Innovation, Delivering Results
"Best practice is a procedure that has been shown by research and experience to produce optimal results and that is established or proposed as a standard suitable for widespread adoption" - Merriam-Webster Dictionary
Data, Analytics, Web & Mobile on AWS
Here is how your architecture would look like on AWS if you needed to implement most of the common data, analytics, web and mobile use cases.
Yes, looks very busy and complex. The good news is most organizations only need to implement part of this architecture for their specific use cases.
So let's get right into it and cover some of the popular use cases and architecture best practices? As you're going through each use case, notice how they all have the same data lake foundation. In other words, regardless of what kind of data you're ingesting, your data lake structure stays the same. This provides a consistent approach for storing, organizing, securing and governing your data, and allows to transform and analyze data from different sources and of different types using common technologies and even the same codebase.
Ingest, process and organize CSV files in near real-time on AWS
This is a straight forward and very popular use case for organizations that have many departments or lines of business with heavy use of spreadsheets. At some point organizations realizes that spending days or weeks creating, combing through and aggregating data from 20, 50 or 100 spreadsheets just to create end-of-month reports is very inefficient. This architecture allows to ingest and organize various spreadsheets into AWS data lake, transform and aggregate data in near real time using Glue jobs and allow organizations to use Athena/SQL queries to explore the data.
On-going replication of small to medium size Oracle or MS SQL Server databases to AWS data lake
Another popular use case to establish a data warehouse and BI foundation on AWS. This architecture ensures near-realtime replication of data from on-premise database to AWS data lake via DMS (Database Migration Service), provides ETL/ELT capability via Glue jobs and allows data exploration using Athena/SQL.
领英推è
Process and organize events in near real-time
Many organizations adopted event based or event sourced architectures for their applications. This use case is appropriate for organizations that need to store and organize events produced by applications in AWS data lake in near real time.
Run ETL/ELT jobs and publish results to Redshift
Some organizations already have a way of ingesting data into AWS S3, but need a proven way of transforming and loading (ETL) or loading and transforming (ELT) data into RedShift data warehouse.
And now let's put it all together for a typical medium complexity data platform on AWS with both internal and external data sources
Conclusion
There are more use cases that organizations are implementing on AWS, while utilizing best practices. They key to successful implementation is choosing the right architectural patterns and technologies for the job.
AWS/Azure & Palantir Certified | Staff AI/Data Eng. Tech Lead |Forward Deployed Engineer | Consultant | Freelance Trainer | Mentor | Content Creator | Gen AI & Copilot Enthusiast
1 个月Which is the tool used to show flows of diagrams