T1A Develops Cloud File Transfer Framework to Democratize Data Integration

T1A Develops Cloud File Transfer Framework to Democratize Data Integration

TL;DR

T1A with Customer support (Media domain) developed a Cloud File Transfer Framework that with little cost enabled and democratized Data Integration for Data Analysts with third parties in the organization.

Practical Daily Use of the Framework by Analysts

Let’s consider the Cloud File Transfer Framework from a Data Analyst perspective. They (Analysts) are tasked with extracting insights from multiple data sources and synthesizing this information to drive business decisions. Typically, this involves waiting for Data Engineers to set up complex pipelines to bring in third-party data. This traditional model creates a dependency that could slow down the analytical process.

The Self-Service Paradigm

Our Cloud File Transfer Framework flips this model on its head by enabling analysts to ingest data on their own, bypassing the lengthy pipeline development process. Here’s how:

  1. Ease of Configuration: Analysts need only create a simple configuration file, specifying details like source file location, an alias for credentials stored in AWS System Parameters, and other settings like target schema, table name, and load type (append, reload, increment).
  2. User-Friendly Workflow: After setting up the configuration file, running the framework is as simple as a few clicks within the AWS console. This user interface is intuitive even for those not deeply familiar with cloud services.
  3. Flexibility: Whether the data is in an Excel file on a remote FTP server or a continuously updating CSV on an API, the framework is capable of ingesting it. This makes the tool versatile and applicable to a range of data sources.

To the article…

Introduction

Imagine a Data Platform that’s operating smoothly, with data pipelines functioning flawlessly day in and day out. Then, all of a sudden, new demands emerge that involve integrating various files into the platform, for example:

  • A business user urgently approaches you, asking for specific data from a file or another system to conduct tests or validate numbers.
  • Your Data Engineering team receives a new assignment: to add hundreds of files to the existing data pipelines on a daily basis.
  • Analysts on your team want to compare your internal data with information from an external Excel or CSV file.

Faced with these challenges, you, as a responsible Manager, have a dilemma: Should you treat each request as a one-off, assigning a team of Data Engineers to create individual pipelines? Or would you rather develop a framework that enables end-users to ingest data themselves and perform analytics with external and partner data?

This article discusses how we approached these types of ad-hoc data ingestion by developing a Cloud File Transfer Framework.

Problem and Solution

First, let’s identify the problem clearly. We need an efficient method to import data files into a Data Lake, Data Repository, or Data Platform. Additionally, this method should be simple to modify and easy to maintain. Traditional approaches often rely on custom data pipelines, using a sequence of ETL (Extract, Transform, Load) tasks to move data at scheduled times from source systems to a target Data Lake or Database. However, these custom pipelines come with limitations:

  • They are restricted by the resources initially allocated, such as CPU, memory, and network capacity.
  • They require considerable time from Data Engineers to develop and from DevOps teams to deploy and manage.
  • They often experience bottlenecks, such as network limitations or resource contention, especially during heavy ETL periods when data is processed in batches.

To solve these challenges, we at T1A Engineering collaborated with our customer (Media domain) to create a Cloud File Ingestion Framework (FIF) or File Transfer-Transformation Framework (FIF). This framework uses serverless technologies, enabling pipelines to run automatically and on-demand. As a result, you can expect high performance without the headaches of resource limitations, which also include:

  • Built-in automatic monitoring for smooth operations
  • Rerun-ability, ability to restart from the point of failure
  • Streamlined credential management for enhanced security.
  • Empowerment of power users, who can now load their data without the need for developing and maintaining custom pipelines.

Challenges

File ingestion has inherited “issues”. First of all, it’s “a file” ingestion, not a record or table, or object. In essence, a file is a series of bytes in some format that we copy to the destination. So we need a robust mechanism to transfer files using multiple protocols as well as extract data from various formats of files. The next challenge is network protocol and authentication; how to get data over the network? Last but not least, how to allocate resources to run that integration? There are many more challenges to the pipeline, but I will mainly focus on these three.

“File Challenge”

Nowadays, files and blobs are the new black, just look at the popularity of Data Lakes, Lake Houses, and Mesh Architecture. Often there is no need to move from files to a relation model (load data to tables). Though not all file formats are suitable for consumption and storage, we identified that there should be a way for the service to save files and convert them into an appropriate and efficient format. Then we thought about integration with Relation Databases using mostly external tables or bulk copy commands. To sum up:

  • Use file storage as the main storage for moved or ingested data
  • Add the ability to reformat files into a more efficient file type for consumption
  • Expose data to users and downstream pipelines via external tables or bulk copy operations into Lake House or Data Marts

“Protocol and Network Challenge”

If we build a scalable solution, how will we copy hundreds of gigabytes of data daily? We cannot allocate a server and expect it to perform well for all cases: there are questions of concurrency, network interface availability, and server allocation itself (if we are going to use SPOT instances to save $, AWS does not guarantee that we will be able to allocate SPOT by request). So, the obvious answer, we decided to run it using serverless services:

  • AWS Lambda
  • AWS ECS on Fargate

In our opinion, Serverless wins for the Cloud File Transfer Framework because:

  • ?? no upfront cost or commitments
  • ??? unlimited scalability, it can be allocated according to your Cloud Account Limits
  • ? resources availability is a guaranty, it is available all the time

“Resource Challenge”

Resource challenge is closely related to the performance, if you allocate more resources to the task, in theory, it should finish in a shorter time (clearly, it depends on the scalability factor). Therefore, we decided to split every request for integration into the smallest possible pieces: a file. To download one file, we will call Lambda or container only once.

  • If the download takes less than 10 minutes, let’s run it in AWS Lambda
  • If the download takes more than 10 minutes, let’s run it in AWS ECS Fargate (serverless Container service)
  • We run one task per file, even if it is a one-byte file. In that way, we parallel and scale service up to a minimum job element, which is one file
  • That approach gives us almost linear scalability in the 10 to 1000 download session range.

Please read more about possible details of such implementation here.

How it works

AWS Serverless Services provide ephemeral servers or serverless computing power defined through code and provisioned at run time to the needed scale. For example, to download 50,000 files, data engineers would typically run 5, 10, or 20 sessions to download files simultaneously. On the other hand, tapping into the bottomless resources available for AWS customers, with “infrastructure as a code”, T1A’s Data Team has been able to run 100, or even 500, parallel processes that download files and store them in an S3 bucket.

We designed a Cloud Transfer Framework using AWS Serverless with three main modules in mind:

  • ?? Control flow: manage ingestion process flow and the status of the steps
  • ?? Execution and data: set up the infrastructure that will enable the data movement
  • ?? Configuration, Metadata, Credentials, and Privilege management: to store credentials, addresses, and manage privileges — mainly who can access what, what is the mapping for the tables, and establish the availability of data in the catalog for a wide range of users

To build each of the modules, T1A and Media Customer using the following AWS serverless services:

Control flow

  • AWS Step Function: It is the centerpiece and the top level of the framework. It manages dependencies, step-by-step execution, and notifications. It can also be used as a user interface for data analysts to integrate third-party data.
  • AWS S3: Stores the raw and processed data
  • AWS System Config: The framework configuration files.

Execution and data management

  • AWS Elastic Container Services on AWS Fargate: Allows running code in containers using serverless infrastructure, while Fargate deals with dynamic resource allocation per task.
  • AWS Lambda: Runs code in the cloud and dynamically allocates resources, including network capacity to download files
  • Databricks Job Clusters or AWS EMR: A framework that runs Spark jobs by request. It runs using ephemeral or Job cluster to process data, covert it into suitable, consumable format or load into target destination. Mainly we use pySpark to complete this task.

Metadata and privilege management

  • Databricks Unity Catalog or AWS Glue Catalog: A central cloud data catalog that allows centralized schema management.
  • Databricks Unity Catalog or AWS Lake Formation: A centralized method for managing data lake access on the database/table level.

It’s easier one time to see than hundreds of time to read:

Let’s dive into the architecture bit by bit:

  • On the left, you see a variety of data sources that store files and allow access to these files.
  • On the right, you see, as usual, data consumers, usually it is some BI tool. Still, it can be any SQL-compatible service or even other ETL processes that can access data (Data Pipeline Downstream).

In the center, we have two layers.

  • The top layer (Execution Control) is responsible for managing the data flow;
  • the bottom layer is the execution services, where work is done by downloading, repackaging data, loading it into Lakehouse and finally ingesting it into target Datamarts.

T1A chose AWS Step Function as a service entry point and for control flow because it is an ideal tool for step-by-step execution, and it has deep native integration with a wide range of AWS services. The diagram below illustrates an example of a Step Function execution starting with pulling data from a third party, processing it with AWS Amazon Elastic MapReduce (EMR), and then loading it to Redshift. Keep in mind, in this particular instance of Step function we used AWS EMR as an engine for pySpark, it can be replaced with Databricks that load data into Lake House.

You can read more about Step Function here

In the Framework step function serves not only as a workflow service but also as a user interface. Users can open the AWS console, open Step Function, specify config for Cloud File Transfer Framework, and run it with a config. In that way, we separate the privileges under which Step Function runs and the privileges of the caller. We grant Step Function invocation privilege for a specific user, and now that user can ingest data into specific systems. If required, we can create multi-step function instances for various sources (if a different department runs those Step Functions).

An alternative solution would require:

  • Workflow management and scheduling (an ETL Orchestration or other ETL tool job?)
  • A user interface to interact with the service (a table with the list of tasks?)
  • Quite a lot of programming to integrate AWS Services with Workflow management and Privilege management

As you can see, Step Function solved all these challenges for us out of the box.

You can find an example of Step Function flow during execution below:

Cloud File Transfer Framework Configuration is quite straightforward:

  • It is a text file, JSON or YAML format
  • It says what service you want to run
  • It has a configuration for each service that you want to run, including:
  • Alias to Authentication parameters (user and password for FTP or SSH, stored in AWS System Configuration)
  • Source and target path where to store files or where to ingest data
  • How to process or extract dates from file names or file paths or use the current date for processing

The AWS Step Function instance of Cloud File Transfer Framework contains the following steps:

Download: Get files from a third party and save them to the AWS S3 bucket:

  • Cloud File Transfer Framework decides whether to download files using Lambda or docker container during this step. Service uses Lambda only for those cases where files can be downloaded in 10 minutes or less, in other cases, it will kick off a container using ECS service (with the same code). You can read about Lambda and its limitations here.
  • In addition to downloading, that step can repair external tables (register new partitions into external tables), making ingested data available for end-users via Redshift or Athena tables.

Convert/Transform: Run Databricks or EMR clusters to convert data to a different format or extract additional fields from a complex structure.

  • During this step, Framework calls the Databricks service to create a job cluster, pass parameters, and script to convert data from text or Avro format to the appropriate format for Data Lake (usually, it is parquet with high compression or Delta Format). After execution, the Databricks and/or EMR cluster is automatically destroyed. You can read about EMR here.

Database file copy: Load data into the target database as needed.

  • The framework calls ECS container during this step that invokes COPY commands, Create External Table command or Create Athena Table or Load Data into Snowflake. We run these commands to load data from S3 and register metadata. We load data into multiple targets, such as Databricks, Snowflake, or RDS.

In the illustrated example, the Step Function calls ECS containers to execute the orchestration logic that processes and validates the workflow configuration, and decides which other services to call and how to process the data. At the end of the execution, the AWS ECS containers send the status and the metadata to the Step Function, making them available in the interface for the user. The service downloads hundreds of files and saves it to S3; registers new partitions for tables and allows users to access downloaded data.

Practical Daily Use by Analysts

Let's consider the Cloud File Transfer Framework from a Data Analyst perspective. They (Analysts) are tasked with extracting insights from multiple data sources and synthesizing this information to drive business decisions. Typically, this involves waiting for Data Engineers to set up complex pipelines to bring in third-party data. This traditional model creates a dependency that could slow down the analytical process.

The Self-Service Paradigm

Our Cloud File Transfer Framework flips this model on its head by enabling analysts to ingest data on their own, bypassing the lengthy pipeline development process. Here’s how:

  1. Ease of Configuration: Analysts need only create a simple configuration file, specifying details like source file location, an alias for credentials stored in AWS System Parameters, and other settings like target schema, table name, and load type (append, reload, increment).
  2. User-Friendly Workflow: After setting up the configuration file, running the framework is as simple as a few clicks within the AWS console. This user interface is intuitive even for those not deeply familiar with cloud services.
  3. Flexibility: Whether the data is in an Excel file on a remote FTP server or a continuously updating CSV on an API, the framework is capable of ingesting it. This makes the tool versatile and applicable to a range of data sources.

Example of Yaml configuration file to run Cloud File Ingestion Framework

# Cloud File Transfer Framework Configuration
version: "1.0"

# Control Flow Parameters
controlFlow:
  notificationEmail: "[email protected]"

# Execution and Data Management Settings
execution:
  databricksJobClusterConfig:
    clusterId: "databricks-cluster-id"
    libraryPaths: 
      - "/dbfs/mnt/library/"

# Metadata and Privilege Management
metadata:
  catalogType: "DatabricksUnityCatalog"

# Service Configuration
services:
  - serviceName: "FileDownloadService"
    authAlias: "s3-credentials-for-3rd-party"
    sourcePath: "s3://source-bucket/folder/"
    targetPath: "s3://target-bucket/processed-data/"
    timeout: 600  # in seconds
    concurrency: 500

  - serviceName: "DataTransformationService"
    databricksJobName: "TransformJob"
    sourceFormat: "text"
    targetFormat: "parquet"
    dateExtraction:
      method: "fileName"
      format: "YYYY-MM-DD"

  - serviceName: "DataLoadService"
    targetDatabase: "RedshiftDB"
    targetTable: "dataTable"
    copyCommandParameters:
      - "gzip"
      - "REMOVEQUOTES"


# Example Usage of S3 for AWS System Configuration
systemConfigS3Path: "s3://system-config-bucket/cloud-file-transfer-config/"        

Democratizing Data Across Teams

The framework not only empowers individual analysts but also fosters an environment of data democratization within the organization. What does this mean?

  1. Reduced Bottlenecks: Since analysts can ingest data themselves, it frees up Data Engineering resources, allowing them to focus on more complex tasks.
  2. Fast-Track to Insights: The self-service nature of this framework expedites the data-to-insights timeline. What earlier took days can now be achieved in a matter of hours.
  3. Data Governance and Security: Despite its ease of use, the framework doesn’t compromise on security. All credentials are stored securely in AWS System Parameters, and the use of aliasing ensures that sensitive information is never exposed.
  4. Collaboration: By allowing analysts to share data more easily, the framework facilitates better collaboration between departments. This integrated approach can often result in richer insights derived from a more diverse dataset.

Performance and Scalability at Glance

The performance of the Cloud File Transfer Framework made me proud of the T1A team. We achieved outstanding performance because we used an excellent architectural approach and implementation. In addition, we focused on the service’s ability to scale using Cloud resources. There are two levels of scalability:

  1. The number of the Cloud File Transfer Frameworks calls — each Step Function call creates its own execution context to complete the task. We can run any number of function instances concurrently, from 10 to 100 easily and it worked without issues.
  2. The number of concurrent jobs running in parallel per one Cloud File Transfer Framework call, specifically to download data (how many lambda functions download files). It varies from 10 to 1000 instances of parallel download.

Let’s us share some numbers that we achieved during Framework operations.

  1. Running the compatible processes on a cloud-based ETL server deployed on an AWS EC2 runs for 6 up to 12 hours, whereas Cloud File Transfer Framework does it in minutes with no additional costs.
  2. The framework regularly downloads over 1,000 files from third parties simultaneously.
  3. The transfer rate from third parties regularly reaches 300GB/minute.
  4. Data Transformation, including recompression, and format change rate is over 60GB/minute.
  5. If necessary, Cloud File Transfer Framework can upload data to the customer S3 buckets at rated 500TB per day.

Lessoned Learned

Even though Media Customer and T1A had a clear vision of what needed to be implemented with Cloud File Transfer Framework, we still faced some hurdles and learned valuable lessons along the way:

  1. Lambda is best suited to download S3 data to S3 or another cloud blob/object provider to S3, especially if the source file is small enough to be downloaded within the Lambda Function time limit (i.e. 15 minutes or less). One more benefit of Lambda Function is that it also allows for almost limitless horizontal scalability.
  2. ECS containers that run on Fargate are great for long-running download requests, for example, FTPS or SSH sources.
  3. Use a framework-managed schema for intermediate tables in Databricks and Snowflake to allow data analysts to truncate the tables and reload data if necessary.
  4. CI\CD Deployment — never place configs that are frequently changed under CI\CD, separate configuration and code in CI\CD
  5. Regress testing — always has test calls for each type of integration, protocol, and logic. It will save you many, many hours during production use.
  6. User access to the production configs allows users to run Cloud File Transfer Framework and separate them from configs that are run daily. Sometimes, users can make mistakes and replace the wrong config (separate access of scheduled production configs and ad-hoc user configs).

Implemented by Customer and T1A, Cloud Transfer Framework handles hundreds of thousands of files and terabytes of data in minutes without having to worry about scale and resources. Relying on this robust framework, data analysts are now pulling data from sources independently and using it without delay to feed their state-of-the-art recommendation engines that support Customer retention and acquisition strategies.

Opportunities

We continue improving the framework; for example, we introduced a couple of new features:

  • Automatic file schema recognition, meaning that the Frameworks will create tables in the target database based on file schema
  • “Allow lists” of users, meaning that we have a short config file where we can specify to which schema what user has access. It is required to grant extra privileges to the Framework itself and limit permissions for users that use the Framework
  • Integration with more cloud services and APIsto get files from

Here are a couple of ideas for future improvements:

  • Implementation for GCP and Azure deployment (reuse the same code with a small modification to deploy it in other Public Clouds)
  • Extended exception handling for typical errors on 3rd party side
  • Adding a new step to the Framework, for example, data processing and transformation
  • Adding new data transformation engines

Conclusion

The article outlines the development and implementation of a Cloud File Transfer Framework designed to simplify and automate data ingestion into a Data Lake or Data Repository. It highlights the need for a more efficient and scalable system to manage new data integration demands, such as importing files or integrating external data sources. Traditional ETL methods face challenges like resource limitations and require significant development and maintenance time. The framework uses serverless technologies like AWS Lambda and AWS ECS on Fargate to overcome these issues.

The framework is modular, comprising three main modules: Control flow, Execution and data management, and Metadata and privilege management. AWS Step Function is highlighted as a crucial tool for managing workflow and also serving as a user interface. The framework is configured through simple text files, making it user-friendly.

By leveraging serverless architecture, the framework offers high performance, automatic monitoring, and simplified credential management. It empowers analysts and other power users to easily load their own data, thereby democratizing data access and integration across the organization.

In summary, the self-service nature of the framework transforms analysts from being passive recipients of data to active participants in the data integration process. This shift towards self-sufficiency doesn’t just make life easier for analysts; it makes the entire organization more agile, more secure, and ultimately, more data-driven.

Authors: Kirill Chufarov , Boris Vasilev .

Documentation

Cloud Framework

Modern Data Stack

AWS

Databricks

要查看或添加评论,请登录

T1A的更多文章

社区洞察

其他会员也浏览了