登录查看更多内容

High Scale Ingestion Meets Big Data Analytics

Upsolver

Bridges the gap between engineering and data teams by streaming and optimizing operational data in an Iceberg lakehouse.

发布日期: 2024年4月12日

Why are we the “Easy Button” for data ingestion?

Solving a problem that makes a customer happy is the goal of every business. So when a customer described us as the “easy button” for data ingestion, we fulfilled our mission of removing the complexity of building and orchestrating pipelines. The term “easy button” stuck, and it’s a principle we continuously work towards, even in our partnerships.

Our core feature might be data ingestion, but we don’t build in isolation. Data ingestion is one part of the process of taking your raw data from the source and converting it into actionable insights that end users find valuable. By partnering with ClickHouse, we combine complementary features to ensure your dashboards and reports serve fresh, high-quality data.

Lightning fast analysis

ClickHouse is an open-source OLAP database management system that enables you to generate reports using SQL queries in real time. Leveraging a column-oriented architecture, ClickHouse is the fastest and most efficient way to query billions of rows of data in milliseconds.?

To obtain this level of performance, you simply load data into one of ClickHouse’s specialized table engines. Upsolver’s ClickHouse connector makes it easy to ingest your high-volume data into these special tables in near real-time.?

Our no-code Wizard guides you through the steps of connecting to your source and target to create an ingestion job.?

Optionally you can load data into your data lake and use query federation via the ClickHouse native S3 integration to analyze your data. However, you won’t experience the gains of querying data in a ClickHouse table that is finely tuned for high performance. The best place to analyze your data is directly in ClickHouse if you want blazing-fast results - and who wouldn't?!

Roll your own

Upsolver’s data ingestion Wizard enables you to be as hands-off as you want when creating an ingestion job. Alternatively, if you’re familiar with SQL, you can roll your own and build more advanced jobs, such as the one below:

CREATE SYNC JOB load_kinesis_to_clickhouse
    COMMENT = 'Ingest sales orders from Kinesis to Clickhouse
    START_FROM = BEGINNING
    CONTENT_TYPE = AUTO
    EXCLUDE_COLUMNS = ('customer.password') 
    COLUMN_TRANSFORMATIONS = (hashed_email = MD5(customer.email))       
    DEDUPLICATE_WITH = (COLUMNS = (orderid), WINDOW = 4 HOURS)
    COMMIT_INTERVAL = 5 MINUTES
AS COPY FROM KINESIS my_kinesis_connection
    STREAM = 'orders'  
INTO CLICKHOUSE my_clickhouse_connection.sales_db.orders_tbl
    WITH EXPECTATION exp_custid_not_null 
      EXPECT customer.customer_id IS NOT NULL ON VIOLATION DROP
    WITH EXPECTATION exp_taxrate 
      EXPECT taxrate = 0.12 ON VIOLATION WARN;

As you can see, we support high-configurable options for customizing your jobs. You can define the exact starting point from when to load events (START_FROM), and specify how frequently to write events to ClickHouse (COMMIT_INTERVAL).?

Personally identifiable information (PII) is easy to manage using options to prevent selected columns from being written to the target (EXCLUDE_COLUMNS) and mask values (COLUMN_TRANSFORMATIONS) to protect sensitive data.

Let’s not ignore the ability to dedupe data in flight (DEDUPLICATE_WITH) to prevent duplicate rows from polluting your reports.

Quality in, quality out

Prevent bad data from reaching your dashboards and reports and producing spurious results that cause your consumers to mistrust the data by using expectations.

Lyftrondata 3 个月前

The Data Science ROI of moving to the Data Lakehouse:…

Tony Branda 1 年前

Data Build Tool(DBT) — Aamir P

AAMIR P 2 个月前

As well as protecting and deduplicating your data, you can check each row against a predefined predicate (WITH EXPECTATION) and either drop the row from the pipeline or raise a warning.?

Using Upsolver’s supported functions and operators, you can build any number of predicates to check each row and define how mismatched rows should be managed. Upsolver handles the rows based on your definition while maintaining a tally of rows that fail to meet the expectation so you can measure the overall quality of your data.

Do it your way

The above example loads the source data directly from Amazon Kinesis into ClickHouse. Another option would be to retain a cold store of data in your lakehouse using an ingestion job to load the events into Apache Iceberg tables. You could then load your hot data into ClickHouse using a transformation job, selecting only the relevant data from Amazon S3:

CREATE SYNC JOB transform_and_load_to_clickhouse
  START_FROM = BEGINNING
  RUN_INTERVAL = 1 MINUTE
  RUN_PARALLELISM = 20
AS INSERT INTO CLICKHOUSE my_clickhouse_connection.sales_db.target_tbl 
  MAP_COLUMNS_BY_NAME
    SELECT 
      orders,
      MD5(customer_email) AS customer_id,
      ARRAY_SUM(data.items[].quantity * data.items[].price) AS net_total
    FROM default_glue_catalog.lake_db.raw_order_events
    WHERE TIME_FILTER();

Upsolver embraces an extensive library of functions and operators that you can use to enhance your jobs and ensure you not only load quality data into our target but also that the right columns and types are included.

Observability as standard

Job monitoring and data observability are integral to the Upsolver platform, so there’s no need for you to spend time and money building custom reports. Our real-time metrics provide everything you need to troubleshoot issues with your jobs and discover quality issues within your data.?

Use the Datasets feature to check column uniqueness and density, and view the number of rows not meeting expectations. It is easy to find the first and last time data was seen in each column, alongside the minimum and maximum values in each column.?

By monitoring the processing volume of data, you can easily spot spikes or dips in the number of events, and zoom in on dates where volume is questionable.?

Data observability is built into Upsolver as standard

Summary

As you can see, Upsolver offers frictionless integration with ClickHouse, enabling you to ingest data into your data lake or lakehouse and then transform data and load it into ClickHouse, or, create a job that loads events directly into your ClickHouse tables.

Whichever way you choose, you will enjoy near real-time analysis, a high level of data quality, and built-in observability. Upsolver also manages schema evolution, and guarantees strongly-ordered and exactly-once delivery of your data.?

To see how easy it is to do this yourself, check out our guide on how to ingest your data into ClickHouse.?

Please reach out if you have any questions, or schedule a no-obligation demo to see Upsolver in action.

High Scale Ingestion Meets Big Data Analytics

Upsolver

Bridges the gap between engineering and data teams by streaming and optimizing operational data in an Iceberg lakehouse.

Why are we the “Easy Button” for data ingestion?

Lightning fast analysis

Roll your own

Quality in, quality out

领英推荐

Do it your way

Observability as standard

更多精彩文章

社区洞察

其他会员也浏览了

Data Catalogue and Meta Data Management

BURYING DATA WAREHOUSE - RIP

Navigating the Data Lake: Insights from Building and Utilizing Data Lakes

The Top Big Data Analytics Tools Of 2022

How to Build a Scalable Big Data Analytics Pipeline

The Bridge to Insight: Data Engineers and the Importance of Understanding Data Analytics Concepts

Management of Large Volumes of Data

Importance of partitioning in Data-intensive Analytics Solution Design

BIG DATA

Streamlining Data Warehouse

Why are we the “Easy Button” for data ingestion?

Lightning fast analysis

Roll your own

Quality in, quality out

领英推荐

Do it your way

Observability as standard

Can You Really Query Fresh Data in Apache Iceberg Tables from Snowflake?

2024年6月7日

Apache Iceberg and the Battle for Open Data Control

2024年5月17日

Slash Your Data Warehouse Costs with Apache Iceberg

2024年5月3日

Visualizing the Data Journey with Lineage

2024年4月19日

Advanced Techniques for Optimizing Apache Iceberg Lakehouse Performance

2024年3月29日

10 Things You Need to Know About Apache Iceberg

2024年3月22日

Unlocking the Potential of Apache Iceberg

2024年3月15日

Meet the Women of Upsolver

2024年3月8日

On the Sofa with Upsolver's Founders

2024年3月1日

社区洞察

其他会员也浏览了

Data Catalogue and Meta Data Management

BURYING DATA WAREHOUSE - RIP

Navigating the Data Lake: Insights from Building and Utilizing Data Lakes

The Top Big Data Analytics Tools Of 2022

How to Build a Scalable Big Data Analytics Pipeline

The Bridge to Insight: Data Engineers and the Importance of Understanding Data Analytics Concepts

Management of Large Volumes of Data

Importance of partitioning in Data-intensive Analytics Solution Design

BIG DATA

Streamlining Data Warehouse