High Scale Ingestion Meets Big Data Analytics
Why are we the “Easy Button” for data ingestion?
Solving a problem that makes a customer happy is the goal of every business. So when a customer described us as the “easy button” for data ingestion, we fulfilled our mission of removing the complexity of building and orchestrating pipelines. The term “easy button” stuck, and it’s a principle we continuously work towards, even in our partnerships.
Our core feature might be data ingestion, but we don’t build in isolation. Data ingestion is one part of the process of taking your raw data from the source and converting it into actionable insights that end users find valuable. By partnering with ClickHouse, we combine complementary features to ensure your dashboards and reports serve fresh, high-quality data.
Lightning fast analysis
ClickHouse is an open-source OLAP database management system that enables you to generate reports using SQL queries in real time. Leveraging a column-oriented architecture, ClickHouse is the fastest and most efficient way to query billions of rows of data in milliseconds.?
To obtain this level of performance, you simply load data into one of ClickHouse’s specialized table engines. Upsolver’s ClickHouse connector makes it easy to ingest your high-volume data into these special tables in near real-time.?
Our no-code Wizard guides you through the steps of connecting to your source and target to create an ingestion job.?
Optionally you can load data into your data lake and use query federation via the ClickHouse native S3 integration to analyze your data. However, you won’t experience the gains of querying data in a ClickHouse table that is finely tuned for high performance. The best place to analyze your data is directly in ClickHouse if you want blazing-fast results - and who wouldn't?!
Roll your own
Upsolver’s data ingestion Wizard enables you to be as hands-off as you want when creating an ingestion job. Alternatively, if you’re familiar with SQL, you can roll your own and build more advanced jobs, such as the one below:
CREATE SYNC JOB load_kinesis_to_clickhouse
COMMENT = 'Ingest sales orders from Kinesis to Clickhouse
START_FROM = BEGINNING
CONTENT_TYPE = AUTO
EXCLUDE_COLUMNS = ('customer.password')
COLUMN_TRANSFORMATIONS = (hashed_email = MD5(customer.email))
DEDUPLICATE_WITH = (COLUMNS = (orderid), WINDOW = 4 HOURS)
COMMIT_INTERVAL = 5 MINUTES
AS COPY FROM KINESIS my_kinesis_connection
STREAM = 'orders'
INTO CLICKHOUSE my_clickhouse_connection.sales_db.orders_tbl
WITH EXPECTATION exp_custid_not_null
EXPECT customer.customer_id IS NOT NULL ON VIOLATION DROP
WITH EXPECTATION exp_taxrate
EXPECT taxrate = 0.12 ON VIOLATION WARN;
As you can see, we support high-configurable options for customizing your jobs. You can define the exact starting point from when to load events (START_FROM), and specify how frequently to write events to ClickHouse (COMMIT_INTERVAL).?
Personally identifiable information (PII) is easy to manage using options to prevent selected columns from being written to the target (EXCLUDE_COLUMNS) and mask values (COLUMN_TRANSFORMATIONS) to protect sensitive data.
Let’s not ignore the ability to dedupe data in flight (DEDUPLICATE_WITH) to prevent duplicate rows from polluting your reports.
Quality in, quality out
Prevent bad data from reaching your dashboards and reports and producing spurious results that cause your consumers to mistrust the data by using expectations.
领英推荐
As well as protecting and deduplicating your data, you can check each row against a predefined predicate (WITH EXPECTATION) and either drop the row from the pipeline or raise a warning.?
Using Upsolver’s supported functions and operators, you can build any number of predicates to check each row and define how mismatched rows should be managed. Upsolver handles the rows based on your definition while maintaining a tally of rows that fail to meet the expectation so you can measure the overall quality of your data.
Do it your way
The above example loads the source data directly from Amazon Kinesis into ClickHouse. Another option would be to retain a cold store of data in your lakehouse using an ingestion job to load the events into Apache Iceberg tables. You could then load your hot data into ClickHouse using a transformation job, selecting only the relevant data from Amazon S3:
CREATE SYNC JOB transform_and_load_to_clickhouse
START_FROM = BEGINNING
RUN_INTERVAL = 1 MINUTE
RUN_PARALLELISM = 20
AS INSERT INTO CLICKHOUSE my_clickhouse_connection.sales_db.target_tbl
MAP_COLUMNS_BY_NAME
SELECT
orders,
MD5(customer_email) AS customer_id,
ARRAY_SUM(data.items[].quantity * data.items[].price) AS net_total
FROM default_glue_catalog.lake_db.raw_order_events
WHERE TIME_FILTER();
Upsolver embraces an extensive library of functions and operators that you can use to enhance your jobs and ensure you not only load quality data into our target but also that the right columns and types are included.
Observability as standard
Job monitoring and data observability are integral to the Upsolver platform, so there’s no need for you to spend time and money building custom reports. Our real-time metrics provide everything you need to troubleshoot issues with your jobs and discover quality issues within your data.?
Use the Datasets feature to check column uniqueness and density, and view the number of rows not meeting expectations. It is easy to find the first and last time data was seen in each column, alongside the minimum and maximum values in each column.?
By monitoring the processing volume of data, you can easily spot spikes or dips in the number of events, and zoom in on dates where volume is questionable.?
Summary
As you can see, Upsolver offers frictionless integration with ClickHouse, enabling you to ingest data into your data lake or lakehouse and then transform data and load it into ClickHouse, or, create a job that loads events directly into your ClickHouse tables.
Whichever way you choose, you will enjoy near real-time analysis, a high level of data quality, and built-in observability. Upsolver also manages schema evolution, and guarantees strongly-ordered and exactly-once delivery of your data.?
To see how easy it is to do this yourself, check out our guide on how to ingest your data into ClickHouse.?
Please reach out if you have any questions, or schedule a no-obligation demo to see Upsolver in action.