Query Sitecore CDP Parquet Export
SQL! A blast from the past. I remember writing thousands of lines of T-SQL to process business logic, integrate systems, and generate custom reports. Here we are again, leveraging SQL to gather insights from a treasure trove of engagement data in Sitecore CDP. Many digital marketing roadmaps are prioritizing the ability to capture data, and for those of you ready to act on it, the CDP export could be a big value to you. This article will provide a walkthrough of the tools, developer process, data model, and querying examples. With this information, you can begin to collaborate with your team on how to unlock these insights.
Obtaining Sitecore CDP Data Export
The process is defined at doc.sitecore.com > Sitecore CDP > Perform a data lake export.
Access to the data is requested directly to Sitecore and provided to you in AWS. From here we will fast forward to gathering prerequisite tools and working with the actual files.
Tools Prerequisites
For this demonstration I am using DuckDB, an in-process SQL OLAP database. On the download page I chose Command Line, Windows, 64-bit.
Next, I will be using VSCode to write SQL and send it to the terminal. I also created a keyboard shortcut so that shift+enter will send my SQL to the terminal.
[
? ? {
? ? "key": "shift+enter", ? ? ? ? ?
? ? "command": "workbench.action.terminal.runSelectedText",
? ? } ? ?
[
Preparing Files
In this example I am working locally against the files with the in-memory DuckDB tool, which is an easy way to play around with the data. In a production or shared development scenario, you might load the data in Amazon Redshift, then access it with an ODBC driver. Back to the files - the export is organized in the same way shown in the ERD. This diagram is helpful to look at by the way when you are constructing your queries. Note that the ERD and field descriptions are all available in the link provided at the top of the article. In the image below you can see I dropped the files in respective folders. I also have my duckdb executable and SQL code.
Configure IDE
With the workspace loaded and terminal view shown, I start DuckDB with .\duckdb.exe. From there I can select any SQL I want and press shift+enter to execute the command. I save my .sql file for backup purposes and I could also commit it to a repository.
Load Data into Memory
Using read_parquet to create a table allows the developer to start writing SQL against the data. As shown in the screenshot there are dozens of parts of files. Using the * wildcard loads the entire directory.
领英推荐
Sample Query #1 - Event Count
For our first sample we will start simple and then expand from there. Getting event counts is pretty straightforward in both Sitecore CDP dashboards and Google Analytics. The point of this exercise is to build a foundation of understanding and grow from there.
SELECT events.type, COUNT(events.type)
FROM events
INNER JOIN sessions
ON events.meta_session_ref = sessions.meta_ref
WHERE events.core_point_of_sale = 'mysite'
AND sessions.core_channel = 'WEB'
AND events.meta_created_at >= '2023-03-07 00:00:00:000'
AND events.meta_created_at <'2023-03-14 23:59:59:998'
AND events.type IN ('VIEW', 'CLICK', 'SEARCH', 'FORM_VIEW',
'FORM_ERROR', 'FORM_COMPLETED', 'IDENTITY', 'FILE_DOWNLOAD')
GROUP BY events.type
ORDER BY count(events.type) DESC;
Sample Query #2 - Search Clickthrough
In this example we will execute a multistep query by staging data in a temporary table, then selecting a calculation off that table. In order for this report to work, you must be tracking an event not only for search, but the click of a search result.
CREATE TABLE search_clickthrough(searches INTEGER, search_clicks INTEGER);
INSERT INTO search_clickthrough
? ? SELECT a.searches, b.search_clicks
? ? FROM (SELECT COUNT(sessions.meta_ref) AS searches
? ? ? ? FROM sessions
? ? ? ? INNER JOIN events
? ? ? ? ON sessions.meta_ref = events.meta_session_ref
? ? ? ? WHERE sessions.type = 'WEB'
? ? ? ? AND events.type = 'SEARCH') AS a
? ? INNER JOIN
? ? ? ? (SELECT COUNT(sessions.meta_ref) as search_clicks
? ? ? ? FROM sessions
? ? ? ? INNER JOIN events
? ? ? ? ON sessions.meta_ref = events.meta_session_ref
? ? ? ? WHERE sessions.type = 'WEB'
? ? ? ? AND events.type = 'SEARCH_SELECTION') AS b
? ? ON 1=1;
SELECT CAST((search_clicks / searches) * 100 AS DECIMAL (4,2))
AS clickthroughpct
FROM search_clickthrough;
Sample Query #3 - Experience Characteristics
For the last example we will view the sequence of events for visitors that had sessions over two minutes. We want to get an idea of what our most engaged visitors are doing. Are they taking a long time because content cannot be found or are they just collecting a lot of data?
Collecting browser_ref is the same ID you can query in Sitecore CDP with the "bid: " prefix. In your investigation, you might write queries to pinpoint a certain BID with specific behaviors, then pop back over to Sitecore CDP to view the guest in the various user interfaces. Ordering by browser_ref and created_at basically groups our visitors and shows the events chronologically for each of their sessions.
SELECT e.meta_browser_ref, e.type, e.meta_created_at, e.core_page
FROM events e
INNER JOIN sessions s
ON e.meta_session_ref = s.meta_ref
WHERE e.core_point_of_sale = 'mysite'
AND s.core_duration > 2
AND s.core_channel = 'WEB'
ORDER BY e.meta_browser_ref, e.meta_created_at;
Conclusion
In this article we learned that we can obtain an export of all Sitecore CDP data by making a request to Sitecore. This gives us full access to create whatever insights we want. The options for leveraging the data is extensive: execute a marketing automation based off a calculation, create your own reporting dashboards, alert marketers based on KPI thresholds, or investigate a hypothesis about your visitors' behavior.
For next steps you may consider investigating the JSON columns more closely and create a way to parse and report on that data. Furthermore, this demonstration focused on events and sessions, but you can also join to guests and orders (if applicable to your implementation). I hope these examples will encourage you to capture boatloads of data and write some interesting queries!
Group Technology Director at Horizontal Digital
6 个月Great article - Thanks! When we do a default download data is stored in date folder with following structure : - YYYY-MM-DD - events meta_created_at_date=YYYY-MM-DD you just dumped all those .parquet files in events. guests, sessions folder?
Product lead for Analytics, CDP, Connect, Personalize, Search and Stream @ Sitecore
1 年Great article Walt!