#BigData – Bigger opportunities: Merging disparate datasets for deeper insights
One challenge stands out in the ever-evolving landscape of big data: merging seemingly unrelated datasets to uncover deeper insights. We recently consolidated all our big data environments into BigQuery, which opened new possibilities for merging previously independent datasets. The newly integrated dataset reveals deeper analytical opportunities, highlighting the power and potential of combining data sources.
Navigating a No-Key database landscape
Imagine a puzzle where two pieces representing distinct datasets do not fit together. This perfectly?describes our initial situation: We had two datasets, one describing our offer (Air Shopping) systems and the other describing the order (Air Booking system). Without a clear connection between them, analysing the data holistically was impossible.
Rather than retreating, we embraced the challenge, seeing the immense potential in merging the datasets. Our first step was to engage our business analysts, tapping into their deep knowledge of airline operations and booking processes. Said knowledge is contributed to Sabre's long history as a travel company. What's more, Sabre has been part of IATA since 1986, which gives us a deep understanding of the air travel business.
This invaluable collaboration resulted in creating unique, "homemade" database keys, following business logic to bridge the gap between data sources. This invaluable collaboration resulted in the creation of unique, custom database keys, developed according to business logic to bridge the gap between data sources. Building domain-specific keys required a great deal of creativity, and this challenge was addressed by Hande Tuzel – our colleague from Sabre Labs, our R&D department. She invented the mapping algorithm and defined the necessary business-oriented keys. Our task in the Data & Analytics department was to translate these inventive algorithms into robust, production-ready data engineering solutions.
Merging a data ocean challenge
Equipped with these bespoke keys, we faced a new hurdle – the immense scale of the data itself. The offer data consisted of terabytes, rendering traditional one-to-one mapping approaches prohibitively slow and resource intensive.
With cost and efficiency at the forefront, we explored various technologies, each offering unique advantages and drawbacks.
Our initial Proof of Concept (PoC), created before moving to Google Cloud, involved a challenging 8-step join process. We iteratively relaxed the matching criteria, starting with strict conditions and gradually loosening them to capture more data. This process, spanning eight iterations, ultimately allowed us to join the entire dataset.
The join algorithm consisted of conditions at different levels, ranging from very restrictive mappings to looser conditions that were easier to meet. The Proof of Concept began with a join in Hive, where successfully joined data was saved. For the remaining data, the mapping conditions were slightly adjusted, and another join was performed. This process was repeated eight times. The SQL query used was highly complicated and resource-intensive, consuming significant resources on our Hadoop cluster. In the end, we expanded this process to 32 steps to join the entire dataset.
From "Impossible" to "Implausible": Embracing simplicity and agility
Having experimented with different approaches, we decided to take a step back and embrace simplicity as our guiding principle. We turned the arrow from a business-wise to data-wise approach. By defining a good plan for growth from ground zero to full production size traffic, we limited the risk of failure during the development process.
Moreover, we were expecting the challenges and failures that may come during the development of such massive data flow. Because of that before starting any ETL coding we developed a flexible data pipeline architecture to test different approaches and technologies.
Breaking Down the Data Silos with FinOps strategy
With a flexible plan from a data and architecture perspective, we embraced an iterative approach. We test and estimate solution costs in smaller subsets before scaling up to larger datasets. As illustrated in the diagram, we begin with a defined plan and build incrementally, ensuring that each step is thoroughly evaluated and validated. This iterative approach allows us to adapt and refine our solution as we progress toward managing the full-scale data flow. The results of initial proof of concept were surprising for us! We started testing all the solutions we came up with, one by one.
Firstly, we found that materialized view cannot be created with arrays or nested strictures.
领英推荐
Secondly, we had to check the Dataflow (Apache Beam), which was our team's first choice. We come from Scala Spark, so when migrating it has pivoted to Java / Apache Beam. Dataflow was spending too much time on the reading block instead of making transformation.
Thirdly, the REDIS cache for the bigger datasets was neglected as a solution because of its data size constraints there.
Our next solution candidate was to use HBASE as BigTable for staging data caching. At first, it required a similarly large number of BigTable workers. However, when defining keys for BigQuery, we encountered a challenge: BigTable requires data cardinality preparation for efficient data storage.
While the initial four steps were exploratory and iterative, the final solution turned out to be surprisingly straightforward. It involved using BigQuery with a carefully designed data structure, leveraging the insights gained from the earlier steps. You might think that the first four steps were a wasted effort, but quite the opposite – they provided the knowledge that guided us to the final, straightforward solution.
What for?
This data join initiative proved far more than just a simple technical solution. In its wake, it has sparked a range of exciting opportunities:
Improved travel agency performance with a 10% decrease in failure rates.
Enhanced AI capabilities through access to a comprehensive and unified travel data landscape.
Empowering data analytics teams with new tools to unlock deeper customer insights and fuel personalization initiatives.
Conclusions and Takeaways
Our journey in merging previously disparate datasets highlights the potential we unlocked when we challenged the seemingly impossible and prioritized collaboration and knowledge-sharing. It showcases how simple, cost-efficient solutions can pave the way for breakthrough innovations, allowing us to navigate the increasingly complex waters of our data-driven world with confidence and vision. If you ever find yourself daunted by the scale of data integration or face a similar challenge in the future, here’s our advice drawn from this year-long development journey.
Never underestimate the power of business knowledge: when conventional methods seem to falter, seek out insights from your business experts — they often hold the key to unlocking new possibilities.
Break down complex problems into manageable steps; tackling them piece by piece ensures steady progress and prevents the overwhelming nature of the task from stalling your efforts.
Embrace simplicity as a guiding principle — sometimes the most profound innovations arise from the simplest ideas.
Finally, always keep your focus on value and impact; every decision and action should drive meaningful outcomes for your business, ensuring that your work remains not only effective but also relevant.
?
?