An island of truth: practical data advice from Facebook and?Airbnb
I’ll confess… more than once I have found myself producing, publishing, and publicizing incorrect data. I cannot recall exactly how I found that data — perhaps I ran a SHOW TABLES command in my data lake or data warehouse and got a result back that sounded legit. Or maybe I dug into a dashboard that referenced a column that seemed like what I needed. Perhaps I tried to track down the person who built a summary table a few months back, only to find they had left the company.
Once, when I ran a query and shared some data out broadly, my heart sank to the bottom of my stomach when an executive emailed me asking “hey, where did you pull this from? Your manager just said this metric was 24% higher.” The painful part of these all-too-common stories, and the thing I’ve felt most acutely, is that trust in data is broken.
“Not that you lied to me, but that I no longer believe you, has shaken me.” — Friedrich Nietzsche
This blog post gives an inside glimpse into data at Facebook and Airbnb, with practical advice on how to build a trustworthy data ecosystem.
Data at Facebook
When I worked at Facebook back in 2008, my official role was data analyst for the growth team. As a side project, I took it upon myself to teach 550 colleagues how to write their first SQL query. It was a great experience, and my colleagues loved the feeling of becoming data informed. Facebook had just developed Hive, and in order to help gain adoption for it, I took initiative to create these intro classes.
Once my colleagues got through the basics of SELECT and FROM, the first question asked was “how do I find the data I need?” This was a surprisingly challenging question to answer. We had a huge number of data tables with similar names and varying levels of relevance. As their teacher, I didn’t want to point someone to the wrong table, but how was I to know which of the tables was the right one?
Was it dim_user, dim_users, or dim_users_extended? Even if I did manage to point them to the right table, I did not know the nuances of how to query it in order to generate accurate metrics. For example, a simple COUNT(*) on our dim_users table would return number larger than what we reported for our count of active users. It turns out that if I did not filter out user_type=-1 and set the active_30d=1 then my results would be dead wrong.
This metric definition caveat was a huge issue, causing frustration and embarrassment for my colleagues when they produced incorrect results. I’ve found these metric problems so challenging that my new company, Transform, is building tools and frameworks to help companies correctly define and catalog their key performance indicators.
Facebook architecture diagram
- *Note the potential problem with BI tools referencing two different data systems.
Data at Airbnb
When I left Facebook to join Airbnb in 2014 as the PM for data infrastructure and data tools, I vowed to get ahead of this problem before it became institutionalized and intractable. At first glance, the Airbnb data lake already looked pretty daunting, with thousands of tables and nearly a petabyte of data. To be perfectly honest, it was treacherous.
Due to some early infra challenges (that I wrote about here), the organization had low trust in data. We lacked a credible, single source of truth for important data tables and key metrics. Almost all analytical insights were generated by a select few data scientists who had context on their particular data domains, and those folks got bombarded with questions all day long. If one of those data scientists left the company, it was a nightmare to unwind their labyrinthine pipelines to find the actual SQL that defined their metrics.
Although there were some early challenges, we stayed committed to fixing this problem because we recognized real potential to create trustworthy, accurate datasets. The stakes were high with this work, though. If we didn’t build trust in data, the company would have wasted millions of dollars on big data infrastructure, data tools to increase productivity, and a technical data science staff. How could we rebuild faith in the accuracy and correctness of data when there were dangers all around?
Airbnb architecture diagram
- *We placed Core Data in the center of the data lake so that it could be joined to all other data assets, reducing the burden of keeping datasets in sync across different systems.
Creating the island of truth
Our answer was to build “Core Data” as our blissful island of truth within our big data lake. A sunny oasis, safe from shark infested waters. Core Data at its heart was a set of fact tables and dimension tables based on practical subject areas. When we built it initially, we worked with domain experts around the company to unpack logic used to generate tables, and then validated metric definitions with stakeholders in different departments. Facts like revenue earned, messages sent, and listings created were then easily joined with dimensions like region, language, and listing type.
By removing many dangerous nuances from the tables and publishing great metadata, we reduced the barrier to entry for novice SQL users. Our team dedicated effort to ensuring the tables were always accurate, complete, reliable, relevant, unique, discoverable, and timely. With those qualities, we engendered trust from customers, then reinforced that trust with our consistency in delivering high-quality datasets.
Once the island of truth was established, we made Core Data the centerpiece of our data education efforts. People new to Airbnb could confidently begin their data journeys knowing that these tables were trustworthy. Teaching data classes was now an absolute joy! And employees could self-serve, lifting the burden from data scientists and data analysts to answer simple questions.
Example schema for an island of truth
- *Many fact tables (fct_*) required high levels of accuracy for financial reporting, so we sourced them from production database snapshots. We could enrich those facts with dimensions from click-stream logs, which were more tolerant of small errors or loss.
The next few sections will provide learnings on how other companies might begin creating their own islands of truth.
Where should we build this?
To create an island of truth, you’ll need to carve out a safe place in your data ecosystem that is easily accessible to your internal customers. It needs to be capable of running queries fast enough to encourage exploration and joining of datasets. At Facebook we made an island of truth by moving a subset of our Hive data into a rack of Oracle machines (note: this was before Presto was developed). This physical barrier ensured that only high-quality, trustworthy data made it across the warehouse gap and analytics queries ran fast in Oracle. The downside of this approach was that only a handful of people knew how to build ETL pipelines that moved data between systems, and it was challenging to keep data synchronized. Additionally, nobody had direct SQL access to the Oracle tables so the data there could only be accessed through BI tools, which limited exploration.
At Airbnb we took a different approach, using a logical barrier to create our island of truth within our Hive ecosystem — not a physical barrier. The benefit was more people had access to Hive and therefore could easily join these trusted Core Data tables to their own datasets. Airbnb also ran Presto, which was fast enough to support analytics queries “at the speed of thought” right in our ecosystem. This was a great pattern because we didn’t need specialized data engineers to move data between two systems, and we saved ourselves headaches with keeping datasets synchronized.
Advice: Create your island of truth right in the middle of your data lake, not in a different system. The rise in popularity of inexpensive file storage, combined with the speed of modern query engines like Presto, Dremio, Spark, AtScale, etc. make this possible. If you’re fortunate enough to work at a company with a single data warehouse that stores all ingestion data and summary data in Snowflake, Redshift, Azure Data Warehouse, or Google BigQuery, then that’s great. The key point here is that you should create your island of truth in a place where all your source data and summary data is accessible.
Who should contribute to our island of truth?
The old way of building a trusted analytics dataset like this was to hire a team of Business Intelligence engineers who were experts on data marts and OLAP cubes. In general, these folks were great at creating processes and building trustworthy reports for finance and operations teams whose datasets did not change often. The downside with this approach is that these BI engineers were often separated from the product development workflows and their governance model did not keep up with the modern pace of product development.
Many organizations have made a shift toward embedding data analysts and data scientists within product development teams so they can quickly build tables, define metrics, and run experiments. This paradigm helps teams move quickly with data, but an anti-pattern is emerging where analysts and data scientists quickly prototype pipelines (in tools like Airflow) and then “throw their ETL over the wall”. BI engineers are asked to adopt pipelines with no context on what the data means or how it will be used, which creates hard feelings.
Advice: This is a tough problem, and one that differs from company to company. My suggestion is to create a small set of data engineering experts who can help find a balance between rigid processes and supporting the faster pace of product development. That team (or virtual team) should outline the structure and reproducible code patterns for committing to the island of truth, helping embedded data analysts and data scientists contribute back.
How can we maintain a quality bar?
Having strict guidelines set on the quality of data (accuracy, completeness, reliability, relevancy, uniqueness, discoverability, and timeliness) creates a high bar to entry. At Airbnb, we set service level objectives for landing times, tested columns for cardinality explosions, and had alerts for data with mismatched types.
Advice: My observation is that there are two important parts of data engineering — sometimes called ETL, pipeline building, analytics engineering, or BI engineering.
- First: deriving data concepts like facts, dimensions, metrics from source data
- Second: maintaining internal-user-facing datasets for consumption
Both are extremely valuable, but only the first is a good use of data engineering time because it harnesses creative thought and insight to generate novel data assets. The second is repetitive, time consuming, error prone, and mostly does not help data engineers learn or grow in their roles. Data engineers should invest in frameworks, tooling, testing, and processes to reduce their ongoing maintenance burden for keeping a high bar for their internal datasets.
What should we start with?
I have seen projects flounder when folks coordinate pre-meetings before meetings to talk about the meetings they want to have in order to get buy-in from a ton of uninvolved stakeholders. What I’ve seen work is when people just start building this island of truth. Picking a name like “core_” or “gold_” to prefix your tables is enough to start building and marketing your internal brand.
“Do not wait; the time will never be “just right.” Start where you stand, and work with whatever tools you may have at your command, and better tools will be found as you go along.” — Napoleon Hill
Advice: Find the datasets that drive your most important company outcomes, and begin building out those tables. Ideally, your island of truth will start outlining the most critical facts and begin cataloging the funnels that drive your metrics. Having a well-defined star schema to separate facts and dimensions reduces the barrier to entry for internal customers. This blog post isn’t going to provide guidance on how to model a star schema, as there are many great resources like Kimball to help guide that journey.
How do we derive key metrics from these (mostly) normalized tables?
Key metrics can be complex to derive, even from a set of trustworthy tables. It’s even more challenging to create a metrics repository with definitions, annotations, anomaly detection, and lifecycle management. My new company is building tools to make this easier, but I will save that topic for another blog post…
Thanks and praises
Rupen Parikh, Siva Kolappa, Ray Ko, and Prashanth Nerella thanks for your instrumental efforts in building this architecture at Facebook. And thanks for everything you taught me about data architecture, governance, and ETL.
Riley Newman, Aaron Keys, Jonathan Parks, Sid Kumar, Max Beauchemin, Marcus Schmaus, Lauren Chircus, and the whole data science team at Airbnb, thank you. You were dedicated and thoughtful in building Core Data, helping Airbnb become a world-class company for people to work with data. Further reading: How Airbnb Democratized Data and Scaling Airbnb’s Experimentation Platform
Thanks Jim Renaud for the killer graphics in this post.
Get in touch
Email … LinkedIn … Waitlist Signup for Transform Data ... Medium Article Cross-Post
Heron Finance | Goldfinch | Links Golf Club | former Airbnb/ESPN
4 年Congrats on new company, exciting. Good read James
I work for a better ?? and solve problems using ??, Data, and AI #ethicaldata #ethicalAI #ethicalbusiness
4 年Spot on! Looking forward to seeing what comes out of Transform.
Engineering Leader, Entrepreneur, Author. Formerly: Airbnb, Facebook, Gowalla co-founder. O'Reilly author
4 年Thank you for this!
Product | Startup Advisor
4 年Awesome! Can't wait to see what's next for Transform!
Manager, Technical Program Management, Ads AI/ML at Google
4 年Such an insightful and enjoyable read! Thanks James Mayfield for sharing!