Horizontal Innovation in Data Science
Innovation is a key driver of progress and can be found in every field, taking on various forms. In the context of Data Science, innovation can be broadly categorized into two types: vertical and horizontal. Vertical innovation is tailored to a specific field and involves developing new data science solutions, such as machine learning models, in areas that were previously less empowered by data science, such as sales. On the other hand, horizontal innovation is characterized by the advancement of existing data analysis techniques, algorithms, and tools, such as enhancing an experimentation platform for broader adoption. The former aims to address domain-specific challenges creatively, while the latter can improve broader Data Science rigor and efficiency once adopted across the teams.
In this article, I would like to provide practical insights into how to drive horizontal innovation across the data science team. To illustrate this, I'll draw upon my experience from a past project. I hope that my reflections can be beneficial by shedding light on potential challenges and opportunities others might face.
Environmental Factors and general steps for horizontal innovation
First, there are several environmental factors to consider in whether a horizontal innovation would likely be cultivated:
Once the above environments are understood, to make horizontal innovation a reality in your data science team, here are three general steps:
Now, that we are familiar with the environmental factors and general steps, let’s move into one case study where I led the establishment of a new metrics foundation framework for LinkedIn's Data Science team in 2018
Case study: on developing a new metrics foundation framework
My pain point: “I don’t like Pig”
In early 2018, Data Scientists at LinkedIn wore multiple hats to create business value: Besides leveraging data analysis to drive business decision-making, many other works, such as building up data foundations powering production metrics and dashboards (more like Data Engineering), also fall into the Data Scientist team’s roadmap.
During the quarterly planning time, I found a need to surface some extra information in a dashboard for better business insights, but that information was not available in existing datasets, so I included the item of “adding new columns in a dataset” in my quarterly plan.
This item was considered a bit of a mundane task at the Data Scientist standards: it is necessary to do for the business need, but not too exciting. One needs to read through legacy code and modify the component to refresh the updated logic. For me, this is OK except the most unacceptable part is that the legacy code was written in Pig.
Speaking of Pig, it may not be a commonly known language at the moment of writing (Aug 2023), but it was the first Hadoop language (hopefully people still know what Hadoop is) to enable the semi-query process. It gained some popularity in early 2012–2014 and then lost ground to more SQL-like languages (Hive, Presto), hence many fewer companies were using it for building data pipelines. Since LinkedIn was one of the few earlier adopters (link1, link2), many data foundation pipelines were built in Pig, and people who develop new pipelines found copying existing code logic easier than rewriting them in alternative solutions (e.g. Hive), so the Pig code base continued growing and it was still a “must-know” language for new members joining the Data Science team in 2018.
While I have been largely able to shield myself away from writing code in Pig since joining the company in mid-2017, by creating new datasets pipelines (in Hive) to replace old ones. This time, unfortunately, the logic is simply too complicated to overhaul in a short time and it was infeasible to avoid Pig.
After a few days of diving into the code, I started to find it unbearable, and seriously think: could there be an alternative solution, that I can push the organization to adopt and totally ditch Pig? So I and others do not need to write in Pig anymore?
Finding organization’s pain point: Logic inconsistency in metrics
It was not easy to convince everyone to ditch Pig, why? Most people already learned Pig during their stay at the company; some liked the syntax and considered it more powerful than SQL-like language (e.g. Hive), and some commented that it may not be perfect but isn’t worth the effort to totally change the code base. Even after I demonstrated a better solution (i.e. Spark SQL) with much superior execution speed (2–3x faster), people still argued it might be an OK tradeoff to continue using Pig. To my bigger surprise, later I learned there was even a proposal about “Pig on Spark” underway, to speed up execution while maintaining using the Pig syntax. It was clear that strong resistance existed.
Why didn’t people just move on and agree on my proposal to use Spark, a clear industrial trending technology? Initially, I was confused about this strange phenomenon but later I found the reason: basically, my pain point (“I don’t like Pig”) was not the organization’s pain point then. If I want to convince the organization to ditch using Pig, I need to find the organization’s pain point and develop a solution to address that.
Luckily one big pain point of the organization was emerging on the horizon: metric inconsistency. Imagine this situation: one customer may see a supposed-to-be-same number (e.g. annual sales) on one product interface and a different one from another, and this would largely impair customer trust and potentially lead to customer attrition. This problem was observed simply because the two data pipelines powering the (supposed-to-be-same) numbers on different product surfaces were using the metric calculation logic inconsistently: one was powered by the updated new logic, while the other used the old one.
This surprising observation led to the formation of a horizontal program inside the Data Science team, to audit all logic units across the production code base: It was a huge effort to identify those discrepancies and resolve them manually. The team resolved some key discrepancies but no one could promise this would never happen again. This was considered a big risk for the organization, and one great lever (for me) to align the organization's interests with my personal ones.
Understanding the root cause: scripting language?
What’s the root cause of the logic inconsistency? There are many reasons (e.g. human errors, code governance), but in my opinion, one fundamental problem lies in how data pipelines are designed.
A bit more background about the data pipeline foundation at LinkedIn in 2018 (Disclaimer. these are all publicly available information): to democratize data pipeline creation, the company developed a powerful in-house platform, Unified Metrics Platform (UMP): anyone can write a Pig/Hive/Presto code and leverage the platform to orchestrate the creation of a dataset/metric. It was a great innovation and made creating a new pipeline much easier (users don’t need to worry about the infrastructure complexity, and just need to write the scripting code), the adoption of the UMP is high across the Data Science team, and there were hundreds of metrics built up.
领英推荐
Since most languages used in the platform are scripting languages (e.g. Pig, Hive), if others want to re-use part of the metric calculation logic, they usually need to copy a code block from place A to place B, in order to replicate the same logic for other metric calculation in place B. However, if the code block in place A was updated by the original owner, the code in place B would not be updated accordingly (because the owner simply doesn’t know it was copied!), leading to logic inconsistencies. This is the root cause.
On the opposite side, let’s take a look at standard software design: one code module (e.g. A) with a specific functionality is usually encapsulated into a single function (or class), has its unit test to validate logic correctness, and can be called by other modules (e.g. B) if they want to re-use the logic. Whenever the logic for A may be updated, it would be unit-tested and then its change would be automatically propagated through and enable other downstream modules (e.g. B) to be updated. So one doesn’t need to worry about potentially inconsistent behavior.?
When I was thinking deeply in this direction, I found it fascinating how the data engineering approach is different from the software design, and the root cause looks clearer, but how to solve the problem? Ditching Pig and replacing it with Hive/Presto, or even Spark SQL, could not fully address the inconsistency problem, because they are still scripting languages that would be written out in a “procedural paradigm”. Then I started to look deeper: Scala is the native language of Spark and it can let users write code in an “object-oriented programming (OOP) paradigm”; if we could encapsulate key logic components and enable them to be referenced directly, we could transform the metric building practice from a script writing practice into an objective-oriented design paradigm. Meanwhile, this paradigm shift would naturally bring along the speed advantage of Spark execution.
This feels exciting and much more promising: My proposed solution is to change from “writing procedural data pipeline” into “designing data foundation under OOP paradigm with logic modularization”, and by accident, Pig should no longer be needed :)
Building the Minimum Viable Product (MVP)
I quickly wrote out a proposal, boldly named it “the next generation of metrics foundation framework”, and claimed the benefits including resolving metrics inconsistency and dramatically including the execution speed. Thanks to the support from my managers, it directly went up to the data science leadership and attracted attention. I was asked to share a minimum viable product to demonstrate how it works and prove the stated benefits.
I vividly remember it was April 21, 2018, during my 12-hour flight back to China, I was coding on the airplane for straight 8 hours. Without any internet access, there was also no debugging functionality from InteliJ (the programming interface). I wrote up roughly a thousand lines of Scala/Spark code to materialize the framework that was conceptualized, regardless of whether it would execute or not, just putting the structures and logic components (e.g. unit testing) as if it would work. When the flight landed and the internet access was granted, I spent another two days debugging all the issues and test-running the workflow. It worked: all the modules were developed in Scala under the proposed design paradigm; each module encodes one key logic piece and they can be referenced easily, and the whole codebase can be compiled error-free. Meanwhile, with the Spark execution engine, the speed is 3–5 times faster than Pig: I tested this multiple times.
This MVP demonstrated that developing a data pipeline could be done in an “objective-oriented fashion”: one needs to design what’s the needed data component, check whether existing logic exists, and build up new ones if needed. This is far more different from the previous “procedure approach”, where one needs to copy / past existing logic and write up the code mechanically. The code was stored in a new multi-repo codebase (referred to as multiproduct), and it served the purpose of demonstration: the MVP went well and the leadership acknowledged the advantage of this approach. However, this would be a huge change and it was still uncertain how others perceived this approach, I was asked to assess its accessibility and adaptability for the organization.
Organizational alignment and feature enhancement?
The new framework has been embraced by my immediate team (~20 members), as the combination of high-speed performance and design thinking was viewed as valuable. However, we soon encountered additional challenges that hindered the framework's expansion. These challenges can be roughly categorized as follows:
To tackle these challenges, I spent a lot of time discussing, learning, and brainstorming with many senior technical leaders across the data organization; these efforts eventually translated into the following solutions:?
As these challenges were successfully addressed, along with alignment with key technical and organizational leaders from data organizations, the framework was then poised for broader expansion.
Roadshow, training, and broad adoption
While we expanded the Scala/Spark metrics framework, others also experienced Spark as a better programming language for more use cases (e.g. Spark SQL is generally faster for daily analytical purposes). So I was granted DS leads support on driving broader adoption of not only this Scala/Spark metrics building framework but also Spark language itself.
What’s coming next may be less technical, but very much fun and exciting. We formed a Spark Cross-Team Forum, which is a technical committee with five senior DS across the analytics team, to drive up Spark adoption inside the Data Science organization (~200 data scientists). To jump things started, we sought funding from DS leadership to bring external vendors and it was swiftly granted. I remember for the first Spark session we hosted, due to the overwhelming interest from the DS community, the RSVP registration was filled within 2 minutes of the invitation email being sent out (Yes, we informed everyone ahead that the RSVP would be sent out on one precise time, and I saw the responses comes right in), along with a long waiting list. Later, the committee designed more customized training materials, ran different training sessions, and hosted office hours to help address specific problems encountered. The broad spectrum of our approaches brought Spark and the new metrics framework to many in the Data Science team.
Besides the fun of chairing the committee and working with talented peers, it was very rewarding to simply watch our Spark adoption number go up every month. We have an internal dashboard to track the Spark adoption (both for metrics foundation building, as well as ad-hoc usage in daily analytics). Seeing those numbers going up makes us feel our efforts were well worth it. By the time I left the company later (around 2021), using Spark to run ad-hoc jobs and build metrics was already a common practice.
Key learnings
This encapsulates my case study journey about how to drive horizontal innovation. Upon reflection, this experience was exceptionally satisfying: it not only tested my technical design thinking but also sharpened my organizational skills by necessitating broad alignment. Ultimately, it helped to solve one of the organization's most important problems and created value for the company.
While my specific case may not directly apply to all, several key learnings could have broader relevance:
Beyond these key takeaways, finding joy in the journey is equally (or even more) important. As I reflect on the process of establishing this framework, numerous moments of delight emerge: troubleshooting Spark executor issues alongside UMP Engineers, engaging with various technical leaders to gain diverse design perspectives, and delivering roadshow presentations to many analytics teams. A genuine sense of joy drives me forward and is indispensable throughout the journey. I hope this article extends that joy to you as well.
Results -Driven AI Technology.
1 年Pan Wu, This article is incredibly insightful & Thought provoking. Thanks for sharing.
Meta - Staff Data Scientist @ Ads CoreML
1 年This article is incredibly insightful, and I'm truly grateful for the valuable best practices you've illuminated in the 'Thinking beyond the status quo' paragraph. Thanks Pan!
Head of Data @ Grammarly | I Learn Every Day | MBA & PhD, Data Science, ML
1 年Love your thoughts here, Pan. I think an additional component to this is how you staff to support innovation horizontally and vertically. While I'm just commenting here, I'd guess if we thought about it they are a bit different. In both cases its hard to grow people headcount proportionally to the capability, so that innovation enables you to be more efficient on a per person basis. Perhaps it is a key part of what it takes to have a full scale data science function. Thanks for writing!
Staff Data Scientist @ Sprout Social | ex-LinkedIn
1 年I remember these Spark UMP roadshows, and I am so grateful I never had to learn Pig! Your impact is still felt at Linkedin, Pan. Another thing I would add is this centralized up leveling has also lead to more incremental improvements that further benefit the entire org in terms of efficiency and scalability. Wins on top on wins!
Vector Compute @ Superlinked | xYouTube
1 年That's an insightful distinction you've drawn between vertical and horizontal innovation in data science ?? Any engagement strategies you found most effective for adopting changes universally?