TalentPort: Secure, Scalable and Self-Serve ETL using Open Source Systems
Introduction
All the major businesses in this new era are driven by data. As the saying goes, Data is the new oil and it needs to be subjected to various treatments to get insights, information and inference. LinkedIn being the most trusted platform handles its member and customer data securely at scale. Talent Hub is LinkedIn's own Applicant Tracking System (ATS) within the Talent Solutions product family.
Organisations tend to accumulate large amounts of hiring data in ATS. When they move from a 3rd party ATS to Talent Hub, it becomes a critical requirement to transfer legacy data to Talent Hub. When customers sign-up for Talent Hub, Implementation Consultants (ICs) help them onboard to Talent Hub and take care of migrating all the legacy data from the 3rd party ATS to Talent Hub. TalentPort is the application ICs leverage in performing this white glove activity. In this blog we will discuss how TalentPort solves various problems faced in this extract-transform-load (ETL ) process by leveraging open source technologies. It is an application that showcases how to choose and connect various open source technologies as the building blocks to forge the right application for the business by performing ETL securely at scale for Talent Hub.
Challenges
Firstly, let's understand the challenges TalentPort is aiming to solve:
Comparative Analysis
Different solutions explained below work under different circumstances. They were explored, analysed and compared before arriving at the existing solution. A comparative analysis is presented below.
Standalone Scripts
Standalone Python-based scripts were developed for the end-to-end solution. Individual scripts catered to extract, transform and load aspects. It was able to meet the short term goals. This solution provided greater agility quickly unblocking the business. However some of the drawbacks were as follows: security and compliance had to be enforced manually, a lot of manual IC and Engg touch points, poor IC usability, mostly vertically scalable, missing end-to-end automation etc. So this solution was deprecated considering the long term goals.
Azure Data Factory
A cloud based ETL service that orchestrates and automates data movement and data transformation. It offers a code free GUI based ETL portal where pipelines can be designed and executed without worrying about the underlying infrastructure. It provides a true serverless experience. Refer to documentation here . A Proof-Of-Concept was developed using test data from Dynamics ATS. This solution works best when your applications are already running in Azure leveraging cloud components and there are expert teams to assist you. We observed a few pros and cons as listed below. Because of the following cons this thread was dropped.
Commercial ETL Frameworks
Some of the commercial ETL frameworks provide a design workbench for code-free experience, while some don’t. Some of them deal with only log processing. Some provide high end features only for enterprise versions which are too costly. It would also be difficult to make them interoperate with the LinkedIn stack.?
Because of the reasons discussed above, we started exploring internally. Various ETL tools and augmenting open source technologies are heavily used at LinkedIn. Some of the frameworks are already used for different use cases. There are a few engineering teams who have developed expertise in this domain. Considering all of this, designing an application leveraging LinkedIn Infrastructure, ETL tools in-use and the augmenting open source technologies was the best road ahead.
Solution
In this section let us explore TalentPort's architecture in greater detail.
Actors & Use Cases
Relevant use cases and actor interactions are captured in Figure 1
Figure 1: Actors and Use cases
Open Source Technologies
Along with some internal systems, following open source technologies have been used in developing TalentPort:
Architecture
Figure 2: TalentPort Architecture
All the pipeline flows are managed and triggered from Azkaban UI (A distributed workflow manager).
Let's zoom-in further on the architecture. Individual details of Extract-Transform-Load are illustrated and explained in the sections below.
领英推荐
Data Extraction
In this phase all the data entities like Jobs, Candidates, Applications, Feedbacks etc are extracted from 3rd party APIs, Sharepoint (source) in various formats JSON, CSV using Gobblin. All the various details and step wise explanations are captured in Figure 3
Figure 3: Flow for extracting data entities
Attachments Extraction
Attachments consist of resumes, cover letters, offer letters etc. No transformations are required for such entities, only metadata is essential during schema transformation. This flow needs to be executed before Data Transformation. All the attachments are extracted from 3rd party APIs, Sharepoint (source) using Gobblin and dumped directly to the Ambry store (final sink). An aggregated metadata is left on the HDFS as a byproduct. This metadata contains, blob Id, filename etc, which are leveraged during schema transformations. A watermark (file on HDFS tracking failed URLs) is maintained in case of failed downloads for retries. Stepwise explanations are captured in Figure 4
Figure 4: Flow for extracting attachments
Data Transformation
Post data extraction, a configuration file is generated. This configuration file is updated by ICs in consultation with customers to capture the right mappings for interview stages, sourcing channels etc. This file is edited on Sharepoint and further pulled from there into HDFS. A spark job is executed for schema transformation from 3rd party ATS format to LinkedIn format (Figure 5). Once data is converted, a smart sampler is executed to shortlist Jobs to be imported for verification and validation. The shortlisted jobs are imported to Test contract for verification from ICs and Engineering team. Samples are also imported into customer contracts so that they can take a good look at the sample imported data, verify information before they go for full import. We might flex this process as pipelines mature. Jobs with minimal information are of little help for verification. Hence we compute a score based on active candidates, notes, attachments etc and sample jobs with maximum score. This ensures a fruitful verification and validation phase. The details of score calculation are captured in Figure 6. Post validation full data is imported.
Figure 5: Flow for schema transformation
Figure 6: Smart sampling
Data Load
Post data transformation, the records would be present in a specified HDFS directory in LinkedIn Talent Hub format. A Kafka-Samza nearline system is leveraged to ingest the records into Talent Hub. The nearline system helps us to throttle and control the QPS directed towards Talent Hub APIs. Samza calls specific Import APIs for ingesting data into Espresso (Talent Hub backend datastore). As shown in Figure 7, there exists certain data dependencies between the four different data types. Contract (a.k.a account) related metadata is imported first followed by Jobs and Candidates data in parallel. Application data goes in the end.
Figure 7: Flow for data import and its dependencies
Performance
For early performance testing, we chose our internal Test Cluster and one of the representative migration pipelines. The test cluster had ample compute and storage but constrained on the GaaP resources (bandwidth to connect to Internet). We considered 3 anonymised datasets: small (~1k candidates), mid (~20k candidates) and large (~60k candidates) [specifying only the number of candidates for brevity. There are equivalent quantities of jobs, notes, applications, attachments etc] and different modes of concurrency. We had a Java driver code invoking Azkaban APIs.
Observations from the performance experiments:
We followed some of the resolutions listed below, to overcome the issues:
The system proved to be scalable and reliable!
Conclusion and Future Work
In this blog post we have discussed TalentPort, an embodiment of E-T-L use case leveraging open source technologies. It's a secure, scalable and self-serve application for moving enterprise data from 3rd party ATSes to Talent Hub. We have successfully developed migration pipelines for major ATSes. We have also integrated these pipelines with standardisation APIs to regulate and normalise incoming data. As an outcome, we could meet all the major challenges mentioned earlier.
As a future work, we want to give back some of the features and extensions built around ETL frameworks/libraries. We also aim to externalise Import APIs so that customers can directly leverage them for importing their data. We plan to re-run the experiments for other migration pipelines and fine-tune various Gobblin, Spark configurations to improve resource utilisation and performance.
Acknowledgements
It takes a village to build something significant! TalentPort was possible with the tremendous effort and collaboration from the following people:
Product Leader, News and Messenger @LinkedIn | USC Trojan
2 年Happy to see this shared with the LI community! Building scalable and shareable products that are highly modular yet loosely coupled & built on open-sourced tech is the way to go!!
This initiative ATS Data-migration had immense learning for you (thru several POCs) and the team when you all evaluated other solutions available to solve the usecase of moving large user defined ATS structured data into a datastore thru several POCs. Also, all just with LI’s technologies which are already open sourced (which others outside of LI could leverage and build for their usecases). Thanks for capturing and sharing your learnings with the community, Aditya Hegde