TalentPort: Secure, Scalable and Self-Serve ETL using Open Source Systems

TalentPort: Secure, Scalable and Self-Serve ETL using Open Source Systems

Introduction

All the major businesses in this new era are driven by data. As the saying goes, Data is the new oil and it needs to be subjected to various treatments to get insights, information and inference. LinkedIn being the most trusted platform handles its member and customer data securely at scale. Talent Hub is LinkedIn's own Applicant Tracking System (ATS) within the Talent Solutions product family.

Organisations tend to accumulate large amounts of hiring data in ATS. When they move from a 3rd party ATS to Talent Hub, it becomes a critical requirement to transfer legacy data to Talent Hub. When customers sign-up for Talent Hub, Implementation Consultants (ICs) help them onboard to Talent Hub and take care of migrating all the legacy data from the 3rd party ATS to Talent Hub. TalentPort is the application ICs leverage in performing this white glove activity. In this blog we will discuss how TalentPort solves various problems faced in this extract-transform-load (ETL ) process by leveraging open source technologies. It is an application that showcases how to choose and connect various open source technologies as the building blocks to forge the right application for the business by performing ETL securely at scale for Talent Hub.

Challenges

Firstly, let's understand the challenges TalentPort is aiming to solve:

  1. Secure and Compliant: 3rd party customer ATS data has a lot of highly confidential data. For example, interview feedback, resumes, offer letters etc. The solution should securely transfer data from 3rd party ATS to Talent Hub. Further it should respect all the security guidelines set out by the Information Security team
  2. Scale Horizontally: As the volume of data we need to import expands over time, the solution should have the capability to scale horizontally.
  3. IC and Engineering Efficiency: TalentPort should be automated end-to-end, highly improving Implementation Consultants’ and Engineering teams’ efficiency. The solution should aim for zero engineering effort per customer. It should also provide IC friendly interface to interact with migration pipelines
  4. Pipeline Customisation: Support migration from multiple Applicant Tracking Systems (ATS). Different customers use their ATS in slightly different ways and belong to different tiers. As a result, the solution should have highly customisable migration pipelines to incorporate various use cases.

Comparative Analysis

Different solutions explained below work under different circumstances. They were explored, analysed and compared before arriving at the existing solution. A comparative analysis is presented below.

Standalone Scripts

Standalone Python-based scripts were developed for the end-to-end solution. Individual scripts catered to extract, transform and load aspects. It was able to meet the short term goals. This solution provided greater agility quickly unblocking the business. However some of the drawbacks were as follows: security and compliance had to be enforced manually, a lot of manual IC and Engg touch points, poor IC usability, mostly vertically scalable, missing end-to-end automation etc. So this solution was deprecated considering the long term goals.

Azure Data Factory

A cloud based ETL service that orchestrates and automates data movement and data transformation. It offers a code free GUI based ETL portal where pipelines can be designed and executed without worrying about the underlying infrastructure. It provides a true serverless experience. Refer to documentation here . A Proof-Of-Concept was developed using test data from Dynamics ATS. This solution works best when your applications are already running in Azure leveraging cloud components and there are expert teams to assist you. We observed a few pros and cons as listed below. Because of the following cons this thread was dropped.

Azure Data Factory - Pros and Cons

Commercial ETL Frameworks

Some of the commercial ETL frameworks provide a design workbench for code-free experience, while some don’t. Some of them deal with only log processing. Some provide high end features only for enterprise versions which are too costly. It would also be difficult to make them interoperate with the LinkedIn stack.?

Because of the reasons discussed above, we started exploring internally. Various ETL tools and augmenting open source technologies are heavily used at LinkedIn. Some of the frameworks are already used for different use cases. There are a few engineering teams who have developed expertise in this domain. Considering all of this, designing an application leveraging LinkedIn Infrastructure, ETL tools in-use and the augmenting open source technologies was the best road ahead.

Solution

In this section let us explore TalentPort's architecture in greater detail.

Actors & Use Cases

  • Customers request Data Migration with their Implementation Consultant
  • Implementation Consultants invoke Migration pipelines on TalentPort
  • TalentPort extracts data from 3rd party ATS APIs, Sharepoint etc

Relevant use cases and actor interactions are captured in Figure 1

Actors and Use cases

Figure 1: Actors and Use cases

Open Source Technologies

Along with some internal systems, following open source technologies have been used in developing TalentPort:

  1. Apache Gobblin - Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems
  2. Apache Kafka - A distributed event streaming platform
  3. Azkaban - A distributed workflow manager
  4. Apache Samza - A distributed stream processing framework
  5. Rest.li - An open source REST framework for building robust, scalable RESTful architectures using type-safe bindings and asynchronous, non-blocking IO
  6. Apache Spark - An open-source unified analytics engine for large-scale data processing

Architecture

TalentPort Architecture

Figure 2: TalentPort Architecture

  1. Data from various sources, 3rd party APIs, Sharepoint in various formats JSON, CSV are pulled over authenticated Gobblin connections over GaaP Proxy (gatekeeper to the internet). This data is dumped into HDFS post encrypting highly confidential data. ICs provide API keys securely to Gobblin extractors via Nuage (an internal tool) and KMS (Key Management System).
  2. The attachments are directly pushed to the final destination ie., to Ambry (Distributed highly secure Object store).
  3. Post data extraction, a data driven configuration file is generated to capture client specific mappings (interview stage mapping, sourcing channel mapping etc) and flags.
  4. Data along with client configurations are transformed into Talent Hub format using Spark. This involves complex schema transformations.
  5. The transformed data are brought to online systems using Kafka and Samza. The data is further loaded into Talent Hub backend using Import Rest.li APIs.

All the pipeline flows are managed and triggered from Azkaban UI (A distributed workflow manager).

Let's zoom-in further on the architecture. Individual details of Extract-Transform-Load are illustrated and explained in the sections below.

Data Extraction

In this phase all the data entities like Jobs, Candidates, Applications, Feedbacks etc are extracted from 3rd party APIs, Sharepoint (source) in various formats JSON, CSV using Gobblin. All the various details and step wise explanations are captured in Figure 3

Flow for extracting data entities

Figure 3: Flow for extracting data entities

Attachments Extraction

Attachments consist of resumes, cover letters, offer letters etc. No transformations are required for such entities, only metadata is essential during schema transformation. This flow needs to be executed before Data Transformation. All the attachments are extracted from 3rd party APIs, Sharepoint (source) using Gobblin and dumped directly to the Ambry store (final sink). An aggregated metadata is left on the HDFS as a byproduct. This metadata contains, blob Id, filename etc, which are leveraged during schema transformations. A watermark (file on HDFS tracking failed URLs) is maintained in case of failed downloads for retries. Stepwise explanations are captured in Figure 4

No alt text provided for this image

Figure 4: Flow for extracting attachments

Data Transformation

Post data extraction, a configuration file is generated. This configuration file is updated by ICs in consultation with customers to capture the right mappings for interview stages, sourcing channels etc. This file is edited on Sharepoint and further pulled from there into HDFS. A spark job is executed for schema transformation from 3rd party ATS format to LinkedIn format (Figure 5). Once data is converted, a smart sampler is executed to shortlist Jobs to be imported for verification and validation. The shortlisted jobs are imported to Test contract for verification from ICs and Engineering team. Samples are also imported into customer contracts so that they can take a good look at the sample imported data, verify information before they go for full import. We might flex this process as pipelines mature. Jobs with minimal information are of little help for verification. Hence we compute a score based on active candidates, notes, attachments etc and sample jobs with maximum score. This ensures a fruitful verification and validation phase. The details of score calculation are captured in Figure 6. Post validation full data is imported.

No alt text provided for this image

Figure 5: Flow for schema transformation

Smart Sampler

Figure 6: Smart sampling

Data Load

Post data transformation, the records would be present in a specified HDFS directory in LinkedIn Talent Hub format. A Kafka-Samza nearline system is leveraged to ingest the records into Talent Hub. The nearline system helps us to throttle and control the QPS directed towards Talent Hub APIs. Samza calls specific Import APIs for ingesting data into Espresso (Talent Hub backend datastore). As shown in Figure 7, there exists certain data dependencies between the four different data types. Contract (a.k.a account) related metadata is imported first followed by Jobs and Candidates data in parallel. Application data goes in the end.

No alt text provided for this image

Figure 7: Flow for data import and its dependencies

Performance

For early performance testing, we chose our internal Test Cluster and one of the representative migration pipelines. The test cluster had ample compute and storage but constrained on the GaaP resources (bandwidth to connect to Internet). We considered 3 anonymised datasets: small (~1k candidates), mid (~20k candidates) and large (~60k candidates) [specifying only the number of candidates for brevity. There are equivalent quantities of jobs, notes, applications, attachments etc] and different modes of concurrency. We had a Java driver code invoking Azkaban APIs.

Observations from the performance experiments:

  1. With increase in dataset size and concurrency, there was no drastic increase in data extraction and transformation time.
  2. With increase in dataset size, attachment extraction time increased dramatically because of constrained internet bandwidth and exponential backoff & retry on failure.
  3. With increased dataset size and concurrency the load time increased to some extent because of throttling imposed by Talent Hub APIs.
  4. Mostly attachments extraction failed for large dataset with maximum concurrency, rest all was successful
  5. For the large dataset with maximum concurrency, data extraction took around 10-20 min and data transformation around 30-40 min. For the large datasets with medium concurrency, attachment extraction took around 2hrs. Further, data import took around 3-4 hrs for large datasets.

We followed some of the resolutions listed below, to overcome the issues:

  1. Increased internet bandwidth and updated the code to prevent creating fresh authentication token for every attachment file extraction
  2. Increased Java heap size to protect attachment extraction from memory overflow
  3. Fine tuned units of parallelism to control the QPS generated on Sharepoint APIs

The system proved to be scalable and reliable!

Conclusion and Future Work

In this blog post we have discussed TalentPort, an embodiment of E-T-L use case leveraging open source technologies. It's a secure, scalable and self-serve application for moving enterprise data from 3rd party ATSes to Talent Hub. We have successfully developed migration pipelines for major ATSes. We have also integrated these pipelines with standardisation APIs to regulate and normalise incoming data. As an outcome, we could meet all the major challenges mentioned earlier.

As a future work, we want to give back some of the features and extensions built around ETL frameworks/libraries. We also aim to externalise Import APIs so that customers can directly leverage them for importing their data. We plan to re-run the experiments for other migration pipelines and fine-tune various Gobblin, Spark configurations to improve resource utilisation and performance.

Acknowledgements

It takes a village to build something significant! TalentPort was possible with the tremendous effort and collaboration from the following people:

  1. Engineering Team: Abhishek Agrawal , Aditya Hegde , Balajee Sundaram , Lekshmy Raju , Maneesha Nidhi , Osho Parth , Pankaj Lohani , Prabal Rastogi , Priyanka Shukla , Purushottam Kumar , Ritu Jha , Sanket Dhopeshwarkar , Saurabh Batheja , Sujith Surendran
  2. Product Team: Swati Raina
  3. Implementation Consultant Team: Alessandro Rollo , Beth Loe , Raymen Au , Stephen Gallegos

Swati Raina Wattal

Product Leader, News and Messenger @LinkedIn | USC Trojan

2 年

Happy to see this shared with the LI community! Building scalable and shareable products that are highly modular yet loosely coupled & built on open-sourced tech is the way to go!!

This initiative ATS Data-migration had immense learning for you (thru several POCs) and the team when you all evaluated other solutions available to solve the usecase of moving large user defined ATS structured data into a datastore thru several POCs. Also, all just with LI’s technologies which are already open sourced (which others outside of LI could leverage and build for their usecases). Thanks for capturing and sharing your learnings with the community, Aditya Hegde

要查看或添加评论,请登录

社区洞察

其他会员也浏览了