Hedge Fund Data Mastery: Governance and Quality Essentials

Hedge Fund Data Mastery: Governance and Quality Essentials

Data is the new oil and this is especially true for hedge funds and asset management firms. Their competitive edge, or alpha, hinges on superior information sources used effectively and swiftly. Thus, firms relentlessly seek diverse data sources to bolster their positions in the market.

Data teams within investment firms find cutting-edge analytical work like nowcasting, revenue modelling using ML to be exciting and rewarding. However, the biggest challenge they face is managing the rapid growth in data complexity and quality issues. As more datasets are incorporated, the operational complexity increases exponentially. This often leads to quality problems and delays, which in turn erodes the trust of downstream data consumers within the organization. Based on my experience working across different data teams, I have found that data quality checks and data metadata management are essential for scaling a firm's data capabilities effectively.?

Managing metadata offers significant opportunities for enhancing both the operational efficiency and strategic direction of a company, as well as accelerating alpha research. However, I'll reserve the exploration of data strategy and research potentials for another discussion. Here, my emphasis will be on establishing efficient processes and platforms to support the scalability of data operations.

Among the many challenges faced by data teams in operations, three stand out:

In this article, I'll specifically address the issues of discovery and quality, leaving aside broader considerations such as data management. Rather than talking about the benefits of data quality, governance, cataloging, lineage, and observability, I'll explore their practical application within a hypothetical organization leveraging open-source technologies. Prioritizing value creation at every stage over reaching a elusive endpoint, I seek to demonstrate a smooth, cost-effective journey and propose solutions for various scaling bottlenecks encountered by data operations teams.

Journey to operational nirvana

Hypothetical Hedge Fund

Let's envision a fund grappling with a fragmented data architecture where data operations systems are entangled with other operations such as risk and trading. Compounding this challenge, the data team is understaffed, struggling to meet the growing demands of the business, including onboarding new data and developing innovative tools. Despite their efforts, they find themselves constantly firefighting operational issues, leaving little time for strategic initiatives. While it's tempting to idealize a firm with flawless architecture, the reality is often far from perfect. So let's begin from this imperfect world and explore solutions from there.

In this arrangement, the fund initially deployed an OLTP database (SQL server) to manage trading and operational requirements, containing certain market and reference datasets. Subsequently, they onboarded Snowflake to function as a data lake for alternative datasets and a warehouse for risk reporting. Data ETL operations happen on-premises and cloud Linux systems via CRON tab, with some ELT tasks executed both within the Snowflake warehouse and SQL server as stored procedures, and the rest scheduled on Jenkins. Users directly interact with and query the data stores.

Pipeline 1.0

The initial step involves establishing a basic ETL framework before integrating any observability platform. This entails standardizing and centralizing key components, starting with scheduling using Apache Airflow. Focus should be solely on data operations without diversion to reporting or trade operations. Key actions include:

  • Centralize scheduling using Apache Airflow and use Airflow operators like SnowflakeOperator and MsSqlOperator for integration and transformation pipelines wherever possible. For complex scripts utilize SSHOperator or BashOperator instead of CRON to run those processes in the Linux box as they were before. Add inlet and outlet definitions for lineage extraction in the subsequent steps.
  • Centralize logging across different machines and Airflow, leveraging tools like Filebeat to ship logs to Opensearch/ELK.
  • Implement source control for DAG code in Airflow, with CI/CD for ETL planned for later stages.
  • Set up Slack or email alerts for DAG failures.
  • Consider separating ELT and ETL logic if feasible.

Data Quality via Great Expectations

Now we will incorporate data quality assertions to access the reliability of vendor data and data stores. Great Expectations (GE) stands out as our preferred solution due to its extensive integrations, adaptable assertion functionalities, and the backing of a helpful community. GE enables us to deploy a variety of assertions, including:

  • Row count to identify empty files or missing data.
  • Schema changes detection.
  • Distributional expectations for anomaly detection.
  • Conditional expectations - very handy for alternative and web-collection data.

GE simplifies expectation checks across datasets as YAML specification and generates human-readable Data Docs for expectation results.?Integration into our pipeline can occur through occasional table checks, pre-loading data source checks (e.g., Write Audit Publish pattern for API data), or embedding expectations within ETL tasks themselves. The easiest and the simplest way is to apply expectations checks on data stores (tables, views) as the final task of the DAG. This approach halts downstream DAGs upon expectation failure, ensuring data quality control. Coupled with Airflow and GE alerts in Slack, we can effectively detect data delays and data quality issues.

After completing these 2 steps (Pipeline 1.0 and data quality checks), several benefits emerge:

- Centralized job monitoring through Airflow.

- Automated alerts on job statuses.

- Centralized log monitoring and investigation.

- Setting data quality expectations at the table level.

- Receiving alerts when expectations fail.

- Utilizing Data Docs to understand and address data quality failures effectively.

Onboarding Datahub - Choosing a catalogue platform and ingesting metadata:

Choosing an observability and cataloging platform can be a daunting task, but open-source options like Datahub and Openmetadata offers compelling advantages compared to some paid options like MonteCarlo, Alation, Manta, Octopai, and Collibra. Out of the open source versions, I've found Datahub to be intuitive, customizable, and feature-rich, making it a practical choice for many use cases.?

Another great alternative is Databricks, it's a comprehensive analytical platform with built-in observability, governance, and data management features. However, its effectiveness is best when most analytical tasks are conducted within it. Unfortunately real-world data infrastructure is often fragmented, posing challenges for Databricks' unity catalog experience, particularly with external pipeline lineages and OLTP stored procedures. But one area where Databricks excels is in data and policy management, which is a big pain point for many data teams, although not discussed here, it's worth noting. Datahub fully supports ingesting Databricks metadata, making it an excellent option even for Databricks users, as Datahub provides additional integrations and customization options.

Metadata Ingestion:

In this scenario, our initial focus with Datahub is to seamlessly connect our primary systems and automatically ingest metadata. Datahub's strong integrations with Airflow, Snowflake, and SQL Server (including stored procedures) enable automatic lineage extraction down to the column level for all data assets within these systems. Its intuitive UI and CLI-based ingestion interface simplify the management of ingestion schedules. Additionally, Datahub offers powerful APIs for manually constructing lineages, providing flexibility and customization options. Furthermore, Great Expectations (GE) integrates seamlessly with Datahub, allowing us to utilize the same assertions already in place, ensuring consistent and efficient data quality monitoring across our systems within Datahub's unified environment.

Cross system lineage example

Table and column-level lineage are crucial for diagnosing data problems. If a table has inaccurate or missing data, lineage helps us quickly trace it back to its source. By comparing two point in time (PiT) versions of lineage, we can identify changes that caused issues, like alterations in column names. Lineage isn't just for debugging; it also helps with resource management, migrations, and more. Serving as connections/edges in our metadata graph, lineage enables us to create incident management systems, business catalogs, and reporting frameworks, adding value and improving our data operations.

Metadata Graph enrichment - categorizing and classifying data assets

Cataloging datasets for exploration and analysis is a fundamental task for the data team, alongside technical and operational duties. Instead of relying solely on manual documentation, generating business metadata from technical metadata is favored. This approach guarantees smooth connections between data assets in the metadata graph and what the business interacts with, fostering discussions and improving comprehension for users.

In Datahub, organizing data assets into Data Products for each Domains or team simplifies the structure. For instance, an Equity Pricing dataset from Bloomberg could be named "Bloomberg Equity Price Dataset" including relevant tables, views, and pipelines. Adding metadata, documentation, and vendor descriptions to data products enhances accessibility and comprehension. Tags and Custom Properties further helps categorize data products by type, data type, asset classes, sector, etc.. . Business Glossary terms also aid in classification and reference management. Utilizing these properties/objects effectively enables data teams to streamline the organization of data assets. Below are some sample tags that we can use for datasets:

  • Type: Distinguishing between different types of data products, such as datasets, reports, or analyses.
  • Data Type: Market data or Alternative data.
  • Asset Classes: Equity, Fixed Income (FI), or Foreign Exchange (FX).
  • Sector: Healthcare, Finance, or Technology.

Dataset as a Data Product

Going beyond conventional datasets, data products can include diverse combinations of analytics, providing endless opportunities. For example, the "Retail Whitespace Analysis" can be an analysis that merges foot-traffic and demographic datasets with market datasets to evaluate a retailer's geographic placement. This showcases the expansive range of possibilities for data product scope, bound only by creativity. Such classification brings considerable value to the research and development process.

Its also important to assign owners to the Data products. And Datahub has this custom ownership type so we can have the roles and user titles according to the firms structure.? At this point you can choose to adding tags and owners to the individual data items or let them inherit from the data products.

Business Catalogue and an opportunity to connect to external workflows

The business catalog serves as the gateway for business and end-users to discover and manage available datasets. While a significant portion of the metadata graph contributes to the business catalog, additional aspects related to vendor contractual items, team assignments, sharing terms, costs, etc., are crucial components as well. Acquiring this information often involves digging through contracts, underscoring the importance of cross-team collaboration , documenting and being able to connect these different aspects together. Now the question arises: how can we achieve this on a large scale? I believe the most effective approach is to broaden the reach of metadata graphs to external systems and functions. This allows us to establish connections and monitor activities efficiently. Through this method, we can seamlessly document and collaborate with cross-functional teams such as project management, vendor management, compliance, accounting, and various front office functions.

Putting this into practice is no simple task, as each of these teams operates with distinct systems, processes, and terminology. Initially, the approach is to utilize the reference of the relevant metadata graph object of a dataset (Data Product) when collaborating with these functions. For instance, consider using Jira for project management and having a task associated with a specific dataset. If we incorporate the Data Products URN as a property for this task entity in Jira, it becomes effortless to trace tasks, projects, and Jira objects back to the metadata graph and vice versa. This concept can be extended to other functions such as vendor management, data onboarding, compliance, accounting, and beyond. The goal is to establish connections between different functions to drive value across the organization.

Metadata objects being the glue connecting different functions

One aspect of Datahub that I appreciate is its intuitive and straightforward design. Paired with workflow tools like the Actions Framework and extensive programmatic access, it offers ample opportunities to tailor the platform and workflows to our specific needs. I strongly encourage you to explore the Datahub Customer Adoption Stories and their blog to discover the diverse ways in which users have leveraged the platform.

Incident Management

Late data, quality problems, and system downtimes are inevitable in operations, making incident management a vital aspect of data teams. Establishing a data incident management workflow for your pipelines involves four key steps:

  • Incident detection and communication
  • Root cause analysis
  • Resolution
  • Post-mortem analysis
  • Managing audit trails.

While using a service management tool like Jira is common practice, integrating it with an observability platform and catalog would greatly enhance the process and user experience, considering the workflow outlined above. Thankfully, Datahub provides a method to mark the unhealthy status of a data asset using the Incidents object. Each incident follows a lifecycle, featuring attributes such as state, title, and description.

Efficient communication is vital for effective incident management. The objective is to promptly detect issues and proactively inform stakeholders, avoiding end-users discovering problems first, which could undermine confidence in data quality and reliability. Datahub's robust API support enables automatic incident creation triggered by pipeline and assertion failures. Datahub Cloud ensures asset owners and users receive prompt alerts about incidents, facilitating swift resolution and tagging affected assets within the platform. Additionally, with API support and functionalities like Actions, we can effortlessly develop tailored solutions and integrate with your customized service management tools.

Using end-to-end table and column lineage, along with tracking airflow task and assertion statuses in Datahub, and leveraging centralized logging enabled by ELK, we can swiftly pinpoint the root cause of issues. Furthermore, centralized catalog management allows for easy identification of dataset owners (internal/external) and facilitates issue escalation. Additionally, with point-in-time audit trails of incidents, pipeline runs, assertions, and categorized metadata, we can conduct various analyses and generate reports effortlessly. Moreover, ETL processes can be configured to execute only when there are no active critical incidents, thereby boosting operational efficiency and reducing potential data issues.

Automatic Incident Detection and Alerting

Reporting and data driven decision making

Now, with all essential information readily available via the metadata graph, effective communication and decision-making become achievable. SLAs and KPIs can be established for both internal and external stakeholders

Through the metadata graph you have the following data points at your disposal.

- Categorized list of data assets with ownership.

- Incident details associated with data assets.

- Usage metrics.

- Pipeline status history.

- Data quality issues.

- SLA's breaches with external data vendors and internal stakeholders.

Using these metrics we can easily create data points that can drive

  • Vendor negotiations.
  • Internal stakeholder meetings.
  • Defining KPIs for Data teams.
  • Project management for Data team.
  • Data operation gap analysis
  • Data Steering committee meetings.
  • Data Strategy gap analysis.

If we establish connections with other functions such as vendor management, accounting, and project management through the metadata graph, the potential is even more.

Pipeline 2.0:

Although significant progress has been made in achieving observability and cataloguing , it's important to acknowledge that have only scratched the surface. There's still much ground to cover and many opportunities for further improvement lie ahead. In the current iteration of Pipeline 1.0, we've focused on fundamental aspects to ensure data quality and observability. However, to attain a fully scalable platform, ongoing updates to our systems and processes are essential. Here are few additions you can implement in the pipeline to enhance its performance, observability and governance even further.

  • Data Contract

The notion of Data Contracts is relatively recent, with different platforms offering diverse interpretations and iterations of it. Essentially, it's a document that points to a specification outlining the structure, format, and meaning of data exchanged between two parties. Within Datahub, Data Contracts consist of assertions, which are verifiable statements enforced on individual data assets. These assertions pertain to schema-related attributes, service level agreements (SLAs), data freshness, and data quality.

Sample Data Contract in Datahub

  • Metadata driven ETL

Initially, we introduced a basic pipeline setup where downstream DAGs fail if upstream DAGs' assertions fail. However, this setup is complex for several reasons. We're treating assertions as task failures and creating dependencies on DAGs, which may not always be scalable, especially considering different scheduling needs. An alternative approach could involve confirming the availability of the latest data in the upstream table before initiating downstream DAGs. However, this method may overlook other data quality issues. Therefore, there's a need for an efficient method to convey assertion status. While some may consider using Xcoms and other Airflow components to address this issue, it could introduce complex dependencies in pipelines. A more straightforward option is to integrate the metadata graph into your pipelines to verify the health and freshness of data assets. Through the utilization of incident statuses and assertion results, we can construct simple pipelines without intricate dependencies between DAGs.

  • Refactoring Pipelines: Start by dissecting each ETL script/process separately, distinguishing between ETL and ELT workloads. Integrate Airflow operators wherever possible to extract entities and trace lineages. Eliminate pipeline dependencies on specific machines by managing and standardizing dependencies and packages, enabling Airflow pipelines to run on scalable infrastructure workers like Celery or Kubernetes.
  • Standardizing ELT with DBT: For those seeking to implement development-like lifecycles for ELT processes, DBT offers an excellent option with its strong integration with Datahub.
  • Standardizing ETL Procedures: Since most vendor data integrations adhere to specific patterns for fetching, processing, and validating assertions and SLAs, it's advisable to maintain a library of common pipeline templates. Each pipeline should inherit these standardized behaviors.
  • Enhancing Source Metadata: Enhance table and other data asset DDLs with descriptions and comments to ensure they are included in the metadata graph during metadata ingestion.
  • Implementing CI/CD: With standardization comes the need for thorough checks and tests for every code change. This also presents an opportunity to enforce data governance standards and add tests to verify the presence of objects like data contracts, inlets and outlets for unsupported operators etc...
  • Compiling Guidelines: Develop documentation outlining the steps for constructing ETL scripts, DAGs, and similar components to ensure adherence to proper standards.

Beyond Graph

Graphs present boundless opportunities beyond mere operational duties. I'd like to wrap up by envisioning an intriguing possibility: merging knowledge graphs with the capabilities of LLMs. This amalgamation could offer a wholly new outlook on operations, observability, data strategy, and research. I'll explore this concept further and share the progress I've made using metadata graphs in a separate post.

Conclusion

When I joined my first Data team, I was tasked with constructing a Catalogue for alternative datasets. It was a thrilling challenge for a newcomer like me. However, gathering the necessary information proved tedious, involving combing through code, documentation, and conversations with colleagues. We had around 50 active datasets, with the top 30 being the most used. Additionally, there were 50 inactive datasets and over 100 trialed datasets. Focusing on cataloging just the top 30 active datasets adds more value than attempting to handle all 200+. This principle also holds true for data quality and observability; typically, 20% of datasets contribute to 80% of the challenges and value. Starting with these critical datasets and addressing major pain points not only saves time but also establishes a robust foundation for future development.

When you speed ahead, mistakes occur and fires ignite, but now they're less intimidating because you possess the tools to handle them. It's a chance to evolve and prepare the platform for genuine innovation. It's also crucial to share this perspective and educate other firefighting teams within your organization so they can leverage and improve upon what's already established.


Solid overview and analysis Saeed Rahman. It is a great read for anyone looking for an overview of a nearly ideal full end-to-end data ecosystem/pipeline for a typical fund; admittedly, not every organization will implement every piece, but it is good to understand the ideal and work back from that based on actual needs and budgets.

要查看或添加评论,请登录

Saeed Rahman的更多文章

社区洞察

其他会员也浏览了