Python software engineering for the finance industry

Python software engineering for the finance industry

In the last issue of this newsletter, I drew your attention to Big Money Python Jobs in finance that pay £600k per year and beyond.

Whether or not that's for you, I think there is educational and pragmatic value in understanding this market better, so here we will continue where we left last time.

In addition, we'll take a deeper dive into the most common type of project that in-house software teams at these firms undertake: the ETL data pipeline, and how to approach it in a modern way with Python and FastAPI. All without using clunky monstrosities like Airflow - I'll show you how to build your own bespoke Airflow from scratch, optimised for the specific tasks at hand.

But first, let's continue deciphering the job description posted by a typical asset management company looking a Python software engineer. If you run into unfamiliar terminology - please read the previous issue of this newsletter where I walk you through the necessary background and the basic terms these firms use to communicate (and to filter away people unfamiliar with their inside finance lingo!)

Your Role at the Data Engineering Team of an Asset Management Firm

Continuing where we left, the job description continues like this:

The data engineering team is part of systematic technology group responsible to design, implement and evolve systems required to enable the quantitative investment process. The successful candidate must possess strong knowledge of financial data including security master, financial time series (pricing data, etc.) across multiple asset classes, have solid coding skills (Python, Java or C++, SQL), and experience working with large datasets. This specific role focuses on the solutions for the Credit, Commodities and Fixed Income Businesses.

Let's decode the terminology as we read.

Earlier, they already gave us a bit of a feel for what the Data Engineering Team does. As in any big organisations, smaller teams are parts of larger groups which are parts of even larger divisions and so on. The "systematic technology group" they mention here sounds like a specialised group tasked with developing and maintaining the technological infrastructure that underpins the quantitative investment process. This most likely involves creating systems which can process and analyse large amounts of financial data to inform investment decisions.

The "Quantitative Investment Process" refers to the use of mathematical models and algorithms to identify investment opportunities and manage risks. The technology developed by this group enables these data-driven investment strategies.

And now we get to the part that's most relevant (and possibly a little unnerving) to you: "Strong Knowledge of Financial Data." Great. What does that even mean?

They do give us a bit more detail. The candidate is expected to be familiar with various types of financial data. Security Master is a comprehensive database which includes detailed information about each security (stocks, bonds, etc.) traded in the financial markets. It typically contains data like security identifiers, descriptions, issue details, and trading rules. So, there you have it - a fancy term for a database. You may not have seen this type of data before, but you've seen a database and you do know how to work with data.

The other term mentioned is "Financial Time Series." This includes historical data (or future projections!) on prices, volumes, and other market-related statistics over time, which is essential for analysing trends and making predictions. In other words, a sequence of data points, which can typically be plotted on a simple 2D graph against time to form a "curve" (so you'll often hear the term "curve" used interchangeably with "time series" in this context.)

We already described "asset classes" (or the different types of securities) in the previous issue of this newsletter. The job description implies that the role requires understanding data across various types of investments, including equities, fixed income, commodities, etc., reflecting the diverse strategies employed by the hedge fund.

What I want to tell you from my experience is that your really REALLY shouldn't let any of this scare you away. Data is data. Yes, a lot of it won't make sense to you at first, when you are not familiar with how these different financial instruments work. But this is knowledge that can be picked up with experience. So even if you have never been exposed to any data like this in the past, I would strongly encourage you to apply for this job on the basis of your software engineering skills alone. Be absolutely transparent and upfront about it - let them know early on where your weaknesses and where your strengths are. Highlight the fact that you are a quick learner and eager to pick up more of this knowledge. Once you start the job, make sure to tell everyone what a total noob you are and keep asking stupid questions until they're all blue in the face from explaining the basics to you over and over again. Do NOT be embarrassed of this. Do NOT expect to understand anything from a single explanation - you need to hear it again and again from different angles, from different people. It is the best way to learn and you will make very rapid progress this way.

Moving on... Next, they're asking for "solid coding skills." Yay. You already got this. (You do, right? If you don't - talk to me about my 6-month personalised Python for Software Engineering training, and I'll get you there, regardless of where you now stand.) You'll notice these kinds of job descriptions often mention multiple languages - sometimes they really get carried away. Here, they've only listed 4 - Python, Java, C++ and SQL. Of these, if you know Python and SQL, you're good to go. The vast majority (if not 100%) of the work you'll be doing will be just Python and SQL. Java and C++ are sometimes preferred where performance is important and a lot of data needs to be crunched in near real-time. If you have some background in these languages - great for you. If you don't - there's really no need to fret over it. Just say that you don't. Nobody actually expects anyone to be an expert in multiple languages. Also, there's a reason Python is the first on their list. Focus on that.

When they talk about "Experience Working with Large Datasets", this indicates the need for practical skills in data handling and analysis, including the ability to process, clean, and derive insights from large amounts of data. Most typically what this means is - you'll need some experience with a Python dataframe library like Pandas.

To summarise everything we've said so far - if you know Python, some SQL, some Pandas and NOTHING else, I would encourage you to apply for this role (armed with the insights I've given you in these newsletters.)

The last part of the quoted paragraph indicates there is specific focus on investment strategies in these areas:

  • Credit: involves lending money and receiving interest payments, including corporate bonds and loans.
  • Commodities: include energy (e.g. electric power) and physical goods like oil, natural gas, liquid gas (LNG), gold, and agricultural products, which are fundamental to global markets.
  • Fixed Income: refers to investments that pay regular interest, such as government and corporate bonds.


Roles and Responsibilities of a Python Software Engineer in Finance

Let's move forward with the job description - we're getting to the section that talks about specific responsibilities for the role they are hiring for:

Design, implement and evolve data platform including pipeline, repository, data warehouse and data access APIs.

For a Python software engineer, this means building and maintaining the infrastructure that manages the flow and storage of data. Specifically, it involves:

  • Planning the architecture of the data platform to ensure it meets the needs for data ingestion, processing, storage, and retrieval efficiently.
  • Writing Python code to create the data pipelines (automated processes for collecting and moving data), repositories (databases or storage systems where data is kept), data warehouses (centralized repositories for integrated data from various sources - usually for longer term storage), and data access APIs (interfaces through which users or systems can query and retrieve relevant data).
  • Continuously improving and updating the platform to accommodate new data sources, optimize performance, and incorporate new technologies or methodologies to enhance data handling and accessibility. This is what your day-to-day work will most likely consist of, once there's some working implementation in place.

Automate and support the Extract, Transform, and Load (ETL) processes from various market data vendors and internal operational data stores.

For a Python software engineer, this task entails creating automated systems to handle the ETL (Extract, Transform, and Load) process, which is a critical part of data management. Specifically, this involves:

  • Extract: Writing scripts to retrieve data from various sources, which could include market data from external vendors (like financial market feeds) and internal data stores (various databases within the company). This typically requires pulling data in various formats (e.g. JSON, XML, CSV, Parquet, etc.) using various techniques (SQL querying, API access, blobs in cloud buckets, network file storage, etc.)
  • Transform: Processing the extracted data to ensure it's in a consistent, usable format. This could involve cleaning / sanitising the data (removing or correcting errors), converting data types, and restructuring the data to fit the needs of the application or analysis.
  • Load: Developing mechanisms to efficiently insert the transformed data into a target data repository, such as a database or data warehouse, where it can be accessed and used for analysis or reporting.

Python's versatility and the extensive libraries available (such as Pandas for data manipulation, SQLAlchemy or other ORM libraries for database interactions, FastAPI for building small, fast micro-services, and clunkier tools like Spark, Apache Airflow or Dagster for workflow automation) make it an ideal choice for implementing these processes.

Automating these steps ensures that data flows smoothly from its sources to where it's needed, ready for analysis or decision-making, with minimal (ideally none) manual intervention.

Design and Implementation of Security and Price Masters and related derived data

For a Python software engineer, this responsibility means developing systems to manage critical financial data:

  • Security and Price Masters: These are centralized databases that contain detailed information about financial securities (like stocks, bonds) and their pricing. The Security Master holds data on each security's characteristics (such as issuer, type, and identifiers), while the Price Master tracks historical and current prices of these securities. In large organisations and financial networks, data is often spread around in different location, with different characteristics and varying quality. You'll sometimes hear the term "single source of truth" or the more archaic "master" to describe the one database you can trust for a certain type of core data.
  • Design: Planning the structure of these databases to ensure they can store all necessary data in an organized, efficient manner. This involves determining the schema (the organization of data within the database), the relationships between different data points, and how to handle updates or corrections to the data or to the schema itself (schema migrations.)
  • Implementation: Writing the code to create these databases and populate them with data. This includes developing micro-services, long-running or serverless applications for data ingestion (extracting data from various sources), data validation (ensuring data accuracy and consistency), and data maintenance (updating with new data, handling corrections or deletions).
  • Derived Data: In addition to the raw data in the masters, you'll also create systems to generate and store data derived from the raw data, such as calculated metrics or aggregated statistics. This could involve writing algorithms to compute financial indicators or perform other analyses based on the security and price information. See the VWAP calculation example below.

Work in close cooperation with Portfolio Teams and internal teams to help with implementation of various high value derived data sets

This just means you'll need to talk to people to understand a) what is important to them and b) how to derive it. Once again: never be afraid to ask lots of questions, no matter how basic or stupid they seem. An interesting thing happens when you do this: not only does your own understanding gradually improve, but the people you talk to and the business itself often ends up arriving at more clarity and specifics about things that may not have been completely thought through before. The point is: there is TREMENDOUS value in asking basic, outsider questions, for all parties involved (even when it irritates people or takes from their "precious" time. ) Develop some skin and keep asking questions.

Derived Data Example: Calculating VWAP in Python

Let me give you a more specific example of Derived Data in the context of these ETL data pipelines. Let's say we are talking about commodities. There will often be some source of pricing data (this is time-series data, or "pricing curves" like we mentioned above), which can be used to price a certain contract at a given point in time, according to the market prices. Separately, the firm would have it's own database of volume data, which contains information about what quantities of a given commodity were traded as part of any contract.

These two core streams of data are often combined in various ways.

The Volume Weighted Average Price (VWAP) is a trading benchmark used to calculate the average price a security has traded at throughout the day, based on both volume and price. It's a useful indicator for traders and investors to evaluate the market's direction.

VWAP is calculated by taking the sum of the product of each transaction's price and its volume, divided by the total volume traded over a specific time period.

So, if we wanted to calculate the VWAP for a given security for the day, we would need to perform the following fairly simple steps:

  1. Multiply each trade's price by the corresponding volume to get the total traded value for each transaction. (Here, we are just combining the available source data - price curves and volume.)
  2. Sum all the values obtained in step 1 to get the total traded value for the day.
  3. Sum the volumes of all trades to get the total traded volume for the day.
  4. Finally, divide the total traded value by the total traded volume to obtain the VWAP.

It's one of the simpler formulas in trading, yet quite commonly used.

Here's how we might do this in Python, using Pandas:

import pandas as pd

# Sample data: let's construct a DataFrame with columns for price and volume of trades
data = {
    'price': [100, 102, 101, 103, 102],
    'volume': [500, 300, 400, 600, 200]
}
df = pd.DataFrame(data)

# Calculate VWAP
vwap = (df['price'] * df['volume']).sum() / df['volume'].sum()

print(f"VWAP: {vwap}")
        

This is a tangible example of creating derived data (VWAP) from raw financial data (price and volume of trades).

Now back to that Job description.

Essential Skills Required for a Hedge Fund Software Engineer

The following skills were listed as "Essential" skills required for this job:

3+ years of Python experience

I'm pretty sure people make these numbers up. I completely ignore them. Number of years experience is not a good metric of anything. You could have dangled your feet in an office for 3+ years. Or you could have taken my 6-month training course and be well ahead of people with 10 times more "experience" than you. You know what I mean. Your job is to demonstrate that you know what you're talking about.

Broad knowledge & experience of database concepts with proficiency in SQL. Knowledge of no-SQL databases and big data technologies.

Of these, I would focus exclusively on SQL. Yes, you need to have a good understanding of how databases work and how to query structured data with SQL. Regarding no-SQL - the mere fact they've used this generic term means that they most likely don't have any specific actual requirements around this, even though they've stuck it under "essential requirements". The term "no-SQL" can refer to all kinds of non-traditional databases, such as document storage systems, time-series databases, graph databases, etc. These are all VERY different beasts, that don't have much in common, apart from the fact they are all databases and they don't typically use an SQL language for queries. So, unless they ask for something more concrete, I would ignore this.

Experience working with cloud or on prem technologies and data infrastructure (workflow management engines, databases, storage & file systems, analytics platforms)

You should have some familiarity with what modern cloud solutions have to offer, and how to interact with these services in Python. If you have some hands-on experience with one provider (e.g. Amazon's AWS, Google's GCP or Microsoft Azure) - those skills are usually fairly directly transferrable to any of the other providers as well. And when they say "on prem" they mean you may be required to work directly on servers in their own data centres or their own private cloud. This typically means you may need some exposure to Kubernetes as well. Knowing what a pod is and how to deploy and manage a Dockerized app using cubectl is probably enough.

Knowledge of financial data (security master, tick & bars pricing data, etc.), a plus with experience in Bloomberg, Refinitiv.

Reading this and the previous issue of this newsletter should have equipped you fairly adequately to know what to learn so you can feel confident in ticking this box. As for the experience with the specific finance technology platforms like Bloomberg terminals and the REUTERS Refinitiv data platform - there's no way to get this without prior experience at a financial institution. The people interviewing you will know this. That's why it's listed as "a plus."

Let me quickly tell you about "tick & bars pricing data" as we hadn't encountered this specific term before.

"Tick data" represents the most granular level of pricing information, detailing every individual transaction (or "tick") that occurs, including price and volume at a specific time. It's crucial for high-frequency trading and detailed market analysis.

"Bar data" aggregates information over a specified time period (e.g., 1 minute, 1 hour, 1 day) into "bars" that show the opening price, closing price, high and low prices, and volume for that interval. It's used for technical analysis and trading strategies that don't require fine-grained tick-by-tick detail. If you've ever seen any stock market chart, you were most likely looking at a "candlestick" chart, which is the most common type of "bar data" used in trading. The body of the candlestick represents the opening and closing prices, and the "wicks" sticking out from either end represent the highest and lowest prices at which that stock traded during the period each bar represents.

Experience in processing large and complex datasets

Your skills with Pandas tick this box.

Emphasis on good engineering practices, design, testing and strategic solutions

This is really the most important requirement. Luckily, that's exactly where you shine, right? (Take my training if you don't.)

Strong communication skills

The ability to talk to people and be thick enough to ask the same stupid questions over and over again until it clicks.

Experience in working with Systematic Credit, Commodities or Fixed Income Desk

Ignore. You either have that experience, or you don't. If you do - great. If you don't - no big deal. Apply anyway. Let them know you don't have a finance or trading background early on, but you're a stellar software engineer, so still worth talking to.

BS in Computer Science, Engineering, Statistics, or related discipline with excellent academic credentials

Idk, friends... I did go to uni, but never graduated. Hasn't stopped me from getting some of the highest paying roles in London over and over again for the past 15 years. If you're reading this, you probably DO have some degree, so you're already ahead of me in this respect.

Self-motivated, proactive

Yay! Python ftw.

Ability to perform well under pressure

Screw this. Never perform under pressure. Nothing good comes out of it. Instead, focus on becoming excellent at prioritising what matters. If you can identify the ONE thing that would have the maximum leverage in bringing the most relevant results in any given situation - and if you can be thick enough to IGNORE everything else - you will never be under pressure, and you will be the most valuable person on the team.

Enthusiastic, flexible and adaptable

Smile :) You've got this.

Ability to work effectively and independently as well as part of a team

You really can't go wrong here. Just communicate your strengths. If you work best on your own, say that you're great at working independently. If you're great at teamwork - bring that up.

Ok, enough nonsense, let's go back to the more interesting stuff. We mentioned ETL data pipelines a few times. Here's one modern way of approaching this in Python.

Building State-of-the-Art ETL data pipelines with FastAPI and PostgreSQL

The specific step-by-step instructions for creating a modern Python ETL data pipeline are as follows. Adapt as per your specific needs and requirements.

Step 1: Setup Your Development Environment. Install FastAPI and PostgreSQL. Use Poetry to manage your virtual environment and 3rd party Python dependencies (we have covered this in previous issues of this newsletter.)

Step 2: Always start with data modelling. Use SQLAlchemy (or something nicer like SQLModel) to define your data models. These models will mirror your financial data structures, such as securities (in a security master database) and pricing data.

Step 3: Create Async Data Ingestion Functions. Design functions to extract data from various sources. These could be external APIs providing market data, internal databases, or even files. Utilize FastAPI's async-native design to make non-blocking I/O calls to these data sources, ensuring efficient data extraction without waiting on network responses (so you can do more than one thing at a time.) You can design each separate data ingestion model as a FastAPI Background Task or as a Serverless component like an AWS Lambda or an Azure Function.

Step 4: Implement Transformation Logic. Write asynchronous functions to transform the extracted data. This might involve normalizing data formats, calculating financial metrics, or aggregating pricing data. Use libraries like Pandas or Numpy for complex data manipulation, keeping in mind the separation of heavy processing work if running within limited environments like AWS Lambda.

Step 5: Design the Load Mechanism. Develop asynchronous functions to load the transformed data into your PostgreSQL database using asyncpg. Ensure your database schema is designed to efficiently store and query the financial data, taking into account the needs of trading operations.

Step 6: Automate ETL Processes. Leverage FastAPI's lifespan event hooks and @repeat_every decorator from fastapi_utils - or an external scheduling service - or a message-passing system such as Kafka or AWS EventBus to orchestrate and automate the ETL tasks.

There are some important design decisions to be made here, according to the specific requirements and constraints of each project. Let me know what details you want me to focus on next time - or if you prefer specific examples of projects I've worked on.

要查看或添加评论,请登录

Jordan D.的更多文章

社区洞察

其他会员也浏览了