Deep Reinforcement Learning in Trading
Reinforcement learning is an exponentially accelerating technology inspired by behaviorist psychologist concerned with how agents take actions in an environment so as to maximize some notion of cumulative reward. This technology is the key driver behind AlphaGo, self driving cars and many other Artificially Intelligence (AI) applications.
Thanks to opensource framework like OpenAI Gym and Deepmind Lab anyone can create cutting edge AI's and test on thousands of games and real life scenarios, whether it be the game Super Mario or modeling a robot to walk.
In this simple example our agent (Mario) sees an opponent and the most optimal action to do is jump to avoid the opponent, meanwhile if Mario can hit the block or move forward to the end of the level, it would get reward (points or advance to next level) . The only way to win this game is to get as much reward as possible without getting killed. How can an agent do this. That's where reinforcement learning comes into play, I won't go into the details of it and make this article boring, but in short its a combination of supervised and unsupervised learning applied in an optimization technique borrowed from control theory. While supervised learning gives us the power of function approximation, reinforcement learning gives us policies, which tell what actions to take.
The agent(Mario) is our AI. During the learning process the agent will see the environment (game screen in pixels) and it will take random actions in the beginning, and these random actions would result in some rewards (resulting from scoring points, rescuing the princess, advancing to the next level...) or getting killed. Say the agent test out all the buttons at the beginning of the game and it found that the "right" button gives it the maximum reward, since it moves forward in game. Similarly after each episodes of interaction with the environment, our agent will learn new maneuvers.
This mechanism of learning might sound so similar, because we have done it and experienced it from our very existence.
A child (a.k.a. the agent) learns to walk through many failures, but after each failure it is the dopamine and serotonin rush inside its brain invoked by the curiosity seeing other 2 legged agents walking around him that motivates (rewards) him to repeat this action until he nails it. Even when we grow up, all our learning whether it be in school or at home have been incentivised by some reward-penalty structure.
Its not just humans, look around and you can see this learning process in all living beings. Reinforcement learning has adopted it principles from nature and neuroscience.
Brain's neuroplasticity is well recognized fact, which states that our brain is plastic and the physical connections between neurons inside the brain changes through an individuals life mainly through learning and its experience's from the interaction with external world. Similarly in reinforcement learning, our agent has a brain, made using Neural Networks. The agent's interaction with the environment which changes the weights (or connections) between these neurons gives it the ability to take actions that will maximize the reward.
Well, why not stock markets? If we can build an agent to play a game, let's create an agent to trade the market. However, the markets unlike a game, is filled with randomness and complexity. In a game, the states are deterministic, if Mario sees an opponent coming its way, then the rationale thing to do is jump over it and the outcome is predictable, but in the stock markets its not the case, states are random in nature. For example, an agent might have learned a particular policy from a pattern that occurred in the historical data during training, but its not necessarily true that the next time this pattern occur in the market the outcome would be the same, that is the market has a lot of uncertainty in it. But just like in the case of price prediction in general, we don't have to be always right. When fused with the right risk management techniques (like position sizing, diversification... ) strategies can be profitable in the long run even if its not always right.
During my final semester at school, I decided to try reinforcement learning in trading. Long story short, I used a Double-Dueling DQN (improvement over the original DQN paper Human-level control through Deep Reinforcement Learning published by Google's Deepmind to play 2600 Atari games) algorithm to design an agent to trade in a single stock environment.
With a few tweaks and examples from OpenAI's Gym game environments and some other open source projects(Trading environment framework), I was able to setup a single security environment that could ingest historical market data for training and testing.
First experiment:
The state that the agent receives at each step would be:
[price(???1), price(??), ???????????????? ??????????, ????????????????]
- ??????????(??) = current price
- ??????????(??-1) = price before 1 time period,
- ???????????????? ?????????? = position entry price
- ???????????????? = current position ( can be ????????, ????????? ???? ????????)
The agent can take three actions – Buy, Sell or Hold
However, the agent wasn't learning much from the two price points. When a trader looks at price chart, they look at 100,1000 or the entire price series before taking a position. Then why not feed 100's of past price points, but to keep things simple I decided to think in terms of a trader. What does a proprietary trader consider before making a trading decision. Most of the times price is just too complicated to decipher, so traders use function of price and volume commonly known as technical indicators. So I decided to stick with 3 Oscillators, which is already stationary and only scaling is needed before feeding into the neural network.
The design of the trading environment has been that only realized P&L is passed onto to as rewards, but in the real world a trader also looks at the unrealized P&L to know when to close their positions. So to imitate this, unrealized returns was also included as one of the state factors.
New State
[??????(??), ??????(??), ??????(??), ??????(??), ????????????????, ???????????????????? ????????????]
- ?????? = ?????????????? ?????????????????????? ??????????
- ?????? = ???????????????? ??????????????? ??????????
- ?????? = ?????????????????? ????????????? ??????????
- ?????? = ????????????
Reward: Realized P&L - Trading Fee - Time Fee
To emulate the real world and control the actions of the agent , I also included a commission and a Time Fee (like Rollover rate in Forex or in general time value of money). These rewards are what controls and optimize the agents behaviors during the training phase.
This new State-Reward structure improved the performance of the agent multi-folds. But, whatever the agent learned wasn't transferable across different securities or in a way what works for Apple doesn't always work for Microsoft.
The agent was trained and tested with all the different stocks in S&P 500 between the year 2013 to 2018 with a 3:1 split and the average results were better than a buy and hold strategy during the same time for these respective stocks.
There are many improvements that I would make to this project, starting with
- Prioritized experience replay instead of experience replay
- LSTM network instead of Vanilla Neural Network
- Convolutional Neural Network (a bit counter-intuitive since we already have the price, but worth testing out to see if it can overcome the effect of memory loss resulting from forcing stationarity, an idea from Dr. Marcos Lopez)
- A3C (Asynchronous Actor-Critic Agents) algorithm instead of DDDQN.
- Curiosity-driven Exploration by Self-supervised Prediction
- Multi-agent algorithms and many more....
A typical Quant strategy workflow involves several risk, portfolio and control steps after the predictive modelling and before the execution. Most of the time these modules are loosely coupled and involves frequent calibration and optimization in the strategy life cycle. But with the use of reinforcement learning, we can engineer these constraints within the reward and state structure.
A simple example could be an agent dealing with portfolio of stocks. A common requirements among most investors is to build a portfolio with Beta close to 0 (uncorrelated to the market) and a popular method to do this is to build a Long-Short Factor based portfolio. Using reinforcement learning, we can create a similar portfolio in which we penalize the agent during the training phase if Beta doesn't stay within a range close to zero.
Supervised and unsupervised learning had made great strides in trading, but I believe the next biggest thing is reinforcement learning and it has limitless potential because it is the closest technology to general intelligence.
To learn more - Check out my report and repository
Algorithmic Trader||Quantitative Researcher||Machine Learning Engineer||Data Analyst
4 年Great article! I wish to improve upon this project, it's really got me fascinated. Thanks for sharing @Saeed Rahman
.
5 年Thanks for great article. Can you explain what is the position you've used as a part of a state?
Business Analyst at KPMG US | Azure AI 900 | CSPO Certified | MS Power BI certified | AZ 900 | Lean Six Sigma Green Belt | MBA at Business Analytics
6 年Congrats Saeed ??
?? TOP 5 WEF2024 Most Important Sustainability Announcements ?? Digital Ecosystems, Climate, ESG, SMEs ??Top20 Global Thought Leaders & Influencers on ESG, Digital Assets, FinTech & Blockchain 20,21, 22??
6 年Great read!