The power of R for trading (part 1)
R is an object-oriented programming language and work environment for statistical analysis. It is not just for programmers, but for everyone conducting data analysis, including portfolio managers and traders. Even with limited coding skills R outclasses Excel spreadsheets and boosts information efficiency. First, like Excel, the R environment is built around data structures, albeit far more flexible ones. Operations on data are simple and efficient, particularly for import, wrangling, and complex transformations. Second, R is a functional programming language. This means that functions can use other functions as arguments, making code succinct and readable. Specialized “functions of functions” map elaborate coding subroutines to data structures. Third, R users have access to a repository of almost 15,000 packages of function for all sorts of operations and analyses. Finally, R supports vast arrays of visualizations, which are essential in financial research for building intuition and trust in statistical findings.
See the full post on the Systemic Risk and Systematic Value site.
What is R?
The R project for statistical computing provides the leading open-source programming language and environment for statistical analysis. It has been developed by academics and statisticians for over 25 years.
- As a language supports effective object-oriented programming, including all usual features such as conditionals, loops, user-defined recursive function. Unlike Python, it is not a general-purpose language but heavily geared towards statistical work.
- As a work environment R offers “an integrated suite of software facilities for data manipulation” or “an environment within which statistical techniques are implemented”.
Put simply, R can do whatever a spreadsheet can do but much faster and far more efficiently and extends to vastly more applications. The primary benefits of R are data wrangling (making untidy data usable), data transformation (building customized data sets), data analysis (applying statistical models), all forms of visualization, and machine learning. The default workspace or integrated development environment (IDE) for R is R Studio. Among the many on-line resources for learning R the data science courses of DataCamp stand out.
What makes R powerful for macro trading?
Most macro traders or portfolio managers rely on quantitative statistical analysis, typically in form of charts (often inside Bloomberg, Reuters Eikon and so forth), calculators and spreadsheets. As responsible trading is full-time demanding work, senior traders often lack time and experience for coding up their analytical tools to professional standards. Even cooperation with desk quants that offer programming support can be difficult, as many traders do not wish to reveal their personal methods and struggle to translate their needs into suitable instructions for programmers.
R makes statistical programming and data science accessible. In particular, R is not just for programmers but for all finance professionals with some interest in statistical analysis. That is because R can initially be run interactively with a limited set of basic commands. In some sense, R can be used by nonprogrammers much like a sophisticated calculator. Even short snippets of code can go a long way in performing operations that would be very tedious in Excel. This means that R can be deployed with minimal programming skills and typically enhances the information efficiency of the investment process quickly. Interest in and advances of programming skills then follow almost naturally.
A personal analytical framework in R is highly extensible and goes far beyond the capacity of Excel spreadsheets are. This is because it allows far greater creativity and far more data to be used (view post here).
The power of data structures
The R language and environment are built around data structures:
- The main homogeneous multidimensional data structure in R is the array, which is just a generalization of vectors and matrixes. “Homogeneous” here means that the array includes just one type of data, such as numeric. Unlike an Excel spreadsheet, it is easy and quick to do a wide range of mathematical operations on one or more of these arrays. Whether one adds two numbers or two large equally-shaped arrays makes really no difference in R.
- The main heterogeneous data structure is the dataframe. It is generally a two-dimensional structure, much like a data table. “Heterogeneous” here means that different columns can contain different types of data, such as numerics, dates, factors, character strings and so forth. Under the hood, a data frame is a list of equal-length vectors, with the latter being the columns of the frame. As with arrays, it is easy to perform logical and mathematical operations on a dataframe, albeit with more restrictions, due to the different data types involved.
The import data structures into the R mostly relies on two types of tools. The first is web APIs. (application programming interfaces) that link the local R environment with external databases, such as Bloomberg, DataStream. Macrobond and the data services of investment banks. The second is special R functions that read and import data files into the environment, including from Excel spreadsheets, csv files, and SQL databases. A particularly useful set of functions is provided by the readr package, which supports the customized import of all sorts of rectangular data.
R offers a whole host of techniques to deal with the immensely important job of data wrangling, i.e. the transformation of raw irregular data into a clean tidy data set. The tidyr package provides functions through which one can reshape imported data into a standardized format that is conducive to standard operations, estimation and analysis, particular for other packages of the tidyverse, (a collection standard R packages for data science).
A tidy data set meets the following conditions:
- Each column represents one variable, whose values represent a single attribute across observational units. This could be the daily returns of a specific asset or the values of a specific business survey.
- Each row represents one observation of these variables. In macro-finance data structures the rows typically span time periods, markets or currency areas.
- There is only one type of observational units (e.g. years, months, cross-sections) per table. If one wishes to investigate data sets with different observational units, such as higher frequencies of observations, one should use different tables.
The R language is geared towards the manipulation of tidy data structures, much more so than Python. Selecting and subsetting data structures simply requires position indices, names or logical conditions. Going beyond basic operations, the dplyr package supports a wide range of manipulations of tidy data tables, particularly
- the manipulation of cases (the rows of data table) in form of summaries, such as means and standard deviations;
- the grouping of cases, for example into specific time periods;
- the extraction of special cases, in dependence on certain conditions or through random sampling;
- the arrangement of cases, for example in form of league tables based on the values of one variable;
- the manipulation of variables (the columns of a table), i.e. transforming variables or calculating new variables based on existing ones.
Most data used for macro trading are time series. The xts (eXtensible Time Series) package has been developed for just this purpose. The package supports a special object class and functions for uniform handling of many R time series classes. An xts object is effectively an extension or special class of zoo object (class of indexed totally ordered observations). Its practical benefits include easy conversion and reconversion from and to other classes and bespoke functionality for time series data. Key advantages of xts dataframes include reliable implementation of time lags, easy and intuitive subsetting with date names, easy extraction of periodicity and time stamps and consideration of different time zones. A complementary package for specialized operations on dates and times is lubridate, which includes consideration of time zones, leap days, daylight savings times.
Finally, the popular data.table package allows efficient operations on data structures with short code, particularly subsetting, grouping, updating, and univariate variable transformation. Hence, it is particularly suitable for extracting analytical summaries from large databases. The objective of the package is to reduce programming and computing time.
The power of functions
Even non-programmers eventually build their own functions to perform special operations in different contexts. Functions reduce quantity and errors of code. They also can make the intention of code much clearer. As a rule of thumb, a snippet of code should be transformed into a function if it is being copied and pasted more than two times. It is often best practice to start the creation by [1] solving a specific simple example problem with a snippet, [2] testing and cleaning up the snippet, and then [3] applying a clearly written working snippet to a function template.
Importantly, R is a functional programming (FP) language. This means that it provides many tools for the creation and manipulation of functions. In particular, R has first-class functions. This means that one can do anything with functions that one can do with data structures, including [i] assigning them to variables, [ii] storing them in lists, [iii] passing them as arguments to other functions, [iv] creating them inside functions, and [v] returning them as the result of a function.
Functional programming simply uses functions as arguments in other functions. It is typically an alternative to for loops and preferable when for loops obscure the purpose of code by displaying repetitive standard procedures. For example, if a macro trading strategy requires a special way of transforming market or macroeconomic data and if that transformation has been captured in a function, this transformation can be applied efficiently and with little code to all relevant data collections.
In particular, a functional is a function that takes another function as an argument. Functionals make code more succinct. As a rule, functionals are preferable to explicit “for loops” because they express a high-level goal clearly. Functionals reduce bugs in by better communicating intent. Most importantly, functionals implemented in base R are well tested and efficient, because they’re used by so many people.
There are two popular sets of functionals in R.
- The first is the apply family. These are really equivalent to “for loops” and no more difficult to use. The apply functions just take a collection of data, apply the input function to every element of the collection in turn, and then store the results in another collection.
- Similarly, the map family of functionals from the purr package generally iterates the application of functions over a wide range of data structures.
The power of packages
In R, the fundamental unit of shareable code is the package. A package bundles together code, data, documentation, and tests, and is easy to share with others. At present, there are almost 15,000 packages available on the “Comprehensive R Archive Network” (CRAN). The ability to find a suitable package for the job at hand can save much programming resources.
Moreover, portfolio managers can create their own packages, maybe with the help of a more experienced R programmer. In some sense, the creation of a custom package is just the natural progression of creating custom functions. An in-house package typically improves the documentation of such functions and makes them easier to use and to share.
The power of visualization
Visualization is the key link between scientific data analysis and decisions in macro portfolio management. Confidence in data-based decisions requires a good intuitive understanding of the data and trust in statistical findings. Graphics support intuition and trust better than words.
The R base package provides a range of convenient graphical functions for a “quick and dirty” visualization of data, often in the context of exploration of a data set. Many are executed through the generic plot() function. A helpful overview can be found in R Base Graphics: An Idiot’s Guide. Basic graphics are usually used for quick exploratory graphs, with some examples shown below.
For more flexible and advanced visualization the ggplot2 package provides a system for creating graphics. It is based on “The Grammar of Graphics”, a set of rules derived from the idea that one can build every graph from the same few components: a data set, a set of geoms (geometric objects that represent data points) and a coordinate system. The central activity of visualizing data with ggplot regularly involves three steps: [1] setting the links between data and plot element (“aesthetic mappings”), [2] specifying the general type of plot (“geom”) to be used,and [3] adding detail such “graphical primitives” and other added layers.
There is hardly a relevant visualization that ggplot2 cannot do (except maybe the manual drawing of trend elements in time series charts that is such a popular feature on Bloomberg and Reuters Eikon). A collection of the top 50 ggplot2 visualizations with related code can be found on r-statistics.co by Selva Prabhakaran, many of which have relevance for macro trading.
(A second part of the post focusing on statistical inference and learning will follow).
数字云 — 技术
5 年hi
Tableau Consultant | Tableau Visionary 2024 | Tableau Ambassador 4x | Toronto TUG Leader
5 年I think I just did the recap of 50% learned R in my life. Amazing post!
今日头条 — IT专员
5 年Life is not to be bosom.
大康肉类食品有限公司 — 行政专员
5 年I like reading very much.Life needs goals and pursuits.No one can do anything that I want to do.If there is nothing we can do, let it be.For the future, better now.
洪景 — 职业招聘专员
5 年I like reading very much.Life needs goals and pursuits.No one can do anything that I want to do.If there is nothing we can do, let it be.For the future, better now.