Data Analysis by Example in Python, BigQuery and Q
In this article, I take a simple, real-life problem and analyze different solutions in Python, BigQuery and Q. The problem is magical in the sense that unraveling it leads us to discover nice areas of these two programming languages including vector operations, Einstein summation, adverbs and functional form of select statements. Each solution has lessons to learn that deepens our IT knowledge, especially if we consider performance aspects as well. I believe that people understand programming concepts better via examples.
Python and Q are two world-class, dynamic programming languages with many similarities and some interesting differences. Python is a general language with specialized libraries for different fields like web and mobile development or data analysis. Q is primarily used in the financial industry for data analysis and developing trading algorithms. On Wall Street, people often use these tools side-by-side, sporadically with R and Matlab.
Problem Statement
We have a table of bid prices and sizes of two buyers. Bid price p with size s means that the buyer is open to buy s number of products at price p. We have a table of four columns:
- bid prices offered by the two buyers, pA and pB.
- bid sizes, sA and sB.
Our job is to add a new best size column (bS) to the table, that returns the size at the best price. If the two buyers have the same price then bS is equal to sA + sB, otherwise, we need to take the bid size of the buyer that offers the higher price.
An example table with the desired extra column is below. The best prices are in bold.
All code snippets of this article are available on GitHub. Here I often embed pictures of code to provide the best look-and-feel across mobiles, tablets, and desktops.
The task sounds simple. To appreciate this article, you should stop here and come up with at least two solutions and also try to generalize the task in several ways. The following code lines generate a random table t to play with.
Database experts are also welcome to the game. PostgreSQL prepared statement and BigQuery parametric query are given below.
BigQuery function GENERATE_ARRAY cannot produce an array of size above 1048575. If we would like to play with a sample table of row number e.g. 10 billion then we need to call GENERATE_ARRAY twice and do a cross join ??. Generating a sample table of an arbitrary size greater than 1048575 is not trivial with BigQuery.
An elementary solution
Complex problems need to be decomposed into simple problems that are easier to solve. So let us forget about the table and take a single row with prices pA, pB and sizes sA and sB. To get the best sizes, we can come up with a function including two IF-THEN-ELSE statements (denoted by operator $ in Q).
Function bestSizeIF can be applied to each row of table t.
If the data is stored in a database that supports SQL queries like MySQL, PostgreSQL, Apache Cassandra, Vertica, Apache Spark, Google BigQuery, Oracle, etc. then a SELECT statement with CASE statement does the job.
SELECT *, CASE WHEN pA = pB THEN sA + sB WHEN pA > pB THEN sA ELSE sB END AS bS FROM t
A more elegant solution
The more experienced programmer you are, the less IF statements you use and replace conditions with proper design principles. IF-THEN-ELSE statements create a condition in the logic that splits the execution path into two branches. The human brain and the processor architectures prefer sequential execution, operations followed by operations... similarly, as we read a book, line-by-line. Can we get rid of the IF statements completely?
Let us derive a Boolean vector that is true at position i if the i?? price is maximal. We prepare for the general case when we have more than two buyers. We can easily get the maximal price by function max (aka. amax in Numpy). Maximal value can be compared to all elements of the price vector. We make use of the vector processing nature of Numpy and Q. In statement x == y, a vector is returned if either x or y (or both with the same length) is a vector. Q uses = for comparison.
Based on the Boolean vector we can follow two approaches. On the one hand, we can get the indices of True values and use these to index the size vector and finally calculate a sum. Again, both Numpy and Q can index a vector with another vector. Furthermore, in both languages function where returns list of indices containing True values. Q saves some typing by letting you omit square brackets for indexing.
In the other approach, we rely on Boolean arithmetic in which True acts as one, False acts as zero. We can replace the where statement and the indexing with a simple vector multiplication.
Multiplying two vectors then summing the result is the weighted sum (inner product). We can simplify our code by leveraging built-in weighted sum functions which are np.dot/np.inner in Python and wsum in Q.
The final solution was proposed by András D?tsch. It is elegant and performant. Its efficiency is coming from the fact that a weighted sum for vectors of sizes two can be replaced by simple addition and the two weights can be derived by two comparisons.
Observe the different usage of parentheses. There is operator precedence in Python. Developers need to remember that comparison has lower precedence than multiplication. In Q the expression is evaluated from right to left.
This approach can also be translated to ANSI SQL. Boolean arithmetic is not supported by SQL so we need to cast bool values to integers manually.
SELECT *, sA * CAST(pA >= pB AS INT) + sB * CAST(pA <= pB AS INT) AS bS FROM t
Performance
Let us stop for a bit and compare the performance of the four approaches. The table below contains execution times in seconds on a sample table of size one million. Prices were random numbers between one and five. Tests were executed a hundred times. I used Q version 3.6, Python 3.4.4. Tests ran on a 64 bit, Red Hat Linux server with kernel version 3.10.0 and CPU Intel Xeon E5-2650 v4 @ 2.20 GHz.
Function with IF statement is the fastest as it does not need to create a temporary variable p, which requires memory allocation. Although we played with vector operations, we don't gain many benefits of it for vectors of size two.
Q is two orders of magnitude faster than Python.
For our dataset and query, Google BigQuery was slower than Postgre SQL on a simple commodity server. The second run of the query returned immediately thanks to BigQuery's caching mechanism. If I added one single new row to the sample table then the query time increased again.
Execution times scale linearly with the size of the table, i.e if I double the number of rows the query execution time also doubles. This does not hold for BigQuery. Its underlying technology, called Dremmel, is massively parallel, hence it is less sensitive to the size of the input table. 100 fold increased in the table size resulted in a 2-3 fold increase in the execution times.
Vector Programming
We took a step toward vector programming by refactoring function bestSize. The input of the function is still scalars and we get the final result by applying a scalar function to four scalars obtained from four lists. If the input table contains one million rows, then we execute function bestSize one million times. Every single function call has a cost. Can we improve our function to accept four lists instead of four scalars and execute the function just once?
Numpy and Q support vector conditionals. Instead of using functions if and $ for IF-THEN-ELSE we can use the three-parameter version of np.where and ? in Q. Operator + works well on vectors and performs element-wise addition in both languages.
Unless the helper function is used multiple times or it helps readability, you can drop it completely and use a simple and stateless expression.
Vector multiplication is supported by both Numpy and Q. We can rewrite bestsizeMult in two ways. Either we store prices and sizes in a list of pairs or in a pair of lists, i.e in a matrix of size N x 2 or 2 x N. We need to choose the proper shape based on the internal representation of the matrix and the operation we would like to perform on it. Traversing an array is faster if the CPU can process the data sequentially and CPU caching is fully utilized.
Numpy stores matrices in row-major order (aka. C-style), Q stores matrices in column-major order (aka. Fortran style). This implies that in our task, Numpy prefers shape N x 2, Q prefers shape 2 x N. For comparison, I will provide Q syntax and performance metric for the suboptimal solution.
In Python we need to employ the axis parameter of np.amax/np.max and np.multiply instead of * to handle vectors:
In Q we need to use a simple adverb. Function each applies a monadic function on each element of a vector (similarly as function map in Python). Operator * can remain intact - it handles lists and scalars, as expected. To create a list of pairs, we can transpose (function flip) a pair of two vectors.
Working with pairs of lists differs slightly
In this approach, we rely on the polymorphic nature of Q functions. For example, the function max returns the maximal element if a list is provided as input. For matrices, it returns the maximal value of each column. The same applies to the function sum. We can use adverb each to use monadic functions on each row instead of on each column. For binary functions, we use adverb each left and each right - denoted by \: and /: respectively.
The second version of function bestSizeMult_V saves two expensive matrix transpose, hence runs faster.
We can simplify the code again by merging the multiplication and summation into a weighted sum method. In Numpy we need yet another function called einsum that implements Einstein summation which can be used to get Hadamard product. In Q we can leverage wsum again as it automatically applies weighted sum to each corresponding columns if the parameters are two matrices.
Function bestSizeDotsch is the only one that does not need to be updated to accept four vectors instead of four scalars. Operations that it uses, i.e. comparison, multiplication, and addition all work seamlessly on scalars and vectors.
Performance
The run times of the vector-based approaches are in the table below.
We can see that the vector-based approach speeds up Numpy-based solution by two orders of magnitude. The vector conditional feature of Numpy and Q are very powerful, they yield super fast solutions.
Generalization
What happens if a new buyer (denoted by C) shows up on the market? We can get prices from three sources. The super fast IF-THEN-ELSE approach becomes ugly since the number of branches increases exponentially - from three to seven in our case. Similarly, we cannot extend function bestSizeDotsch. In the general problem, we are given a table and a list that stores the names of price columns and another list that stores the names of size columns.
First, we investigate how we need to change the two buyers case if the columns are not hardcoded but provided in parameters. Let us stick to the weighted sum solution and do a minor refactoring. Instead of accepting two vectors (pA and pB), we need a pair of vectors, i.e. a vector of vectors, aka. matrix.
Generalizing the code from here is trivial. When selecting a list of columns we can provide a list of column names.
Q allows developers to follow the same approach that Python does. Furthermore, we can use select/update statements which often provides faster and more readable code than Pandas equivalent. For completeness, I also provide a general version of the select/update-based solution.
In Q the column names are part of the update statement. If you need the flexibility of accepting a symbol list like `pA`pB`pC and you need an expression (pA; pB; pC) then you need to employ an advanced technique called the functional form of an update. In reality, every select and update statement translates to a functional select/update statement. For example, the simple update statement
translates to
Exclamation mark with four parameters stands for the update statement. The first parameter is the table, then the list of WHERE constraints - empty list if there is no filtering. The third parameter is the group-by option - false if grouping is not required. The last parameter is a dictionary with the keys of the new column names and values with the parse-tree expression to get the new columns. In our example, we update a single column. You can create a list of size one by function enlist.
We can get the functional form of a select/update statement by employing helper function parse or buildSelect.
Without in-depth knowledge of parse trees, we can figure out that we simply need to prepend command enlist to the list of price and size column names to get the functional form of the update statement we are looking for.
What about ANSI SQL?
Vector operation and boolean arithmetic are not supported by SQL so you need to implement the weighted sum manually and convert bool values to integers.
WITH t_ext AS ( SELECT *, GREATEST(pA, pB, pC) AS bP FROM t ) SELECT *, (sA * CAST(pA = bP AS INT) + sB * CAST(pB = bP AS INT) + sC * CAST(pC = bP AS INT)) AS bS FROM t_ext
The code is lengthy and harder to maintain. It is more prone to error and becomes cumbersome as the number of columns increases.
Performance
In my final experiment, I checked how the two languages scale if the number of columns increases. Numpy and Q scale in a similar way and Q is a few times faster than Numpy.
Optional bidding
Experienced data analysts know that a slight modification in the problem statement may require a completely new approach. Let us stick to the general case and have multiple buyers. The only change is that bidding is optional. We can represent this by either denoting missing bid by a null value or by storing prices and sizes in a variable-sized array. If the number of potential buyers is much larger than the average number of bidders then the second approach is more memory efficient. Let us assume this data representation.
In the new table, we have single price and size columns that contain variable-length arrays. Code to create sample tables is below.
BigQuery code is a bit lengthier.
Several database systems support array columns including Pandas, Q, PostgreSQL, Cassandra, standard SQL of Google's BigQuery platform, etc. There is a huge difference in how these systems support operations on array columns. For example in Cassandra, you can only retrieve the complete column, append a new element to it or filter based on content but this requires creating an index. In PostgreSQL, we can do more, e.g. fetch the fifth element of each array. Array handling of various database systems deserves a separate article. Let us go back to the original problem.
The major difference in the new setting is that the prices and sizes are not represented by matrices of size N x M, in which M denotes the number of buyers. Instead, prices are stored in an array of arrays of different length. We need to forget about matrix operations, hence drop Numpy functions amax and einsum and go back to the slow apply function.
t['bS'] = np.multiply(t['size'], t.price.apply(lambda r: r == r.max())).apply(sum)
In Q, we never really required the price and size arrays to be matrices. The each operator (denoted by ') goes through each element like apply but much faster.
t: update bS: size wsum' price = max each price from t
Google enters the scene
Google extended ANSI SQL by supporting arrays and dictionaries inside cells. The language is called Standard SQL. Many new functions are available to analyze array columns. Let see Google's solution to the problem.
WITH t_ext AS ( SELECT *, (SELECT MAX(p) FROM UNNEST(price) as p) AS bestPrice FROM `myproject.t`) SELECT *, (SELECT SUM(s) FROM UNNEST(price) AS p WITH OFFSET JOIN UNNEST(size) AS s WITH OFFSET USING(OFFSET) WHERE p = bestPrice ) AS bS FROM t_ext
Some explanation. First, we create a temporal table t_ext that extends the original table with a single new, best price column. Then we convert array column to normal column by row multiplication (command UNNEST). The nested SELECT statement with UNNEST allows aggregation and filtering of array column. UNNEST returns the elements in arbitrary order. We use virtual column OFFSET to see prices and sizes aligned next to each other properly.
It is interesting to see the standard SQL solution when the elements of the price and size columns are atomic and there is multiple prices and sizes columns (e.g. pA, pB, pC and sA, sB, sC). Google's solution is to convert that data representation to the array column representation.
WITH t_ext AS ( SELECT * EXCEPT(arr), ARRAY(SELECT CAST(val AS INT64) FROM UNNEST(arr) val WITH OFFSET WHERE OFFSET < ARRAY_LENGTH(arr) / 2) AS prices, ARRAY(SELECT CAST(val AS INT64) FROM UNNEST(arr) val WITH OFFSET WHERE OFFSET >= ARRAY_LENGTH(arr) / 2) AS sizes, (SELECT MAX(CAST(val AS INT64)) FROM UNNEST(arr) val WITH OFFSET WHERE OFFSET < ARRAY_LENGTH(arr) / 2) AS bestPrice FROM ( SELECT *, REGEXP_EXTRACT_ALL(TO_JSON_STRING(T), r'(?:"(?:pA|pB|pC|sA|sB|sC)"):(\d+)') AS arr FROM `myproject.t` AS T ) ) SELECT * EXCEPT(prices, sizes), (SELECT SUM(s) FROM UNNEST(prices) AS p WITH OFFSET JOIN UNNEST(sizes) AS s WITH OFFSET USING(OFFSET) WHERE p = bestPrice ) AS bS FROM t_ext
Producing temporal table t_ext looks scary. How does the first part work? In the innermost SELECT statement, we wrap all columns into a single column. Function TO_JSON_STRING convert the column to a string in a JSON format, that can easily be processed by REGEXP_EXTRACT_ALL. This function creates an array column and fills the array with values that match the pattern. Prices are in the first part of the array, sizes in the second. In the outer SELECT, we split the array column to two array column by considering virtual column OFFSET and the length of the array.
Google made a huge step towards extending the functionality of ANSI SQL. It is possible to achieve what was not possible before. Unfortunately, BigQuery is still a far cry from Pandas and Q when it comes to elegance and simplicity.
Performance
The performance comparison is below. Again, Q is the fastest with a significant margin.
Python and Q execution times scale linearly with the size of the input table. BigQuery is less sensitive to the input table size. For tables larger than 200M rows, BigQuery becomes faster than Q.
Conclusion
I've presented a simple data analytics problem and went through solutions in Python, Q and SQL at various expertise levels. Junior developers are able to come up with a simple and readable IF-THEN-ELSE based solution, that is two orders of magnitude faster in Q than in Python. Intermediate developers who know vector operations come up with an even faster and readable solution. Numpy's and Q's solutions are very similar both in syntax and in performance.
Solving the general problem requires Boolean arithmetic and vector operations. In Q, we can use operators like *, max and wsum. In Numpy, we need to drop * and dot and pick up multiply and einsum as soon as we move from simple vectors to vector of vectors. This resonates with the different design principle of Q and Python. In Q, there are powerful and orthogonal tools, which you can plug together to solve complex problems. In Python, there are far more functions and parameters, orthogonality is less of a concern. Also, developers need to type more.
Q also allows developers to use select/update statements. If the column names are coming from parameters then we need to know the functional forms of the select/update statement. With helpers of parse and buildSelect even an intermediate Q developer will come up with the proper solution. However, programmers with expertise in other common programming languages are less likely to be able to read, support and modify the Q code.
Pandas can only handle in-memory data sets. Q is not limited by the memory of the box, data can also reside on disks - the same select query can be used. If the data is segmented then Q (more precisely kdb+) takes advantage of parallel I/O and processes segments concurrently.
Similarly, a table in BigQuery is not limited by the size of available memory. BigQuery is cloud-based and backed by tens of thousands of servers. Jump in the input data from 100M rows to 10B is almost seamless and even the exustion times hardly increase.
Acknowledgment
I would like to thank Péter Gy?r?k and Andrew Wilson - two black belt Q masters - for their valuable feedback. Also, suggestions on Google's Standard SQL from Mikhail Berlyant are much appreciated.
Guardian of the Connected World
5 年Thanks for sharing this awesome article!
Area Sales Manager / Europe & Middle East and India
5 年Really a great article!
Executive Director at Morgan Stanley
5 年I loved this article, thanks for this, the performance metrics are great!
Reinforcement Learning Research
5 年Of course real data scientists use PyTorch or TensorFlow to run it in microseconds on a GPU. Timings for a desktop PC:
Interesting article! An approach which generalises well to multiple columns (and without needing complicated functional selects) ? ?update bS: (sA;sB)wsum{x=\:max x}(pA;pB) from t I would expect this not doing worse than python with number of columns increasing