Using Data Models and Algorithms
Eduardo Ed Eddie Contreras
Data Modeling/Analyst/Sr. & Data Engineer, use statistical regression and Big data with notebooks and Jupyter Lab, Python, R, or Scala w. use of AWS and Azure (Databricks, Snowflake), or warehouses, AWS Redshift, Synapse
A l g o r I t h m? U s e s, ?C o m m o n? B u s I n e s s ?P r a c t I c e s
?
??? This edition (Volume IV) of the newsletter brings us to a discussion of algorithm with data modeling and applicable use cases.? The data model captures Information that summarizes key metrics about an organization including a variety of industries, business, Research and Development (R&D) and Operations.? Conveyed is a static view of processes, costs, prices and revenues at a given moment of interest, close of business (COB), end of month (EOM) or other.? The fundamental questions from business owners may be answered in a presentation that shows details about outputs, quantities, and such matters that relate.? Algorithms in tandem with existing data can simplify predictive analytics and as such provide an important opportunity to look behind the curtain where data is stored and left to gather in breadth and depth.?
Background
???? Often at the end of a day, customers and operation staff and management quickly scatter off to their respective places of import, only to find that this routine has to be interrupted from time to time by the generation of analysis and reporting which is required by rules and regulations, the government and curious top level management.? Simply totaling the week’s sales can be an arduous task and calculating the average sale by product a timely consideration.? In other words data is often left untouched, or stored somewhere in the confines of an excel workbook, Sql database or last month’s report.? Have you ever just adjusted the report from last month based on word-of-mouth?? Surely others if not yourself rely on word of mouth or inventory to deduce the percent of stock on hand, the percent of sales that reflect today as compared to last year.? However, the details can be modeled to simplify and synchronize reporting which becomes less adhoc and more routine.?
???? In the newsletter we have relied on simple examples, sales by menu item, for instance, which implements data gathered from food and beverage managers and respective shift management.? Food was tallied by shift and sales generated were copiously reported.? In fact the tabular format that shows total by food item and day of the week clearly captures the fundamentals of data modeling:? gather summarize and visualize the data.?
?
Moreover, the conclusions made from the tabularization of data was less easy to decipher.? While expensive items generated the most revenue, generally, it was no the case that the items were the most popular.? We concluded that in order to improve the bottom line, a deep dive into which items could sell better and more importantly why had to be paramount for future consideration.
M e t h o d o l o gy –
Data suggests that some menu items were conducive to the purchase of strong alcoholic beverages including wine.? Meanwhile other items were conducive to the purchase of fewer items; filling, low -cost items (neutral in price) generated less revenue for beverage and deserts.?? A thorough data anlaysis of restaurants beyond the given institution in question, or the market, (similar restaurants, statewide, for example), might benefit the local establishment.? A thorough data analysis would include modeling from greater counts of meals and dishes to encounter whether or not behaviours assumed to be true were actually reported.?
Statisical regression or the consideration of correlation and theory would hash out further if the consumption of wine and cocktails was correlated with the purchase of less fatty or less common items such as sandwiches or pasta.? Chi square and t-square tests are employed to compare data across various domains, in this case points of service (restaurant machiner).? Moreover, the data might show other conclusions after additional theory is introduced.
Algorithms for popularity might be employed that show how popular an item was.? If an item was popular would that suggest that the restaurant was also close to revenue optimization.? In other words a popular item might generate higher total meal revenue if that item was conducive to greater purchases of wines and deserts.? If on the other hand the algorithm for popularity has more to do with how quickly and how simple the dish was, then perhaps adjusting a marketing strategy to attach more expensive sides and deserts to it would be an order.? Algorithms employ multiple characteristics or variables to draw conclusions from data.? In so doing, the characteristics better depict facts or totals.? One theory might include having more fast food items during peak television or movie times is emhancing bottom line, because families and their children are rushing off to watch Star Wars on Amazon Prime at 8 or a sporting event at 7:30.? Thus an algorithm for popularity might include not just the quantity sold but the quantity sold by hour.?
???? In such cases the containerization or compartmentalization of data, what I call a three dimensional view (3-d), comes into focus.? Two -dimensional models are standard but effective of course.? It compares data on an x- and y- axis, including sales on one and product on the other. For example in a discussion from the restaurants example, one would be interested in menu items, thus we would be interested in a table such as the one shown below:
At this particular restaurant and on this particular evening, one might be amazed that a great majority of the revenue came from the Main Course, suggesting that deserts and drinks were not the focus of financial attention. The two dimensional chart allows us to quickly surmise that appetizers, drinks and deserts were equally weighted in comparison to the Main Course, or equally popular. At the very least a manager might want to know what she can do to get the Deserts orders up and the drinks orders to meet or exceed the revenue from the appetizers.
In a high volume and exhausting environment, just getting the figures tabularized for ownership required an extra few minutes of time at the close or by start of the next business day. Enough already, was the response from the team involved.
However, a simple tweak to the analysis might be in order. If the managers can assign a responsible staff member to focus on an idividual item, then perhaps a clearer picture would ensue. In other words what was the most popular appetizer and what was the most popular desert would not be easy to measure using the two-d model. No, in fact a hard working restaurant employee would shuffle through loads of tickets to tabulate such information, not even presented in the aforementioned two-d model.
I.T. quickly got wind of the questions and issues, as the restaurant staff approached her with the question. I.T. received a set of data points from Kitchen and host stand:
Evidently management had been asked to track the main dish and deserts one night and the above was captured. For ease of visualization I.T. employed a histogram with an anaconda interface sponsored by Jupyter labs, the pandas and numpy library also came in handy (anaconda allows windows users to use python normally-hosted in a linux environ).
Et voila. voici? Yes here we are:
Immediately one noticed that the price was no concern for the desert fan such as you or I! Customers were equally interested in the deserts regardless of price. As for Main entrees one migiht conclude that the most popular dish was priced at 15 USD and represented over two thirds of the items sold. I other words if as a restaurant owner I was concerned with how well my dishes and deserts were doing, I would argue, perhaps something can be done to induce more equivalency from my higher priced items with less priced items. However, I also concerned now with how I can increase consumption of deserts. Perhaps lowering the price will increase the quantity and bottom line.
My lead waiter indicated he needed to know more. So the trusty IT person showed what results he could based on tickets for the appetizers and beverages thanks to his contacts at the bar. Here we see what was discovered:
In this department, the bartender can tell you she sees that the beer (priced at 5 USD) was ordered more frequently than the cocktail or wine, but that regardless of price her costlier appetizer was doing better than the modestly priced one (in this case the wings outperformed the soups). Simply using a mean function, the I.T. department noted that if increasing the quantity of items sold at the price equal to or higher than the mean price, for example, would suggest profit was better maximized. In other words without a breakdown by item and quantity he or she is hard-pressed to understand how to optimze price. Without a three d model one cannot actually understand what drives the revenue on any given work day.
A n a l y s i s
Histograms are of value for a number of features. I the restaurant example we broke down revenue by menu item. One could breakdown histograms to show not just items, but items by quantity sold. One could thus see what price points were the most popular. In another example we would analyze larger quantities of data. Such a case as the stock market often implores us to consider extracting transforming and loading data into a model much like the means employed above.
Data is often stored at a website which server can be accessed via api and sends data for customers. Data is provided by day and time and watchlist, for example. For any given day I wish to know how my stock did in comparison to another portfolio or set of stocks. I might decide my vanguard account is being outperformed by a value stock today because of the point in time and relation to a business cycle.
Below is an image of data that concerned me last week including popularity (volume), and price move. An algorithm might be used to answer is the best performance a factor of popularity and market trends of a related sector or is there more to it?
As one looks at their portfolio the first question would be whether stocks that were trending or popular, doing better or worse than the average. Please note, above is a sample of data from a basket of equities grouped together by volume of trades on a given day and the positive items shown above were Tripadvisor, Progressive Insurance, Schwab Bank and Brokerage Servicves, NXPi Semiconductors. The vast majority shown were not in the black. An experinced financial professional might suggest that if there is a correlation between volume or trades and price action, then surely one might be interested in following stocks that are traded at a simlar rate of exchange or frequency of trades. Because this data example is a relatively trivial quantity, using a histogram is a prime example of presenting the data without wasting hundreds of rows of space on this interface. Moreover, a cursory view of the board above shows in fact the correlation does not exist. Stocks toward the top of the chart are not doing better than those toward the bottom. For exampe JETS a ETF focused on airlines landed toward the top of the chart in terms of volume and among the higher percents of the chart toward the downside.
However, as we know from statistics courses, a subset of data may not correctly predict the population statistics unless it is a valid subset of the data or what is known as a sample. I did nothing more than take a snapshot of a dozen or so rows with no rhyme or reason and as such not representative of the entire population of stocks in that portfolio which is to say it was not a sample. Without further ado, the entire portfolio can be modeled using a two-d diagram here:
In the two-d model we see above there is some correlation between the number of contracts traded and the price action. In looking at the most negative and most positive ranges, far fewer contracts were traded with the most negative ranges of -4% thru -9%. That is to say at these very low ranges or extremes, one finds that about ten percent of the stocks in that portfolio were moving very far in the negative where very far is described as moving to the downslope by more than 4%. Inversely, there were four times as many stocks traded on the higher end of the range which starts at about 0% to the upside and ends at a positive 8% move. The mean of the move to the top-side incidentally can be calculated from the data set itself (not the presentation or model shown above) and is presented for those who might inquire: of those in the postive range the mean was 1.4% and because the histogram included the range of just below zero, that adjusted mean is about .5%. Generally one could surmise that the more popular a stock was on that day, the more likely it was that it would be in the money, or advancing in price.
While this is exactly the opposite of what was depicted previously, it is a very important detail about statistics, or the the science. In data science like statistics, there are differences between so called samples of a population and actual statistically significant samples. For example if I know the state of Texas is approximately ten percent of the population I cannot actually confirm it is a good sample of the United States in terms of a statistic such as height or weight. In other words there might be a better sample that includes people from various regions of the country to describe the entire population better due to genetics, and other things. Please see a detailed description of proper sampling methodology in any number of texts on Descriptive Statistics which is included in the curricula of my major, Public Finance at the George Washington U. where I earned a Master in Public Policy and Public Administration.
Please also note that data modeling and sampling are intrinsic to the study of data anlaytics, Machine Learning, or data training and of course Artifical Intelligence (AI).
Back to data and algorithms. We have seen how data modeling begins at the 2-dimensional model. In the stocks case we survey data on a given day and purport that activity and price action were generally correlated. In statistics we speak of generally correlated and more importantly to what extent. How strongly two features are correlated has to do with a specific metric, so we can actually plug in the data to determie its strength. From my vantage point I can see it is correlated to an average extent but not very , very highly. From a cursory view of the data we see that the central range is actually higher with regards to volume as compared to the lower ends. However, because the middle range is intrinsically or obviously higher in percent move to the lower, one would say that both the mid range and the high range of volume are higher than the very low ones whereas a stronger correlation between price and volume would be shown by the mid range being between the volume noted at the high and low ranges. That is to say the middle range has the higher volume. This distribution would also be calculated in a statistical model using R, Python or statistical software (SpSS) or even excel (MIcrosoft TM) and its plugins. Bear in mid we referred to the model above as a 2-d model, because we were capturing a summary of data from the data set much like we saw in the previous example. However, for ease of use I did add the counts of items which is a statistical calculation or sum not normally shown on a table. Added functions differentiate a two -d model, so in theory it is also a histogram where counts and categories are ivolved. Table views generally are referred to as 2-d or flat files and generally are the first step to plot a chart even a histogram. Thus for finality here is a histogram chart for the stocks data:
(also please consult with your favorite data modeling coursebook or olap (and sql 3d visualization with data warehousing etc.) for distinctions in greater detail on 2-d flatfiles verses 3-d models)
The arrow in the chart shows how ideally a one to one correlation (slope of a straight line) is stronger where the -1% to 8% range should be higher than the middle range or the -1% - -4% range. Please consult your statistics book of choice for further reading.
Algorithms for trading are concepts widely adopted by students of finance and experts alike. The algorithm that follows from above in my estimation would relate to the features above. As I have said a stronger correlation is necessary to conclude this was a good day for picking bullish stocks. In fact you can see the histogram shows the medium range to include negative trending stocks. Thus trading and the algorithm for "bullish' or bearish might be affected by data such as this where one would consider including a feature for non bullish or non bearish. Perhaps a sentiment algorithm might include the blend or bearish, bullish and neutral for a given stock that depended on its volume and this sentiment, not just the volume (hence an algorithm as sum of two components is employed).
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?