Data Strategy for AI solutions

Given the advances in the Machine Learning (ML) research and all the media attention around the same, it is understandable that a lot of organizations are showing a huge urgency, and at times desperation, to implement AI solutions in some shape or form for improving their existing products/ services or developing new ones. CEOs/senior management in a lot of business firms are setting a clear goal for their divisional/functional heads — to leverage the idle sitting repository of organization’s data to implement atleast one use case of ML in their respective business function. While such a push from senior management is definitely a good step to build an AI savvy organizational culture, however it is critical to understand that having a huge volume of internal data does not necessarily imply that machine learning can be applied effectively to solve business problems.

Most of the real life machine learning problems require more than one type of datasets to train the algorithms on predicting a certain output accurately. For example, a self-driving/autonomous car predicts its next action on the road by processing more than one type of dataset - the videos/ images of the traffic around the car captured by car cameras, radar signals which are sent from the car and bounce off the objects nearby, the laser data from 360 degree rotating lidar units on the car and finally access to third party Google Maps data to analyze traffic and optimize routes. Hence, once a business problem has been fully defined, the most crucial next step as part of the ML project plan should be to brainstorm on what different data sets would be needed for the problem at hand, whether those datasets currently exist in the organization’s database and if not, then what is the strategy for getting access to all the required datasets.

Build/buy/partner strategy works the best when it comes to collecting all the required data sets for solving the ML & AI related problems. To explain this in more detail, let’s take an example of a retail chain trying to predict their stores’ demand/ inventory levels by predicting how many volumes of items will be bought by the customers in the next month. In order to collect the right datasets for solving the problem, the retail chain will have to follow three pronged strategy as explained below:

  1. Build: Build refers to extracting relevant datasets from an organization’s own internal database. Organizations generally have huge central data warehouses which store several data elements, not all of which are related to the predictive problem which is being solved. Hence, a lot of work is required to extract and build the relevant dataset from this existing warehouse by applying all sorts of filters. In our example, the retail chain might have millions of customer related data spanning across a large time period of time in its database. However, to solve the particular problem at hand, it would need to filter out and build customer transaction data and that too over a recent period of time (may be a couple of years) so that it is a good predictor of current buying patterns.
  2. Buy: To solve a particular ML problem and get accurate prediction results, an organization might need access to datasets which currently don’t reside within its own databases. Thus, it will have to reach out to the organizations which have access to those data sets and purchase it from them. For example, in order to better predict the future customer sales, the retail chain would need data on the forecasted future weather conditions, future economic growth prediction, any recent disease/flu outbreaks leading to health related products sales etc. The retail chain will have to buy all this data from a weather forecasting company/an economy research think tank/health company respectively.
  3. Partner: Many a times, the required data sets for the problem reside with external organizations which don’t want to sell the data given it is either proprietery or core to their own business. In those cases, a partnership strategy might work better, where one can partner with this external organization in a way which is win-win for both the organizations. Going back to our retail store inventory problem, if the retail chain wants to predict sales for certain products which are sold by the original manufacturer both through the offline retail chain and also through an online marketplace company, the retail store company would need access to sales data of this online marketplace company too to make accurate predictions about offline sales. However, the online marketplace company might decide not to sell this data to the offline retail chain given it is core to their own business. In such a scenario, both the companies can get into an agreement to leverage each others’ data for building better inventory prediction models for each of their respective companies. WIN-WIN

In conclusion, I would say that the machine learning project teams should develop an appropriate strategy upfront for getting access to all the relevant datasets before getting deep into any project. The more holistic and relevant is the data, the more will be the accuracy of the final prediction results and hence better will be the ROI on investments! So, did you ask yourself the question - do you have the right data and if not do you know how to get it?

要查看或添加评论,请登录

Nitish Kumar的更多文章

社区洞察

其他会员也浏览了