Machine learning, data and getting started
When businesses talk about developing AI today, they’re usually talking about building mathematical models that can be trained on data to make decisions. That’s a specific subset of AI called machine learning. AI is a generalised, and often abused, abbreviation that refers to a wider set of methods; it includes for instance robotic process automation and rule-based systems, such as expert systems. Generally, each machine learning model is specialised to make a certain set of decisions based on particular sets of data. So the intelligence encapsulated in machine learning models is narrow and targeted at specific problems.
This is the third in a series of posts, with a focus on machine learning. In a first post I reviewed the steps in becoming an AI-first business. Then I looked at aligning business and AI strategy. In this, and subsequent posts, I focus on the development and deployment of machine learning models which are specific to your business value proposition and are aimed at increasing revenue and market share; as compared with those that are about reducing costs or increasing operational effectiveness.
You won’t be surprised to learn that results from machine learning projects can take time and you shouldn’t expect to achieve success overnight. You should regard your machine learning initiative as a mid-term project. In the software development world, time estimations have always been a challenge. This challenge is at least as big in the case of machine learning projects.
It makes sense then to start with an initial problem or opportunity to tackle. Sometimes that might be some obvious “low hanging fruit” but it might also be about building a prototype for a new product or service you envisage offering.
An initial project is very likely a business problem or opportunity you have already thought about and identified. The range and diversity of projects that we have seen, as consultants, is wide so it’s hard and probably unhelpful to make generalisations. However, it is likely to be around some acute pain point that you have in your business, or more likely, it will be a particular opportunity that you have seen for increasing revenues, improving service or pleasing customers. Often you will envisage it as a new product or a new component in an existing product.
Assuming you have a machine learning project you have decided to tackle, or at least explore, the next consideration, and it’s a hugely important one, is data. You are likely to be aware that machine learning needs good data to be successful. Indeed leaders of AI and platform-driven organizations know that data is their most important asset. As devices and people produce more and more data, so more becomes possible with AI. Don’t be fooled however that you need the terabytes of data that companies like DeepMind are using in their deep learning models. Other types of machine learning models don’t necessarily need such vast amounts of data. As specialists in probabilistic programming, we are particularly adept at building useful machine learning models for relatively small initial datasets.
At this point you might be asking: so how much data do I need? Sorry but it really does depend; on the problem you are addressing, on the specific machine learning techniques you are using and the accuracy that you are seeking from the models you’ll develop and deploy.
Typically, and in our experience with clients, it isn’t a simple binary case of having more than enough data or not having anywhere near enough. It’s more usual that a client has some data, but it’s not enough of the right kind or it’s missing significant pieces needed to embark on a project. Of course, you might have been drawn into the popular hype that the only way to compete is to build a business with huge amounts of data, a so-called “data moat”. As top investors, Andresson-Horowtiz state “data effects need more thoughtful consideration than leaping from ‘we have lots of data’ to ‘therefore we have long-term defensibility’”.
If you don’t think you have the data internally, you’ll need to make plans to collect and/or acquire it. Often we are working with companies that have some initial data but have definite plans to acquire more as their business develops.
Remember there might be external data sets that you can purchase or are available for free. Increasingly there are datasets available that are public & free, and resources that can link you to these and other datasets. Below we’ve put together links to many of the most useful data resources and lists:
- Kaggle datasets
- Open data on AWS
- Google dataset search
- UCI machine learning repository
- Microsoft Research open data
- Awesome public datasets on Github
- Government datasets: eg US, EU and UK
There are numerous other datasets scattered around the web which might be applicable to your problem, so it’s worth doing a search for them. In our work, and as part of our general development, we keep our own database of data sources; in our case, because we work across multiple industry sectors, this is very diverse. We recommend you invest in doing the same for your business.
To succeed with machine learning and AI you’ll need to become proficient at acquiring data strategically. You will need to identify data sources, build data pipelines and, no doubt, clean and prepare data.
For larger companies, a solid data strategy is particularly important. For instance, an over-regulated information policy or simply hoarding of data across departments can really slow down AI adoption. That’s another reason why AI strategy should be introduced and guided by the highest management levels.
Of course, the larger a business gets the more data tends to get spread out in multiple data silos and systems. So an important contribution to becoming AI capable is to form a cross-business unit taskforce, which takes steps to integrate different data sets together and sort out inconsistencies. Before implementing machine learning into your business, it makes sense to sort out these issues and clean your data.
But remember that even if your data set is messy and unstructured, it’s not necessarily a death sentence for your data science initiative. Today, data scientists are well equipped with a number of practices to apply during the preparation stage to restructure, clean your data set, and further optimise it for efficient modelling.
In the next post, I’ll take a look at the other crucial element to implementing your machine learning roadmap; people. Should you train, hire, partner or outsource when tackling your machine learning projects?
This post originally appeared on https://www.datajavelin.com/post/data-and-getting-started
#ai #machinelearning #data #datasets #datasources
If you think this post is useful please consider liking or sharing below.?
Phil Cheetham Engineering Consultant at Phil Cheetham Consultant
5 年Colin. Thanks for the educational article, just right for novices like me. Regarding data it’s worth pointing out there may be legal or otherwise data which needs to be considered in AI.?