登录查看更多内容

Machine learning, data and getting started

?? Colin Hayhurst

CEO @mojeek | No-Tracking Search Engine

发布日期: 2019年8月29日

When businesses talk about developing AI today, they’re usually talking about building mathematical models that can be trained on data to make decisions. That’s a specific subset of AI called machine learning. AI is a generalised, and often abused, abbreviation that refers to a wider set of methods; it includes for instance robotic process automation and rule-based systems, such as expert systems. Generally, each machine learning model is specialised to make a certain set of decisions based on particular sets of data. So the intelligence encapsulated in machine learning models is narrow and targeted at specific problems.

This is the third in a series of posts, with a focus on machine learning. In a first post I reviewed the steps in becoming an AI-first business. Then I looked at aligning business and AI strategy. In this, and subsequent posts, I focus on the development and deployment of machine learning models which are specific to your business value proposition and are aimed at increasing revenue and market share; as compared with those that are about reducing costs or increasing operational effectiveness.

You won’t be surprised to learn that results from machine learning projects can take time and you shouldn’t expect to achieve success overnight. You should regard your machine learning initiative as a mid-term project. In the software development world, time estimations have always been a challenge. This challenge is at least as big in the case of machine learning projects.

It makes sense then to start with an initial problem or opportunity to tackle. Sometimes that might be some obvious “low hanging fruit” but it might also be about building a prototype for a new product or service you envisage offering.

An initial project is very likely a business problem or opportunity you have already thought about and identified. The range and diversity of projects that we have seen, as consultants, is wide so it’s hard and probably unhelpful to make generalisations. However, it is likely to be around some acute pain point that you have in your business, or more likely, it will be a particular opportunity that you have seen for increasing revenues, improving service or pleasing customers. Often you will envisage it as a new product or a new component in an existing product.

Assuming you have a machine learning project you have decided to tackle, or at least explore, the next consideration, and it’s a hugely important one, is data. You are likely to be aware that machine learning needs good data to be successful. Indeed leaders of AI and platform-driven organizations know that data is their most important asset. As devices and people produce more and more data, so more becomes possible with AI. Don’t be fooled however that you need the terabytes of data that companies like DeepMind are using in their deep learning models. Other types of machine learning models don’t necessarily need such vast amounts of data. As specialists in probabilistic programming, we are particularly adept at building useful machine learning models for relatively small initial datasets.

At this point you might be asking: so how much data do I need? Sorry but it really does depend; on the problem you are addressing, on the specific machine learning techniques you are using and the accuracy that you are seeking from the models you’ll develop and deploy.

Typically, and in our experience with clients, it isn’t a simple binary case of having more than enough data or not having anywhere near enough. It’s more usual that a client has some data, but it’s not enough of the right kind or it’s missing significant pieces needed to embark on a project. Of course, you might have been drawn into the popular hype that the only way to compete is to build a business with huge amounts of data, a so-called “data moat”. As top investors, Andresson-Horowtiz state “data effects need more thoughtful consideration than leaping from ‘we have lots of data’ to ‘therefore we have long-term defensibility’”.

If you don’t think you have the data internally, you’ll need to make plans to collect and/or acquire it. Often we are working with companies that have some initial data but have definite plans to acquire more as their business develops.

Remember there might be external data sets that you can purchase or are available for free. Increasingly there are datasets available that are public & free, and resources that can link you to these and other datasets. Below we’ve put together links to many of the most useful data resources and lists:

Kaggle datasets
Open data on AWS
Google dataset search
UCI machine learning repository
Microsoft Research open data
Awesome public datasets on Github
Government datasets: eg US, EU and UK

There are numerous other datasets scattered around the web which might be applicable to your problem, so it’s worth doing a search for them. In our work, and as part of our general development, we keep our own database of data sources; in our case, because we work across multiple industry sectors, this is very diverse. We recommend you invest in doing the same for your business.

To succeed with machine learning and AI you’ll need to become proficient at acquiring data strategically. You will need to identify data sources, build data pipelines and, no doubt, clean and prepare data.

For larger companies, a solid data strategy is particularly important. For instance, an over-regulated information policy or simply hoarding of data across departments can really slow down AI adoption. That’s another reason why AI strategy should be introduced and guided by the highest management levels.

Of course, the larger a business gets the more data tends to get spread out in multiple data silos and systems. So an important contribution to becoming AI capable is to form a cross-business unit taskforce, which takes steps to integrate different data sets together and sort out inconsistencies. Before implementing machine learning into your business, it makes sense to sort out these issues and clean your data.

But remember that even if your data set is messy and unstructured, it’s not necessarily a death sentence for your data science initiative. Today, data scientists are well equipped with a number of practices to apply during the preparation stage to restructure, clean your data set, and further optimise it for efficient modelling.

In the next post, I’ll take a look at the other crucial element to implementing your machine learning roadmap; people. Should you train, hire, partner or outsource when tackling your machine learning projects?

This post originally appeared on https://www.datajavelin.com/post/data-and-getting-started

#ai #machinelearning #data #datasets #datasources

If you think this post is useful please consider liking or sharing below.?

Phil Cheetham

Phil Cheetham Engineering Consultant at Phil Cheetham Consultant

5 年

Colin. Thanks for the educational article, just right for novices like me. Regarding data it’s worth pointing out there may be legal or otherwise data which needs to be considered in AI.?

查看更多评论

要查看或添加评论，请登录

?? Colin Hayhurst的更多文章

Search the Web You Want

2022年8月2日

Search the Web You Want

Each of us has a unique perspective and set of interests. Giant companies guiding what we find on the web is subtly…

3 条评论
Why we need alternative search engines

2020年7月1日

Why we need alternative search engines

As of today, I’m joining Mojeek as CEO. Mojeek is an independent search engine company and the only one independent of…

21 条评论
Small vs Big data: AI innovations for startups and SMEs

2019年11月19日

Small vs Big data: AI innovations for startups and SMEs

The press about AI is dominated by developments in reinforcement learning, image processing and natural language…
How Recommendation Engines Work

2019年9月25日

How Recommendation Engines Work

The internet has had a profound effect on our everyday lives, be it from shopping online, streaming music and films or…
Where are the Unicorns?

2019年9月23日

Where are the Unicorns?

It's fascinating to see the nature and distribution of Unicorns across the globe. Thanks to data from CB Insights I was…

3 条评论
Aligning your business and AI strategy

2019年5月23日

Aligning your business and AI strategy

Artificial intelligence can have a profound and positive impact on your business, but only if there is a company…
How to be an intelligent business

2019年5月16日

How to be an intelligent business

If you are reading this, you will most likely have adopted digital, and probably the cloud, into your business. If you…

1 条评论

See all articles

Machine learning, data and getting started

?? Colin Hayhurst

CEO @mojeek | No-Tracking Search Engine

?? Colin Hayhurst的更多文章

社区洞察

其他会员也浏览了

How to Build an AI Model: A Comprehensive Guide

Top 14 No-Code Machine Learning Platforms To Use in 202

Dealing with the Intrinsic Instability and Dual Nature of AI Models: The Promise of MLOps

The Crucial Role of Quality Data Labeling in AI: Why Crimson Phoenix Leads the Way

Data Quality Is Essential for AI and Machine Learning Success

Why Is Machine Learning Important?

What Is AI? Setting the Record Straight

Implementing and Leveraging Machine Learning Models

Laying the Groundwork for Machine Learning Success

Why ML algorithms should have Human in the loop data validation?

?? Colin Hayhurst的更多文章

Search the Web You Want

Why we need alternative search engines

Small vs Big data: AI innovations for startups and SMEs

How Recommendation Engines Work

Where are the Unicorns?

Aligning your business and AI strategy

How to be an intelligent business

社区洞察

其他会员也浏览了

How to Build an AI Model: A Comprehensive Guide

Top 14 No-Code Machine Learning Platforms To Use in 202

Dealing with the Intrinsic Instability and Dual Nature of AI Models: The Promise of MLOps

The Crucial Role of Quality Data Labeling in AI: Why Crimson Phoenix Leads the Way

Data Quality Is Essential for AI and Machine Learning Success

Why Is Machine Learning Important?

What Is AI? Setting the Record Straight

Implementing and Leveraging Machine Learning Models

Laying the Groundwork for Machine Learning Success

Why ML algorithms should have Human in the loop data validation?