Building a data company in 2022.

Building a data company in 2022.

I've had a pretty varied career in machine learning and software development. I've worked for ten person startups and 70,000 employee behemoths. I've worked as a machine learning researcher, a data engineer, a cloud architect, a project manager and an innovator. I've developed products as diverse as a natural language search engine, spectral calibration for drone cameras, identity verification systems, precision agriculture, remote water measurement and content recommendation. The one thing that has been constant in many ways has been change. But there are commonalities between all of these experiences.

I guess if you're reading this, and you want to get into data, or are thinking of moving your company that way, these are my top-five (in a "High Fidelity" kind of a way) things I want you to learn from my experiences.

Data analysis first, then Data science.

If you want to start using data, and you find yourself saying "we have gigabytes of data in our operational databases", stop. Sure, loads of data is great, but you don't want *data*, you want *information*. A running count of all the grains of sand on a beach can take up GB of data, but there's really only one number in there - the total number of grains.

Test your hypothesis that you have loads of data early. Get a data analyst on the case, give them a dump of your operational databases and ask them "okay, so what's in here?". You may be shocked by the reply (often, the answer is really not much), but if you don't do it early on, you'll be shocked and saddled with a bunch of expensive data scientists who have no data. Don't get me wrong, there are lots of things they can do, but nothing comes close to the value you can get from real data.

Be data driven. In everything.

If you want to become a data company, you need to live and breathe data yourself. If you don't, why should anyone else trust you and your pronouncements about "the new oil"? This means using data to answer your own problems as a business. Not sure which of two possible variations of a UX/UI to deploy? Run some user A/B tests. Want to know which market demographics to try to engage? Look at who are the easiest customers to onboard first.

Of course, this isn't to say that you should leave every decision you take to your analysts. Sometimes there isn't the data available (a common problem in counterfactual situations), or it's hard or expensive to collect beyond a small sample. In these cases, be strong and make as well-informed a choice as you can (see Ken Kocienda's excellent book "Creative Selection" for how they did this at Apple for the iPhone). Know that if you're making a decision which goes against the data, it's important to have a compelling reason to do so. If this kind of decision turns out badly, don't ignore what you just learned - be a good Bayesian and update your priors!

Control how your data is created

The best-functioning machine learning companies out there control their entire ecosystem. We're often told that google and facebook just happened upon these amazing treasure troves of data that was produced by their user's interactions with their tools, but I'm afraid that's bunk. The reality is that their whole product is designed to provide data in a virtuous cycle - the experience is intended to guide you to part with precisely the right data in precisely the right way. The actual data collected is of dubious ethical taste in the case of some big tech companies, but one thing they get very right is controlling the whole data generation process.

Machine learning still isn't that smart, and it's still pretty fragile. One of the main reasons that radiologists haven't been replaced (as so many ML practitioners predicted five years ago) isn't that machine learning models can't perform better - they definitely can at e.g. diagnosing tumours. They fail simply because they fail to generalise. A radiologist can walk down the corridor to a different MRI machine and do just as well, but an ML model trained on data from only one machine will fail badly on data from another one.

Make sure that, as much as possible, you keep your users on the happy path that gives them a pleasant experience of using your products, and at the same time generates data that you can use. Too much platform flexibility can have severe, hidden costs in the form of unusable data, and unhappy users when ML models are involved.

Start with simple models and baselines

This point can almost be seen as a follow on from the last two. It's tempting when starting out to try to take a top-down approach to machine learning systems, and design something approximating a general intelligence for your specific use case or market vertical. I guarantee you, no matter who you are, you don't have enough data for that (yet!).

Instead, start from the bottom up. This can really mean having human beings initially doing a repetitive task (e.g. labelling images, categorising emails, recommending content) until you have enough data to consider automating this process. The first question you need to ask yourself is "how well do I need my humans to perform to make this a viable feature?". If you know you need 70% accuracy, go for the absolute simplest model you can that will get you there. Sure, there are fancier techniques that your data scientists would love to try, but they will cost you in understanding and explainability.

And the reality of those "almost-AGIs" that some companies have, like Microsoft's XiaoIce chatbot? They are made up of hundreds of smaller moving parts, each one crafted one at a time. Don't make perfect the enemy of good, you have to start somewhere with the first little feature.

Don't sell it as AI. It's 2022. Nobody cares.

In 2016, it was broadly the case that you could say AI/ML to investors and they would hurl bricks of cash at you, in much the same way as Web 3.0 and NFTs seem to be hitting the mark now. Some VCs were less interested in what you were solving than how you were solving it, which has led to a lot of people starting sales pitches with "We're doing <insert random task here> with ML/AI" and waiting for the contracts to roll in.

The reality is, it's 2022. We're over the hump of the AI Gartner hype cycle. People want to know *what* you're doing not how you're doing it. If you're doing it with magic elves and pixie dust, what matters is that you're doing it. If you're not doing it, all the AI in the world isn's going to help. Focus on the mission, not the method.

Thanks for sharing Chris Pedder PhD, some really good lessons for anyone interested in data and the forbidden word.

Kornelia Papp

Group Head of Conversational AI & Intelligent Automation

2 年

Agree, agree, agree. Thanks for sharing.

Hollie Bayliss

Neural Networking | Executive Technical Recruiter

2 年

‘Don't sell it as AI. It's 2022. Nobody cares’ haha

要查看或添加评论,请登录

Chris Pedder的更多文章

  • Conform to be free.

    Conform to be free.

    As a sometimes awkward, sometimes I’m sure downright frustrating teenager, who just wanted to be, I always remember my…

    4 条评论
  • What is emergence in neural networks?

    What is emergence in neural networks?

    Large language models & emergence. If you’re reading this, I don’t need Bayes’ theorem to tell me that there’s a very…

    10 条评论
  • How to survive ML research

    How to survive ML research

    How (and why?) to stay ahead. I’ve seen numerous articles about how to “stay ahead” in ML research in the last two…

    5 条评论
  • Why “speed” is a bad metric for success.

    Why “speed” is a bad metric for success.

    To start, two aphorisms: “If you want to go fast, go alone. If you want to go far, go together” - African proverb.

    3 条评论
  • Why I love UX/UI as an ML engineer.

    Why I love UX/UI as an ML engineer.

    “There’s a truth, universally accepted, that an AI startup in posession of funding must be in search of good UX…

  • Don’t make a mesh (unless you have to…)

    Don’t make a mesh (unless you have to…)

    Apologies for the punny title, it’s a bit clickbaitey, but I want to talk a bit about one of the current hypes in…

    9 条评论
  • What I learned from my first year in an innovation team.

    What I learned from my first year in an innovation team.

    I have spent the last year as part of Cisco's internal innovation program. As a result, I have read a lot of books and…

    3 条评论
  • What makes NLP hard (and fun).

    What makes NLP hard (and fun).

    So it's 2020, and the much-anticipated AI-powered robot uprising is still very much in the indiscernible mists of the…

    1 条评论
  • The "A" in AI?

    The "A" in AI?

    There’s really only one possible interpretation, and it’s “artificial”, isn’t it? For a long time, people would have…

  • "Fail fast" vs Machine learning.

    "Fail fast" vs Machine learning.

    Yep, you read that right. There can be only one.

社区洞察

其他会员也浏览了