AI 101, Part II: How to Deal with Data Preparation

AI 101, Part II: How to Deal with Data Preparation

My first post in this series covered what marketers and sales leaders need to know about the four main phases of building predictive models. The second of these steps – data preparation – tends to be the least understood part of AI and predictive analytics in marketing. In this next post, I’ll dig deeper into key considerations surrounding this process, namely related to data volume and data quality. When my company introduces our predictive platform to companies, two of the biggest concerns we hear are: (1) Do I have enough data? and (2) Is my data “clean” enough?


How Much Data is Needed for Machine Learning?

There’s a rule of thumb for how much data you need in order to be successful with a predictive model, and the most important number is the amount of positive signals or “good” examples there are in your data set. In the case of historical customer data for lead or account scoring, this would be how many total opportunities or closed/won deals you have in your CRM database.

Of course, these positive signals exist among other negatives. Make sure your positive is defined as a relatively significant achievement in the pipeline. For example, the creation of an opportunity is a meaningless milestone if it happens for every single free trial that comes in. Instead, consider going further down the funnel to find a tougher hurdle that really points to lead quality.


Predictions will be most accurate when you have around 400 to 500 of these positive results. In that range, they can be randomized and split into two proportions (60% and 40%) for model comparison. If you have fewer than a hundred examples to test your model over, your results won’t be quite as precise as you might want (until you add more data over time and refresh the model).


How AI Solves the Data Hygiene Problem

The truth is that no business has perfect quality, complete data, but that’s okay. Modern data preparation techniques are built to work around that very problem, so there’s no need to delay AI initiatives while you wade through cumbersome data clean-up projects. If you do, you’ll just leave revenue opportunities on the table. By matching whatever limited lead data you have with hundreds of external signals from the web, predictive platforms like Infer can build a complete picture of each prospect or customer. In fact, our algorithms can produce lead scores with nothing more than a company name or an email address. That’s thanks to advanced data science approaches like Natural Language Processing (NLP), which can bridge gaps in your data by looking for patterns in web crawls, performing title normalization and doing spam analysis on form input.


Title Normalization

Anyone who has sold into IT or the sales and marketing industry knows that job titles are all over the place (or sometimes not included in the data at all). Title normalization techniques tend to be especially important for lead fit models, because you need to know that “Marketing Director” might be equivalent to “demand gen lead,” or that “IBM” and “International Business Machines” are the same company. NLP essentially splits out each word that exists across all of your records, and uses an algorithm to assess related patterns and find the words that show up most often in positive outcomes for a particular data set.

Spam Analysis

Another sophisticated feature to look for is spam analysis – something that’s often used in consumer search algorithms like Google. By analyzing the number of capitalized characters and key input for a name, company, title or email, you can assess the likelihood that each data point is a legitimate input. For example, the way a person’s fingers traveled across the keyboard (i.e. the number of row switches, etc.) often indicates whether their entry is legitimate. An email like [email protected] doesn’t travel very far and is probably not a real address. Machine-learning can perform these checks on every single record, regardless of whether or not it matches a known website domain.

As you can imagine, NLP alone can help you immediately improve your data hygiene. That’s why, instead of doing months of data cleansing first in hopes of being able to get better intelligence later on, it’s smarter to get your predictive and AI initiatives started now, with the data you have. There’s no sense in spending time and money augmenting fields and cleaning up data that isn’t helpful for your models anyway. Rather, use machine-learning to figure out what your most important data points actually are, and then focus your data cleanup efforts there as needed.

It’s so important to understand common data science methodologies like these as you move forward, even if you never intend to work with the algorithms yourself. This knowledge will help you spot any flaws, unrealistic expectations, assumptions, and missing pieces in predictive and AI solutions, so that you can thoughtfully evaluate them. In my next post, I’ll expand further on basic model types and more problems sales and marketing teams can solve with data.



要查看或添加评论,请登录

Sean Zinsmeister的更多文章

  • The 3 Most Important Qualities for Product Marketers

    The 3 Most Important Qualities for Product Marketers

    In chatting with friends and colleagues over the past few months, I'm often asked what are the things you look for in…

    3 条评论
  • AI tl;dr - 7 Deadly Sins of AI

    AI tl;dr - 7 Deadly Sins of AI

    7 Deadly Sins of AI by Rodney Brooks This is one of my favorite articles to-date offering a realistic look at AI. If…

  • Problem-Centric Thinking

    Problem-Centric Thinking

    In my opinion, this is one of the greatest challenges of modern-day Product Marketing and Management. Getting under the…

  • The Best Product Marketing for AI yet

    The Best Product Marketing for AI yet

    It’s so refreshing to see a story come together around AI: Arsenal — AI assistant for photographers Practical, useful…

    3 条评论
  • Technology does not manage people

    Technology does not manage people

    Well…at least not yet :-) So I haven’t really had a chance to sit down and go unplugged as it were, but as I peruse the…

    1 条评论
  • The Modern Marketer’s Guide to Machine Learning Algorithms

    The Modern Marketer’s Guide to Machine Learning Algorithms

    Most marketing (and sales) teams have seemingly simple goals: identifying your best customers, targeting prospects who…

    1 条评论
  • AI 101, Part 1: What You Need to Know about Predictive Models

    AI 101, Part 1: What You Need to Know about Predictive Models

    While predictive analytics and AI are big topics in the sales and marketing profession these days, it can feel daunting…

  • Predictive Analytics & AI – Separating Hype from Reality

    Predictive Analytics & AI – Separating Hype from Reality

    (Originally published on MarketingTech) These days, marketers can’t read about their profession without getting…

    2 条评论
  • The Buyer's Guide to AI for Sales

    The Buyer's Guide to AI for Sales

    (Originally published on Hubspot) Salespeople have never had so much technology at their fingertips. Some of the…

  • 4 Tactics for Infusing AI and Predictive Analytics Into Sales Processes

    4 Tactics for Infusing AI and Predictive Analytics Into Sales Processes

    (Originally published on Salesforce) Unless you were hiding under a rock this year, you probably heard a thing or two…

社区洞察

其他会员也浏览了