Stuck in the Muck: Big Data means Big Problems
Imagine that your organization is a sleek thing of beauty, like a very fast, very expensive, highly polished Ferrari. But it can't reach top speed – or its true potential – because it's stuck in the mud all the way to the top of its wheels.?
And your data is the key problem because data is the new mud. Big Data, especially Big Text Data, is slowing you down and holding you back – no matter how hard your team pushes, no matter how fast your Ferrari can go. Regardless of what datacenter and database vendors tell you, your data is the problem, not the solution. And by the way, generative AI won't magically save you from the data swamps that your organization has created -- it's built on top of the same quivering muck that you need to be rid of.
Decades ago, the Big Data movement created a paradise for statistics- and math-minded workers as organizations everywhere accumulated oceans of numeric data. But it also opened the floodgates for a vast mudslide of text data, as well:? labels, words, queries, sentences, reviews, documents, emails, comments, suggestions, complaints, text messages, and more. Data that no one was prepared for. Data that databases don't work well with. Billions upon billions of texts! I've even seen estimates that today 80% of all data is text data.?
Take LinkedIn for one example. When I worked there, member profiles (those mini-résumés that everyone posts to promote themselves) mentioned more than 150 million deduped job titles, in 23 languages. (Now there are many, many more!) And there were millions more job titles (usually worded differently from profile jobs) in job postings, as well. Text, text, and more text!
A treasure trove, right??Wrong!
It's more like a humongous, web-scale headache!??Search, ads, insights, and recommendations – all key drivers of revenue and engagement – are constantly marred with wildly irrelevant, incorrectly labeled "related" results. For one simple example, the vast majority of jobs for which I am supposedly a "top candidate" (the very best matches!) are positions that I have absolutely no training, experience, or qualifications for (much less any interest). Today I got a new "top pick": as a Senior Archeologist! I also, apparently, use many of the same keywords that a very highly qualified Distinguished Software Engineer uses – but why should I care? Granted my background looks like a dog's breakfast, but their simplistic string-matching approach yields results that are most often useless. This painfully na?ve approach also means that I'll see the wrong ads, retrieve the wrong search results, and stats about or insights into job trends will see me (and millions others) counted in the wrong places. Clearly, LinkedIn will also get more complaints, so they'll have more support costs and more missed or delayed sales, as well.
领英推荐
All because they have too much text data and too few reliable methods for text processing from engineering.
Each and every mismatch like this is a tear in LinkedIn's reputation and a dip in revenue. For web-scale operations, that ends up being very real money left behind. Instead of being the best, most reliable service, they're simply better than most others – a rather low bar as it turns out. My team there ended up developing robust methods and resources to address these problems, so I know that solutions are available, even if they are not widely known, widely understood, or widely adopted.
LinkedIn is just a single example. Almost all other organizations have similar problems.? If customers are searching for products or content and can't find them, then you won't sell. If clients depend on your data but it's not reliable, then they won't renew their subscriptions.? If you match ads to viewers incorrectly, then your clients won't sell as much.? If you build gen AI on top of all this mucky data, you get hallucinations.?
In sum, the state of the art for Text Data is this:
More data is definitely NOT better data, no matter what statistically-minded advisors say. The key lessons learned from decades of Big Data are that high-quality data is far more valuable than high-volume data and that text data is very hard to process reliably with engineering methods.?
Slogging through LinkedIn-scale text data shows very clearly that what's really needed to advance both business goals and the state of the art is substantially more investment in meaning-centric methods. To dig out from this vast collection of messy, muddy data, we need to transform it into meaningful bricks that we can build with. These bricks we can shape into powerfully structured knowledge graphs that have already proven invaluable when we need to aggregate, integrate, and evaluate data or tame unruly language models.?
Your Ferrari will go much faster on a track paved with bricks than on one knee-deep in mud.
Founder Proprietor at Knowledge Enabler Systems
3 个月Text is "inappropriately" dubbed and "blamed" as "unstructured" but most text we deal with is certainly well-structured according to the the language grammar and conventions humans use in their communications. It may not be perfect but NOT unstructured. Text in print or speach is the ultimate form for agreements among humans. The only thing is that text is NOT readily machine processible yet. So, there is no scope for "semantic computation" eqivalent to "numerical computation" With "knowledge hypergraphs" the problem of machine compatibility of text is well settled at least conceptually. Now, it is a short leap to interconvert "text to knowledge hypergraphs" and vice versa, with human assistance. Progressively that can be automatic with flagging for necessary human validation. Then we can rely on machines to aid humans to make sense and validate understanding of most text. Some text will still need human analysis, arguments and negotiations but that should be miniscule fraction of the volume of text now in use. Soon enough bulk text will not be in much use. X tweets and messages will dominate human communications and interworking. Let us develop on this [email protected]
Artificially Intelligent. Bringing together people, ideas, and data. I am because we are.
3 个月How to get every knowledge worker engaged in #dataintelligence and own their domain #datagovernance ? How to bring business context into data-centric decision making?
Mike Dillinger, PhD https://www.dhirubhai.net/feed/update/urn:li:activity:7224624904082448384
Data Quality Evangelist for 20 years, steadily improving my ability to communicate the importance of DQ for Cost Reduction & Data Monetization.
3 个月Loved "data is the new mud".
Researcher
3 个月Clearly, briefly and well stated Mike. The photo is a bit out of sync to your otherwise well expressed message. Great to see Knowledge Architecture supported as system informatics. Thank you for your posting Mike.