登录查看更多内容

Stuck in the Muck: Big Data means Big Problems

Mike Dillinger, PhD

发布日期: 2024年7月31日

Imagine that your organization is a sleek thing of beauty, like a very fast, very expensive, highly polished Ferrari. But it can't reach top speed – or its true potential – because it's stuck in the mud all the way to the top of its wheels.?

And your data is the key problem because data is the new mud . Big Data, especially Big Text Data, is slowing you down and holding you back – no matter how hard your team pushes, no matter how fast your Ferrari can go. Regardless of what datacenter and database vendors tell you, your data is the problem, not the solution. And by the way, generative AI won't magically save you from the data swamps that your organization has created -- it's built on top of the same quivering muck that you need to be rid of.

Decades ago, the Big Data movement created a paradise for statistics- and math-minded workers as organizations everywhere accumulated oceans of numeric data. But it also opened the floodgates for a vast mudslide of text data, as well:? labels, words, queries, sentences, reviews, documents, emails, comments, suggestions, complaints, text messages, and more. Data that no one was prepared for. Data that databases don't work well with. Billions upon billions of texts! I've even seen estimates that today 80% of all data is text data.?

Take LinkedIn for one example. When I worked there, member profiles (those mini-résumés that everyone posts to promote themselves) mentioned more than 150 million deduped job titles, in 23 languages. (Now there are many, many more!) And there were millions more job titles (usually worded differently from profile jobs) in job postings, as well. Text, text, and more text!

A treasure trove, right??Wrong!

It's more like a humongous, web-scale headache!??Search, ads, insights, and recommendations – all key drivers of revenue and engagement – are constantly marred with wildly irrelevant, incorrectly labeled "related" results. For one simple example, the vast majority of jobs for which I am supposedly a "top candidate" (the very best matches!) are positions that I have absolutely no training, experience, or qualifications for (much less any interest). Today I got a new "top pick": as a Senior Archeologist! I also, apparently, use many of the same keywords that a very highly qualified Distinguished Software Engineer uses – but why should I care? Granted my background looks like a dog's breakfast, but their simplistic string-matching approach yields results that are most often useless. This painfully na?ve approach also means that I'll see the wrong ads, retrieve the wrong search results, and stats about or insights into job trends will see me (and millions others) counted in the wrong places. Clearly, LinkedIn will also get more complaints, so they'll have more support costs and more missed or delayed sales, as well.

Data & Analytics 7 个月前

Unraveling the Secrets of Data: An In-Depth Look into…

Data & Analytics 11 个月前

Data Drives Statistical Models, Not Cognitive Models

thinkbridge 1 年前

All because they have too much text data and too few reliable methods for text processing from engineering.

Each and every mismatch like this is a tear in LinkedIn's reputation and a dip in revenue. For web-scale operations, that ends up being very real money left behind. Instead of being the best, most reliable service, they're simply better than most others – a rather low bar as it turns out. My team there ended up developing robust methods and resources to address these problems, so I know that solutions are available, even if they are not widely known, widely understood, or widely adopted.

LinkedIn is just a single example. Almost all other organizations have similar problems.? If customers are searching for products or content and can't find them, then you won't sell. If clients depend on your data but it's not reliable, then they won't renew their subscriptions.? If you match ads to viewers incorrectly, then your clients won't sell as much.? If you build gen AI on top of all this mucky data, you get hallucinations.?

In sum, the state of the art for Text Data is this:

A huge swath of revenue generation depends directly on text data and how we process it. There's a huge amount of business value that is untapped and virtually unexplored.
Organizations have accumulated vast amounts of text data – Big Data – without understanding which parts of it are reliable, valuable, or interoperable.?
Huge expenses are associated with storing, protecting, and processing all this data – with unclear returns.
Engineering methods for processing text – as seen in lousy search results and generative AI's hallucinations – are simply ineffectual in real life. These methods relate words based on their order or spelling rather than on their meaning.?
Software engineers' training pays precious little attention to text and string data. To the point where they call text "unstructured data" – essentially, random stuff. That's not exactly a rich conceptual framework for them to start from, especially given the scale and impact of such an important problem. The software engineers that we rely on so heavily for processing numeric data are clearly at a loss for how to effectively face this massive onslaught of words.?
Effective but little-known methods are available for addressing the complexities of text data.

More data is definitely NOT better data, no matter what statistically-minded advisors say. The key lessons learned from decades of Big Data are that high-quality data is far more valuable than high-volume data and that text data is very hard to process reliably with engineering methods.?

Slogging through LinkedIn-scale text data shows very clearly that what's really needed to advance both business goals and the state of the art is substantially more investment in meaning-centric methods. To dig out from this vast collection of messy, muddy data, we need to transform it into meaningful bricks that we can build with. These bricks we can shape into powerfully structured knowledge graphs that have already proven invaluable when we need to aggregate, integrate, and evaluate data or tame unruly language models.?

Your Ferrari will go much faster on a track paved with bricks than on one knee-deep in mud.

Knowledge Architecture

3,303 位关注者

Putcha Narasimham

Founder Proprietor at Knowledge Enabler Systems

3 个月

Text is "inappropriately" dubbed and "blamed" as "unstructured" but most text we deal with is certainly well-structured according to the the language grammar and conventions humans use in their communications. It may not be perfect but NOT unstructured. Text in print or speach is the ultimate form for agreements among humans. The only thing is that text is NOT readily machine processible yet. So, there is no scope for "semantic computation" eqivalent to "numerical computation" With "knowledge hypergraphs" the problem of machine compatibility of text is well settled at least conceptually. Now, it is a short leap to interconvert "text to knowledge hypergraphs" and vice versa, with human assistance. Progressively that can be automatic with flagging for necessary human validation. Then we can rely on machines to aid humans to make sense and validate understanding of most text. Some text will still need human analysis, arguments and negotiations but that should be miniscule fraction of the volume of text now in use. Soon enough bulk text will not be in much use. X tweets and messages will dominate human communications and interworking. Let us develop on this [email protected]

1 次回应

Gideon Kory, CFA ???

Artificially Intelligent. Bringing together people, ideas, and data. I am because we are.

3 个月

How to get every knowledge worker engaged in #dataintelligence and own their domain #datagovernance ? How to bring business context into data-centric decision making?

IZEON IT TRAINING

3 个月

Mike Dillinger, PhD https://www.dhirubhai.net/feed/update/urn:li:activity:7224624904082448384

Gordon Hamilton

Data Quality Evangelist for 20 years, steadily improving my ability to communicate the importance of DQ for Cost Reduction & Data Monetization.

3 个月

Loved "data is the new mud".

Andrew McFadzean

Researcher

3 个月

Clearly, briefly and well stated Mike. The photo is a bit out of sync to your otherwise well expressed message. Great to see Knowledge Architecture supported as system informatics. Thank you for your posting Mike.

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Stuck in the Muck: Big Data means Big Problems

Mike Dillinger, PhD

领英推荐

Knowledge Architecture

3,303 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

A simple guide to Cortex ML Functions: Anomaly Detection

Why are vector databases now a hot topic?

Your intuitive guide to interpret SHAP's beeswarm plot

Turning numbers into narratives

Data Democratization in the Era of GenAI

Introduction to Group Feature Selection

KNN Classification: A Beginner's Guide

Datasets/ Data Sources and where to find them, ????.

Bias in Data Analytics

Navigating the Complex Landscape of Data: Insights from Cristina Alaimo on "The Business of Government Hour"

领英推荐

Knowledge Architecture

3,303 位关注者

Knowledge Graphs are Essential for Safe AI

2024年11月11日

Knowledge graphs, Linguists, and the Last-mile problem of AI

2024年11月4日

Audio: How to make AI safe and reliable?

2024年10月21日

Audio: What are Knowledge Graphs?

2024年10月1日

Entity Resolution: Priority #1 for Building Real Knowledge Graphs

2024年9月6日

Google's Semantic Search: Going to the Dogs?

2024年8月26日

Spelling-driven Reasoning in LLMs

2024年8月2日

Better Knowledge for Better AI

2024年7月24日

Psychological Foundations of AI

2024年7月22日

Counterfeit Knowledge Graphs

2024年7月12日

社区洞察

其他会员也浏览了

A simple guide to Cortex ML Functions: Anomaly Detection

Why are vector databases now a hot topic?

Your intuitive guide to interpret SHAP's beeswarm plot

Turning numbers into narratives

Data Democratization in the Era of GenAI

Introduction to Group Feature Selection

KNN Classification: A Beginner's Guide

Datasets/ Data Sources and where to find them, ????.

Bias in Data Analytics

Navigating the Complex Landscape of Data: Insights from Cristina Alaimo on "The Business of Government Hour"