DataScava: How It Pinpoints Unstructured Text Data Using Your Business Language

DataScava: How It Pinpoints Unstructured Text Data Using Your Business Language

To learn more, read Machines in the Conversation: The Case for a More Data-Centric AI, published in CDO Magazine, and a series of commissioned articles about DataScava and TalentBrowser written by Scott Spangler, former IBM Watson Health Researcher, Chief Data Scientist, and author of the book Mining the Talk: Unlocking the Business Value in Unstructured Information.


By John Harney and Janet Dwyer



Do you know what a 'zettabyte' is?

By 2025, IDC estimates there will be a mind-boggling 175 zettabytes (that's 175 with 21 zeros) of global data. An astonishing 80% of this massive amount of data will be unstructured, predominantly in text format, holding the key to valuable insights.

Yet, for many organizations, extracting meaningful information from these vast, untapped data sources within a specific business context remains a formidable challenge.

Enter DataScava.

Our unstructured data miner leverages your business and domain-specific language to pinpoint the high-value text data you need for applications in AI, LLMs, ML, RPA, BI, Talent, Research, BAU applications, and more.

In this article, we'll explain how it helps people to unlock the full potential of their data by defining the abstract topics and themes that represent their own business and subject matter expertise, applying both to big data sets in real time.

And keeps the Human in Command in the era of AI and advanced analytics.


Establish a Common Language Across Teams

This recent MIT Management post, How to Set Technology Strategy in the Age of AI, outlines three strategic imperatives for companies navigating the realm of artificial intelligence: value creation, value capture, and value delivery.

Within this context, it emphasizes the significance of "complementary assets, which refers to what a company develops to exploit the knowledge that its innovation generates."

The two types of assets for creating more value out of their generative AI Products are divided into two key categories:


"One is data, because a large language model trained on a unique, proprietary data set may be more valuable than one trained on widely available public data sets. Another is proprietary technology that improves accuracy, reduces bias, and makes models stand out in the crowd."


It's within this framework that DataScava emerges as a pivotal complementary asset. Collaborating seamlessly with human expertise, DataScava takes center stage in helping businesses transform their chaotic unstructured text data into structured data that is accessible, comprehensible, measurable, and leads to actionable insights.


Use Cases: Empowering Various Roles



"DataScava enables a more data-centric approach to business applications, with topic models which reflect the primary areas of focus, flexible topic scoring to encode your organization’s priorities, and customized text processing that mirrors the way people actually communicate in the industry." - Scott Spangler


DataScava is designed for a diverse range of professionals, including data scientists, data analysts, researchers, BI and operations specialists, SMEs, talent professionals, IT, and more. By bridging the gap between technical and non-technical teams, DataScava facilitates seamless cross-collaboration.

With DataScava, you can:

  • Index, measure, curate, filter, match, classify, and label raw messy text data automatically using your business language and domain expertise.
  • Curate quality training data from large data sets to unleash AI and optimize machine learning models.
  • Identify relevant unstructured text data to enhance business intelligence tools and research.
  • Triage, filter, route, and efficiently manage emails, inquiries, and service desk tickets for BAU operations or RPA.
  • Measure, filter, and match skills for talent intelligence and job matching using TalentBrowser's Automated Skills Analytics powered by DataScava.
  • Mine raw text to extract relevant information from notes, reports, contracts, transcripts, news feeds, subscriptions, and beyond.


The Three Proprietary Methods of DataScava

Our practical, easy-to-use toolset lets you capture the business ontologies and context that provide the critical bridge between unstructured data analysis using standard data science techniques and the human expertise that gives your organization its competitive edge.

DataScava employs three proprietary methodologies, each designed to get you the data and enhance your capabilities:

Tailored Topics Taxonomies (TTT):

  • TTT models features and topics within heterogeneous text, with specialized taxonomies that can be selected, created, edited, or imported.
  • Facilitates the capture of user-defined and controlled business language and domain expertise.
  • Allows for the creation of highly customized business vocabularies and logic, essential for processing complex documents.

Domain-Specific Language Processing (DSLP):

  • DSLP complements or serves as an alternative to Natural Language Processing (NLP) and Natural Language Understanding (NLU).
  • Incorporates domain-specific knowledge into the language processing pipeline.
  • Indexes text at the file level to generate weighted topic scores and surface relevant textual files.
  • Produces value-added metadata for use in other systems and to enhance charting.


Weighted Topic Scoring (WTS):

  • WTS accurately measures and matches topics, operating based on user-defined score thresholds.
  • Labels documents into appropriate cohesive categories utilizing heuristic techniques customized for specific business purposes.
  • Facilitates a true human-machine partnership, adapting to an ever-changing environment.


A Weighted Topic Scoring File Match


Weighted Topic Scores Metadata in a File Matches data grid




How DataScava Sets Itself Apart

DataScava . . .

Encapsulates Expertise: continuously incorporates your business language and domain subject matter expertise into its software.

Precise Tuning: is tuned to discover precisely what you're seeking, avoiding assumptions made by traditional AI or ML systems.

Pre-Built Taxonomies: offers pre-built editable taxonomies for the financial services, IT, and recruitment domains while allowing you to create your own.

A Top-Down Approach: works top-down through your entire corpus at the file level, not the sentence level.

Auditable Transparency: offers auditable corpus-level statistics that are explainable, transparent, and provable.

Sortable Metadata: generates sortable?metadata, making it easy to summarize textual content in a numerical format.

Focused Insights: measures and color-codes topics, highlights key terms in relevant files, and efficiently filters out irrelevant data.



Get in Touch

Visit our DataScava and TalentBrowser websites to view our videos and get more product information. Contact me at [email protected].

要查看或添加评论,请登录

Janet Dwyer的更多文章

社区洞察

其他会员也浏览了