Semantics Is Not Search (but it can help!)
A common misconception I've seen when introducing people to semantics is that they want to see it as "search". Semantics, it seems, is perceived as being a lot like Sriracha sauce - a little sprinkle of Sriracha sauce will make your search spicier and more relevant.
The problem with this is not that its incorrect - semantics can make search more relevant, if used right - but that it assumes that most semantic technologies can be added as a little sprinkle to already existing search systems and suddenly everything tastes better. The reality is that using semantics in this way is probably the smallest bang for your buck than you can make.
Most traditional document searches, when it comes right down to it, use an index to associate one bit of data with another. The most expensive type of search that you can do in the document space is to examine every word or sequence of words in a document. This is not feasible when you're talking about more than a few documents, and when you start talking about thousands of documents or more, its pretty much a non-starter.
So how do search engines work? Well, they do precisely this - examining every single word in a document. The difference is that they do this precisely once - when the document enters the document repository. Take A Tale of Two Cities, by Charles Dickens, with its famous starting line "It was the best of times. It was the worst of times."
A typical search engine will associate an identifier with a document, then will "tokenize" the document - breaking it up by word, eliminating white space and most punctuation. So, let's say that the identifier was something like #Tale2Cities# (just to make it easy to differentiate), then the search engine will create an index, with the word as a lookup value, and the document as a target. The indexer would then create a table:
"it" ? #Tale2Cities#
"was" ? #Tale2Cities#
"the" ? #Tale2Cities#
"best" ? #Tale2Cities#
"of" ? #Tale2Cities#
"times" ? #Tale2Cities#
"worst" ? #Tale2Cities#
Notice that one they find a word once, then they don't need to create another index entry. If a second document was stored, call it <MissingBook>, that had the phrase "The book, it was missing". If the second was loaded into the search engine, then the new table would look like the following:
"it" ? #Tale2Cities#,#MissingBook#
"was" ? #Tale2Cities#,#MissingBook#
"the" ? #Tale2Cities#
"best" ? #Tale2Cities#
"of" ? #Tale2Cities#
"times" ? #Tale2Cities#
"worst" ? #Tale2Cities#
"the" ? #MissingBook#
"book" ? #MissingBook#
"missing" ? #MissingBook#
So if you were to look for the word "book", you will only get the documents back that has that word (in this case, just #MissingBook#. If you have multiple keywords (such as "the book") then in the simplest case you'd search each keyword's documents then, depending upon the database engine, either take the intersection of the sets or the union (an exclusive vs. an inclusive search).
Now, most modern document stores have more than one index. Some indexes will retain additional information about the document matches, such as the location of a word in the document, the number of the same word that the document has and so forth. Other indexes will look at sequences such as "it was", "was the", "the best", "best of", "of times" and so forth, and will store these as keys. This is actually used heavily in natural language processing, because it makes it (much) easier to both predict what a person is likely to enter given two or three words, and because the number of combinations of two, three, or n-term phrases is roughly comparable to the number of unique words in a book.
Relational database indexes work in a similar fashion, but in this case, rather than a key pointing to a document identifier, the key usually points to a table (which can be roughly generalized to be analogous to an object). However, it is possible to set up lexical indexes on specific fields within an RDBMS that match the desired keyword with a given row in a table.
The challenge in all of these cases is that these indexes take up both space and, as they get larger, time in searching. One technique used in databases to reduce the latter is to hash a word - convert it into a numeric value that can then be ordered and walked more quickly. This isn't always perfect - a hash function has the possibility of having multiple words map to the same number - but this can be considered as the possibility of returning false positives in a match. Similar techniques can be used to map location coordinates or similar constructs such as dates.
Ultimately, however, almost all search involves indexes to turn text or numeric keys into document references.
Semantics works a bit differently. A triple store can be thought of as two keys that point to either a value or another key. For instance, consider a person named Jane Doe who owns a cat named Ms. Whisper. This can be written as a set of triples:
<JaneDoe> <is_a> <Person>.
<JaneDoe> <owns_pet> <MsWhisper>.
<JaneDoe> <is_age> years(24).
<JaneDoe> <is_gender> <gender_Female>.
<MsWhisper> <is_a> <Pet>.
<MsWhisper> <is_pet_type> <Cat>.
<MsWhisper> <is_gender> <gender_Female>.
<MsWhisper> <is_breed> <catBreed_RussianBlue>.
<MsWhisper> <is_age> years(4).
<MsWhisper> <is_fictional> bookean(true).
<MacRuff> <is_a> <Pet>.
<MacRuff> <is_pet_type> <Dog>.
<MacRuff> <is_gender> <gender_Male>.
<MacRuff> <is_breed> <dogBreed_Terrier>.
<MacRuff> <is_age> years(3).
<MacRuff> <is_fictional> boolean(true).
This differs from the indexes discussed above in several ways. For starters, it becomes possible to compose complex sentences from simple ones to get information. As an example, I can ask questions such as what breed of animals a person has by composing a set of triples:
select ?petType ?breed where {
?person <has_pet> ?pet.
?pet <is_pet_type> ?petType.
?pet <is_breed> ?breed.
} value ?person <JaneDoe>
This returns a table as follows:
<Dog> <dogBreed_Terrier>
<Cat> <catBreed_RussianBlue>
What this means in practice is that we are using relationships to create (and query) compositions, rather than using keys to perform lookups (as is the case for search). It should be noted that some triple store optimizers do a form of lookup optimization as well, but this is intended primarily as a means to optimize performance, not change the nature of the operation.
So given these distinctions, what does this mean for search vs. semantics? This often comes down to the distinction between concepts and lexical terms. A lexical term is a sequence of characters. To a computer, there's really nothing special about a lexical term - the sequence "Ms. Whisper", for instance, is simply the sequence of characters "M","s","."," ",...,"p","e","r". A concept, on the other hand, is an entity - a person, animal, place, event, abstraction and so forth. In a semantics system <MsWhisper> is a reference to the entity of the cat that Jane Doe owns as a pet.
What a semantic system can do though is to look for a lexical label of a concept and see if it is in a particular document. As an example, let's say that the triple store also includes two other critical statement:
<MsWhisper> <has_name> "Ms. Whisper".
<MissingBook> <has_representation> #MissingBook#
What this means is that we can create another index, one that says that if the lexical term "Ms. Whisper" is found in the work (let's say the <MissingBook>) then create an association between the document and the concept of <MsWhisper>.
<MissingBook> <references> <MsWhisper>.
A point to clarify here - <MissingBook> is a concept - it's a way of abstractly talking about a particular document. #MissingBook# on the other hand, is a way of identifying in the system in question what book is referred to. It may be some kind of internal system identifier, or may be a URL - the key is that the identifier points to the data file that contains the sequence of lexical terms that make up the book itself. In semantic terms #MissingBook can be thought of as a "representation" of the work.
Most semantic classifiers work in similar ways - they create an association between the label(s) that a particular concept uses (along with alternate labels or synonyms) and a lexical term within the document. Once that association is created, then a person can create a semantic query that can use the constraints set up by the associated triple store to find documents.
For instance, suppose that you wanted to find all works that included fictional cats that were identified as being Russian Blues. You could do a text search for "Russian Blue", but it's possible that your authority for a given cat being a Russian Blue came from the author or some other source, rather than in the book itself. In this particular case, then, the book would be missed in the query.
If you did the search semantically, on the other hand, you could do the following query:
select ?bookLocation where {
?cat <is_pet_type> <Cat>.
?cat <is_breed_type> <catBreed_RussianBlue>.
?cat <is_fictional> boolean(true).
?book <references> ?cat.
?book <has_representation> ?bookLocation.
}
This would then return a list of references to the documents in the document database that has fictional Blue Russian cats.
Many triple stores also double as document stores, including MarkLogic, Allegro, Virtuoso, OntoText and others, and these frequently have extensions that allow you to do full or partial searches on documents within SPARQL. For instance, if you were looking for Russian Blue cats in conjunction with the name "Jane Doe" in MarkLogic, you could use the Sparql query above with one additional line:
filter (cts:contains(fn:doc(?bookLocation),cts:word-query("Jane Doe")))
This would use the fn:doc() function to retrieve (based upon the location) the document reference in the database, then to run the word query on that document.
This is very important, because rather than searching through the indexes that may have thousands of references to this person (unlikely here, but far more common in real world examples), you are only looking for those documents for which you have already established that a fictional Russian Blue cat exists. In this sense, the semantic search has served to constrain the result down to perhaps only a handful of documents, and as such can be blazingly fast.
This is not the only thing that can be done with semantics, of course - many use cases for semantics do not even touch documents directly. However, there are also some significant caveats here. The first is that semantics in this case serves primarily as a way of indexing or pre-calculating relationships between a document and a concept, and if the concept (its triples) does not exist in the triple store, then no relationship will be made.
This means that in general you have to do the hard work of building and curating an ontology ahead of time in order for it to be useful as a classifier. Semantics is a form of magic, and like all magic, there is always an associated cost. This means that many semantic classification systems are very limited in terms of topic - breeds and appearances of cats and dogs. It is possible to bootstrap some of this with applications such as SmartLogic, Temis or Gate/Jade, but the reality is that somewhere along the line you will need to spend time constructing that classifying ontology before hand. (Writing classifiers is a topic for another article).
The benefit of this, on the other hand, is that once you create such an ontology, you can use inferences and queries to add to that ontology, both by identifying entities by type more readily and also by refining the rulesets that determine mappings. It is, in effect, another form of machine learning, and it is becoming more sophisticated as new triple store environments come on line.
The examples here are certainly not real world, but in most domains, the need for semantic constraints based upon relationships is much higher, not less. In health insurance, for instance, you may have hundreds of millions of documents - plan memberships, benefits documents, medical claims, and so forth, each connected by models that may have hundreds or even thousands of types of objects. Being able to allow a person to search her benefits requires establishing a chain of contexts that can be quite deep (and complex), and the easier it can be both to search context and to make decisions based upon a combination of context and text content is a holy grail of the insurance industry.
So, semantics is not search, but semantics provides a layer of context that can make search feasible as we move into the era of Big Text.
Kurt Cagle is the founder of Semantical, LLC. Curiously enough, he owns a Russian Blue cat.
Accessibility, QA, techcomm, i18n and more
9 年Thanks Kurt Cagle, very nice insight into search engine innards!
Uncontrolled and unorganized information is no longer a resource in an information society.
9 年Good Article
MarkLogic | XQuery | XSLT | Node.js | Rest Services | GraphQL | Jenkins | Postman
9 年Nice!
Nice explanation.
Consultant
9 年Good to know! I'll keep this in mind when searching all works that included fictional cats that were identified as being Russian Blues.