Dimensions of Thesaurus Lifecycle
Thesauri are since two decades one of the most intelligent and ordered way to save and maintain (enterprise, scientific, private) knowledge in a language independent (*) and in a very efficient way on computer systems.
A thesaurus is like a "knowledge" map of your explicit knowledge, it has an "own" life, bound to how humans use concepts in time. Therefore, as long as it does not have to accomplish merely non-changing historical targets e.g. reflecting history, a (useful) thesaurus has an own lifecycle (birth=creation, development, change=maintenance, death=archiving?) which induce taxonomists to keep it up-to-date according to the current use of concepts. Will a thesaurus ever die? Up to now it does not seems to me a thesaurus might even "die" but as in the field of records management or documents management, records and documents are first introduced in the system, maintained and "die" at their "end of usefulness" being archived.
Keeping a thesaurus up-to-date preserve it from "dying". The decision whether a thesaurus should "die" or not is obviously bound to the decision, whether a thesaurus is useful or not (Being a thesaurus mostly used in search). What can you change in it to keep it in "life" ? Concept re-structuring (could say also concept refactoring) and/or concepts labels update.
Preserving thesaurus "death" through merging - a rescue for explicit knowledge ?
One possibly destiny of a useful thesaurus could be e.g. to be merged with another one, as long as a person, a commission think it is worth it. This let me think that the essence of a thesaurus (the essence of [explicit] knowledge) actually seldom dies, often transforms.
Thesaurus "merging" is more commonly known as "thesaurus aligning". Aligning means here, that the concepts of one thesaurus with the concepts of another one are aligned, i.e. recognised and marked to be similar or the same concept by an actor. Please wait a paragraph or two on the question, who or what should be an actor here. When are two concepts in a thesaurus the same ? Since a thesaurus lacks of precise mathematical constraints (which are present in taxonomies) one could argue: two concepts are the same if they have the same label (I would like to renounce here to require the same language and the SKOS model as given, but this would not disturb the argumentation). What if two concepts have the same label but a different upper tree ? The "upper tree" being here meant as the inverse abstraction path descending from the thesaurus root down to that "concrete" concept. If same labels but different upper trees (and this could really be the normal situation) then the question arises: are two concepts with same labels but different upper trees again the same concept? Where does an upper tree (the location of a concept) come from? It comes usually from the (thesaurus creator's) use of a thought abstraction order which puts one concept at its position in the thesaurus besides the other concepts in the same thesaurus. Supposing the upper tree being a kind of "context" of one specific concept in one thesaurus, the contexts of two or more thesauri could (and I guarantee they are) be always quite different. Contexts reflect the abstractions needed while developing a thesaurus as closed world, there is hence "suggestive evidence" that such contexts be always different. This underline a bit the apparently simple looking task of the thesaurus alignment process, in the "real" case, is not so evident at all.
The actor - who should do the alignment here ? There are (a few) systems which list thesaurus concepts to a human user in order to suggest her the way to do it. Such system work merely on the label basis. Should the alignment be done by a machine (we love machine learning so nowadays), maybe there they could help. There is no evidence for a clear "no" and there is no evidence for a "yes". Since thesauri were done from humans for humans it appears to me, that besides the merely syntactical (resilient) comparisons of labels, there should be no way to re-engineer the inverse abstraction tree of any concept automatically. Thesaurus alignment appears to be still in hands of (human) taxonomists which must understand how and why contexts were conceived.
Preserving thesaurus "life" by renewing concepts labels - adapting names !
In some easier situations, there is no need to refactor the upper trees of thesaurus concepts but simply to change or rotate its labels (e.g. giving a new preferred label in some languages, rotating the old preferred label to a hidden label, etc...). This task is accomplished by reading a quantity of current domain focused journals publications and to extract terms (candidate new labels) which could suggest a label update inside some concepts. This is a corpus management task, where in one predefined and fixed language terms are extracted, ranked, stored and considered by an actor to be used for a change. Again: who or what is here an actor? Due to the more simple character of this updating task, luckily there is a way to alleviate (human) actors by means of machines. Here Machines can perform the heavier task of "ingesting" quantities of text from current sources, creating a variable text corpus and to extract / rank / store terms to be presented as a suggestion basis for the (human) taxonomists, which thanks to this machine work have more time to do their valuable job.
领英推荐
What is the "cost" of all that
The first task (thesaurus aligning) occurs for a given domain at the beginning of some projects and only up and then when new thesauri arise, which were not already merged... In my opinion very seldom. Here you need a solid knowledge of how trees in the thesaurus should be refactored. Tools should allow you to change concept connection in a stable but reversible way. More RDF talented taxonomists could use SPARQL directly, other ones should use some RDF middleware services to alleviate refactoring.
The second task (re-labeling) requires a corpus ingestion and term extraction machinery, which presents you at predefined time gaps lists of "new" suggestions for terms to be considered.
Do we have a real choice?
For both tasks you have to choose whether you "fall in love" with some expensive marketing talented companies selling the "yes - but what is your question" which let you dream of a salty paid "best-in-class" solution or you do it by yourself, considering real help from less marketing aware but "same-in-class" people which can give you advice to save time and resources preserving solution sustainability at a more reasonable cost. Interestingly, for both cases you will have also to spend YOUR TIME to specify / direct / learn situations and INTERACT with whoever helps you, for much or less money.
Thank you for reading
Again hoping in your numerous and brilliant constructive comments
Yours Fabio Ricci from Semweb
(*) In some cases this does not apply, i.e. there where concept building is shared across languages of different "nature" like e.g. Arabic with English. So this sentence is valid primarily for thesauri which have one predominant language and several (secondary) languages, where labels (and concept structure) only approximate the concept(s) in the main language. That is the reason why in SKOS there is a "default" language.