Schema.org, Wikidata & Co:
Leveraging Open Repositories

Schema.org, Wikidata & Co: Leveraging Open Repositories

This article was the subject of a small white paper published in September 2020. It was slightly updated on 12/07/2024 10:54:15!

What is an invoice? What are its properties? What's an article? a press article? a technical article? a website? a web page? a contact page? a garage? a car? a motorcycle? a plane? a boat? a customer? a supplier? and so ... what's an invoice?

These are questions that computer scientists all over the world ask themselves, or have asked themselves, at one time or another in their careers. Their answers are crystallized in all the world's software - sometimes, unfortunately, incoherently. However, the means exist to erase these inconsistencies.

Schema.org

Schema.org is an attempt at standardization, a coherent approach, a pursuit of convergence so that through the programs we manipulate every day, a common understanding can emerge. So that when we talk about an invoice in Beijing, Brussels, New York or Paris, we have the same representation: a description of what an invoice is, the characteristics of an invoice, the types of information in each property, and so on.

Schema.org is an initiative of Google, Microsoft, Yahoo and Yandex, organized as an open community that aims to establish ontologies, from the Greek ?ντο? ("being") and λ?γο? ("discourse, speech, study"), structured sets of terms and relations covering specific concepts in particular domains.

Returning to my invoice example, it's obvious that this "object" belongs to a wider domain, just like an "offer", a "purchase order", a "credit note", a "payment" and so on. It's this higher domain that an ontology covers; they're like Russian dolls.

An ontology is the modelling of a field of knowledge through words and terms generally used to cover the field in question, terms that are related. More simply, schema.org will talk about vocabularies that help us understand the world around us. What's a product? a service? a company? a hospital? an event? an artist, an album, a concert, a ticket, etc.? And also ... what is an invoice!

These vocabularies and ontologies cover entities[1] in the same way as Wikidata, which assigns a unique identifier to each one. For example, you'll find a definition of an invoice at the following URLs: https://schema.org/Invoice for schema.org (A statement of the money due for goods or services; a bill) and https://wikidata.org/wiki/Q190581 for Wikidata (commercial document issued by a seller to a buyer, relating to a sale transaction and indicating the products, quantities, and agreed prices for products or services the seller has provided the buyer). In the definitions given by these two sources alone, you'll be able to find a set of words that make sense - sale, transaction, products, quantities, prices, ... - (and, speaking of invoices on the Internet, on a site you're building for example, you can draw on such words to improve your search engine ranking, especially since schema.org is primarily a search engine initiative). Making use of all properties and entities also enables you to give your OCR procedures the intelligence they need to understand the nature of a document, to classify it automatically and to route it within your organization.

In computer science, ontologies are used in particular in the Semantic Web and in Artificial Intelligence to grasp and understand a particular domain. Each domain is described with reference to the types of objects that make it up (classes) and their properties (or attributes). These objects are linked to each other (relationships) and respond to events that induce changes in their properties or in their relationship to each other. They are essential for anyone wishing to embark on a serious Digital Transformation.

Ontologies are a place where multiple techniques converge, such as the Web Ontology Language (OWL) , a knowledge representation language based on RDF (Resource Description Framework) , a graph model designed to describe Web resources and enable their automatic processing. This is a broad field of computer science, and one that is totally essential to our understanding of natural languages.

Semantic Web

I'll just briefly mention the Semantic Web, as it's not the focus of my article. Nevertheless, it's not a minor topic, because a well-constructed website that makes sense to humans and robots alike is a guarantee of universal exposure.

In addition to serving as a repertoire of entities perfectly described in terms of their properties and inheritance (in the sense of object-oriented programming), Schema.org provides a number of examples that help to better qualify what's hidden in the depths of HTML code.

See, for example, how to present an invoice in HTML with reference to the Schema.org vocabulary (and therefore ... shared understanding of the information that is entered):

<div itemscope itemtype="https://schema.org/Invoice">
  <h1 itemprop="description">New furnace and installation</h1>
  <div itemprop="broker" itemscope itemtype="https://schema.org/LocalBusiness">
    <b itemprop="name">ACME Home Heating</b>
  </div>
  <div itemprop="customer" itemscope itemtype="https://schema.org/Person">
    <b itemprop="name">Jane Doe</b>
  </div>
  <time itemprop="paymentDueDate">2015-01-30</time>
  <div itemprop="minimumPaymentDue" itemscope itemtype="https://schema.org/PriceSpecification">
    <span itemprop="price">0.00</span>
    <span itemprop="priceCurrency">USD</span>
  </div>
  <div itemprop="totalPaymentDue" itemscope itemtype="https://schema.org/PriceSpecification">
    <span itemprop="price">0.00</span>
    <span itemprop="priceCurrency">USD</span>
  </div>
  <link itemprop="paymentStatus"  />
  <div itemprop="referencesOrder" itemscope itemtype="https://schema.org/Order">
    <span itemprop="description">furnace</span>
    <time itemprop="orderDate">2014-12-01</time>
    <span itemprop="orderNumber">123ABC</span>
    <div itemprop="orderedItem" itemscope itemtype="https://schema.org/Product">
      <span itemprop="name">ACME Furnace 3000</span>
      <meta itemprop="productID" content="ABC123" />
    </div>
  </div>
  <div itemprop="referencesOrder" itemscope itemtype="https://schema.org/Order">
    <span itemprop="description">furnace installation</span>
    <time itemprop="orderDate">2014-12-02</time>
    <div itemprop="orderedItem" itemscope itemtype="https://schema.org/Service">
      <span itemprop="description">furnace installation</span>
    </div>
  </div>
</div>
        

This information is completely disambiguated. A computer robot differentiates between each zone. For example, all date fields are perfectly established: we know what is the payment date, what is the order date, ...

I'll leave you to visit the https://schema.org/Product page, which provides examples of semantic expressions: Microdata, RDFa, JSON-LD. This way of presenting your product catalog ensures that relevant information is picked up by search engines. That's no mean feat!

I can't resist the temptation to give you one last example: an event, again taken from Schema.org (https://schema.org/Event) ... and what company/organization of a certain size doesn't organize an event? Know how to present information in a structured way, so that it is correctly indexed in search engines and ... on X (formerly Twitter), Facebook, etc.!

<div itemscope itemtype="https://schema.org/TouristAttraction">
  <h1><span itemprop="name">Musée Marmottan Monet</span></h1>
  <div>
    <span itemprop="description">It's a museum of Impressionism and french ninenteeth art.</span>
  </div>
  <div itemprop="event" itemscope itemtype="https://schema.org/Event">It is hosting the
    <span itemprop="about">Hodler</span>'s
    <span itemprop="about">Monet</span>'s
    <span itemprop="about">Munch</span>'s exibit:
    <span itemprop="name">"Peindre l'impossible"</span>.
    <meta itemprop="startDate" content="2016-09-15" />Start date: September 15 2016
    <meta itemprop="endDate" content="2017-01-22" />End date: Genuary 22 2017
  </div>
</div>
        

I'm sure you've understood the advantages of using the entities defined in Schema.org.

And the others ...

Freebase

There are other initiatives besides Schema.org that model the world with an even wider spectrum. Freebase, for example, covers almost 39 million topics, including people (e.g. Bob Dylan - /m/01vrncs), places (e.g. Brussels - /m/0177z) and things (e.g. a truck - /m/07r04). I'm not spending too much time on Freebase to spend more time, further afield, on Wikidata!

Geni

Let's move on to Bob Dylan, if you don't mind. For this singer and spokesperson for his generation, you can go to the geni.com directory, 6000000017944190389, as Bob Dylan is indeed known there under the ID 6000000017944190389. Another open directory from which to extract information.

The case of Bob Dylan is a perfect example of what I'm trying to show you, and illustrates my general point, which is to show just how much computer scientists can benefit from these ontologies, vocabularies and other general directories that model the world.

On TRQL Radio I've programmed an automatic announcement/disannouncement system. This means that a program scans the day's playlist (which is itself built up entirely automatically), examines the programmed tracks and "announces" the next track or "unannounces" one or more previous tracks. Let's imagine that one of these songs is by ... Bob Dylan. The program can then search to see if there's anything to say about the American singer in an open-access directory such as Geni, Freebase or ... Wikidata: the program can choose between several sources of information. Let's say it chooses the Freebase identifier, it ends up on the https://freebase.toolforge.org/m/01vrncs page, which says ... Bob Dylan is an American singer-songwriter, artist, and writer. He has been influential in popular music and culture for more than five decades. [...]. The program passes this information to a text-to-speech engine (Amazon Polly, for example, or ElevenLabs, ...) and you get an intelligent, AUTOMATIC advert (or de-advert) to play on air (Polly gives you back an .mp3 which, after all, is no different from a traditional piece of music: so you can insert the advert/de-advert into the playlist as if it were a song). A perfect example of automation.

Radio: world premiere at Vitrival

Wikidata. Ah, Wikidata!

I'm leaving the other initiatives similar to schema.org to concentrate a little on Wikidata, a directory with which I've been able to gain very practical experience, both from a Why and a How point of view.

Wikidata is a free, collaborative, multilingual database. This database contains structured data which is used to feed Wikipedia, and a whole range of other projects in the Wikimedia movement.

Wikidata covers 111,000,753 concepts at the time of writing!

Wikidata data is published under a Creative Commons Public Domain Transfer (CC0 1.0) license, which means you can copy, modify, share and enhance the data, even for commercial use, without having to ask permission.

I've published a generic service that queries this huge directory of entities. Bob Dylan is listed under ID Q392. Here's the URL: https://wikidata.org/wiki/Q392. It says: American recording artist, singer-songwriter, musician, author, artist and and Nobel Laureate in 2016. Despite the slight error in the double "and", this is a possible variation of automatic announcements/disannouncements [2].

Take a look at the incredible amount of information you can glean about Bob Dylan. You'll see all the artist's aliases, you'll see in which other directories he appears, you'll see that the Bob Dylan entity, known as Q392, is an instance of the Human Being entity (property P31 means instance of, which you can check with https://www.wikidata.org/w/api.php?action=wbgetentities&ids=P31&format=xml), known as ID Q5 (https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q5&format=xml) which is part of the Humanity entity Q1156970, https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q1156970&format=xml) ... and that you can follow documentaries about Humanity on TED thanks to https://www.ted.com/topics/humanity. All this, and a host of other things, are thus available to robot automatons, to programs. All this is a candidate for the automation into which Digital Transformations are pushing us. Of course, there's nothing wrong with using it; in fact, there's an undeniable advantage in sharing this knowledge, structuring it, disseminating it and standardizing it within the applications to be built, whatever your field of activity. It's all about using what already exists and is freely shared, and avoiding reinventing the wheel with every IT system that is devised in response to a need.

Wikidata still lists the IDs of entities known to it but present in other directories: Freebase, DBPedia, KBPedia, Geonames , ISNI (International Standard Name Identifier for an identity), BNF, NORAF, ...

In addition to the directories I've just mentioned, there are a number of public directories such as CBE-open-data, Plateforme ouverte des données publiques fran?aises, and many others. Take a look at what you can get out of them. See how these directories enable dematerialization, digitization and ... automation.

Hard work and ... study

All this opens the door to quite extraordinary possibilities, which are, in fact, the very basis of LLM - Large Language Models. On the other hand, it's a real painstaking task to delve into these gigantic directories, to understand them, to grasp the dependencies that are introduced into them, to unravel the meaning of each ID, each property and so on. A real study, but a very profitable one nonetheless. As far as Wikidata is concerned, the following page may help you: Browse and view all properties on Wikidata.

It's not uncommon for programs to have to go back and forth between numerous end-points to build up a meaningful view[3] . As these operations are time-consuming, it immediately occurs to us to "hide" the information obtained and create ready-to-use semantic fields. This is what we do at TRQL Radio when we gather our knowledge of artists, and also when it comes to our news aggregator, Infusio. We have a "database" of around 225,000 artists, which we hope to grow to 400,000. This knowledge has been extracted from multiple directories[4] , ingested, classified, structured, compared, annotated, appreciated, digested... to form our own corpus, a corpus that is regularly reviewed (knowledge is constantly evolving: things that were true yesterday are no longer true today), an important aspect that I don't wish to discuss here for fear of being too technical and tiresome. Suffice it to say that accumulated knowledge in a given ontological field is like science: we know what we know until we know we're wrong. This implies a constant revision of our accumulated knowledge.

In concrete terms...

Let's imagine that TRQL Radio is searching for Jim Croce's birth and death dates. A search on Wikidata reveals that Jim Croce's ID (spelled Jim_Croce in the search) is Q464277.

https://www.trql.fm/vaesoli!/?wikidata-search=Jim%20Croce&xml: search for Jim Croce on Wikidata.

We therefore go (programmatically as shown above) to the page describing the Jim Croce entity and search for the properties P19 (place of birth), P3150 (birthday), P569 (date of birth), P1477 (birth name), P570 (date of death), P1196 (manner of death), P20 (place of death), P119 (place of burial), and P509 (cause of death).

For TRQL Radio, this information forms the skeleton of what our knowledge engine records. I'll concentrate on birth and death dates for ease of demonstration, i.e. properties P569 and P570 :

<property id="P569">
    <datavalue type="time">
        <value time="+1943-01-10T00:00:00Z" timezone="0" &/>
    </datavalue>
</property>

<property id="P570">
    <datavalue type="time">
        <value time="+1973-09-20T00:00:00Z" timezone="0" &/>
    </datavalue>
</property>        

A simple date calculation calculates Jim Croce's age at the time of his death, and this information can be used to generate automatic variable text. A real asset for a radio station!

In a similar way, it's totally possible to find out when a band was formed (a rather special birth date). Take the Doobie Brothers, for example: this is Wikidata entity Q506670. Their formation date is given by property P571 and you get the following XML for this property:

<property id="P571">
    …
    <datavalue type="time">
     <value time="+1970-01-01T00:00:00Z"&>
    </datavalue>
</property>        

The Doobies were formed in 1970, as shown by the results of various search engines:

Proven by search engines

As with the calculation of dates in the Jim Croce example, you are then able to create a clever announcement like "The %artist.name%, formed in %artist.inception%, on %radio%! Only music; no blahblah! This phrase, for which the variable parts are to be substituted (The Doobie Brothers, formed in 1970, on TRQL Radio. Only music; no blahblah!) can have multiple variants, which can be created by programming. In the case of TRQL Radio, we have around 4,000 variants that we select at random. This is part of TRQL Radio's Digital Transformation.

What for?

You may rightly ask how this kind of digging into the mysteries of Schema.org, Wikidata and other directories is likely to give TRQL Radio any kind of competitive edge.

In our case, the competitive advantage for a micro-radio, with (very) few people behind the consoles, lies in the ability to detect and use this information without having to devote the slightest amount of time to it, since these are programs that are constantly searching for information, to use it to, for example, create intelligent announcements/disannouncements (see above), create automatic tributes, flesh out and thicken the depth of the presentation, generate news/ephemera (see this new service that allows us to retrieve the day's events - today in history), etc.

Everyone needs to take an open-minded look at what these directories can bring to their own core business. As a friend rightly said, it's better to love your problems than your solutions. The question remains, "Why?" and Simon Sinek is certainly not one to contradict me. Asking and answering that question is up to you!

Your first steps with schema.org

To help you take your first steps with schema.org, here are 2 XML files listing the classes covered and their properties (extracted March 2020). Personally, I've used these 2 files to AUTOMATICALLY generate over 850 PHP classes, which are now part of my development arsenal.

  1. schemaorg-classes.trql.xml
  2. schemaorg-properties.trql.xml

Practical: services for immediate use

To help you get the most out of these open, public directories, I'm providing you with 3 Wikipedia/Wikidata access services: to find out what Wikipedia has on a term (or set of terms), to search for entities corresponding to a term or terms (Elvis Presley, for example), and to obtain the details of an entity/property.

wikipedia-search

This service lets you find out everything Wikipedia knows about a term (or set of terms) - terms must be listed in English :

Examples

wikidata-search

This service returns entities that match the search term(s). Not to be confused with wikipedia-search. Only English is supported at the moment.

wikidata-entity

This service lets you find out the definition of an entity or property:

I'll update the list of services regularly, just long enough for me to be able to generate service documentation automatically (there's an extended version of the list of services that I provide which I will update).

Our news aggregator: Infusio

The Infusio product, a news aggregator, parses the text of an article and searches it for key texts (almost 100,000) to create a list of concepts (Wikidata entityID). Parsing has to be very, very fast; that's why I use word trees. In the end, this gives me a list of coarse subjects, a list of fine subjects and a list of entities documented in Wikidata. Thanks to this technique, I can extract news items and “hashtag” them automatically... I'll cover this topic in more detail in another article.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了