Using Tarql to Convert Excel Spreadsheets to RDF
Over the years, I've worked on a number of semantics projects. While some of them involved pulling data from relational databases, one thing that seemed to emerge was that a significant proportion of the metadata within an organization - the operational data that controls everything from movie production to publications to describing businesses - ultimately ended up residing in spreadsheets, specifically, Excel spreadsheets.
In a way this makes sense. Semantic data is data about other data, and is usually collected not in databases running applications but in spreadsheets running models. It's often much easier to open up a spreadsheet, create several tables, then package it and send it to someone else for review or additional processing, whereas most data contained within databases usually tends to be more transactional in nature.
There have been, over the years, a number of different tools for pulling information from spreadsheets, however, they typically require a fair amount of upfront activity. Tarql (Tables for SPARQL) is a Java based command-line tool that converts CSVs of various sorts (comma-separated, tab-separated, pipe-separated, with or without fields) into SPARQL variables that can then be used to either query the spreadsheet or to generate (via SPARQL CONSTRUCT statements) triplets or Turtle files. The latter can then be uploaded into a triple store for more sophisticated processing or querying.
Before digging deeper into how to work with tarql, it is worth identifying what it doesn't do. Tarql is actually built using the Jena toolkit (ARQ), which means that it has many of the same capabilities and limitations that the Jena/Fuseki2 RDF server has, and can be extended in the same way that ARQ can (see https://jena.apache.org/documentation/query/library-function.html for details about the ARQ extension library). Most of these functions are available in SPARQL 1.1.
However, there's no way to persist content back into the CSV from tarql, the CSV file is a static store. This isn't a huge limitation - Jena can generate a CSV from a SELECT statement, which actually makes tarql act more like XSLT - a transformer tool in a pipeline. Additionally while tarql can do some deduping of redundant triples, it has to be called out explicitly.
Installing tarql
You can get tarql on github at https://github.com/tarql/tarql (which also includes links to the tarql documentation).
I used chocolatey as my package manager on Windows, though you can use homebrow on Linus or IOS to do much the same thing. Because tarql is a Java-based application, you need to also install the java-jdk and maven to build it. Navigate to where you want to install tarql, then retrieve the environments:
> choco install jdk8 > choco install maven > gh repo clone tarql/tarql > cd tarql/target > mvn package appassembler:assemble mvn package appassembler:assemble
Once this is done, add the path (/target/appassembler/bin/) of the bin file to your environment's PATH variable, then refresh the shell terminal to pick up the changes. This will make tarql available as a direct command from the command-line anywhere.
Querying with tarql
I'm the community editor for https://datasciencecentral.com, which means, among other things, that I produce a newsletter every week of the top data science stories submitted by contributors from across the data science community. Building the newsletter each week involved copying and pasting hyperlink content into am HTML form, an especially time-consuming operation. We had been tracking the articles we'd put into previous newsletters going back to the previous two editors for almost ten years in an Excel spreadsheet, and I had modified that spreadsheet to build out lists of articles, authors, and related metadata, with nearly 6,000 entries. This seemed like a good candidate for tarql.
The CSV file (contained in WDSummary.csv) was created by saving an Excel document in CSV format. This varies by version, so I would recommend checking the documentation of your Excel instance. The first line of the CSV contained the headers that would eventually be used:
Hyperlink,Title,URL,Author,Page views,Creation date,Content Type,Seed,Hash,Social Post,HTML,Tags,HasImage
In it's default mode, tarql will look for this first line if it's available, then will create SPARQL variable with the same name. For instance, the contents of Title will be available in ?Title , URL will be available in ?URL and so forth. In the case where the labels have spaces, tarql will replace the space with an underscore: Page views becomes the variable ?Page_views . Note that case is significant here.
If no header row is present (or you want to explicitly just use the excel columns as variables, include the command line argument --no-header-row in the tarql call:
tarql --no-header-row ...
In that case, the content in column A is contained in the SPARQL variable ?a, column B is in ?b, column AB is is ?ab and so forth. Wiith the above CSV file, this would mean that the contents of the Title column would be in variable ?b.
These variables are then invoked with a sparql script, and don't need to be explicitly defined. For instance, the following SPARQL script (contained in queryAuthorTitle.rq ) would retrieve a list of authors, titles and dates:
SELECT ?ROWNUM ?Author ?Article_Title ?Creation_date WHERE { bind (substr(?Title,0,36) as ?Article_Title) filter(?Content_Type = "Article") filter(contains(?Creation_date,'2021')) } LIMIT 20
This is then invoked on the command line as:
tarql queryAuthorTitle.sparql WDSummary.csv
which then generates the following output:
---------------------------------------------------------------------------------------- | ROWNUM | Author | Article_Title | Creation_date | ======================================================================================== | 5137 | "Vincent Granville" | "Spectacular Visualization: The Eye " | "1/4/2021" | | 5138 | "AISara" | "Keras Hyperparameter Tuning Has Nev" | "1/4/2021" | | 5139 | "Chirag Shivalker" | "4 Key Principles of Data Collection" | "1/4/2021" | | 5140 | "Aileen Scott" | "HR Technology: What the CHROs Must " | "1/4/2021" | | 5142 | "Bill Schmarzo" | "Ethical AI, Monetizing False Negati" | "1/4/2021" | | 5146 | "Kedar Supekar" | "Global Web Hosting Service Industry" | "1/4/2021" | | 5149 | "Stephanie Glen" | "Fraudulent Covid-19 Data and Benfor" | "1/4/2021" | | 5150 | "Ajit Jaokar" | "Interesting AI papers published in " | "1/4/2021" | | 5151 | "Madiha Jamal" | "Artificial Intelligence in the Cont" | "1/4/2021" | | 5153 | "Kholoud" | "Linear Regression Assumptions" | "1/4/2021" | | 5155 | "Zoe Zbar" | "Best Naming Conventions When Writin" | "1/4/2021" | | 5156 | "Ryan Williamson" | "FinTech: How AI is Improving This I" | "1/11/2021" | | 5157 | "Ryan Jason" | "How the Blend of Artificial Intelli" | "1/11/2021" | | 5158 | "jwork.ORG" | "Encyclosearch.org is launched" | "1/11/2021" | | 5159 | "Lokesh" | "20 Of The Most Important Machine Le" | "1/11/2021" | | 5160 | "Justin Schulberg" | "Women in Politics | Exploratory Dat" | "1/11/2021" | | 5161 | "Laura Jean" | "Top Advanced Analytics Use Cases" | "1/11/2021" | | 5162 | "Logan berger" | "How Data Science Ensures that Custo" | "1/11/2021" | | 5163 | "Aileen Scott" | "Triumvirate ? AI, IoT, and Blockcha" | "1/11/2021" | | 5165 | "Kedar Supekar" | "Facial Recognition Biometric Techno" | "1/11/2021" | ----------------------------------------------------------------------------------------
It's worth dissecting the query. First, as indicated note that the initial variables ?Author, ?Article and so forth are already predefined. The WHERE clause does additional post-processing, such as truncating the titles to 36 characters, which is useful given that DSC titles can get long, then it filters the contents by a particular content type (only Articles) and by year. If the dates had been given in YYYY-MM-DD format, this same list could also have been sorted and grouped by month and day. using the SPARQL ORDER BY statement (ORDER BY ASC(?Creation_date)).
Additionally, the ?ROWNUM variable is a magic variable provided by tarql itself that gives the number of the row in the original CSV variable where the record comes from. This can be handy in creating specialized keys.
Just using TARQL for querying you can build fairly elaborate tables. TARQL does not have a CSV output format, but a relatively simple awk script should be able to convert the above table back into CSV.
Generating Turtle RDF With tarql
Where tarql really shines, however, is in the generation of RDF. TARQL defaults to generating Turtle, but it can also generate ntriple format if the --ntriples flag is used. As with querying, generating RDF is done with a SPARQL script, in this case one that contains one or more CONSTRUCT statements (contained in toTurtle.sparql):
PREFIX class: <https://thecaglereport.com/ns/_Class/> PREFIX article: <https://thecaglereport.com/ns/_Article/> PREFIX author: <https://thecaglereport.com/ns/_Author/> PREFIX contentType: <https://thecaglereport.com/ns/_ContentType/> PREFIX xsd: <https://www.w3.org/2001/XMLSchema#> CONSTRUCT { ?article a class:_Article; article:hasTitle ?title; article:hasURL ?url; article:hasAuthor ?author; article:hasCreationDate ?Creation_date; article:hasContentType ?contentType; article:hasHash ?Hash; article:hasSocialPost ?Social_Post; article:hasPageViews ?pageViews; article:hasHTML ?HTML; . ?author a class:_Author; author:hasName ?authorStr; author:hasURL ?authorURL; . ?contentType a class:_ContentType; contentType:hasLabel ?contentTypeStr; . } WHERE { #create a node for the article bind(IRI(concat("https://thecaglereport.com/ns/_Article/_",MD5(?URL))) as ?article) bind (STRDT(?Title,xsd:string) as ?title) bind (STRDT(?URL,xsd:string) as ?url) #create a node for the author bind(IRI(concat("https://thecaglereport.com/ns/_Author/_",MD5(?Author))) as ?author) bind (strdt(?Author,xsd:string) as ?authorStr) bind (strdt(concat("https://www.datasciencecentral.com/profile/",replace(?Author,"\\s+","")),xsd:string) as ?authorURL) bind (strdt(?Content_Type,xsd:string) as ?contentTypeStr) #create a node for the content type bind(IRI(concat("https://thecaglereport.com/ns/_ContentType/_",replace(?Content_Type,"\\s+",""))) as ?contentType) bind (strdt(?Page_views,xsd:integer) as ?pageViews) # For demonstration purposes show only one row filter(?ROWNUM = 5137) }
This is a more complex example, but I figured it would be worth illustrating this with a real use-case scenario. The construct statement iterates over each line of the converted CSV file. The first bind creates a URI for the article, taking advantage of the MD5() hashing function to turn a potentially ugly title into a clean URL. Binding for Title and ?URL make use of the STRDT (string-datatype) function to create typed literals.
The lower case ?author and ?contentType variables illustrate how IRIs can be constructed for linking purposes. with the CONSTRUCT statement then using these IRIs to create entries of the appropriate type. There will be a lot of duplication in the triples, but once this is loaded into a triple store such duplications will be removed behind the scenes as you need only one indexed item for any given triple.
Finally, the script specifically restricts the output to one row number to control the output, though in practice you'd probably not limit the conversion script. When this is run:
> tarql toTurtle.sparql WDSummary.csv
This generates the following output:
@prefix rdf: <https://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix author: <https://thecaglereport.com/ns/_Author/> . @prefix xsd: <https://www.w3.org/2001/XMLSchema#> . @prefix class: <https://thecaglereport.com/ns/_Class/> . @prefix contentType: <https://thecaglereport.com/ns/_ContentType/> . @prefix article: <https://thecaglereport.com/ns/_Article/> . article:_25c97410fa440f25765c80aa78e84590 rdf:type class:_Article ; article:hasTitle "Spectacular Visualization: The Eye of the Riemann Zeta Function" ; article:hasURL "https://www.datasciencecentral.com/profiles/blogs/spectacular-visualization-the-eye-of-the-riemann-zeta-function" ; article:hasAuthor author:_c3969164fdd72efafb3391c6f2c2a3d3 ; article:hasCreationDate "1/4/2021" ; article:hasContentType contentType:_Article ; article:hasHash "VincentGranville44200" ; article:hasSocialPost "Spectacular Visualization: The Eye of the Riemann Zeta Function by Vincent Granville https://www.datasciencecentral.com/profiles/blogs/spectacular-visualization-the-eye-of-the-riemann-zeta-function #dsc_Article #dsc_hash_VincentGranville44200" ; article:hasHTML "<li><a >Spectacular Visualization: The Eye of the Riemann Zeta Function</a></li>" . author:_c3969164fdd72efafb3391c6f2c2a3d3 rdf:type class:_Author ; author:hasName "Vincent Granville" ; author:hasURL "https://www.datasciencecentral.com/profile/VincentGranville" . contentType:_Article rdf:type class:_ContentType ; contentType:hasLabel "Article" .
Tarql's performance metrics are surprisingly good. On a reasonably fast machine, the six thousand line spreadsheet was processed in under a second.
What's more, it is possible to chain together multiple CONSTRUCT statements such as the triples defined in the first CONSTRUCT become available to the second construct statement. Additionally, more than one CSV source file can be used, making it possible to create joins from multiple tables (or an Excel spreadsheet with multiple worksheets). For more on that, I'd recommend reading Bob DuCharme's excellent article about tarql (which inspired me to write this one) at https://www.bobdc.com/blog/sparqlcsvjoin/.
Modeling Thoughts and Summary
There is a tendency, when looking at tools such as tarql, to see them as ways to avoid doing hard modeling - It doesn't take that much work to put together a model from multiple spreadsheets. However, keep in mind that such models are frequently "naive" in the sense that the person putting together the spreadsheet most likely was not thinking about the long term implications of creating a model, but rather likely put together things organically, based upon what was important at the time.
I've found that, especially when working with spreadsheets developed by different groups that it's worth taking the design to develop an intermediate ontology that gets the information into the system, then store this information into a separate graph in a triple store. This holds true especially if the goal is to create a centralized knowledge base.
On the other hand, if your goal is to make the data within the spreadsheet available through other processes (such as GraphQL) then tarql (or similar tools in many contemporary triple stores) may be exactly the ticket, especially if you can infer the structure of a given model from the data.
Either way, tarql should be a key part of any data analyst's workbench.
Kurt Cagle is the Community Editor for DataScienceCentral.com, and the editor of the LinkedIn Newsletter The Cagle Report, also collected at https://thecaglereport.com.
XML technology specialist
3 年Kurt, don't you think transforming CSV to RDF could be much more declarative? Leveraging what is there (namely the column headers) and reducing user input to a handful of configuration data for the deviation from reasonable defaults (e.g. specifying columns with IRI and other non-string content)?
Founder & CEO at OpenLink Software | Driving GenAI-Based Smart Agents | Harmonizing Disparate Data Spaces (Databases, Knowledge Bases/Graphs, and File System Documents)
3 年Nice post! Data Wrangling and Transformation are becoming hot topics again :) CSV to RDF and JSON to RDF transformations ultimately bolster the fact that RDF is an abstract Data Definition Language (DDL) -- independent of Data Representation Notations and Data Serialization Formats. BTW -- Our Structured Data Sniffer Browser Extension transforms JSON or CSV into RDF via a single click. Alternatively, if you place a URL to your CSV or JSON doc in the FROM CLAUSE of a Virtuoso SPARQL Query it will generate RDF and and associated Ontology [1][2]. Links: [1] https://chrome.google.com/webstore/detail/openlink-structured-data/egdaiaihbdoiibopledjahjaihbmjhdj -- for Chrome-based Browsers [2] https://addons.mozilla.org/en-US/firefox/addon/openlink-structured-data-sniff/ -- for Firefox [3] https://www.dhirubhai.net/posts/kidehen_rdf-csv-json-activity-6800513675351400448-huuT -- post I made earlier about the Structured Data Sniffer Browser Extension and its ability to transform JSON or CSV via a single-click [4] https://twitter.com/search?q=%40kidehen%20%23SPARQL%20%23CSV&src=typed_query&f=live -- collection of CSV to RDF demos #DataWrangling #RDF #DataTransformation #JSON #CSV