Data Wrangling: Generating a U.S. Presidential Election Knowledge Graph from a Google Spreadsheet
Kingsley Uyi Idehen
Founder & CEO at OpenLink Software | Driving GenAI-Based AI Agents | Harmonizing Disparate Data Spaces (Databases, Knowledge Bases/Graphs, and File System Documents)
Arising from curiosity about the potential outcome of the 2020 US Presidential Election, I’ve constructed a spreadsheet comprising data about eligible voters in said election, a breakdown of early-votes cast by state, and the final tallies for the 2016 election.
I am interested in states that show very high early-vote counts as a percentage of the total vote count in the prior 2016 Presidential Election.
Data Wrangling Goal?
I want to generate an explorable Knowledge Graph from my Google Spreadsheet which itself comprises a merge of data from various sources that include:
- DBpedia
- 2016 Election Dataset published to Github by Kris Shaffer
- 2020 Early Voting Data from the U.S Elections Project
I will use the Virtuoso instance behind the public URIBurner Service (which includes an enabled Sponger Transformation Middleware Module), as my tool of choice.
Steps?
[1] Create Google Spreadsheet. Here’s my example.
[2] Pass the Spreadsheet URL to URIBurner using options for returning CSV rather than HTML — https://linkeddata.uriburner.com/about/html/https/docs.google.com/spreadsheets/d/1PTqUkqv-9BPWY1V1cFq13xR92EL6vJ9b8N4Ox3TUagE/gviz/tq?tqx=out:csv&range=A2:O53&sheet=2020_Election_Analysis
That returns an HTML-Entity Description Page which is also an entry point into a newly generated Knowledge Graph. This pages includes a listing of Entities derived from Rows in the original Spreadsheet.
For instance, clicking on the hyperlink “Record2” results in a lookup that returns data for the state of Texas.
Alternatively, you can click on the “Browse using” drop-down or manually construct an alternative URL-pattern to obtain a different HTML-based Entity Description page that includes powerful Faceted Browsing capability i.e., the use of Entity Attributes to provide pivot-style exploration across various dimensions.
For instance, here is a list of instances of the Class "2020 Election Analysis" derived from the Google Spreadsheet.
Here's a view of a specific Record, a specific instance of the "2020 Election Analysis" Class derived from the Google Spreadsheet.
Here's the Ontology derived from the Google Spreadsheet.
A Little Reasoning & Inference
At this point, I have two views of my Knowledge Graph, but the labelling for each state could be clearer.
To solve the preferred labelling problem, I’ve applied a SPARQL INSERT statement to generate an additional attribute that will ultimately be used as my preferred label, courtesy of the skos:prefLabel term from the SKOS Ontology.
PREFIX lod: <https://lod.openlinksw.com/> PREFIX rdfs: <https://www.w3.org/2000/01/rdf-schema#> PREFIX schema: <https://schema.org/> PREFIX owl: <https://www.w3.org/2002/07/owl#> PREFIX skos: <https://www.w3.org/2004/02/skos/core#> PREFIX election-2020-data: <https://docs.google.com/spreadsheets/d/1PTqUkqv-9BPWY1V1cFq13xR92EL6vJ9b8N4Ox3TUagE/gviz/tq?tqx=out:csv&range=A2:O53&sheet=2020_Election_Analysis#> INSERT { GRAPH <urn:us:election:2020:data:cleanup> { ?record skos:prefLabel ?label } } WHERE { ?record election-2020-data:State_Name ?label }
The following screenshots depict effects of the INSERT statement above i.e., how the underlying Faceted Browsing Engine has selected the preferred Entity Attribute for Display Labeling used in the “About:” heading.
And here's a revamped view of "Texas" rather than "Record2" .
Conclusion
Courtesy of the data management and transformation power of Virtuoso, I’ve successfully used a single mouse-click to generate a Knowledge Graph deployed using Linked Data principles from my Google Spreadsheet — comprising data about the U.S. Presidential Election collated from a variety of sources.
Related
- Data Wrangling Definition from the OpenLink Technology Glossary
- US 2020 Presidential Elections Google Spreadsheet
- Ontology Generated from the Google Spreadsheet -- RDF Turtle Format
- Ontology Generated from the Google Spreadsheet -- JSON-LD Format
- What is the Virtuoso Sponger, and Why is it Important?
- Virtuoso Home Page
Partner @ ObscureIQ??Data Broker Expert??Privacy Recovery for VIPs
4 年Thanks for this!
Founder & CEO at OpenLink Software | Driving GenAI-Based AI Agents | Harmonizing Disparate Data Spaces (Databases, Knowledge Bases/Graphs, and File System Documents)
4 年I've also added a new column to the spreadsheet for integrating interactive elections results data from The New York Times.
Software Engineering | Cloud | ML/AI | Solution Architecture | IT Strategy
4 年Kingsley Uyi Idehen You forgot to include referenced screenshots ("The following screenshots depict effects ..."). But, more importantly ... I couldn't figure out how to access the autogenerated ontology. I have found a way to get to the corresponding ontology page [https://linkeddata.uriburner.com/describe/?url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1PTqUkqv-9BPWY1V1cFq13xR92EL6vJ9b8N4Ox3TUagE%2Fgviz%2Ftq%3Ftqx%3Dout%3Acsv%26range%3DA2%3AO53%26sheet%3D2020_Election_Analysis%23ontology] (though it was a weird path - e.g.: container of -> (select state, e.g., Texas) -> (select any attribute, e.g., 2020 Early Vote Count) -> isDefinedBy -> 2020 Election Analysis Ontology - I think that such an important entity as ontology should be referenced at the main page of a resource), but I don't see a way to export a relevant Turtle (.ttl) file. Neither do I see a relevant SPARQL endpoint. Would you like to clarify?