Linked data for relational minds

Dan Scott <https://dscott.ca/#i>

PhD student, McGill University

Associate Librarian, Laurentian University

2019-11-26 Creative Commons License

I love relational databases

But there is another way…

Image credit: Paul Clarke on Wikimedia Commons Creative Commons License

Web of documents

  • A web address (like https://laurentian.ca/)
  • … returns an HTML document -- a web page
  • … which may link to other web addresses

Web of data

  • Not just documents that have meaning for humans
  • … but also data that has meaning for machines
  • a semantic web

Modelling universities

Entity-relationship diagram

Relational

TableColumns
University id (PK), name, founding date, location, ...
UniversityPerson id (PK), university (FK), person (FK), role
Person id (PK), name, birth date, death date, ...

Entities and relationships: graphs

  • A set of nodes, vertices, or points connected by edges, arcs, or lines

Graph

Linked data principles

  1. Use URIs as names for things.
  2. Use HTTP URIs (web addresses) so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
  4. Include links to other URIs, so that they can discover more things.

Berners-Lee, T. (2009, June 18). Linked Data - Design Issues. Retrieved March 10, 2019, from https://www.w3.org/DesignIssues/LinkedData.html

Plain language statements

SubjectPredicateObject
Dan Scottmember ofLaurentian University
Laurentian UniversitylocationSudbury
Laurentian Universityfounding date1960

Resource Description Framework (RDF)

Use HTTP URIs (web addresses) to identify things:

SubjectPredicateObject
https://dscott.ca/#ihttp://schema.org/memberOfhttps://laurentian.ca/
https://laurentian.ca/http://schema.org/locationhttps://greatersudbury.ca/
https://laurentian.ca/http://schema.org/foundingDate"1960"^^xsd:gYear

These three-part statements are called triples.

Vocabularies and ontologies

Vocabularies: (mostly) naming things

  • Classes (types of things)
  • Properties (relationships between things, or predicates)
  • Often expressed in RDFS:
    • Domain and range restrictions
    • Human-readable descriptions

Ontology: "the study of being"

  • Describes a worldview for a given domain
  • Often expressed in OWL (more complex than RDFS):
    • Cardinalities
    • Class disjointness
    • Class intersections and unions
  • Can link vocabularies (owl:equivalentClass, owl:equivalentProperty) and things (owl:sameAs)
  • Enables reasoning over / deriving inferences from your data

Notable vocabularies/ontologies

  • FOAF - linking people
  • schema.org - general vocabulary, led by search engines
  • SKOS - thesauri, taxonomies, classification schemes
  • BIBFRAME - bibliographic description; replacement for MARC 21

Publishing linked data

Syntaxes

  • Inline syntax: JSON-LD, RDFa, microdata
  • Parallel data (content negotiation): Turtle, RDF/XML, JSON-LD, NTriples, …

Common approaches

  • Inline: modify HTML templates
  • Content negotiation:
    • Different templates
    • R2RML - mapping database tables to RDF
  • SPARQL endpoint (for queries)
  • Data dumps

Origins of Wikidata

  • Wikipedia (and the many Wikipedias)
  • Wikimedia Commons
  • The Wikimedia Foundation
  • A human-editable repository of linked data under the CC-0 license
  • Now over 75 million items

What kind of data?

  • Must fulfill Wikidata's notability policy:
    • It contains a valid sitelink to a Wikimedia Foundation project page
    • It refers to an instance of a clearly identifiable conceptual or material entity. The entity must be notable, in the sense that it can be described using serious and publicly available references
    • Or fulfills some structural need--supporting another item
    • Tension between trying to represent everything that would benefit from a multilingual representation in Wikipedia & related projects, without tackling literally everything - there is a space for MusicBrainz and other subject-dedicated projects

Wikidata runs on Wikibase -- all open source software

  • Wikipedia (PHP + MySQL + Apache)
  • Wikibase extension for Wikipedia
  • ElasticSearch (full text search)
  • BlazeGraph triple store (query service)
  • Nginx (query service proxy)

Querying linked data: SPARQL

  • Just different enough from SQL to be confusing 😕
    • SELECT variables rather than columns
    • No FROM clause: just the entire dataset of triples!
    • WHERE clause creates a pattern for the triples you want
    • No JOINs: relationships between entities
    • FILTER attribute values to narrow further

Hands-on with Wikidata: randomness

  1. Open query.wikidata.org in a browser.
  2. On the right hand side, type the following query:
    SELECT * WHERE {
      ?s ?p ?o
    }
    LIMIT 10
  3. Press CTRL + ENTER, or click the arrow button on the left, to submit the query.

Wikidata SPARQL: Laurentian

  1. Let's get something less random. Change ?o to "Laurentian University":
    SELECT * WHERE {
      ?s ?p "Laurentian University"
    }
    LIMIT 10
  2. One of the subjects should be Q3551432. Click on it to see the structured data about Laurentian University.

Wikidata SPARQL: Universities

Let's retrieve all instances of universities in Wikidata:
  1. Change your predicate to wdt:P31 (instance of).
  2. Change your object to wd:Q3918 (university).
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918
    }
  • wdt: = "truthiness"
  • wd: = entity

Wikidata SPARQL: countries

Let's get the countries for each university in Wikidata:
  1. Append the "AND" operator (a period .) to your first statement.
  2. Add a statement that asks for the country (predicate = wdt:P17) for the subject and store the value in a new variable, ?country.
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country
    }

Wikidata SPARQL: labels!

Okay, we need human-friendly labels.
  1. Add a statement that asks for the rdfs:label of the subject.
  2. Add a statement that asks for the rdfs:label of the country.
  3. Add a LIMIT 100 clause to keep things fast.
  4. SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?s rdfs:label ?sLabel .
      ?country rdfs:label ?countryLabel .
    }
    LIMIT 100

Don't forget the "AND" operator (period .) for your statements!

Wikidata SPARQL: English labels!

Okay, we really need labels in a particular language.
  1. Add FILTER statements that restrict the language of each label to "en" (FILTER(LANG(?countryLabel) = "en"))
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?s rdfs:label ?sLabel .
      ?country rdfs:label ?countryLabel .
      FILTER(LANG(?sLabel) = "en") .
      FILTER(LANG(?countryLabel) = "en") .
    }
    LIMIT 100

Wikidata SPARQL: ORDER

Let's sort the results in order of country name.
  1. Add an ORDER BY ?countryLabel clause just before the LIMIT clause:
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?s rdfs:label ?sLabel .
      ?country rdfs:label ?countryLabel .
      FILTER(LANG(?sLabel) = "en") .
      FILTER(LANG(?countryLabel) = "en") .
    }
    ORDER BY ?countryLabel
    LIMIT 100

Wikidata SPARQL: on a map

Let's display the results on a map. We'll need coordinates for that.
  1. Add a clause requesting P625 and store it in a ?coords variable:
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?s rdfs:label ?sLabel .
      ?country rdfs:label ?countryLabel .
      FILTER(LANG(?sLabel) = "en") .
      FILTER(LANG(?countryLabel) = "en") .
      ?s wdt:P625 ?coords .
    }
    ORDER BY ?countryLabel
    LIMIT 100
  2. Change the display type from the standard table to a map by clicking the eye icon

Wikidata SPARQL: subclasses and OPTIONAL

Let's focus on Canadian universities

# show instances & subclasses of Canadian universities on a map
#defaultView:Map

SELECT * WHERE {
  ?s wdt:P31/wdt:P279* wd:Q3918 . # instances of and subclasses of university
  ?s wdt:P31 ?instanceOf .        # instance of what?
  ?s wdt:P17 wd:Q16 .             # in Canada
  ?s rdfs:label ?sLabel .
  ?instanceOf rdfs:label ?instanceLabel .
  FILTER(LANG(?sLabel) = "en") .  # give us the English label
  FILTER(LANG(?instanceLabel) = "en") .  # give us the English label
  OPTIONAL{?s wdt:P625 ?coords} . # and coordinates, if possible
}
ORDER BY ?sLabel

Wikidata SPARQL: aggregates

SPARQL can count and group too:
  1. Change your SELECT clause to select ?countryLabel (COUNT (?s) AS ?cnt).
  2. Add an GROUP BY DESC(?cnt) clause just before the ORDER BY clause:
    SELECT ?countryLabel (COUNT (?s) AS ?cnt) WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?country rdfs:label ?countryLabel .
      FILTER(LANG(?countryLabel) = "en") .
    }
    GROUP BY ?countryLabel
    ORDER BY DESC(?cnt)
    LIMIT 100

Wikidata SPARQL: HAVING filter

SPARQL can filter aggregate results
  1. Add a HAVING(COUNT(?s) <= 50) clause just before the ORDER BY clause:
    SELECT ?countryLabel (COUNT (?s) AS ?cnt) WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?country rdfs:label ?countryLabel .
      FILTER(LANG(?countryLabel) = "en") .
    }
    GROUP BY ?countryLabel
    HAVING(COUNT(?s) <= 50)
    ORDER BY DESC(?cnt)
    LIMIT 100

In practice

A few research directions

  • Ontology development
  • Information retrieval
  • Data quality

Further resources