Linked data for relational minds

Dan Scott <https://dscott.ca/#i>

PhD student, McGill University

Associate Librarian, Laurentian University

2019-03-15 Creative Commons License

I love relational databases

But there is another way…

Image credit: Paul Clarke on Wikimedia Commons Creative Commons License

Web of documents

  • A web address (like https://mcgill.ca/)
  • … returns an HTML document -- a web page
  • … which may link to other web addresses

Web of data

  • Not just documents that have meaning for humans
  • … but also data that has meaning for machines
  • a semantic web

Modelling today's class

People

Type or classInstance
ProfessorDr. Guastavino
StudentYou*
Teaching assistantRichard Yanaky
Guest lecturerDan Scott

Entities and relationships: graphs

  • A set of nodes, vertices, or points connected by edges, arcs, or lines

Linked data principles

  1. Use URIs as names for things.
  2. Use HTTP URIs (web addresses) so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
  4. Include links to other URIs, so that they can discover more things.

Berners-Lee, T. (2009, June 18). Linked Data - Design Issues. Retrieved March 10, 2019, from https://www.w3.org/DesignIssues/LinkedData.html

Plain language statements

SubjectPredicateObject
Dan Scottmember ofMcGill University
McGill UniversitylocationMontreal
McGill Universityfounding date1821

Resource Description Framework (RDF)

Use HTTP URIs (web addresses) to identify things:

SubjectPredicateObject
https://dscott.ca/#ihttp://schema.org/memberOfhttps://mcgill.ca/
https://mcgill.ca/http://schema.org/locationhttps://ville.montreal.qc.ca/
https://mcgill.ca/http://schema.org/foundingDate"1821"^^xsd:gYear

These three-part statements are called triples.

Vocabularies and ontologies

Vocabularies: (mostly) naming things

  • Classes (types of things)
  • Properties (relationships between things, or predicates)
  • Often expressed in RDFS:
    • Domain and range restrictions
    • Human-readable descriptions

Ontology: "the study of being"

  • Describes a worldview for a given domain
  • Often expressed in OWL (more complex than RDFS):
    • Cardinalities
    • Class disjointness
    • Class intersections and unions
  • Can link vocabularies (owl:equivalentClass, owl:equivalentProperty) and things (owl:sameAs)
  • Enables reasoning over / deriving inferences from your data

Notable vocabularies/ontologies

  • FOAF - linking people
  • schema.org - general vocabulary, led by search engines
  • SKOS - thesauri, taxonomies, classification schemes
  • BIBFRAME - bibliographic description; replacement for MARC 21

Publishing linked data

Syntaxes

  • Inline syntax: JSON-LD, RDFa, microdata
  • Parallel data (content negotiation): Turtle, RDF/XML, JSON-LD, NTriples, …

Common approaches

  • Inline: modify HTML templates
  • Content negotiation:
    • Different templates
    • R2RML - mapping database tables to RDF
  • SPARQL endpoint (for queries)
  • Data dumps

Querying linked data: SPARQL

  • Just different enough from SQL to be confusing 😕
    • SELECT variables rather than columns
    • No FROM clause: just the entire dataset of triples!
    • WHERE clause creates a pattern for the triples you want
    • No JOINs: relationships between entities
    • FILTER attribute values to narrow further

Hands-on with Wikidata: randomness

  1. Open query.wikidata.org in a browser.
  2. On the right hand side, type the following query:
    SELECT * WHERE {
      ?s ?p ?o
    }
    LIMIT 10
  3. Press CTRL + ENTER, or click the arrow button on the left, to submit the query.

Wikidata SPARQL: McGill

  1. Let's get something less random. Change ?o to "McGill University":
    SELECT * WHERE {
      ?s ?p "McGill University"
    }
    LIMIT 10
  2. Two of the subjects should be Q201492. Click on one of them to see the structured data about McGill University.

Wikidata SPARQL: Universities

Let's retrieve all instances of universities in Wikidata:
  1. Change your predicate to wdt:P31 (instance of).
  2. Change your object to wd:Q3918 (university).
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918
    }
  • wdt: = "truthiness"
  • wd: = entity

Wikidata SPARQL: countries

Let's get the countries for each university in Wikidata:
  1. Append the "AND" operator (a period .) to your first statement.
  2. Add a statement that asks for the country (predicate = wdt:P17) for the subject and store the value in a new variable, ?country.
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country
    }

Wikidata SPARQL: labels!

Okay, we need human-friendly labels.
  1. Add a statement that asks for the rdfs:label of the subject.
  2. Add a statement that asks for the rdfs:label of the country.
  3. Add a LIMIT 100 clause to keep things fast.
  4. SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?s rdfs:label ?sLabel .
      ?country rdfs:label ?countryLabel .
    }
    LIMIT 100

Don't forget the "AND" operator (period .) for your statements!

Wikidata SPARQL: English labels!

Okay, we really need labels in a particular language.
  1. Add FILTER statements that restrict the language of each label to "en" (FILTER(LANG(?countryLabel) = "en"))
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?s rdfs:label ?sLabel .
      ?country rdfs:label ?countryLabel .
      FILTER(LANG(?sLabel) = "en") .
      FILTER(LANG(?countryLabel) = "en") .
    }
    LIMIT 100

Wikidata SPARQL: ORDER

Let's sort the results in order of country name.
  1. Add an ORDER BY ?countryLabel clause just before the LIMIT clause:
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?s rdfs:label ?sLabel .
      ?country rdfs:label ?countryLabel .
      FILTER(LANG(?sLabel) = "en") .
      FILTER(LANG(?countryLabel) = "en") .
    }
    ORDER BY ?countryLabel
    LIMIT 100

Wikidata SPARQL: aggregates

SPARQL can count and group too:
  1. Change your SELECT clause to select ?countryLabel (COUNT (?s) AS ?cnt).
  2. Add an GROUP BY DESC(?cnt) clause just before the ORDER BY clause:
    SELECT ?countryLabel (COUNT (?s) AS ?cnt) WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?country rdfs:label ?countryLabel .
      FILTER(LANG(?countryLabel) = "en") .
    }
    GROUP BY ?countryLabel
    ORDER BY DESC(?cnt)
    LIMIT 100

Wikidata SPARQL: HAVING filter

SPARQL can filter aggregate results
  1. Add a HAVING(COUNT(?s) <= 50) clause just before the ORDER BY clause:
    SELECT ?countryLabel (COUNT (?s) AS ?cnt) WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?country rdfs:label ?countryLabel .
      FILTER(LANG(?countryLabel) = "en") .
    }
    GROUP BY ?countryLabel
    HAVING(COUNT(?s) <= 50)
    ORDER BY DESC(?cnt)
    LIMIT 100

Further resources