Linked data for relational minds

Dan Scott <https://dscott.ca/#i>

PhD student, McGill University

Associate Librarian, Laurentian University

2020-11-18 Creative Commons License

Image credit: Paul Clarke on Wikimedia Commons Creative Commons License

Web of documents

  • A web address (like https://mcgill.ca/)
  • … returns an HTML document -- a web page
  • … which may link to other web addresses

Web of data

  • Not just documents that have meaning for humans
  • … but also data that has meaning for machines
  • a semantic web

Linked data principles

  1. Use URIs as names for things.
  2. Use HTTP URIs (web addresses) so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
  4. Include links to other URIs, so that they can discover more things.

Berners-Lee, T. (2009, June 18). Linked Data - Design Issues. Retrieved March 10, 2019, from https://www.w3.org/DesignIssues/LinkedData.html

Plain language statements

SubjectPredicateObject
Dan Scottmember ofMcGill University
McGill UniversitylocationMontreal
McGill Universityfounding date1821

Resource Description Framework (RDF)

Use HTTP URIs (web addresses) to identify things:

SubjectPredicateObject-or-value
https://dscott.ca/#imember ofMcGill University
McGill UniversitylocationMontreal
McGill Universityfounding date1821

RDF: using HTTP URIs

SubjectPredicateObject-or-value
https://dscott.ca/#imember ofhttps://mcgill.ca/
https://mcgill.ca/locationMontreal
https://mcgill.ca/founding date1821

RDF: using HTTP URIs

SubjectPredicateObject-or-value
https://dscott.ca/#imember ofhttps://mcgill.ca/
https://mcgill.ca/locationhttps://ville.montreal.qc.ca/
https://mcgill.ca/founding date1821

RDF: using HTTP URIs

SubjectPredicateObject-or-value
https://dscott.ca/#ihttp://schema.org/memberOfhttps://mcgill.ca/
https://mcgill.ca/http://schema.org/locationhttps://ville.montreal.qc.ca/
https://mcgill.ca/http://schema.org/foundingDate1821

RDF: using expressive literals

SubjectPredicateObject-or-value
https://dscott.ca/#ihttp://schema.org/memberOfhttps://mcgill.ca/
https://mcgill.ca/http://schema.org/locationhttps://ville.montreal.qc.ca/
https://mcgill.ca/http://schema.org/foundingDate"1821"^^xsd:gYear
https://ville.montreal.qc.ca/http://schema.org/name"Montreal"@en
https://ville.montreal.qc.ca/http://schema.org/name"Montréal"@fr

These three-part statements are called triples.

Vocabularies and ontologies

Vocabularies: (mostly) naming things

  • Classes (types of things)
  • Properties (relationships between things, or predicates)
  • Often expressed in RDFS:
    • Domain and range restrictions
    • Human-readable descriptions

Ontology: "the study of being"

  • Describes a worldview for a given domain
  • Often expressed in OWL (more complex than RDFS):
    • Cardinalities
    • Class disjointness
    • Class intersections and unions
  • Can link vocabularies (owl:equivalentClass, owl:equivalentProperty) and things (owl:sameAs)
  • Enables reasoning over / deriving inferences from your data

Publishing linked data

Syntaxes

  • Embedded within HTML: JSON-LD, RDFa, microdata
  • Parallel data (content negotiation): Turtle, RDF/XML, JSON-LD, NTriples, …

Published linked data example

Knowledge graphs

  • But crawling the web is a lot of work!
  • A knowledge graph is a set of queryable linked data, organized via an ontology, that represents entities and their relationships
  • Google Knowledge Graph
  • DBPedia: linked data extracted from Wikipedia
  • Wikidata: linked data supporting all Wikimedia Foundation products

Querying linked data: SPARQL

  • Just different enough from SQL to be confusing 😕
    • SELECT variables rather than columns
    • No FROM clause: just the entire dataset of triples!
    • WHERE clause creates a pattern for the triples you want
    • No JOINs: relationships between entities
    • FILTER attribute values to narrow further

Hands-on with Wikidata: randomness

  1. Open query.wikidata.org in a browser.
  2. On the right hand side, type the following query:
    SELECT * WHERE {
    ?s ?p ?o
    }
    LIMIT 10
  3. Press CTRL + ENTER, or click the arrow button on the left, to submit the query.

Wikidata SPARQL: McGill

  1. Let's get something less random. Change ?o to "McGill University":
    SELECT * WHERE {
      ?s ?p "McGill University"
    }
    LIMIT 10
  2. Two of the subjects should be Q201492. Click on one of them to see the structured data about McGill University.

Wikidata SPARQL: Universities

Let's retrieve all instances of universities in Wikidata:
  1. Change your predicate to wdt:P31 (instance of).
  2. Change your object to wd:Q3918 (university).
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918
    }
  • wdt: = "truthiness"
  • wd: = entity

Wikidata SPARQL: countries

Let's get the countries for each university in Wikidata:
  1. Append the "AND" operator (a period .) to your first statement.
  2. Add a statement that asks for the country (predicate = wdt:P17) for the subject and store the value in a new variable, ?country.
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country
    }

Wikidata SPARQL: labels!

Okay, we need human-friendly labels.
  1. Ask for the rdfs:label of the subject.
  2. Ask for the rdfs:label of the country.
  3. Add a LIMIT 100 clause to keep things fast.
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?s rdfs:label ?sLabel .
      ?country rdfs:label ?countryLabel .
    }
    LIMIT 100

Don't forget the "AND" operator (period .) for your statements!

Wikidata SPARQL: English labels!

  1. Add FILTER statements that restrict the language of each label to "en" (FILTER(LANG(?countryLabel) = "en"))
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?s rdfs:label ?sLabel .
      ?country rdfs:label ?countryLabel .
      FILTER(LANG(?sLabel) = "en") .
      FILTER(LANG(?countryLabel) = "en") .
    }
    LIMIT 100

Wikidata SPARQL: English labels!

  1. Replace the FILTER statements with Wikidata's fast label service:
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
    }
    LIMIT 100

Advanced SPARQL

Wikidata SPARQL: ORDER

Sort the results by country name.
  1. Add an ORDER BY ?countryLabel clause just before the LIMIT clause:
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?s rdfs:label ?sLabel .
      ?country rdfs:label ?countryLabel .
      FILTER(LANG(?sLabel) = "en") .
      FILTER(LANG(?countryLabel) = "en") .
    }
    ORDER BY ?countryLabel
    LIMIT 100

Wikidata SPARQL: on a map

  1. Add a clause requesting P625 (coordinates) and store it in a ?coords variable:
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?s rdfs:label ?sLabel .
      ?country rdfs:label ?countryLabel .
      FILTER(LANG(?sLabel) = "en") .
      FILTER(LANG(?countryLabel) = "en") .
      ?s wdt:P625 ?coords .
    }
    ORDER BY ?countryLabel
    LIMIT 100
  2. Change the display type from the standard table to a map by clicking the eye icon

Wikidata SPARQL: subclasses and OPTIONAL

Let's focus on Canadian universities

# show instances & subclasses of Canadian universities on a map
#defaultView:Map

SELECT * WHERE {
  ?s wdt:P31/wdt:P279* wd:Q3918 . # instances of and subclasses of university
  ?s wdt:P31 ?instanceOf .        # instance of what?
  ?s wdt:P17 wd:Q16 .             # in Canada
  ?s rdfs:label ?sLabel .
  ?instanceOf rdfs:label ?instanceLabel .
  FILTER(LANG(?sLabel) = "en") .  # give us the English label
  FILTER(LANG(?instanceLabel) = "en") .  # give us the English label
  OPTIONAL{?s wdt:P625 ?coords} . # and coordinates, if possible
}
ORDER BY ?sLabel

Wikidata SPARQL: aggregates

SPARQL can count and group too:
  1. Change your SELECT clause to select ?countryLabel (COUNT (?s) AS ?cnt).
  2. Add an GROUP BY DESC(?cnt) clause just before the ORDER BY clause:
    SELECT ?countryLabel (COUNT (?s) AS ?cnt) WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?country rdfs:label ?countryLabel .
      FILTER(LANG(?countryLabel) = "en") .
    }
    GROUP BY ?countryLabel
    ORDER BY DESC(?cnt)
    LIMIT 100

Wikidata SPARQL: HAVING filter

SPARQL can filter aggregate results
  1. Add a HAVING(COUNT(?s) <= 50) clause just before the ORDER BY clause:
    SELECT ?countryLabel (COUNT (?s) AS ?cnt) WHERE {
    ?s wdt:P31 wd:Q3918 .
    ?s wdt:P17 ?country .
    ?country rdfs:label ?countryLabel .
    FILTER(LANG(?countryLabel) = "en") .
    }
    GROUP BY ?countryLabel
    HAVING(COUNT(?s) <= 50)
    ORDER BY DESC(?cnt)
    LIMIT 100

Hands on app time!

If you are familiar with HTML and JavaScript, try building your own app that enhances its content via a SPARQL query against Wikidata!

Get the code and try it out

Progressively enhance

  • We start basic: just a list of pets with their Wikidata concept URIs
  • At each step, I add a chunk of vanilla JavaScript
    • Simple DOM manipulation
    • Iteratively building the SPARQL query
    • Parsing and displaying the results (entity summarization)
  • Get your hands dirty! Get more data, or try a different dataset!

Supporting exploratory search

Helping users learn while searching by tapping into Wikidata:

Further resources