Linked data, Wikidata, and SPARQL

A McGill SIS workshop

Dan Scott <https://dscott.ca/#i>

Associate Librarian, Laurentian University

2024-11-26 Creative Commons License

Web of documents

  • A web address (like https://laurentian.ca/)
  • … returns an HTML document -- a web page
  • … which may link to other web addresses

Web of data

  • Not just documents that have meaning for humans
  • … but also data that has meaning for machines
  • a semantic web

Linked data principles

  1. Use URIs as names for things.
  2. Use HTTP URIs (web addresses) so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
  4. Include links to other URIs, so that they can discover more things.

Berners-Lee, T. (2009, June 18). Linked Data - Design Issues. Retrieved March 10, 2019, from https://www.w3.org/DesignIssues/LinkedData.html

Plain language statements

SubjectPredicateObject
Dan Scottmember ofLaurentian University
Laurentian UniversitylocationSudbury
Laurentian Universityfounding date1960

Resource Description Framework (RDF)

Use HTTP URIs (web addresses) to identify things:

SubjectPredicateObject-or-value
https://dscott.ca/#imember ofLaurentian University
Laurentian UniversitylocationSudbury
Laurentian Universityfounding date1960

RDF: using HTTP URIs

SubjectPredicateObject-or-value
https://dscott.ca/#imember ofhttps://laurentian.ca/
https://laurentian.ca/locationSudbury
https://laurentian.ca/founding date1960

RDF: using HTTP URIs

SubjectPredicateObject-or-value
https://dscott.ca/#imember ofhttps://laurentian.ca/
https://laurentian.ca/locationhttps://greatersudbury.ca/
https://laurentian.ca/founding date1960

RDF: using HTTP URIs

SubjectPredicateObject-or-value
https://dscott.ca/#ihttp://schema.org/memberOfhttps://laurentian.ca/
https://laurentian.ca/http://schema.org/locationhttps://greatersudbury.ca/
https://laurentian.ca/http://schema.org/foundingDate1960

RDF: using expressive literals

SubjectPredicateObject-or-value
https://dscott.ca/#ihttp://schema.org/memberOfhttps://laurentian.ca/
https://laurentian.ca/http://schema.org/locationhttps://greatersudbury.ca/
https://laurentian.ca/http://schema.org/foundingDate"1960"^^xsd:gYear
https://greatersudbury.ca/http://schema.org/name"Greater Sudbury"@en
https://greatersudbury.ca/http://schema.org/name"Grand Sudbury"@fr

These three-part statements are called triples.

Vocabularies and ontologies

Vocabularies: (mostly) naming things

  • Classes (types of things)
  • Properties (relationships between things, or predicates)
  • Often expressed in RDFS:
    • Domain and range restrictions
    • Human-readable descriptions

Ontology: "the study of being"

  • Describes a worldview for a given domain
  • Often expressed in OWL (more complex than RDFS):
    • Cardinalities
    • Class disjointness
    • Class intersections and unions
  • Can link vocabularies (owl:equivalentClass, owl:equivalentProperty) and things (owl:sameAs)
  • Enables reasoning over / deriving inferences from your data

Prefixes are shortcuts

  • You'll often see links like rdfs:label or dc:creator
  • "But Dan,", you say, "those aren't links!"
  • The prefix rdfs: is just a shorter way of saying http://www.w3.org/2000/01/rdf-schema#
  • rdfs:name really means http://www.w3.org/2000/01/rdf-schema#label
  • dc:creator expands to http://purl.org/dc/terms/creator

Publishing linked data

Syntaxes

  • Embedded within HTML: JSON-LD, RDFa, microdata
  • Parallel data (content negotiation): Turtle, RDF/XML, JSON-LD, NTriples, …

Published linked data example

Knowledge graphs

  • But crawling the web is a lot of work!
  • A knowledge graph is a set of queryable linked data, organized via an ontology, that represents entities and their relationships
  • Google Knowledge Graph
  • DBPedia: linked data extracted from Wikipedia
  • Wikidata: linked data supporting all Wikimedia Foundation products

Querying linked data: SPARQL

  • Just different enough from SQL to be confusing 😕
    • SELECT variables rather than columns
    • No FROM clause: just the entire dataset of triples!
    • WHERE clause creates a pattern for the triples you want
    • No JOINs: relationships between entities
    • FILTER attribute values to narrow further

Hands-on with Wikidata: randomness

  1. Open the Wikidata Query Service (WDQS) in a browser.
  2. On the right hand side, type the following query:
    SELECT * WHERE {
    ?s ?p ?o
    }
    LIMIT 10
  3. Press CTRL + ENTER, or click the arrow button on the left, to submit the query.

Wikidata SPARQL: Laurentian

  1. Let's get something less random. Change your object from ?o to "Laurentian University":
    SELECT * WHERE {
      ?s ?p "Laurentian University"
    }
    LIMIT 10
  2. One of the subjects should be Q3551432. Click it to see the structured data about Laurentian University.

Wikidata SPARQL: Universities

Let's retrieve all instances of universities in Wikidata:
  1. Change your predicate from ?p to wdt:P31 (instance of).
  2. Change your object from "Laurentian University" to wd:Q3918 (university).
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 # instance of university
    }
  • wdt: = "truthiness"
  • wd: = entity

Wikidata SPARQL: countries

Let's get the countries for each university in Wikidata:
  1. Append the "AND" operator (a period .) to your first statement.
  2. Add a statement that asks for the country (predicate = wdt:P17) for the subject and store the value in a new variable, ?country.
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country
    }

Wikidata SPARQL: labels!

Okay, we need human-friendly labels.
  1. Ask for the rdfs:label of the subject.
  2. Ask for the rdfs:label of the country.
  3. Add a LIMIT 100 clause to keep things fast.
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?s rdfs:label ?sLabel .
      ?country rdfs:label ?countryLabel .
    }
    LIMIT 100

Don't forget the "AND" operator (period .) for your statements!

Wikidata SPARQL: English labels!

  1. Add FILTER statements that restrict the language of each label to "en" (FILTER(LANG(?countryLabel) = "en"))
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?s rdfs:label ?sLabel .
      ?country rdfs:label ?countryLabel .
      FILTER(LANG(?sLabel) = "en") .
      FILTER(LANG(?countryLabel) = "en") .
    }
    LIMIT 100

Wikidata SPARQL: English labels!

  1. Replace the FILTER statements with Wikidata's fast label service:
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      SERVICE wikibase:label { 
        bd:serviceParam wikibase:language "en".
      }
    }
    LIMIT 100

SPARQL semi-colon operator

  1. To avoid repeating the subject (?s) in successive statements, use a semi-colon ; instead of the period (.):
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918;
         wdt:P17 ?country .
      SERVICE wikibase:label { 
        bd:serviceParam wikibase:language "en".
      }
    }
    LIMIT 100

UNION: the OR operator

  1. Until now, you've been strictly using AND expressions. But the UNION operator allows you to create OR expressions. For example, to list universities in Ontario or Québec:
    SELECT ?s ?sLabel WHERE {
      { ?s wdt:P31 wd:Q108403040 }  # university in Ontario
      UNION
      { ?s wdt:P31 wd:Q3551519 } .  # university in Quebec
      SERVICE wikibase:label {
        bd:serviceParam wikibase:language "en".
      }
    }
    LIMIT 100

Advanced SPARQL

SPARQL: ORDER

Sort the results by country name.
  1. Add an ORDER BY ?countryLabel clause just before the LIMIT clause:
    SELECT ?s ?sLabel ?country ?countryLabel WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      SERVICE wikibase:label {
        bd:serviceParam wikibase:language "en".
      }
    }
    ORDER BY ?countryLabel
    LIMIT 100

Wikidata SPARQL: on a map

  1. Add a clause requesting P625 (coordinates) and store it in a ?coords variable:
    SELECT ?s ?sLabel ?country ?countryLabel ?coords WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?s wdt:P625 ?coords . # give us coordinates
      SERVICE wikibase:label {
        bd:serviceParam wikibase:language "en".
      }
    }
    ORDER BY ?countryLabel
    LIMIT 100
  2. Change the display type from the standard table to a map by clicking the eye icon

Wikidata SPARQL: subclasses and OPTIONAL

Let's focus on Canadian universities

# show instances & subclasses of Canadian universities on a map
#defaultView:Map

SELECT * WHERE {
  ?s wdt:P31/wdt:P279* wd:Q3918 . # instances of and subclasses of university
  ?s wdt:P31 ?instanceOf .        # instance of what?
  ?s wdt:P17 wd:Q16 .             # in Canada
  ?s rdfs:label ?sLabel .
  ?instanceOf rdfs:label ?instanceLabel .
  FILTER(LANG(?sLabel) = "en") .  # give us the English label
  FILTER(LANG(?instanceLabel) = "en") .  # give us the English label
  OPTIONAL{?s wdt:P625 ?coords} . # and coordinates, if possible
}
ORDER BY ?sLabel

SPARQL: aggregates

SPARQL can count and group too:
  1. Change your SELECT clause to select ?countryLabel (COUNT (?s) AS ?cnt).
  2. Add an GROUP BY DESC(?cnt) clause just before the ORDER BY clause:
    SELECT ?countryLabel (COUNT (?s) AS ?cnt) WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?country rdfs:label ?countryLabel .
      FILTER(LANG(?countryLabel) = "en") .
    }
    GROUP BY ?countryLabel
    ORDER BY DESC(?cnt)
    LIMIT 100

SPARQL: aggregate functions

  1. The aggregate functions defined in SPARQL 1.1 are COUNT, SUM, MIN, MAX, AVG, GROUP_CONCAT, and SAMPLE
  2. GROUP_CONCAT takes a variable name and the string with which you want to concatenate each value of the variable, e.g. GROUP_CONCAT(?instrument, "; ")

Wikidata SPARQL: HAVING filter

SPARQL can filter aggregate results
  1. Add a HAVING(COUNT(?s) <= 50) clause just before the ORDER BY clause:
    SELECT ?countryLabel (COUNT (?s) AS ?cnt) WHERE {
    ?s wdt:P31 wd:Q3918 .
    ?s wdt:P17 ?country .
    ?country rdfs:label ?countryLabel .
    FILTER(LANG(?countryLabel) = "en") .
    }
    GROUP BY ?countryLabel
    HAVING(COUNT(?s) <= 50)
    ORDER BY DESC(?cnt)
    LIMIT 100

SPARQL: CONSTRUCT graphs

Hands on app time!

Wikidata-powered infocards

  • Years ago, I taught our Evergreen library catalogue to display Wikidata infocards about musicians and bands
  • Sadly we migrated from Evergreen in 2020
  • But the generic code is still available
  • The following lab is a simplified version of that code

Get the code and try it out

Progressively enhance

  • We start basic: just a list of pets with their Wikidata concept URIs
  • At each stage, I add a chunk of vanilla JavaScript
    • Simple DOM manipulation
    • Iteratively building the SPARQL query
    • Parsing and displaying the results (entity summarization)
  • Get your hands dirty! Get more data, or try a different dataset!