Linked data for relational minds

Dan Scott <https://dscott.ca/#i>

PhD student, McGill University

Associate Librarian, Laurentian University

2019-09-07 Creative Commons License

I love relational databases

But there is another way…

Image credit: Paul Clarke on Wikimedia Commons Creative Commons License

Web of documents

  • A web address (like https://laurentian.ca/)
  • … returns an HTML document -- a web page
  • … which may link to other web addresses

Web of data

  • Not just documents that have meaning for humans
  • … but also data that has meaning for machines
  • a semantic web

Modelling today's gathering

People

Type or classInstance
OrganizersEm, David
AttendeesYou*
SpeakersKelsey, Dan

Entities and relationships: graphs

  • A set of nodes, vertices, or points connected by edges, arcs, or lines

Linked data principles

  1. Use URIs as names for things.
  2. Use HTTP URIs (web addresses) so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
  4. Include links to other URIs, so that they can discover more things.

Berners-Lee, T. (2009, June 18). Linked Data - Design Issues. Retrieved March 10, 2019, from https://www.w3.org/DesignIssues/LinkedData.html

Plain language statements

SubjectPredicateObject
Dan Scottmember ofLaurentian University
Laurentian UniversitylocationSudbury
Laurentian Universityfounding date1960

Resource Description Framework (RDF)

Use HTTP URIs (web addresses) to identify things:

SubjectPredicateObject
https://dscott.ca/#ihttp://schema.org/memberOfhttps://laurentian.ca/
https://laurentian.ca/http://schema.org/locationhttps://greatersudbury.ca/
https://laurentian.ca/http://schema.org/foundingDate"1960"^^xsd:gYear

These three-part statements are called triples.

Knowledge Graph Search API

  • Enable the API in the Google API Console
  • Daily quota of 100,000 requests, 17,000 "Discovery" requests
  • Undocumented result limit of 200; defaults to 20

Knowledge Graph Discovery Widget

<head>
  <link type="text/css" rel="stylesheet"
    href="https://www.gstatic.com/knowledge/kgsearch/widget/1.0/widget.min.css">
  <style>.kge-search-picker { width: 25em; }</style>
  <script type="text/javascript"
    src="https://www.gstatic.com/knowledge/kgsearch/widget/1.0/widget.min.js"></script>
</head>
<body>
  <form id='myform'>
    <label>Search: <input type="text" id="myinput"></label>
  </form>
  <script>
    KGSearchWidget(API_KEY, document.getElementById('myinput'), {});
  </script>
</body>

Awesome live demo!

KG Search API: Requests

KG Search API: Results

{ "@context": {
    "@vocab": "http://schema.org/",
    "goog": "http://schema.googleapis.com/",
    "EntitySearchResult": "goog:EntitySearchResult",
    "detailedDescription": "goog:detailedDescription",
    "resultScore": "goog:resultScore",
    "kg": "http://g.co/kg"
  },
  "@type": "ItemList", "itemListElement": [
    {
      "@type": "EntitySearchResult",
      "result": {
        "@id": "kg:/m/02zzm_",
        "name": "John Kasich",
        "@type": [ "Person", "Thing" ],
        "description": "Governor of Ohio",
        "image": {
          "contentUrl": "http://t1.gstatic.com/images?q=tbn:ANd9GcRoou4pZKD6FoNaE71ngNlv4RGgUS46mgtin5YJtyEoh42CIs4x",
          "url": "https://en.wikipedia.org/wiki/John_Kasich"
        },
        "detailedDescription": {
          "articleBody": "John Richard Kasich is an American politician, the 69th and current Governor of Ohio. First elected in 2010 and re-elected in 2014, Kasich is a member of the Republican Party. His term is set to end by January 2019.\n",
          "url": "https://en.wikipedia.org/wiki/John_Kasich",
          "license": "https://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License"
        },
        "url": "https://johnkasich.com"
      },
      "resultScore": 22.28808
    }
  ]
}
                        

Knowledge Graph IDs

"kg": "http://g.co/kg"
...
"@id": "kg:/m/02zzm_"

So: http://g.co/kg/m/02zzm_

URIs don't have to resolve, but it's nice when they do!

Knowledge Graph API problems

  • Limited constraints = poor precision
    • Can't limit to name of the entity
  • Type mapping to schema.org loses more precision
    • A Fictional character is a Thing, not Person
  • Simplistic entity results: no facts, no relationships
    • When was John Kasich born? - no birthdate returned

Knowledge Graph: entity relations

Note: The Knowledge Graph Search API returns only individual matching entities, rather than graphs of interconnected entities. If you need the latter, we recommend using data dumps from Wikidata instead.
https://developers.google.com/knowledge-graph/

Vocabularies and ontologies

Vocabularies: (mostly) naming things

  • Classes (types of things)
  • Properties (relationships between things, or predicates)
  • Often expressed in RDFS:
    • Domain and range restrictions
    • Human-readable descriptions

Ontology: "the study of being"

  • Describes a worldview for a given domain
  • Often expressed in OWL (more complex than RDFS):
    • Cardinalities
    • Class disjointness
    • Class intersections and unions
  • Can link vocabularies (owl:equivalentClass, owl:equivalentProperty) and things (owl:sameAs)
  • Enables reasoning over / deriving inferences from your data

Notable vocabularies/ontologies

  • FOAF - linking people
  • schema.org - general vocabulary, led by search engines
  • SKOS - thesauri, taxonomies, classification schemes
  • BIBFRAME - bibliographic description; replacement for MARC 21

Publishing linked data

Syntaxes

  • Inline syntax: JSON-LD, RDFa, microdata
  • Parallel data (content negotiation): Turtle, RDF/XML, JSON-LD, NTriples, …

Common approaches

  • Inline: modify HTML templates
  • Content negotiation:
    • Different templates
    • R2RML - mapping database tables to RDF
  • SPARQL endpoint (for queries)
  • Data dumps

Querying linked data: SPARQL

  • Just different enough from SQL to be confusing 😕
    • SELECT variables rather than columns
    • No FROM clause: just the entire dataset of triples!
    • WHERE clause creates a pattern for the triples you want
    • No JOINs: relationships between entities
    • FILTER attribute values to narrow further

Hands-on with Wikidata: randomness

  1. Open query.wikidata.org in a browser.
  2. On the right hand side, type the following query:
    SELECT * WHERE {
      ?s ?p ?o
    }
    LIMIT 10
  3. Press CTRL + ENTER, or click the arrow button on the left, to submit the query.

Wikidata SPARQL: Laurentian

  1. Let's get something less random. Change ?o to "Laurentian University":
    SELECT * WHERE {
      ?s ?p "Laurentian University"
    }
    LIMIT 10
  2. One of the subjects should be Q3551432. Click on it to see the structured data about Laurentian University.

Wikidata SPARQL: Universities

Let's retrieve all instances of universities in Wikidata:
  1. Change your predicate to wdt:P31 (instance of).
  2. Change your object to wd:Q3918 (university).
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918
    }
  • wdt: = "truthiness"
  • wd: = entity

Wikidata SPARQL: countries

Let's get the countries for each university in Wikidata:
  1. Append the "AND" operator (a period .) to your first statement.
  2. Add a statement that asks for the country (predicate = wdt:P17) for the subject and store the value in a new variable, ?country.
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country
    }

Wikidata SPARQL: labels!

Okay, we need human-friendly labels.
  1. Add a statement that asks for the rdfs:label of the subject.
  2. Add a statement that asks for the rdfs:label of the country.
  3. Add a LIMIT 100 clause to keep things fast.
  4. SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?s rdfs:label ?sLabel .
      ?country rdfs:label ?countryLabel .
    }
    LIMIT 100

Don't forget the "AND" operator (period .) for your statements!

Wikidata SPARQL: English labels!

Okay, we really need labels in a particular language.
  1. Add FILTER statements that restrict the language of each label to "en" (FILTER(LANG(?countryLabel) = "en"))
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?s rdfs:label ?sLabel .
      ?country rdfs:label ?countryLabel .
      FILTER(LANG(?sLabel) = "en") .
      FILTER(LANG(?countryLabel) = "en") .
    }
    LIMIT 100

Wikidata SPARQL: ORDER

Let's sort the results in order of country name.
  1. Add an ORDER BY ?countryLabel clause just before the LIMIT clause:
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?s rdfs:label ?sLabel .
      ?country rdfs:label ?countryLabel .
      FILTER(LANG(?sLabel) = "en") .
      FILTER(LANG(?countryLabel) = "en") .
    }
    ORDER BY ?countryLabel
    LIMIT 100

Wikidata SPARQL: on a map

Let's display the results on a map. We'll need coordinates for that.
  1. Add a clause requesting P625 and store it in a ?coords variable:
    SELECT * WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?s rdfs:label ?sLabel .
      ?country rdfs:label ?countryLabel .
      FILTER(LANG(?sLabel) = "en") .
      FILTER(LANG(?countryLabel) = "en") .
      ?s wdt:P625 ?coords .
    }
    ORDER BY ?countryLabel
    LIMIT 100
  2. Change the display type from the standard table to a map by clicking the eye icon

Wikidata SPARQL: aggregates

SPARQL can count and group too:
  1. Change your SELECT clause to select ?countryLabel (COUNT (?s) AS ?cnt).
  2. Add an GROUP BY DESC(?cnt) clause just before the ORDER BY clause:
    SELECT ?countryLabel (COUNT (?s) AS ?cnt) WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?country rdfs:label ?countryLabel .
      FILTER(LANG(?countryLabel) = "en") .
    }
    GROUP BY ?countryLabel
    ORDER BY DESC(?cnt)
    LIMIT 100

Wikidata SPARQL: HAVING filter

SPARQL can filter aggregate results
  1. Add a HAVING(COUNT(?s) <= 50) clause just before the ORDER BY clause:
    SELECT ?countryLabel (COUNT (?s) AS ?cnt) WHERE {
      ?s wdt:P31 wd:Q3918 .
      ?s wdt:P17 ?country .
      ?country rdfs:label ?countryLabel .
      FILTER(LANG(?countryLabel) = "en") .
    }
    GROUP BY ?countryLabel
    HAVING(COUNT(?s) <= 50)
    ORDER BY DESC(?cnt)
    LIMIT 100

In practice

Laurentian library catalogue

Further resources