Wikidata: McGill SIS Workshop

Dan Scott <https://dscott.ca/#i>

Doctoral student, McGill University

2018-11-12

A few examples

Origins

  • Wikipedia (and the many Wikipedias)
  • Wikimedia Commons
  • The Wikimedia Foundation

Authority data on Wikipedia

  • Bands or musicians from Montreal?
  • Authority entries on Wikipedia pages
  • Show the coding: {{Authority Control}}
  • Show the inter-wiki links (if any)

Laurentian catalogue

  • Search for music recording: "Innocence"
  • Click through a few of the "author" entries
  • The Alice Neary entry could use more info
  • Add links (linked data!) to MusicBrainz and AllMusic entries

Minding your Ps and Qs

What kind of data?

  • Must fulfill Wikidata's notability policy:
    • It contains a valid sitelink to a Wikimedia Foundation project page
    • It refers to an instance of a clearly identifiable conceptual or material entity. The entity must be notable, in the sense that it can be described using serious and publicly available references
    • Or fulfills some structural need--supporting another item
    • Tension between trying to represent everything that would benefit from a multilingual representation in Wikipedia & related projects, without tackling literally everything - there is a space for MusicBrainz and other subject-dedicated projects

Back to Alice Neary

  • Look at one of the references - aha, received the Pierre Fournier Award!
  • Try to add a link for "award received" for the Pierre Fournier Award - the award does not exist
  • Add a new item "Pierre Fournier Award", modelled after the Peabody Award
    • Establishes the "follow the lead of an existing, high-profile item" in terms of properties
  • Credit Alice Neary with the award

How does this work?

Who is using Wikidata's data?

Artisanal curation vs. bulk manipulation in Wikidata

Reconciliation (creating links to Wikidata entities)

Some SPARQL hands-on

  • Walk through map of public libraries in Ontario
  • Limits of the Wikidata Query Service (60 second timeouts)
  • Example queries
  • User-friendly interface (hover help, CTRL-Space completion)

More examples

Wikidata runs on Wikibase -- all open source software

  • Wikipedia (PHP + MySQL + Apache)
  • Wikibase extension for Wikipedia
  • ElasticSearch (full text search)
  • BlazeGraph triple store (query service)
  • Nginx (query service proxy)

Linked open data

Making statements about things

Dereferencing links

Links may give you more information about themselves (if you ask nicely):
curl -LH 'Accept: text/turtle' http://example.org/metropolis

@prefix example: <http://example.org> .

example:metropolis a example:Item ;
    example:subclass_of example:city ;
    example:named_after example:mother ;
    example:named_after example:city .
(with apologies for the Turtle syntax)

(Wikidata) opaque identifiers

Opaque identifiers allow neutral multilingualism:
@prefix wd: <http://www.wikidata.org/entity/>

wd:Q340 wd:P31 wd:Q200250

    @en: Montreal -- instance of -- metropolis
    @fr: Montréal -- nature de l'élément -- métropole
    @atj: Moriak -- nature de l'élément -- métropole

Linked open data

  • Wikidata: CC-0 license
  • (Roughly: Do whatever you want, no attribution required)

Wikidata advanced data model

Qualifying statements

wd:Q340 wdt:P1082 "+1704694"^^xsd:decimal ;
Population of Montréal - but when?

Qualifier: A statement about a statement

wd:Q340 p:P1082 wds:Q340-4460bad7 .

wds:Q340-4460bad7 a wikibase:Statement ;
  ps:P1082 "+1704694"^^xsd:decimal ;
  pq:P585 "2016-01-01T00:00:00Z"^^xsd:dateTime ;

Establishing trust

References - Blocks of statements about a statement
wd:Q340 p:P1082 wds:Q340-4460bad7 .

wds:Q340-4460bad7 a wikibase:Statement ;
  ps:P1082 "+1704694"^^xsd:decimal ;
  prov:wasDerivedFrom wdref:229126ee7cccf5ae097b4f604191a1a3e66b97ac .

wdref:229126ee7cccf5ae097b4f604191a1a3e66b97ac a wikibase:Reference ;
  pr:P248 wd:Q16955163 .

wd:Q16955163 a wikibase:Item ;
  rdfs:label "Canada 2016 Census"@en ;

Getting data: dereferencing

URI patterns

URI typePattern
Wiki (human)http://www.wikidata.org/wiki/(P|Q)###
Concept (RWO)http://www.wikidata.org/entity/(P|Q)###
Data (document)http://www.wikidata.org/Special:EntityData/(P|Q)###
  • P prefixes identify properties
  • Q prefixes identify items

Serializations

Media type (Accept:)Extension
application/json.json1
application/n-triples.nt
application/rdf+xml.rdf
text/html.html
text/n3.n3
text/turtle.ttl

1. Wikidata's JSON format is not JSON-LD.

Dereference and dig in

curl -LH 'Accept: text/turtle' http://www.wikidata.org/entity/Q6801308

@prefix wikibase: <http://wikiba.se/ontology-beta#> .
@prefix wdata: <https://www.wikidata.org/wiki/Special:EntityData/> .
@prefix schema: <http://schema.org/> .
@prefix cc: <http://creativecommons.org/ns#> .
@prefix wd: <http://www.wikidata.org/entity/> .

wdata:Q6801308 a schema:Dataset ;
  schema:about wd:Q6801308 ;
  cc:license <http://creativecommons.org/publicdomain/zero/1.0/> ;
  schema:softwareVersion "0.1.0" ;
  schema:version "632535452"^^xsd:integer ;
  schema:dateModified "2018-02-17T22:02:36Z"^^xsd:dateTime ;
  wikibase:statements "9"^^xsd:integer ;
  wikibase:identifiers "2"^^xsd:integer ;
  wikibase:sitelinks "2"^^xsd:integer .

Labels and descriptions

curl -LH 'Accept: text/turtle' http://www.wikidata.org/entity/Q6801308

@prefix wikibase: <http://wikiba.se/ontology-beta#> .
@prefix wds: <http://www.wikidata.org/entity/statement/> .
@prefix wd: <http://www.wikidata.org/entity/> .
@prefix wdt: <http://www.wikidata.org/prop/direct/> .
@prefix wdtn: <http://www.wikidata.org/prop/direct-normalized/> .
@prefix p: <http://www.wikidata.org/prop/> .

wd:Q6801308 a wikibase:Item ;
  rdfs:label "McGill University Library"@en ;
  skos:prefLabel "McGill University Library"@en ;
  schema:name "McGill University Library"@en ;
  rdfs:label "Bibliothèque de l'Université McGill"@fr ;
  skos:prefLabel "Bibliothèque de l'Université McGill"@fr ;
  schema:name "Bibliothèque de l'Université McGill"@fr ;
  schema:description "library system of McGill University in Montreal, Quebec, Canada"@en,
    "universiteitsbibliotheek in Montreal, Canada"@nl ;

Truthy values

@prefix wdt: <http://www.wikidata.org/prop/direct/> .

wd:Q6801308 a wikibase:Item ;
  wdt:P17 wd:Q16 ;
  • The country (P17) for the McGill library system is Canada (Q16).
  • wdt: is the "truthy" value for a given statement; qualifiers such as "start date" or "language of work" are stripped out, and instead appear in separate statements.

Normalized values

@prefix wdt: <http://www.wikidata.org/prop/direct/> .
@prefix wdtn: <http://www.wikidata.org/prop/direct-normalized/> .

wd:Q6801308 a wikibase:Item ;
  wdt:P2581 "01813570n" ;
  wdtn:P2581 <http://babelnet.org/rdf/s01813570n> ;
  wdt:P856 <http://www.mcgill.ca/library/> ;
  • wdtn is the "truthy normalized" value for a given statement. The BabelNet ID (0181357n) is also available as a normalized URL.
  • The official website (P856) for the McGill library system is at the URL http://mcgill.ca/library/.
  • MARC cataloguers rejoice, the 856 lives!

Qualified statements

@prefix wds: <http://www.wikidata.org/entity/statement/> .
@prefix p: <http://www.wikidata.org/prop/> .
@prefix ps: <http://www.wikidata.org/prop/statement/> .
@prefix pq: <http://www.wikidata.org/prop/qualifier/> .

wd:Q48035044 a wikibase:Item ;
  rdfs:label "South End Library"@en ;
  wdt:P856 <http://www.sudburylibraries.ca/en/aboutus/southendlibrary.asp>,
    <http://www.sudburylibraries.ca/fr/aboutus/southendlibrary.asp> ;

wd:Q48035044 p:P856 wds:Q48035044-814e2829-4ba7-ec4b-c397-da81f893b64b .

wds:Q48035044-814e2829-4ba7-ec4b-c397-da81f893b64b a wikibase:Statement,
    wikibase:BestRank ;
  wikibase:rank wikibase:NormalRank ;
  ps:P856 <http://www.sudburylibraries.ca/en/aboutus/southendlibrary.asp> ;
  pq:P407 wd:Q1860 .
Best docs for Wikidata data model: RDF Dump Format

Getting data: SPARQL endpoint

Humanshttps://query.wikidata.org/
Machineshttps://query.wikidata.org/sparql
Built on Blazegraph Blazegraph logo

Friendly human interface

  • Auto-completion
  • Hover for labels
  • Results as maps, images, trees, timelines, charts...
  • Rich set of examples just a click away
Wikidata query editor screenshot

Quirks

  • Highly available, averaging over 3M queries/day
  • Long-running queries are killed after 60 seconds
  • Data is ~5 minutes behind main site
  • Wikibase services (extended functions)

Powering live lookups

Screenshot of Wikidata-powered infocard at Laurentian

Label service

Rather than:
SELECT ?library ?libraryLabel WHERE {
  ?library wdt:P31 wd:Q856234.
  ?library rdfs:label ?libraryLabel.
  FILTER(lang(?libraryLabel) = 'en' || lang(?libraryLabel) = 'fr')
use the label service:
SELECT ?library ?libraryLabel WHERE {
  ?library wdt:P31 wd:Q856234.
  SERVICE wikibase:label { bd:serviceParam
    wikibase:language "[AUTO_LANGUAGE],en,fr". }

Geospatial services

  • wikibase:around
  • wikibase:box

wikibase:box example (libraries between San Jose and Sacramento)

Getting data: Wikibase API

Base URL for all modules: https://wikidata.org/w/api.php

Modules: wbavailablebadges, wbcheckconstraintparameters, wbcheckconstraints, wbcreateclaim, wbcreateredirect, wbeditentity, wbformatvalue, wbgetclaims, wbgetentities, wblinktitles, wbmergeitems, wbparsevalue, wbremoveclaims, wbremovequalifiers, wbremovereferences, wbsearchentities, wbsetaliases, wbsetclaim, wbsetclaimvalue, wbsetdescription, wbsetlabel, wbsetqualifier, wbsetreference, wbsetsitelink, wbsgetsuggestions, webapp-manifest

API: Search entities

  • Retrieve entities with labels or aliases that match a string: wbsearchentities
  • Query params:
    { "action": "wbsearchentities",
      "format": "json",
      "search": "mcgill university library",
      "languages": "en" }
  • Results via API Sandbox

API: Get entities

  • Retrieve claims for one or more items: wbgetentities
  • Query params:
    { "action": "wbgetentities",
      "format": "json",
      "ids": "Q6801308|Q48035044",
      "languages": "en|fr" }
  • Results via API Sandbox ("snak" format)
  • claim IDs ("Q6801308$E938C340-3F01-41BA-BF4A-3B19F36337DA") used in wbsetclaim, etc

Best(?) LOD workflow

  1. Use wbsearchentities to find matching items
  2. Dereference the items with the LOD serialization of your preference

Editing data: QuickStatements

Q849751 → Len → "York University"
Item Q849571 has the label "York University"
  • Tab-delimited format for adding/editing Wikidata items (docs)
  • Editing a label (Llangcode) replaces the current label, if any.
  • Strings are delimited by double quotes

Editing statements

Q849751 → P17 → Q16
York University country is Canada

Adds a new statement if it does not exactly match the current property

Create an item

CREATE
LAST → Len → "Ajax Public Library"
LAST → P31 → Q28324850
New item has label "Ajax Public Library", is a library system
  • CREATE creates a new item
  • LAST refers to the most recently created item

Add/edit qualified statements

Q41506 → P2196 → 16336 → P585 → +2016-10-23T00:00:00:Z/11
Stanford student count = 16,336 as of 2016-10-23
  • Add 1 or more P### qualifiers to the statement
  • Datetime values use +yyyy-MM-ddThh:mm:ss:Z/precision format

Statements with sources

Q41506 → P2196 → 16336 → P585 → +2016-10-23T00:00:00:Z/11
  → S854 → "https://example.com"
  → S813 → +2018-04-29T00:00:00:Z/11
Source URL = https://example.com, retrieved on 2018-04-29

Use S as the prefix for P### IDs when used as source statements

Applying QuickStatements

QuickStatements sample

CREATE
LAST	Len	"Ajax Public Library"
LAST	Den	"library system in Ontario, Canada"
LAST	P31	Q28324850	S248	Q52147771	S854	"https://files.ontario.ca/opendata/ontario_public_library_statistics_open_data_2016.csv"	S813	+2018-04-22T00:00:00Z/11
LAST	P17	Q16	S248	Q52147771	S854	"https://files.ontario.ca/opendata/ontario_public_library_statistics_open_data_2016.csv"	S813	+2018-04-22T00:00:00Z/11
LAST	P131	Q386567	S248	Q52147771	S854	"https://files.ontario.ca/opendata/ontario_public_library_statistics_open_data_2016.csv"	S813	+2018-04-22T00:00:00Z/11
LAST	P463	Q7570226	S248	Q52147771	S854	"https://files.ontario.ca/opendata/ontario_public_library_statistics_open_data_2016.csv"	S813	+2018-04-22T00:00:00Z/11
LAST	P969	"55 Harwood Avenue South, Ajax, Ontario, L1S 2H8"	S248	Q52147771	S854	"https://files.ontario.ca/opendata/ontario_public_library_statistics_open_data_2016.csv"	S813	+2018-04-22T00:00:00Z/11
LAST	P856	"http://ajaxlibrary.ca"	S248	Q52147771	S854	"https://files.ontario.ca/opendata/ontario_public_library_statistics_open_data_2016.csv"	S813	+2018-04-22T00:00:00Z/11
LAST	P1174	9457	P585	+2016-00-00T00:00:00Z/9	S248	Q52147771	S854	"https://files.ontario.ca/opendata/ontario_public_library_statistics_open_data_2016.csv"	S813	+2018-04-22T00:00:00Z/11

Load into https://tools.wmflabs.org/quickstatements

Source: https://gitlab.com/denials/wikidata_ontario_public_libraries

Applying QuickStatements

Screenshot of QuickStatements web page

OpenRefine reconciliation

OpenRefine reconciliation screenshot

OpenRefine Wikidata extension

OpenRefine Wikidata extension screenshot

In development since August 2017

Details

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License