schema.org, Wikidata, Knowledge Graph:

strands of the modern semantic web

Dan Scott, Laurentian University
coffeecode.net / +Dan Scott / @denials

  • Knowledge graph IDs in the middle
  • Wikidata IDs and relations at the edges
  • schema.org data as bugs (raw material for the semantic web)
 

Laziness

The virtue is about the avoidance of future work.

Virtues of the Perl Programmer, Larry Wall

Linked open data aspires to be lazy

  • Publish and let the data be consumed
  • Avoid the hard work natural of language parsing / entity recognition and disambiguation
  • Pull the data you need from various aggregators

Who is the governor of Ohio?

 

Knowledge Graph sources

  • Freebase
  • Wikipedia / Wikidata
  • Crawling the web
  • Licensed data

Freebase: Open graph database launched in 2007

Focused on entities and relationships
Each type of entity was well-structured
with browsable instances
  • Freely available with a great auto-suggest API
  • Purchased by Google in 2010
  • ... sunset August 31, 2016

Knowledge Graph Search API

  • Enable the API in the Google API Console
  • Daily quota of 100,000 requests, 17,000 "Discovery" requests
  • Undocumented result limit of 200; defaults to 20

Knowledge Graph Discovery Widget

<head>
  <link type="text/css" rel="stylesheet"
    href="https://www.gstatic.com/knowledge/kgsearch/widget/1.0/widget.min.css">
  <style>.kge-search-picker { width: 25em; }</style>
  <script type="text/javascript"
    src="https://www.gstatic.com/knowledge/kgsearch/widget/1.0/widget.min.js"></script>
</head>
<body>
  <form id='myform'>
    <label>Search: <input type="text" id="myinput"></label>
  </form>
  <script>
    KGSearchWidget(API_KEY, document.getElementById('myinput'), {});
  </script>
</body>

Awesome live demo!

KG Search API: Requests

KG Search API: Results

{ "@context": {
    "@vocab": "http://schema.org/",
    "goog": "http://schema.googleapis.com/",
    "EntitySearchResult": "goog:EntitySearchResult",
    "detailedDescription": "goog:detailedDescription",
    "resultScore": "goog:resultScore",
    "kg": "http://g.co/kg"
  },
  "@type": "ItemList", "itemListElement": [
    {
      "@type": "EntitySearchResult",
      "result": {
        "@id": "kg:/m/02zzm_",
        "name": "John Kasich",
        "@type": [ "Person", "Thing" ],
        "description": "Governor of Ohio",
        "image": {
          "contentUrl": "http://t1.gstatic.com/images?q=tbn:ANd9GcRoou4pZKD6FoNaE71ngNlv4RGgUS46mgtin5YJtyEoh42CIs4x",
          "url": "https://en.wikipedia.org/wiki/John_Kasich"
        },
        "detailedDescription": {
          "articleBody": "John Richard Kasich is an American politician, the 69th and current Governor of Ohio. First elected in 2010 and re-elected in 2014, Kasich is a member of the Republican Party. His term is set to end by January 2019.\n",
          "url": "https://en.wikipedia.org/wiki/John_Kasich",
          "license": "https://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License"
        },
        "url": "https://johnkasich.com"
      },
      "resultScore": 22.28808
    }
  ]
}
                        

Knowledge Graph IDs

"kg": "http://g.co/kg"
...
"@id": "kg:/m/02zzm_"

So: http://g.co/kg/m/02zzm_

URIs don't have to resolve, but it's nice when they do!

Knowledge Graph API problems

  • Limited constraints = poor precision
    • Can't limit to name of the entity
  • Type mapping to schema.org loses more precision
    • A Fictional character is a Thing, not Person
  • Simplistic entity results: no facts, no relationships
    • When was John Kasich born? - no birthdate returned

Knowledge Graph: entity relations

Note: The Knowledge Graph Search API returns only individual matching entities, rather than graphs of interconnected entities. If you need the latter, we recommend using data dumps from Wikidata instead.
https://developers.google.com/knowledge-graph/

Wikidata

Wikidata entity search

REST-based MediaWiki API wbsearchentities module:

https://www.wikidata.org/w/api.php?action=wbsearchentities&search=john+kasich&language=en

Restrictions: Serial requests, or risk the ban hammer!

wbsearchentities results

{
  "searchinfo": { "search": "john kasich" },
  "search": [
    {
      "id": "Q69319",
      "concepturi": "http://www.wikidata.org/entity/Q69319",
      "url": "//www.wikidata.org/wiki/Q69319",
      "title": "Q69319",
      "pageid": 72036,
      "label": "John Kasich",
      "description": "American politician",
      "match": {
        "type": "label",
        "language": "en",
        "text": "John Kasich"
      }
    }
  ]
}

Relationships

    With https://www.wikidata.org/entity/ as prefix wde:, the entity wde:Q69319 ("John Kasich"):

    • has the property wde:P569 (was born) with the literal value "+1952-05-13T00:00:00Z";
    • also has the property wde:P39 ("positions held")
    • with the array member wde:Q13218630 ("United States representative")
    • having the property wde:P580 ("start time")
    • with the literal datavalue/value/time value of +1983-01-03T00:00:00Z.

Graph Relationships

More simply...

SubjectPredicateObject
wde:Q69319wde:P569"+1952-05-13T00:00:00Z"
wde:Q69319wde:P39wde:Q13218630
wde:Q13218630wde:P580"+1983-01-03T00:00:00Z"

SPARQL endpoint

  • SPARQL is the primary and most powerful query interface for Wikidata
  • And it's a standard: yay!
  • SELECT / WHERE / GROUP BY / ORDER... just like SQL right?
  • Ish - querying a graph instead of a relational database

SPARQL examples

We can pull some awesome results from Wikidata:

SPARQL endpoint

  • URL: https://query.wikidata.org/sparql?query=<QUERY>
  • 30-second timeout
  • Available formats:
    Format HTTP Header GET param
    XML Accept: application/sparql-results+xml format=xml
    JSON Accept: application/sparql-results+json format=json
    TSV Accept: text/tab-separated-values
    CSV Accept: text/csv

Getting data into Wikidata

Notability requirements still apply

A typical four-year old?

  • Not just cute anymore
  • Still growing rapidly: 25 milion entities
  • A little unstable: queries can and do break
  • Will benefit from guidance

schema.org

Decentralized data publishing

Introducing modal news!

OpenGraph Protocol

<head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# medium-com: http://ogp.me/ns/fb/medium-com#">
  <title>LIVE REVIEW: Midpoint Music Festival — Cincinnati, OH – The Owl Mag – Medium</title>
  <meta name="title" content="LIVE REVIEW: Midpoint Music Festival — Cincinnati, OH">
  <meta property="og:title" content="LIVE REVIEW: Midpoint Music Festival — Cincinnati, OH">
  <meta property="og:url" content="https://medium.com/the-owl-mag/live-review-midpoint-music-festival-cincinnati-oh-a922af156600">
  <meta property="og:image" content="https://cdn-images-1.medium.com/proxy/1*MXL-j6S8fTEd8UFP_foEEw.png">
  <meta name="description" content="Cincinnati is not a music city by any means. Numerous bands skip the city in lieu of Columbus, OH on national tours and the city’s hottest music venues reside over the river in Kentucky. Midpoint…">
  <meta property="og:description" content="Cincinnati is not a music city by any means. Numerous bands skip the city in lieu of Columbus, OH on national tours and the city’s hottest music venues reside over the river in Kentucky. Midpoint…">
  <meta property="og:site_name" content="Medium">
  <meta property="og:type" content="article">
</head>

Twitter Cards

<head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# medium-com: http://ogp.me/ns/fb/medium-com#">
  <title>LIVE REVIEW: Midpoint Music Festival — Cincinnati, OH – The Owl Mag – Medium</title>
  <meta name="title" content="LIVE REVIEW: Midpoint Music Festival — Cincinnati, OH">
  <meta name="description" content="Cincinnati is not a music city by any means. Numerous bands skip the city in lieu of Columbus, OH on national tours and the city’s hottest music venues reside over the river in Kentucky. Midpoint…">
  <meta name="twitter:description" content="Cincinnati is not a music city by any means. Numerous bands skip the city in lieu of Columbus, OH on national tours and the city’s hottest music venues reside over the river in Kentucky. Midpoint…">
  <meta name="twitter:image:src" content="https://cdn-images-1.medium.com/proxy/1*MXL-j6S8fTEd8UFP_foEEw.png">
  <meta name="twitter:site" content="@Medium">
</head>

schema.org

<head>
  <script type="application/ld+json">
  {
    "@context": "http://schema.org",
    "@type": "NewsArticle",
    "image": {
      "@type": "ImageObject",
      "width": 1920,
      "height": 534,
      "url": "https://cdn-images-1.medium.com/max/1920/1*5ztbgEt4NqpVaxTc64C-XA.png"
    },
    "datePublished": "2011-09-27T16:45:34.000Z",
    "dateModified": "2016-04-19T20:30:07.176Z",
    "headline": "LIVE REVIEW: Midpoint Music Festival — Cincinnati, OH",
    "name": "LIVE REVIEW: Midpoint Music Festival — Cincinnati, OH",
    "keywords": [
      "Review"
    ],
    "author": {
      "@type": "Person",
      "name": "The Owl Mag",
      "url": "https://medium.com/@theowlmag"
    },
    "creator": [
      "The Owl Mag"
    ],
    "publisher": {
      "@type": "Organization",
      "name": "The Owl Mag",
      "url": "https://medium.com/the-owl-mag",
      "logo": {
        "@type": "ImageObject",
        "width": 215,
        "height": 60,
        "url": "https://cdn-images-1.medium.com/max/215/1*5ztbgEt4NqpVaxTc64C-XA.png"
      }
    },
    "mainEntityOfPage": "http://www.theowlmag.com/live-reviews/midpoint-music-festival-cincinnati-oh/"
  }
  </script>
</head>

schema.org (inline RDFa)

<body vocab="http://schema.org/" typeof="NewsArticle">
  <h1 property="name headline">LIVE REVIEW: Midpoint Music Festival — Cincinnati, OH</h1>
  <div>
    <b>Published by</b>
    <em property="publisher" typeof="Organization">
      <a href="https://medium.com/the-owl-mag" property="url">
        <span property="name">The Owl Mag</span>
      </a>
    </em>
    <span property="datePublished" content="2011-09-27T16:45:34.000Z">2011-09-27</span>
  <div>
  <p property="description">
  Cincinnati is not a music city by any means. Numerous bands skip the city in lieu of Columbus, OH on national tours and the city’s hottest music venues reside over the river in Kentucky. Midpoint...
  </p>
  <div property="image" typeof="ImageObject">
    <img src="https://cdn-images-1.medium.com/max/1920/1*5ztbgEt4NqpVaxTc64C-XA.png" />
    <meta property="width" content="1920" />
    <meta property="height" content="534" />
  </div>
</body>

Deep, rich structrure

  • 584 types, 846 properties
  • Covering a broad set of domains:
    • commerce
    • culture
    • events
    • health
    • organizations
  • Run as a W3C Community Group
  • Extensible

Impatience (virtue #2)

  • Joined the Bibliographic Extension community
    • Fleshed out journals/magazines
    • Represented library holdings as Product/Offers
    • Dragged comics across the finish line
  • Implemented schema.org support in library systems
  • Evangelized schema.org at conferences and articles
  • Aggregated library linked open data for analysis and reuse

Hubris (virtue #3)

Result: Despair (a bit)

Adoption from 2014 to 2015

Big data makes common schemas even more necessary.
[...]
In this sample 31.3 percent of pages have Schema.org markup, up from 22 percent one year ago.

Guha, R. V., Brickley, D., & Macbeth, S. (2016). Schema.org: Evolution of structured data on the web. Communications of the ACM, 59(2), 44-51. (http://cacm.acm.org/magazines/2016/2/197422-schema-org/fulltext)

The semantic web is alive

  • schema.org data is very rich in some communities
  • Crawling and aggregating schema.org data can make sense
  • That can feed into wikidata
  • We can pull from and build on top of wikidata
  • That can also feed the Google Knowledge Graph
  • EVERYTHING GETS BETTER

Questions?

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License