In this codelab, you're going to take a simple web page that contains a codelab and enhance it so that it contains structured data. You will use the schema.org vocabulary and express it via RDFa attributes.
Audience: Beginner
Prerequisites: To
complete this codelab, you will need a basic familiarity with HTML. The
exercises can be found in schema_org_codelab.zip,
with the solutions found in the exercises
subdirectory. There are
frequent checkpoints through the code lab, so if you get stuck at any point,
you can use the checkpoint file to resume and work through this codelab
at your own pace.
In this exercise, you will learn the basic steps required to add simple RDFa structured data to an existing web page.
Open exercise1.html
in a text
editor. You should see something like the following HTML source for the
web page:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Structured data with schema.org codelab</title>
</head>
<body>
<h1>Structured data with schema.org codelab</h1>
<img style="float:right" src="squares.png" />
<p class="byline">
By <a href="http://example.com/AuthorName">Author Name</a>,
January 29, 2014
</p>
<h2>About this codelab</h2>
<h2>Exercise 1: From basic HTML to RDFa: first steps</h2>
<h2>Exercise 2: Embedded types</h2>
<h2>Exercise 3: From strings to things</h2>
</body>
</html>
Note: In a pinch, you can use the browser development tools to
view and edit the source of the web page (CTRL-Shift-i
in
Chrome or Firefox, in the Elements or Inspector tab
respectively).
There are a number of RDFa parsers, both online and locally installable, that can help you check the results of your work. Copy and paste the HTML source into each of the following online structured data extraction tools:
The results should (not suprisingly!) show that the page currently contains no structured data.
RDFa (Resource Description Framework in attributes) enables us to embed descriptions of things (types) and their properties within HTML documents using just a handful of HTML attributes.
To avoid a tower of Babel situation where one person uses the type name "author" to refer to the same concept that someone else calls a "writer", collections of types and their properties are typically standardized and published as a vocabulary (also known as an ontology).
Each type and property is expected to have a dereferenceable URI so that
you (or more realistically the machines) can look up the definition of
the vocabulary element and determine its relationship (if any) to other
vocabulary elements. For example, you can look up
http://schema.org/TechArticle
and learn that it is a subclass of the Thing / CreativeWork /
Article
hierarchy.
You could use the full URI for each vocabulary element, but that would
be extremely verbose - especially given vocabularies that publish URIs
like http://rdaregistry.info/Elements/a/countryAssociatedWithThePerson.
Therefore, RDFa offers the @vocab
attribute; if you
add a vocab="http://<path/for/vocab>
attribute to an HTML element, any of the RDFa @typeof
and
@property
attributes within its scope will automatically
prepend the specified value to those attributes.
We're going to use the schema.org vocabulary for our exercise, as it
includes types and properties that enable us to describe many things of
general interest without having to mix and match multiple vocabularies.
Declare the default vocabulary for the HTML document
as http://schema.org/
on the <body>
element.
Note: Do not forget the trailing slash (/
)!
Checkpoint: Your HTML page should now look like check_1c.html
Many vocabularies focus on a particular domain; for example:
In practice, documents often ended up using types and properties from several different vocabularies. While vocabulary description languages like RDF Schema (RDFS) and the Web Ontology Language (OWL) offer ways to express equivalence between types and properties of different vocabularies, it can still be extremely complex to publish and consume mixed-vocabulary documents.
schema.org, on the other hand, tries to provide a vocabulary that can describe almost everything, albeit in many cases with less granularity than more specialized vocabularies.
Unless declared otherwise, web pages are assumed to have a type of WebPage. The choice of type is important as it dictates which properties you can "legally" use, so this section will help you find a more specific match for your purposes.
The schema.org types are arranged in a top-down hierarchy. Starting at
the top level of the type
hierarchy, browse through the CreativeWork -> Article -> TechArticle
type hierarchy. Notice how each type inherits the properties from its parent
(beginning with Thing
), offers its own more specific definition
for its raison d'etre, and may add its own properties to enable you
to describe it more completely.
To declare an RDFa type for an HTML document, add the
@typeof
attribute to the <body>
element
and set the value of the attribute to TechArticle
.
Checkpoint: Your HTML page should now look like check_1d.html
Every schema.org type has a name
property available to it, because the
property is declared on the Thing
type from which every other type
inherits. In the case of a TechArticle
, the title of the article
is mapped to its name. Go ahead and add a @property="name"
attribute to the <h1>
element to assert that
the content of that element is the name of the technical article.
Note: You might be tempted to add the attribute to the
<title>
element of the HTML document, but this would
fall outside of the scope of your @typeof
attribute. And
while a search engine would likely make a best guess that, if the
content of the <title>
and <h1>
for a given web page match then that's likely the title, your explicit
assertion of that property is stronger than an inference.
This article has an author, and if you check the documentation for TechArticle you will find that
there is indeed an author
property. Notice that the expected
type of the author
property is either a Person
or Organization
type. For now, go ahead and add the
@property="author"
attribute to the <a>
element for the author's name.
Note: You might be tempted to add the attribute to the
<p class="byline">
element of the HTML document,
but the scope of the <p>
element includes more
than just the name of the author, so you would be asserting (falsely!)
that the author was "By Author Name, January 29, 2014".
Right now a date of publication is visible on the page, but as the
data just lives inside an undifferentiated string of text, it would
difficult for a machine to know what the data means. To remove
this uncertainty, wrap the date in a <time>
tag and add the @property="datePublished"
attribute.
Checkpoint: Your HTML page should now look like check_1d_i.html
Check the results from various structured data parsers. Do they match
your expectations? Look closely at the author
value; you
probably did not expect the value of the author
property
to be a URL. This is one of the subtleties of RDFa, where the
href
attribute value is used for an RDFa attribute value
rather than the content of the <a>
element.
Let's fix that: add a new span
element that wraps the
<a>
tag, and move the @property="author"
attribute to the new span
tag. Run your structured data
parsers again to ensure that you're getting the results that you expect.
If you check the datePublished
documentation, you will find that the expected value of the property is
Date, which in turn is defined as a
date value in ISO 8601
format. For a date, then, the expectation is a value in either
YYYYMMDD
or YYYY-MM-DD
format.
You could change the human-visible content to match the ISO 8601 standard,
but, while that will make the machines happier, it might confuse some of
the poor humans in the audience who do not undertand perfectly logical date
formats. Let's make them both happy by supplying an inline, properly
formatted value using the @content
attribute. Go ahead and add
@content="2014-01-29"
to the <time>
element.
Every type in schema.org can have an image
property. One
potential use case for search engines is to use image
entities to provide a more visually attractive search result. Your
technical article contains an image (let's not discuss whether it is
visually attractive!). Add the @property="image"
attribute
to the <img>
element.
You can help processors of your article find the actual substance of the
article. Assume that all of the relevant content can be found
in the sections under the <h2>
elements... so once
again, you need to add a new element strictly to support structured
data. In this case, add a <div>
element that wraps
all of the <h2>
elements. Include the
@property="articleBody"
attribute on the new element.
Checkpoint: Your HTML page should now look like check_1d_ii.html
You might have noticed that some of the RDFa parsers generate a rich
snippet that shows you what your page might look
like as a search result. You may also have noticed that the rich snippet
did not contain much content of your page other than its title. To help
search engines generate a better rich snippet, you should include a
@property="description"
attribute in your web page.
Move the <div property="articleBody">
element down so that
it wraps only the "Exercise" <h2>
elements. Then add
a new <div property="description">
element and attribute
that wraps only the "About this codelab" <h2>
element.
It can be useful for technical articles to include an indication of
their intended use. For example, this is a code lab, intended
for hands-on learning that reinforces concepts that are introduced
gradually throughout the code lab. Fortunately, schema.org offers
the educationalUse
property for this purpose on the
CreativeWork
type and its children. However, your page
does not include an obvious place to attach this markup.
When you realize that a vocabulary has pointed out a possible
deficiency in your work, you could revisit the web page and add an
"Intended use" field that you could then use to classify all of your
work. In this step, assume that you are working with a strict designer
who forbids you from altering the look or content of the page. In that
situation, your only option is to use a <meta>
element to define the property value for the machines.
Go ahead and add <meta property="educationalUse" content="codelab">
anywhere within the scope of the TechArticle
. The solutions
add the element directly under the <h1>
element.
Note: Do not use this approach as a license to stuff your
web page full of lascivious keywords that have no connection to your
content in the hopes of drawing a larger audience to your site. The
search engines learned about this "spiderfood" tactic back in the 90's
and will punish your site mercilessly with low relevancy ranking if you
are determined to have been trying to game their systems. The generally
accepted best practice is to try to only add machine-readable markup to
the same content that humans can see. Reserve <meta>
elements only for the most important purposes.
Checkpoint: Your HTML page should now look like check_1f.html
Go back to TechArticle page
and view the source for the Properties from CreativeWork table. Notice
that the type and property hierarchies are themselves defined in RDFa using the
rdfs:subClassOf
and rdfs:domainIncludes
properties
from the RDF Schema vocabulary
(a vocabulary for defining vocabularies).
In this exercise, you learned:
@content
attribute to supply
machine-readable versions of human-oriented data<meta>
element to supply properties
that would not otherwise be part of the content
So far you have described the page using a single type and a handful
properties. However, when you added the @property="author"
attribute, the expected value for the property (the range) was
not a simple text string; it was supposed to be either a Person or Organization type.
In this exercise, you will add several embedded types to the page to conform to the vocabulary definition and make your structured data even more useful.
Continue working with the HTML file that you have been editing so far, or for a fresh start, copy check_1f.html into a new file. As a refresher, your HTML should now look something like the following:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Structured data with schema.org codelab</title>
</head>
<body vocab="http://schema.org/" typeof="TechArticle">
<h1 property="name">Structured data with schema.org codelab</h1>
<img style="float:right" src="squares.png" property="image" />
<meta property="educationalUse" content="codelab">
<p class="byline">
By <span property="author"><a href="http://example.com/AuthorName">Author Name</a></span>,
<time property="datePublished" content="20140129">January 29, 2014</time>
</p>
<div property="description">
<h2>About this codelab</h2>
</div>
<div property="articleBody">
<h2>Exercise 1: From basic HTML to RDFa: first steps</h2>
<h2>Exercise 2: Embedded types</h2>
<h2>Exercise 3: From strings to things</h2>
</div>
</body>
</html>
Your @property="author"
attribute needs to define a
Person
type to satisfy the expected value of
author
. Simply add the @typeof="Person"
attribute to the same HTML element so that you are, in one step,
defining the author
attribute for the overall
TechArticle
type, while simultaneously starting a new
Person
type scope.
Now that you have defined a Person
type, you can define
specific properties for it.
Declare that the <a>
element is the
sameAs
property of the Person
. While
url
might be tempting, it is usually reserved for linking
to a URL where the thing that is described is available, whereas
sameAs
is used to link to a description of the thing.
Declare that the person's name is the name
property of the
Person
type. For bonus points, nest the
givenName
and familyName
properties inside of
the name
property.
Tip: Remember that you might need to add
<span>
tags to create a new scope for the properties
that you want to add.
Checkpoint: Your HTML page should now look like check_2a.html
Copyright is an important subject for both creators and organizations
and individuals seeking to reuse or republish work, so naturally
schema.org includes a copyrightHolder
property that you
can apply. In this case, however, the author and the copyrightHolder
are one and the same, and you have already used the
@property
attribute.
To define multiple property values for the same attribute, simply
include the values as a whitespace-delimited list. In this case, edit
the HTML to declare @property="author copyrightHolder"
and check your work in one or more structured data validators.
Note: These are still relatively early days for structured
data validators, and their output varies for more esoteric cases
like multi-valued attributes. For example, the Structured Data Linter
recognizes the second value for copyrightHolder
but
generates a "blank node" identifier for it, whereas Google's
Structured Data Testing Tool only recognizes the last value of the
multi-valued attribute. To complicate matters further, the search
engines recognize that their tools have bugs that differ from what
their actual production parser understands... so don't be overly
alarmed if it seems like your markup is not being recognized by the
testing tool.
Sometimes your HTML document does not group all of the content in such
a way that you can cleanly keep all of the attributes for a given
instance of a type within a single scope. In these cases, you may be
able to use the @resource
attribute to logically group the
properties for that instance.
For example, assume that you have been asked to add the following author
biography to the bottom of your technical article: <p>Dan
Scott is a systems librarian at Laurentian University.</p>
Now you have the perfect description
property for your
author--but it is separated from your existing Person
instance by all of the content in the middle.
To resolve the problem, simply add a @resource
attribute
to your existing Person
declaration. The value of the new
attribute should be unique on this page; use "authorName" for the sake of
simplicity.
Add a wrapping <div>
element around the
biography section, including a @resource
attribute with
a value of "authorName" to match what you added above. This creates a new
scope for the existing type instance, such that any properties declared
within this new scope will be added to the existing type instance.
Now add a @property="description"
attribute to the
<p>
element for the biography, and check your work
in the RDFa parser tools.
Checkpoint: Your HTML page should now look like check_2c.html
Given the new author biography, there are several other structured data
assertions you can now make on the Person
type:
jobTitle
property.worksFor
property. Note that the expected
type is Organization
. Add another embedded type here,
and use <meta>
elements to make some assertions
such as sameAs
.In this exercise, you learned:
@property
attribute@resource
to group assertions for a single type on a pageSo far you have described the page using types and properties that are inside the page itself. But if you have to update some information that is common to many of your pages, that could be painful to roll out... and even if you have an automated process for updating that information across all of your pages, there is no guarantee that anything extracting data from your site will extract all of the updates at one time.
Fortunately, the problem of providing one copy of information on the web was solved at the same time the web was created: via the simple power of the link! And structured data is no different; in fact, linked data is a term that has emerged over the past few years marking a more pragmatic approach to building a web of structured data than the somewhat classically academic semantic web.
The following principles of linked data were first articulated by Tim Berners-Lee in a 2006 design note:
Keep these principles in mind as you work through the following steps!
Continue working with the HTML file that you have been editing so far, or for a fresh start, copy check_2c.html into a new file. As a refresher, your HTML should now look something like the following:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Structured data with schema.org codelab</title>
</head>
<body vocab="http://schema.org/" typeof="TechArticle">
<h1 property="name">Structured data with schema.org codelab</h1>
<img style="float:right" src="squares.png" property="image" />
<meta property="educationalUse" content="codelab">
<p class="byline">
By <span property="author" typeof="Person" resource="#authorName">
<a href="http://example.com/AuthorName" property="sameAs"><span
property="name"><span property="givenName">Dan</span>
<span property="familyName">Scott</span></span></a></span>,
<time property="datePublished" content="20140129">January 29, 2014</time>
</p>
<div property="description">
<h2>About this codelab</h2>
</div>
<div property="articleBody">
<h2>Exercise 1: From basic HTML to RDFa: first steps</h2>
<h2>Exercise 2: Embedded types</h2>
<h2>Exercise 3: From strings to things</h2>
</div>
<div resource="#authorName">
<p property="description">
Author Name is a systems librarian at Laurentian University.
</p>
</div>
</body>
</html>
Take a look at how the page has developed over time; there is now a lot of HTML markup just to describe the author. It's a perfect candidate for refactoring; you can move the bulk of the markup to a separate page about the author. Once it is a separate page, then you can simply link to it from this page... as well as from any other pages that want to provide information about this author.
Create a new file named authorName.html
in your text editor,
and copy the @resource="#authorName"
markup into the file.
As the new file describes a single type, you can move the
declaration of the type into the <body>
element, and
you can remove the @resource
attributes from the markup
that you pasted into the file. Don't forget the @vocab
declaration! Use your existing page as a template.
Use the RDFa parsers to ensure that the markup in the new file expresses the same information as it did in the original file.
Now, replace the inline markup in the original page with a simple link
to your new file. You still want to state that "Author Name" is the
author of the technical article using the @property="author"
assertion, but now you can simply add that property directly to the
<a>
element that links to your new file. This is a
signal to any RDFa parser that the linked resource contains the data
for the named property.
Note: "when the element contains the href
(or
src
) attribute, @property
is automatically
associated with the value of this attribute rather than the textual
content of the <a>
element" (Adida, Ben;
Birbeck, Mark; Herman, Ivan; Sporny, Manu. RDFa 1.1 Primer - Second
edition). Using a @property
attribute on the
same element as a @resource
attribute works in a similar
fashion; the target of the @resource
attribute is used as
the value of the @property
attribute.
Checkpoint: Your original HTML page should now look like check_3b.html and your new author HTML page should look like exercises/check_3b_authorName.html.
Now that you have created an entirely separate author page, you can add much more information about the author; for example, you can include an email address, links to their personal web sites and social media accounts, a list of their publications and previous talks... far more information than you would have wanted to publish inline in the article itself.
Following the principles of linked data can lead not only to more efficient maintenance of information and (potentially) more useful results in search engines and other aggregators of data, but also to a better information design and experience for your users.
Use the Person
properties to flesh out the "about this author" page with properties
such as address
, birthDate
, email
, follows
, and telephone
. Be
adventurous, and remember to try to use nested types and ranges
appropriately!
In this exercise, you learned:
@property
and @href
attributes to link to data on another pageDan Scott is a systems librarian at Laurentian University.
This work
is licensed under a Creative
Commons Attribution-ShareAlike 4.0 International License.