In this codelab, you're going to take a simple web page that contains a codelab and enhance it so that it contains structured data. You will use the schema.org vocabulary and express it via RDFa attributes.
complete this codelab, you will need a basic familiarity with HTML. The
exercises can be found in schema_org_codelab.zip,
with the solutions found in the
exercises subdirectory. There are
frequent checkpoints through the code lab, so if you get stuck at any point,
you can use the checkpoint file to resume and work through this codelab
at your own pace.
In this exercise, you will learn the basic steps required to add simple RDFa structured data to an existing web page.
exercise1.html in a text
editor. You should see something like the following HTML source for the
<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title>Structured data with schema.org codelab</title> </head> <body> <h1>Structured data with schema.org codelab</h1> <img style="float:right" src="squares.png" /> <p class="byline"> By <a href="http://example.com/AuthorName">Author Name</a>, January 29, 2014 </p> <h2>About this codelab</h2> <h2>Exercise 1: From basic HTML to RDFa: first steps</h2> <h2>Exercise 2: Embedded types</h2> <h2>Exercise 3: From strings to things</h2> </body> </html>
Note: In a pinch, you can use the browser development tools to
view and edit the source of the web page (
Chrome or Firefox, in the Elements or Inspector tab
There are a number of RDFa parsers, both online and locally installable, that can help you check the results of your work. Copy and paste the HTML source into each of the following online structured data extraction tools:
The results should (not suprisingly!) show that the page currently contains no structured data.
RDFa (Resource Description Framework in attributes) enables us to embed descriptions of things (types) and their properties within HTML documents using just a handful of HTML attributes.
To avoid a tower of Babel situation where one person uses the type name "author" to refer to the same concept that someone else calls a "writer", collections of types and their properties are typically standardized and published as a vocabulary (also known as an ontology).
Each type and property is expected to have a dereferenceable URI so that
you (or more realistically the machines) can look up the definition of
the vocabulary element and determine its relationship (if any) to other
vocabulary elements. For example, you can look up
and learn that it is a subclass of the
Thing / CreativeWork /
You could use the full URI for each vocabulary element, but that would
be extremely verbose - especially given vocabularies that publish URIs
Therefore, RDFa offers the
@vocab attribute; if you
attribute to an HTML element, any of the RDFa
@property attributes within its scope will automatically
prepend the specified value to those attributes.
We're going to use the schema.org vocabulary for our exercise, as it
includes types and properties that enable us to describe many things of
general interest without having to mix and match multiple vocabularies.
Declare the default vocabulary for the HTML document
http://schema.org/ on the
Note: Do not forget the trailing slash (
Checkpoint: Your HTML page should now look like check_1c.html
Many vocabularies focus on a particular domain; for example:
In practice, documents often ended up using types and properties from several different vocabularies. While vocabulary description languages like RDF Schema (RDFS) and the Web Ontology Language (OWL) offer ways to express equivalence between types and properties of different vocabularies, it can still be extremely complex to publish and consume mixed-vocabulary documents.
schema.org, on the other hand, tries to provide a vocabulary that can describe almost everything, albeit in many cases with less granularity than more specialized vocabularies.
Unless declared otherwise, web pages are assumed to have a type of WebPage. The choice of type is important as it dictates which properties you can "legally" use, so this section will help you find a more specific match for your purposes.
The schema.org types are arranged in a top-down hierarchy. Starting at
the top level of the type
hierarchy, browse through the CreativeWork -> Article -> TechArticle
type hierarchy. Notice how each type inherits the properties from its parent
Thing), offers its own more specific definition
for its raison d'etre, and may add its own properties to enable you
to describe it more completely.
To declare an RDFa type for an HTML document, add the
@typeof attribute to the
and set the value of the attribute to
Checkpoint: Your HTML page should now look like check_1d.html
Every schema.org type has a
name property available to it, because the
property is declared on the
Thing type from which every other type
inherits. In the case of a
TechArticle, the title of the article
is mapped to its name. Go ahead and add a
attribute to the
<h1> element to assert that
the content of that element is the name of the technical article.
Note: You might be tempted to add the attribute to the
<title> element of the HTML document, but this would
fall outside of the scope of your
@typeof attribute. And
while a search engine would likely make a best guess that, if the
content of the
for a given web page match then that's likely the title, your explicit
assertion of that property is stronger than an inference.
This article has an author, and if you check the documentation for TechArticle you will find that
there is indeed an
author property. Notice that the expected
type of the
author property is either a
Organization type. For now, go ahead and add the
@property="author" attribute to the
element for the author's name.
Note: You might be tempted to add the attribute to the
<p class="byline"> element of the HTML document,
but the scope of the
<p> element includes more
than just the name of the author, so you would be asserting (falsely!)
that the author was "By Author Name, January 29, 2014".
Right now a date of publication is visible on the page, but as the
data just lives inside an undifferentiated string of text, it would
difficult for a machine to know what the data means. To remove
this uncertainty, wrap the date in a
tag and add the
Checkpoint: Your HTML page should now look like check_1d_i.html
Check the results from various structured data parsers. Do they match
your expectations? Look closely at the
author value; you
probably did not expect the value of the
to be a URL. This is one of the subtleties of RDFa, where the
href attribute value is used for an RDFa attribute value
rather than the content of the
Let's fix that: add a new
span element that wraps the
<a> tag, and move the
attribute to the new
span tag. Run your structured data
parsers again to ensure that you're getting the results that you expect.
If you check the datePublished
documentation, you will find that the expected value of the property is
Date, which in turn is defined as a
date value in ISO 8601
format. For a date, then, the expectation is a value in either
You could change the human-visible content to match the ISO 8601 standard,
but, while that will make the machines happier, it might confuse some of
the poor humans in the audience who do not undertand perfectly logical date
formats. Let's make them both happy by supplying an inline, properly
formatted value using the
@content attribute. Go ahead and add
@content="2014-01-29" to the
Every type in schema.org can have an
image property. One
potential use case for search engines is to use
entities to provide a more visually attractive search result. Your
technical article contains an image (let's not discuss whether it is
visually attractive!). Add the
You can help processors of your article find the actual substance of the
article. Assume that all of the relevant content can be found
in the sections under the
<h2> elements... so once
again, you need to add a new element strictly to support structured
data. In this case, add a
<div> element that wraps
all of the
<h2> elements. Include the
@property="articleBody" attribute on the new element.
Checkpoint: Your HTML page should now look like check_1d_ii.html
You might have noticed that some of the RDFa parsers generate a rich
snippet that shows you what your page might look
like as a search result. You may also have noticed that the rich snippet
did not contain much content of your page other than its title. To help
search engines generate a better rich snippet, you should include a
@property="description" attribute in your web page.
<div property="articleBody"> element down so that
it wraps only the "Exercise"
<h2> elements. Then add
<div property="description"> element and attribute
that wraps only the "About this codelab"
It can be useful for technical articles to include an indication of
their intended use. For example, this is a code lab, intended
for hands-on learning that reinforces concepts that are introduced
gradually throughout the code lab. Fortunately, schema.org offers
educationalUse property for this purpose on the
CreativeWork type and its children. However, your page
does not include an obvious place to attach this markup.
When you realize that a vocabulary has pointed out a possible
deficiency in your work, you could revisit the web page and add an
"Intended use" field that you could then use to classify all of your
work. In this step, assume that you are working with a strict designer
who forbids you from altering the look or content of the page. In that
situation, your only option is to use a
element to define the property value for the machines.
Go ahead and add
<meta property="educationalUse" content="codelab">
anywhere within the scope of the
TechArticle. The solutions
add the element directly under the
Note: Do not use this approach as a license to stuff your
web page full of lascivious keywords that have no connection to your
content in the hopes of drawing a larger audience to your site. The
search engines learned about this "spiderfood" tactic back in the 90's
and will punish your site mercilessly with low relevancy ranking if you
are determined to have been trying to game their systems. The generally
accepted best practice is to try to only add machine-readable markup to
the same content that humans can see. Reserve
elements only for the most important purposes.
Checkpoint: Your HTML page should now look like check_1f.html
Go back to TechArticle page
and view the source for the Properties from CreativeWork table. Notice
that the type and property hierarchies are themselves defined in RDFa using the
from the RDF Schema vocabulary
(a vocabulary for defining vocabularies).
In this exercise, you learned:
@contentattribute to supply machine-readable versions of human-oriented data
<meta>element to supply properties that would not otherwise be part of the content
So far you have described the page using a single type and a handful
properties. However, when you added the
attribute, the expected value for the property (the range) was
not a simple text string; it was supposed to be either a Person or Organization type.
In this exercise, you will add several embedded types to the page to conform to the vocabulary definition and make your structured data even more useful.
Continue working with the HTML file that you have been editing so far, or for a fresh start, copy check_1f.html into a new file. As a refresher, your HTML should now look something like the following:
<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title>Structured data with schema.org codelab</title> </head> <body vocab="http://schema.org/" typeof="TechArticle"> <h1 property="name">Structured data with schema.org codelab</h1> <img style="float:right" src="squares.png" property="image" /> <meta property="educationalUse" content="codelab"> <p class="byline"> By <span property="author"><a href="http://example.com/AuthorName">Author Name</a></span>, <time property="datePublished" content="20140129">January 29, 2014</time> </p> <div property="description"> <h2>About this codelab</h2> </div> <div property="articleBody"> <h2>Exercise 1: From basic HTML to RDFa: first steps</h2> <h2>Exercise 2: Embedded types</h2> <h2>Exercise 3: From strings to things</h2> </div> </body> </html>
@property="author" attribute needs to define a
Person type to satisfy the expected value of
author. Simply add the
attribute to the same HTML element so that you are, in one step,
author attribute for the overall
TechArticle type, while simultaneously starting a new
Person type scope.
Now that you have defined a
Person type, you can define
specific properties for it.
Declare that the
<a> element is the
sameAs property of the
url might be tempting, it is usually reserved for linking
to a URL where the thing that is described is available, whereas
sameAs is used to link to a description of the thing.
Declare that the person's name is the
name property of the
Person type. For bonus points, nest the
familyName properties inside of
Tip: Remember that you might need to add
<span> tags to create a new scope for the properties
that you want to add.
Checkpoint: Your HTML page should now look like check_2a.html
Copyright is an important subject for both creators and organizations
and individuals seeking to reuse or republish work, so naturally
schema.org includes a
copyrightHolder property that you
can apply. In this case, however, the author and the copyrightHolder
are one and the same, and you have already used the
To define multiple property values for the same attribute, simply
include the values as a whitespace-delimited list. In this case, edit
the HTML to declare
and check your work in one or more structured data validators.
Note: These are still relatively early days for structured
data validators, and their output varies for more esoteric cases
like multi-valued attributes. For example, the Structured Data Linter
recognizes the second value for
generates a "blank node" identifier for it, whereas Google's
Structured Data Testing Tool only recognizes the last value of the
multi-valued attribute. To complicate matters further, the search
engines recognize that their tools have bugs that differ from what
their actual production parser understands... so don't be overly
alarmed if it seems like your markup is not being recognized by the
Sometimes your HTML document does not group all of the content in such
a way that you can cleanly keep all of the attributes for a given
instance of a type within a single scope. In these cases, you may be
able to use the
@resource attribute to logically group the
properties for that instance.
For example, assume that you have been asked to add the following author
biography to the bottom of your technical article:
Scott is a systems librarian at Laurentian University.</p>
Now you have the perfect
description property for your
author--but it is separated from your existing
instance by all of the content in the middle.
To resolve the problem, simply add a
to your existing
Person declaration. The value of the new
attribute should be unique on this page; use "authorName" for the sake of
Add a wrapping
<div> element around the
biography section, including a
@resource attribute with
a value of "authorName" to match what you added above. This creates a new
scope for the existing type instance, such that any properties declared
within this new scope will be added to the existing type instance.
Now add a
@property="description" attribute to the
<p> element for the biography, and check your work
in the RDFa parser tools.
Checkpoint: Your HTML page should now look like check_2c.html
Given the new author biography, there are several other structured data
assertions you can now make on the
worksForproperty. Note that the expected type is
Organization. Add another embedded type here, and use
<meta>elements to make some assertions such as
In this exercise, you learned:
@resourceto group assertions for a single type on a page
So far you have described the page using types and properties that are inside the page itself. But if you have to update some information that is common to many of your pages, that could be painful to roll out... and even if you have an automated process for updating that information across all of your pages, there is no guarantee that anything extracting data from your site will extract all of the updates at one time.
Fortunately, the problem of providing one copy of information on the web was solved at the same time the web was created: via the simple power of the link! And structured data is no different; in fact, linked data is a term that has emerged over the past few years marking a more pragmatic approach to building a web of structured data than the somewhat classically academic semantic web.
The following principles of linked data were first articulated by Tim Berners-Lee in a 2006 design note:
Keep these principles in mind as you work through the following steps!
Continue working with the HTML file that you have been editing so far, or for a fresh start, copy check_2c.html into a new file. As a refresher, your HTML should now look something like the following:
<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title>Structured data with schema.org codelab</title> </head> <body vocab="http://schema.org/" typeof="TechArticle"> <h1 property="name">Structured data with schema.org codelab</h1> <img style="float:right" src="squares.png" property="image" /> <meta property="educationalUse" content="codelab"> <p class="byline"> By <span property="author" typeof="Person" resource="#authorName"> <a href="http://example.com/AuthorName" property="sameAs"><span property="name"><span property="givenName">Dan</span> <span property="familyName">Scott</span></span></a></span>, <time property="datePublished" content="20140129">January 29, 2014</time> </p> <div property="description"> <h2>About this codelab</h2> </div> <div property="articleBody"> <h2>Exercise 1: From basic HTML to RDFa: first steps</h2> <h2>Exercise 2: Embedded types</h2> <h2>Exercise 3: From strings to things</h2> </div> <div resource="#authorName"> <p property="description"> Author Name is a systems librarian at Laurentian University. </p> </div> </body> </html>
Take a look at how the page has developed over time; there is now a lot of HTML markup just to describe the author. It's a perfect candidate for refactoring; you can move the bulk of the markup to a separate page about the author. Once it is a separate page, then you can simply link to it from this page... as well as from any other pages that want to provide information about this author.
Create a new file named
authorName.html in your text editor,
and copy the
@resource="#authorName" markup into the file.
As the new file describes a single type, you can move the
declaration of the type into the
<body> element, and
you can remove the
@resource attributes from the markup
that you pasted into the file. Don't forget the
declaration! Use your existing page as a template.
Use the RDFa parsers to ensure that the markup in the new file expresses the same information as it did in the original file.
Now, replace the inline markup in the original page with a simple link
to your new file. You still want to state that "Author Name" is the
author of the technical article using the
assertion, but now you can simply add that property directly to the
<a> element that links to your new file. This is a
signal to any RDFa parser that the linked resource contains the data
for the named property.
Note: "when the element contains the
@property is automatically
associated with the value of this attribute rather than the textual
content of the
<a> element" (Adida, Ben;
Birbeck, Mark; Herman, Ivan; Sporny, Manu. RDFa 1.1 Primer - Second
edition). Using a
@property attribute on the
same element as a
@resource attribute works in a similar
fashion; the target of the
@resource attribute is used as
the value of the
Checkpoint: Your original HTML page should now look like check_3b.html and your new author HTML page should look like exercises/check_3b_authorName.html.
Now that you have created an entirely separate author page, you can add much more information about the author; for example, you can include an email address, links to their personal web sites and social media accounts, a list of their publications and previous talks... far more information than you would have wanted to publish inline in the article itself.
Following the principles of linked data can lead not only to more efficient maintenance of information and (potentially) more useful results in search engines and other aggregators of data, but also to a better information design and experience for your users.
properties to flesh out the "about this author" page with properties
adventurous, and remember to try to use nested types and ranges
In this exercise, you learned:
@hrefattributes to link to data on another page
Dan Scott is a systems librarian at Laurentian University.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.