Assessing the quality of linked bibliographic data

Dan Scott <https://dscott.ca/#i>

PhD student, McGill University

2019-04-29 Creative Commons License

Printed card from a library catalogue describing a book published in 1969

Photo credit: worthpoint.com

MARC record from a library catalogue describing same book published in 1969 with same values
Bibliographic dataLinked data
Printed cards A web of documents
Digital card catalogues A web of data
Records (fields with values)Entities and relationships
TextText and media
Records are shared, modified, resharedAnyone can say anything about anything
MetadataMetadata and data

Linked bibliographic data

  • From bibliographic records to linked bibliographic data:
    • Dublin Core Metadata Terms (dcterms) - 2003-
    • Bibliographic Ontology (bibo) - 2008-2009
    • FRBR-Aligned Bibliographic Ontology (fabio) - 2010-
    • Bibliographic Framework (BIBFRAME) - 2012-
    • Resource Description Access (RDA) - 2014-
Part of a MARC21 record showing fields and text
MARC record data
Corresponding data as BIBFRAME showing relationships between the book and film
Linked data

Data quality: why?

  • Standards are not sufficient, nor standard (Zhu et al, 2016)
  • Bibliographic data: history of collaborative efforts and challenges
  • Linked data: broad scope, sophisticated models, but limited results
  • Linked bibliographic data: early days, wide open

Seminal definition of data quality

data that are fit for use by data consumers Wang & Strong (1996)
  • "data consumers" means context is critical
  • Context for this work was MIS with defence systems

Data quality dimensions

Dimension CS models LIS models MIS models Total
Intrinsic structural consistency 6 4 4 14
Intrinsic accuracy 4 3 4 11
Intrinsic completeness 4 3 4 11
Accessibility 4 2 2 8
Completeness 5 1 2 8
Relevancy 4 2 2 8
Reputation 5 1 2 8
Timeliness 6 0 2 8
Relational accuracy 4 2 1 7
Relational structural consistency 5 1 1 7
Intrinsic concise representation 5 0 2 7

15 studies across CS (n=7), LIS (n=4), and MIS (n=4)

Data quality dimensions

Dimension CS models LIS models MIS models Total
Intrinsic complexity 0 2 2 4
Relational semantic consistency 1 3 0 4
Currency 1 2 1 4
Volatility 1 2 1 4

15 studies across CS (n=7), LIS (n=4), and MIS (n=4)

Data quality research methods

MethodCSLISMISGrand Total
Automated analysis92112
Benchmark33
Case studies1113
Content analysis88
Deductive reasoning224
Expert opinion22
Survey (quantitative)235
Test datasets123

33 studies across CS (n=11), LIS (n=19), and MIS (n=3)

Gaps and opportunities

Generalizability

  • Linked data: yes, but specificity is lacking
  • Bibliographic literature: no
    • Small scale
    • Dublin Core
    • Context bound

Reproducibility

  • External validity threat: history vs. open datasets
  • Lack of access to implementations, in-house datasets

Automation: breadth vs. depth

  • Billions of records: mostly syntax
  • Handfuls of records: rich semantics

Research questions

  1. How should libraries conceive of quality in the domain of linked bibliographic data?
  2. How can libraries meaningfully assess the quality of linked bibliographic data?
  3. How can libraries assess the quality of linked bibliographic data at scale?

References

Appendices

Proposal

  1. Synthesize a conceptual model for quality assessment of linked bibliographic data
  2. Compile a gold standard dataset of linked bibliographic data using the BIBFRAME vocabulary (Aalberg et al., 2018; Decourselle, 2016)
  3. Implement the conceptual model metrics in an automated assessment tool

Significance

  • Constructivist grounded theory is a new approach to model development for data quality
  • Gold standard dataset will ease the efforts of other researchers, in data quality and other inquiries
  • Assessment tool should enable libraries to evaluate and improve their linked bibliographic data early

Methodologies

  • Surveys
  • Interviews
  • Card sorting
  • Content analysis
  • Log analysis
  • Functional requirements heuristics
  • Deductive reasoning
  • Automated tools

Aggregations

From national union catalogues, cooperative cataloguing efforts, and OAI-PMH harvested records, to open knowledge graphs

Boissonas (1979); Bruce & Hillman (2004); Debattista et al. (2018); El-Sherbini (2010); Färber et al. (2017); Hider & Tan (2008); Moen et al. (1998); Stvilia et al. (2007);

Fitness for use

  • Heavy dependence on IFLA LRM (2017) tasks:
    • Find
    • Identify
    • Select
    • Obtain
    • Explore

Data consumers

  • "Data consumers" depend on context
  • Linked data broadens possible contexts
  • Consumers... but also producers?