Assessing the quality of linked bibliographic data

Dan Scott <https://dscott.ca/#i>

PhD student, McGill University

2019-04-02 Creative Commons License

Bibliographic dataLinked data
Printed cards A web of documents
Digital card catalogues A web of data
Records (fields with values)Entities and relationships
TextText and media
Records are shared, modified, resharedAnyone can say anything about anything
MetadataMetadata and data
Part of a MARC21 record showing fields and text
MARC record data
Corresponding data as BIBFRAME showing relationships between the book and film
Linked data

Linked bibliographic data

  • From bibliographic records to linked bibliographic data:
    • Dublin Core Metadata Terms (dcterms) - 2003-
    • Bibliographic Ontology (bibo) - 2008-2009
    • FRBR-Aligned Bibliographic Ontology (fabio) - 2010-
    • Bibliographic Framework (BIBFRAME) - 2012-
    • Resource Description Access (RDA) - 2014-

Data quality: why?

  • Standards are not sufficient, nor standard (Zhu et al, 2016)
  • Bibliographic data: history of collaborative efforts and challenges
  • Linked data: broad scope, sophisticated models, but limited results
  • Linked bibliographic data: early days, wide open

Seminal definition of data quality

data that are fit for use by data consumers Wang & Strong (1996)
  • "data consumers" means context is critical
  • Context for this work was MIS with defence systems

Wang & Strong (1996) model

CategoryDimensions
Intrinsic believability, accuracy, objectivity, reputation
Contextual value-added, relevancy, timeliness, completeness, amount
Representational interpretability, ease of understanding, representational consistency, concise representation
Accessibility accessibility, access security

Bibliographic data models

PaperCategories (# dimensions)
Moen et al. (1998)Completeness, accuracy, serviceability
Bruce & Hillman (2004)Completeness, accuracy, provenance, conformance, coherence, timeliness, accessibility
Stvilia et al. (2007)Intrinsic (9), relational/contextual (12), reputational (1)
Ochoa & Duval (2009)Bruce & Hillman (2004)
Gavrilis et al. (2015)Completeness, accuracy, consistency, appropriateness, auditability

Linked data models

PaperCategories (# dimensions)
Pipino et al. (2002)Wang & Strong (1996) + free-of-error, objectivity
Fürber & Hepp (2011)Completeness, accuracy, timeliness, uniqueness
Zaveri et al. (2016)Intrinsic (5), contextual (4), representational (4), accessibility (5)
Färber et al. (2017)Intrinsic (3), contextual (3), representational (2), accessibility (3)
Debattista et al. (2018)Intrinsic, contextual, representational, accessibility

Gaps and opportunities

Generalizability

  • Linked data: yes, but specificity is lacking
  • Bibliographic literature: no
    • Small scale
    • Dublin Core
    • Context bound

Reproducibility

  • External validity threat: history vs. open datasets
  • Lack of access to implementations, in-house datasets

Automation: breadth vs. depth

  • Billions of records: mostly syntax
  • Handfuls of records: rich semantics

Research questions

  1. How should libraries conceive of quality in the domain of linked bibliographic data?
  2. How can libraries meaningfully assess the quality of linked bibliographic data?
  3. How can libraries assess the quality of linked bibliographic data at scale?

References

Appendices

Proposal

  1. Synthesize a conceptual model for quality assessment of linked bibliographic data
  2. Compile a gold standard dataset of linked bibliographic data using the BIBFRAME vocabulary (Aalberg et al., 2018; Decourselle, 2016)
  3. Implement the conceptual model metrics in an automated assessment tool

Significance

  • Constructivist grounded theory is a new approach to model development for data quality
  • Gold standard dataset will ease the efforts of other researchers, in data quality and other inquiries
  • Assessment tool should enable libraries to evaluate and improve their linked bibliographic data early

Methodologies

  • Surveys
  • Interviews
  • Card sorting
  • Content analysis
  • Log analysis
  • Functional requirements heuristics
  • Deductive reasoning
  • Automated tools

Aggregations

From national union catalogues, cooperative cataloguing efforts, and OAI-PMH harvested records, to open knowledge graphs

Boissonas (1979); Bruce & Hillman (2004); Debattista et al. (2018); El-Sherbini (2010); Färber et al. (2017); Hider & Tan (2008); Moen et al. (1998); Stvilia et al. (2007);

Fitness for use

  • Heavy dependence on IFLA LRM (2017) tasks:
    • Find
    • Identify
    • Select
    • Obtain
    • Explore

Data consumers

  • "Data consumers" depend on context
  • Linked data broadens possible contexts
  • Consumers... but also producers?