Main | Table of Contents | Introduction | Textual Works and Musical Compositions | Still Image Works | Audio Works | Moving Image Works | Software and Electronic Gaming and Learning | Datasets/Databases | Websites
VI. Datasets/Databases
The Library is aware that, in some cases, the provision of datasets and databases for current research uses (including support for the U.S. Congress) may depend upon native formats and associated software, while preservation and long-term access may depend upon data-migration via transport or export formats, with a concomitant risk of loss of precision and accuracy. Given the focus of this document is preservation and long-term access, the following format preferences favor those outcomes.
- Datasets
(For Geospatial Data, see Section VI.ii below)Preferred:
- Formats
- Platform-independent, character-based formats are preferred over native or binary formats as long as data is complete, and retains full detail and precision. Preferred formats include well-developed, widely adopted, de facto marketplace standards, e.g.
- Self-describing, e.g. JSON, XML-based data formats using well known schemas, XML-based data formats accompanied by schema employed
- Line-oriented, e.g. TSV, CSV, fixed-width
- Platform-independent open formats, e.g. SQLite (.sqlite, .db, .db3)
- Character Encoding, in descending order of preference:
- UTF-8, UTF-16 (with BOM)
- US-ASCII or ISO 8859-1
- Other named encoding
- Related Materials
- Consult the appropriate sections of this document to identify the preferred formats for supplementary material
- Delivery Method
- Hard drive; CD-ROM; DVD-ROM, or other media not yet assigned
- Metadata
- Deposits should include all applicable metadata, data dictionaries, XML schemas, and technical specifications as appropriate. Discipline-specific metadata standards should be used whenever possible
- As supported by format:
- Title
- Creator
- Creation date
- Place of publication
- Publisher/ producer/ distributor
- Contact information
- A list of software used to produce, render or compress the data (if applicable)
- Character encoding
- Include if available:
- Language of work
- Other relevant identifiers (e.g., DOI, LCCN, canonical URL, etc.)
- Subject descriptors
- Abstracts
- Key or reference to each data field
- Checksums
- Version-control IDs or tags
- Information about how the data was collected and any sampling or post-processing which has been applied
- Known copyright terms, especially for datasets which combine data from multiple sources
- For datasets serving as part of a database: proprietary database package and version
- Technological Measures
- Files must contain no measures (such as digital rights management or encryption) that control access to or prevent use of the digital work.
- Files in formats which support linking or embedding external resources (e.g. XML, JSON, Excel) should be self-contained to remain useful in the event of external service changes.
- Files in formats which support executable code (e.g. Excel) do not contain executable code.
- Formats
- Non-proprietary, publicly documented formats endorsed as standards by a professional community or government agency, e.g. CDF, HDF
- Any proprietary format that is a de facto standard for a profession or supported by multiple tools (e.g. Excel .xls or .xlsx)
- Related Materials
- See Preferred
- Delivery Media
- See Preferred
- Metadata
- See Preferred
- Technological measures
- Files in formats which support executable code do not depend on embedded programs for purposes other than display (e.g. search, filtering, etc.); the raw data is available without executing code.
- Geospatial Data
- Formats, in order of preference
- Most complete data (all layers, appendices), even if proprietary
- Formats compatible with widely adopted GIS (e.g. ArcGIS)
- Formats compatible with recommendations and tools from geospatial open source and open data communities
- Formats developed or endorsed by the Open Geospatial Consortium (OGC) (e.g., GML see: http://www.opengeospatial.org/
)
- Formats supported by well supported open source software libraries such as GDAL, OGR and GeoTools
- Related Materials
- Consult the appropriate sections of this document to identify the preferred formats for supplementary material
- Delivery Method
- Hard drive; CD-ROM; DVD-ROM, or other media not yet assigned
- Metadata
- For metadata information see Federal Geographic Data Committee (FGDC)
- To the extent allowed by the underlying format, include available information about how the data was collected and any post-processing which has been applied
- Technological measures
- Files must contain no measures (such as digital rights management or encryption) that control access to or prevent use of the digital work.
- Formats, in order of preference
- See Preferred
- Related Materials
- See Preferred
- Delivery Method
- See Preferred
- Metadata
- See Preferred
- Technological measures
- See Preferred
- Databases
- Preservation
- Complete set of the content contained within the database, conforming to preferred specifications in sec. VI.i-ii
- Access, in order of preference
- Publisher web interface, with
- Comprehensive and user-friendly search and discovery
- Counter complient usage statistics
- Delivered preservation content
- Preservation
- See Preferred
- Access, in order of preference
- Documented API
Acceptable:
Preferred:
Acceptable:
Preferred:
Acceptable: