VII. Websites
This format specification covers the Library’s preferred format for archived web content, as well as a preferred “format” for presentation of web content for archiving (in other words, best practices for content creators to help in creating preservation-friendly websites). The Library is aware that websites, including blogs, social media and other web content that make up websites, are presented and created in formats for viewing in a web browser, and are often different than the standard format that is recommended for preservation and long-term access. Given that the focus of this document is preservation and long-term access, the following format preferences favor those outcomes, and include recommendations for best practices to better enable preservation of web content.
Preferred:
- Technical Characteristics
- Website creators can improve the archivability of web content by following best practices such as:
- Using sitemaps and stable URLs
- Using open formats
- Following accessibility standards, such as:
- Section 508 (https://www.access-board.gov/guidelines-and-standards/communications-and-it/about-the-section-508-standards/guide-to-the-section-508-standards)
- Web Content Accessibility Guidelines (WCAG) (https://www.w3.org/WAI/intro/wcag
)
- United States Web Design Standards (https://standards.usa.gov/)
- Providing page specific titles and description, publication or update dates, and meaningful web addresses, when possible, to convey the substance of content presented
- Resources that address this further and may be helpful to content creators can be found on the Library of Congress Guide to Creating Preservable Websites (http://loc.gov/webarchiving/preservable.html)
- Formats
- The Library, and other organizations involved in web archiving, are preserving web content in the Web ARChive (WARC) format
- Delivery Method
- Capture using tools that produce non-proprietary output, to conform with standard formats and requirements
- Metadata
- Refer to the WARC ISO-standard specification for mandatory and recommended metadata fields
- When displaying archived content, the following should be clearly indicated:
- archiving institution,
- dates and time of capture,
- statements about functionality within the archive to distinguish from the live site
- Technological Measures
- Websites should not contain measures (such as content behind logins or only accessible through search functions) that control access to or prevent capture of the digital work.
- Robots.txt restrictions should be set so as not to block crawlers from capturing important content, such as image and style sheets, which allow for replay of the site as it looked at the time of capture.
Acceptable:
- Technical Characteristics
- See Preferred
- Formats
- Internet Archive’s ARC_IA format, a precursor to the WARC format
- Delivery Method
- Transmission of WARC or ARC IA files created by web content producers or other archiving organizations
- Metadata
- The ARC IA should be named in a manner that easily identifies the archiving institution (see WARC standard for recommended naming conventions)
- Technological Measures
- Tools currently available cannot capture all web content, so certain types of web content may not be preservable through web capture at this time. These include:
- Multi-media rich content
- Streaming media
- Deep web content
- Databases