Web data is already highly ephemeral

Your go-to forum for bot dataset expertise.
Post Reply
asimm22
Posts: 9
Joined: Thu May 22, 2025 5:27 am

Web data is already highly ephemeral

Post by asimm22 »

This end-of-term collection builds on similar initiatives in 2008 and 2012 by original partners Internet Archive, Library of Congress, University of North Texas, and California Digital Library to document the “gov web,” which has no mandated, domain-wide single custodian. For instance, here is the National Institute of Literacy (NIFL) website in 2008. The domain went offline in 2011. Similarly, the Sustainable Development Indicators (SDI) site was later taken down. Other websites, such as invasivespecies.gov were later folded into larger agency domains. Every web page archived is accessible through the Wayback Machine and past and buy sales lead current End of Term specific collections are full-text searchable through the main End of Term portal. We have also worked with additional partners to provide access to the full data for use in data-mining research and projects.

The project has received considerable press attention this year, with related stories in The New York Times, Politico, The Washington Post, Library Journal, Motherboard, and others.

“No single government entity is responsible for archiving the entire federal government’s web presence,” explained Jefferson Bailey, the Internet Archive’s Director of Web Archiving. “ and websites without a mandated custodian are even more imperiled. These sites include significant amounts of publicly-funded federal research, data, projects, and reporting that may only exist or be published on the web. This is tremendously important historical information. It also creates an amazing opportunity for libraries and archives to join forces and resources and collaborate to archive and provide permanent access to this material.”
Post Reply