Like anything with immense cultural and historical impact, the internet should be archived for posterity. At this Mozilla Festival event, Thomas Preece explained the reasons for and strategies behind web archiving.
Given its intangibility and vastness, it may seem strange to expend energy on archiving the internet. However, as Thomas Preece explained, there are important reasons for doing so. The first is that because the internet undeniably governs all of our lives, and through archiving it we can preserve our culture and collective history before it can be deleted at the click of a button and turned into a 404 error page. He clarified that despite the warnings we hear about the things we post staying online forever, the internet is often short-lived, as shown by 40% of citations on journal studies disappearing off the internet within 4 years of publication. This is often down to the costs of keeping a website running, but archiving provides a solution to this.
Another reason for web archiving is to combat the spreading of disinformation, which is vital in this age of ‘fake news’. An example Preece gave was that a politician might make a statement and post it on their social media, before deleting it and pretending it had never happened. If said post was archived, this would act as evidence for the contrary.
The technology used for web archiving is known as a crawler. One of the oldest examples is Wayback Machine , which was founded in 1996, and is where you can explore 549 billion web pages saved over time. There are some sites that boast automated capture archiving such as Heritrix, and some where you can request a crawl of a specific site, like archive.org or conifer.rhizome.org.
While crawling is a useful archiving tool, Preece expanded on the various issues it holds. The way to assess the quality of a crawl is by monitoring how complete and extensive its coverage of a site is. However, the larger and more interactive a site is, the harder it is for a crawler to capture it successfully and it can potentially get stuck in a loop of archiving a single page continuously. It is also important not to just rely on sites like Wayback Machine for comprehensive archiving, as they only capture sites that they deem to be significant.
There are ways to counter these problems. Preece emphasised that companies and institutions should bear the burden of archiving their own content through whatever strategy works best for them, as preservation should be a shared responsibility. If employees understand the coding of a particular website they can modify a crawler to capture more efficiently. It is also important to notify a crawler when a page changes, as you can’t wholly rely on the crawler’s scheduling to keep up with the frequent changes of a website.
This Mozilla Festival event highlighted the importance of archiving the internet which often goes unmentioned in discussions around the preservation of culture and history. We often take the existence of the internet for granted, but it is not as permanent as it seems.
You can find Thomas Preece’s web archiving resources here.