Because good research needs good data

Web Archiving

By Alex Ball, DCC, UKOLN at University of Bath

Published: 1 March 2010

Web archiving is important not only for future research but also for organisations' records management processes.

There are technical, organisational, legal and social issues that Web archivists need to address, some general and some specific to types of content or archiving operations of a given scope.

Many of these issues are being addressed in current research and development projects, as are questions concerning how archived Web material may integrate with the live Web.

Since its invention in 1989 and subsequent release in 1991, the World Wide Web has grown in both size and popularity with such vigour that it is has eclipsed most of the other applications that run on the Internet.

Its importance as an information resource is undisputed – the number of reference works scaling down or abandoning their print runs in favour of online editions is testament to that – but it is also increasingly important as an expression of contemporary culture.

The advent of blog service providers and social networking sites has lowered the barriers for people wishing to express themselves on the Web, meaning even those with few technical skills and limited Internet access can publish their thoughts, ideas and opinions.

The value of preserving snapshots of the Web for future reference and study was quickly recognised, with the Internet Archive and the National Library of Sweden both starting their large-scale harvests of Web sites in 1996. Since that time, Web archiving – the selection, collection, storage, retrieval, and maintenance of the integrity of Web resources – has become more widespread, assisted by ever more advanced tools, but perfecting the process is something of a moving target, both in terms of the quantities involved and the sophistication and complexity of the subject material. This is without factoring in the growing demands of the research questions for which the archived material might be expected to act as evidence.

The DCC has produced a report that provides a snapshot of the state of the art of Web archiving in early 2010, noting areas of contemporaneous research and development.

It should be of interest to individuals and organisations concerned about the longevity of the Web resources to which they contribute or refer, and who wish to consider the issues and options in a broad context. The report begins by reviewing in more detail the motivations that lie behind Web archiving, both from an organisational and a research perspective. The most common challenges faced by Web archivists are discussed in section 3. The following two sections examine Web archiving at extremes of scale, with section 4 dealing with full-domain harvesting and the building of large-scale collections, and section 5 dealing with the ad hoc archiving of individual resources and small-scale collections. The challenges associated with particular types of difficult content are summarised in section 6, while methods for integrating archived material with the live Web are reviewed in section 7. Finally, some conclusions are drawn in section 8.

Further information

For the full text, download the report, Web Archiving (PDF, 261 KB), by Alex Ball, UKOLN, University of Bath.