Because good research needs good data

Preservation and Curation in Institutional Repositories

By Alex Ball, DCC, UKOLN at University of Bath

Published: 10 March 2010

Institutional repositories were originally intended as a way of giving immediate and wide access to research papers.

They are increasingly taking on a role as curators of institutional digital output, requiring the adoption of specific policies and tools for preservation and curation.

In the ten years since the first dedicated institutional repository software was released, a range of tools have been developed to assist in everything from drawing up preservation plans and policies to extracting preservation metadata from files, alongside modular architectures for linking all the tools together.

While the uptake of these technologies and techniques in repositories is modest, there are encouraging signs of progress.

The concept of online repositories of scientific publication has been around as long as the World Wide Web. The first e-print repository was arXiv, established in August 1991 at the Los Alamos National Laboratory and initially serving high energy theoretical physics.

Physicists in this area already had a culture of sending each other hard-copy pre-prints of articles as they were completed, as a more rapid form of dissemination than journals could provide.

The ‘hep-th’ database provided them with a dissemination route that was cheaper and much easier to administer, as well as being even faster than the paper-based systems. The arXiv repository has since expanded to cover most other areas of physics, as well as areas of mathematics and computer science.

The success of arXiv was followed by the launch of similar services for other disciplines and large institutions, and eventually lead to the formation of the Open Archives Initiative (OAI).

One of the outcomes of the first meeting of the OAI was the adaptation of the software underlying CogPrints to make it easier for institutions to set up their own repositories: this software was named EPrints and released in 2000. Since then, other institutional repository systems have emerged, notably DSpace and Fedora.

The question of whether institutional repositories should have preservation responsibilities was raised early on.

In some quarters, there was (and still is) strong resistance to the notion, the argument being that repositories are solely for the purposes of accelerating the dissemination and widening the impact of high quality research; preservation should more properly target the ‘official’ printed record, published in journals and held by libraries.

There are, of course, arguments counter to this view. One is that the usefulness of e-prints does not cease once the printed versions are published. Open access to e-prints means that researchers can still access the material even if they do not belong to an institution that subscribes to the journal, or one that could obtain a copy through interlibrary loan.

Another argument is that institutional repositories can hold more than just surrogates of journal articles: for example, expanded versions of articles, underlying data, unpublished conference papers, teaching and learning resources, multimedia presentations, or corporate material (administrative records, publicity materials, and so on). Thus, while research and development in the areas of preservation and curation in the context of institutional repositories was slow to begin with, it has accelerated and is gaining more mainstream attention.

It is telling, for example, that the JISC created a combined Repositories and Preservation Programme in April 2006 to further the work of both its Digital Repository Programme and its Digital Preservation and Asset Management Programme.

The DCC has produced a report that provides a snapshot of the state of the art of preservation and curation in an institutional repository context in early 2010, noting areas of recent and current research and development.

It should be of interest principally to institutional repository managers and others concerned with the strategic planning for these services.

The report begins with a brief introduction to preservation and curation, followed in chapter 3 by a summary of the current provision for these activities in EPrints, DSpace and Fedora. Some repository models and architectures relevant to preservation and curation are presented in chapter 4 and chapter 5 respectively, while a selection of preservation planning tools of possible use in a repository context are described in chapter 6. Pertinent developments in metadata are reviewed in chapter 7, while tools for working with such metadata are presented in chapter 8. Technologies that assist in performing emulation, reverse engineering and migration are described in chapter 9. The issue of identifiers for repository materials is tackled in chapter 10. A selection of guidelines and tools for auditing curatorial aspects of institutional repositories is presented in chapter 11, and a selection of tools for calculating the costs and benefits of curation is presented in chapter 12. Finally, some conclusions are drawn in chapter 13.

Further information

For the full text, download the report, Preservation and Curation in Institutional Repositories (PDF, 650 KB), by Alex Ball, UKOLN, University of Bath.