Because good research needs good data

Digital Forensics for Preservation

Alex Ball | 07 July 2011

What could someone discover about you from your mobile phone? That was one of the questions asked and answered at the Digital Preservation Coalition's Briefing Day on Digital Forensics for Preservation, held in Oxford on 28 June.

Digital forensics is the name given to a range of techniques used to recover data from a piece of hardware. As the name suggests, these techniques are mainly used to provide evidence in legal proceedings. To understand how, we turned to Simon Attfield of Middlesex University, who outlined how lawyers work their way through recovered data to find the evidence they need. The process is iterative, time-consuming and not particularly scalable, so there is a market for software to make it quicker and more intuitive. The focus is on collaboration tools, clustering techniques and improved data visualisation. But before all that comes into play, the data need to be recovered.

It is obviously important for legal purposes that the recovery process does not change the recovered data, and that this can be proved. This is just as important when the techniques are used to support the archiving of digital material. As Jeremy Leighton John of the British Library explained, even turning a computer on changes its state, so instead the forensic work must be done directly on the hard drive using a second machine. In many cases, the drive can be mounted read-only and a disk image taken. One would want to do this with at least two tools, and compare the results, in case one of the tools has a bug. If the disk has become corrupted, it is sometimes possible to recover the data by analysing its magnetic flux with a specialist tool.

Where archiving differs from the legal context is what happens to the recovered image. While lawyers are typically only interested in a small subset of files, an archive receiving a personal digital collection is interested in all it contains. The archivist will typically extract files from the disk image and process them in various ways. Of particular importance is catching files in danger of format obsolescence, and protecting files containing sensitive data (credit card numbers and so on). As mentioned above, clustering is currently a hot topic. The next big thing, it seems, is fuzzy hashing; this is a way of determining how similar two files are by taking many hashes of each (based on various subsets of the content) and comparing the hashes. But files aren't the whole story. Sometimes how a person sets up their desktop is just as revealing as work they do on it, so there are also tools to help with running disk images as virtual machines.

The question I opened with relates to the talk by Brad Glisson of HATII, University of Glasgow. Mobile phones are very often resold or donated rather than thrown away, but how much information do people sell on along with their phone? Armed with three standard toolkits for phone data recovery, Brad and his team bought around 50 second-hand mobile phones and set to work finding out. The data they retrieved was enough to construct a fairly detailed profile of most of the owners, and they could have discovered much more by cross-referencing phone directories and the like. There was a surprising quantity of sensitive and embarrassing information left on the phones: photos, addresses, PINs, bank account details, National Insurance numbers… Not all of this was the owner's fault: some phones just don't delete things very well. It begs a lot of worrying questions, especially as more phones these days geo-tag images, and smartphones substantially broaden the range of data that can be stored.

Following on from this, Michael Olson of Stanford University Libraries described his forensics lab and how it had tackled the personal digital archives of Douglas Engelbart, Stephen Jay Gould, Robert Creeley and others. One of the messages I took from the talk was the necessity to act early in getting the data off old hardware: three quarters of the data from Keith Henson's Project Xanadu archive were lost due to hard disk failures. While it is possible in principle to repair drives, the cost would be prohibitive in most cases. Michael closed his talk with some impressive visualisations of email, including an amusing timeline of the sentiments expressed in Peter Koch's emails.

Gareth Knight of the Centre for e-Research, King's College London (KCL), described the FIDO Project, which is exploring how digital forensics techniques might be used to aid digital records management in higher education institutions. Gareth gave us quick reviews of disk image file formats and some of the forensic toolkits available, explaining why certain ones were chosen for the KCL pilot. Along the way he made some very good points about embedding digital forensics into archival processes. First of all, you can't call it forensics, else those donating or surrendering their equipment will get edgy about the motivation behind it. Second, just because one can recover deleted data, it doesn't mean one should, and certainly not without the agreement of the donor.

The final presentation of the day, from Kam Woods and Cal Lee of the University of North Carolina, Chapel Hill, was a trailer of sorts for the forthcoming BitCurator toolkit. This toolkit will be a friendly and archive-orientated wrapper around Simson L. Garfinkel's AFFLIB library and tools. Among the plans for BitCurator are a graphical user interface for the tools, and a bootable CD containing an operating system with everything pre-installed and working.

William Kilbride of the DPC rounded things off by chairing a discussion of next steps.

  • Mobile phones are inherently more difficult to extract data from than computers: not only are they quirky about how they store data, they are also much better at protecting data (if told to do so). But in future we may not have to deal with the phones themselves, as people and software get better at backing up the data to their computers.
  • There is no easy answer to the problem of malware in ingested material, but the risk to the archive itself is minimal. Thus, the simplest strategy is to try to stop it getting out again, by restricting access to suspicious files, and redacting them from the access copies of disk images.
  • We can expect a stronger separation between standard and archival digital forensics tools in future, as archives – having a more archaeological angle – need to support a wider range of formats and devices.

The impression I came away with was that the technology is mostly in place to support preservation with digital forensics. The challenge is to make that technology easy to use, and to make it sound friendly! Of course, as the barriers to entry reduce, I guess we'll all need to be more careful about what we leave on our devices for others to find…