Back to Basics – Dealing with the Data

18 November 2013 by Adrienn Toth

In my previous post I looked at the planning that is needed prior to any potential edisclosure exercise to gain an understanding of what may be required to gather the data. Once the data has been collected, there are still many questions that need to be asked and answered to enable you to determine your budget and plan your review. This post will consider some of the steps that can be taken to keep control of costs and plan your review of the documents.

For example, let’s imagine you have collected 30 gigabytes (GB) of data from 10 custodians, with a mixture of emails and documents. What does that mean in terms of the number of documents that will need to be reviewed by lawyers? Typically the volume of data or the number of GB is used to estimate the number of documents based on industry averages. The first thing to consider is data expansion, particularly if emails have been collected as .pst files. A .pst file is basically a storage file associated with Microsoft Outlook emails. Emails are compressed into a .pst file so they take up less space, but when the data is processed so that it can be loaded into a document review tool, the emails decompress to their original size, often resulting in a higher GB count. We have seen .pst files expand more than 4 times during processing. However, for this example, let’s say the total volume of data increases to 50GB. Depending on the nature of the documents, there could be 5,000 – 10,000 documents per GB, resulting in 250,000 – 500,000 documents. Using our document review platform, Ontrack Inview, we would expect reviewers to read on average over 50 documents per hour, so you could be looking at 10,000 hours of review. This may be great for your billings as a lawyer, but I suspect it is unlikely to match your budget, or the need for proportionality!

There are various steps that you can take to reduce the number of documents that end up in your document review set. Filtering the data generally involves applying keyword searches. You can also filter the data by custodian or specific date ranges, and/or any other document properties which are available and removing irrelevant file types such as system files. Your keyword list needs to be carefully thought out though, as keywords that are too generic will have minimal effect, and too few or too specific keywords could risk relevant documents not passing through the filter and being excluded from the document review set. You also need to bear in mind that key word filtering may be ineffective with spreadsheets, pictures and drawings. It can be helpful to test the effectiveness of your key words on a subset of the data, and we can certainly assist in that process. If you plan to use predictive coding, where the document review software learns from human reviewers and automatically codes the documents as relevant, there is a suggestion that no keyword filtering should be applied, but that discussion is for another day.

In a litigation matter, the scope of the filtering should be agreed by the parties. Should the parties disagree, the court can make an order before disclosure starts. In Digicel (St. Lucia) Limited and others v Cable & Wireless Plc and others [2008] EWHC 2522 (Ch), the court was critical of the parties’ solicitors deciding the search terms without consulting each other:

“[The Defendants] did not discuss the issues that might arise regarding searches for electronic documents and they used key word searches which they had not agreed in advance or attempted to agree in advance with the Claimants. The result is that the unilateral decisions made by the Defendants' solicitors are now under challenge and need to be scrutinised by the Court. If the Court takes the view that the Defendants' solicitors' key word searches were inadequate when they were first carried out and that a wider search should have been carried out, the Defendants' solicitors' unilateral action has exposed the Defendants to the risk that the Court may require the exercise of searching to be done a second time, with the overall cost of two searches being significantly higher than the cost of a wider search carried out on the first occasion.”

A further reduction of the data can be achieved by reducing the number of duplicate documents. De-duplication can be applied, either to each individual custodian’s data, or across all the data to remove duplicate documents. In some cases it will be sufficient to keep only one copy of key documents in the review database and remove all other copies, but in other cases such as fraud matter, it will be important to keep the copies to show who had knowledge of what.

If you have key custodians, you should consider processing their data in order of priority. That way your highest priority custodian’s data will de-duplicate first with the least de-duplication, the next custodian’s data will be de-duplicated against the first custodian’s data, and so on, ideally resulting in your lowest priority custodian having the least data. You should be aware though that not all duplicates are removed during this process; for instance, where the same document is attached to different emails, it will not be removed to allow you to review the attachment in the context of the email.

Making mistakes with filtering and de-duplication can result in potentially costly consequences, perhaps best highlighted in West African Gas Pipeline Company Ltd v Willbros Global Holdings Inc [2012] EWHC 396 (TCC). In this case documents were not collected properly and some were missing. Additionally, there were problems experienced with the quality of OCR, de-duplication, inconsistent redactions and the out-sourced review. The Judge readily accepted that disclosure in complex international construction projects is difficult, but he was persuaded that errors were made and the claimant’s disclosure did cause additional problems which wasted time and costs. He ordered the claimant to pay the wasted costs caused by the de-duplication failings and the inconsistent redactions and the wasted costs of a disrupted and prolonged disclosure exercise. Working closely with your edisclosure partner as early as possible, can ensure that all the steps you have taken to reduce the data are defensible.

Filtering and de-duplication planned well can have a dramatic effect on the number of documents that need to be reviewed, which in turn impacts on your budget. We frequently see the volume of data reduced by 40 – 60% through filtering and de-duplication. To put this in perspective, if we return to our original example, this would result in 250,000 – 500,000 documents being reduced to 100,000 – 200,000. With tight budgets, and frequently tight deadlines, spending time determining the most appropriate way to reduce your data will be time well spent.