All posts by

Back to Basics – Dealing with the Data

Back to Basics - Data

In my previous post I looked at the planning that is needed prior to any potential edisclosure exercise to gain an understanding of what may be required to gather the data.  Once the data has been collected, there are still many questions that need to be asked and answered to enable you to determine your budget and plan your review.  This post will consider some of the steps that can be taken to keep control of costs and plan your review of the documents.

For example, let’s imagine you have collected 30 gigabytes (GB) of data from 10 custodians, with a mixture of emails and documents.  What does that mean in terms of the number of documents that will need to be reviewed by lawyers?  Typically the volume of data or the number of GB is used to estimate the number of documents based on industry averages. The first thing to consider is data expansion, particularly if emails have been collected as .pst files.  A .pst file is basically a storage file associated with Microsoft Outlook emails.  Emails are compressed into a .pst file so they take up less space, but when the data is processed so that it can be loaded into a document review tool, the emails decompress to their original size, often resulting in a higher GB count.  We have seen .pst files expand more than 4 times during processing.  However, for this example, let’s say the total volume of data increases to 50GB.  Depending on the nature of the documents, there could be 5,000 – 10,000 documents per GB, resulting in 250,000 – 500,000 documents.  Using our document review platform, Ontrack Inview, we would expect reviewers to read on average over 50 documents per hour, so you could be looking at 10,000 hours of review.  This may be great for your billings as a lawyer, but I suspect it is unlikely to match your budget, or the need for proportionality!

There are various steps that you can take to reduce the number of documents that end up in your document review set.  Filtering the data generally involves applying keyword searches.  You can also filter the data by custodian or specific date ranges, and/or any other document properties which are available and removing irrelevant file types such as system files.  Your keyword list needs to be carefully thought out though, as keywords that are too generic will have minimal effect, and too few or too specific keywords could risk relevant documents not passing through the filter and being excluded from the document review set.  You also need to bear in mind that key word filtering may be ineffective with spreadsheets, pictures and drawings.  It can be helpful to test the effectiveness of your key words on a subset of the data, and we can certainly assist in that process.  If you plan to use predictive coding, where the document review software learns from human reviewers and automatically codes the documents as relevant, there is a suggestion that no keyword filtering should be applied, but that discussion is for another day.

In a litigation matter, the scope of the filtering should be agreed by the parties. Should the parties disagree, the court can make an order before disclosure starts.  In Digicel (St. Lucia) Limited and others v Cable & Wireless Plc and others [2008] EWHC 2522 (Ch), the court was critical of the parties’ solicitors deciding the search terms without consulting each other:

“[The Defendants] did not discuss the issues that might arise regarding searches for electronic documents and they used key word searches which they had not agreed in advance or attempted to agree in advance with the Claimants.  The result is that the unilateral decisions made by the Defendants’ solicitors are now under challenge and need to be scrutinised by the Court. If the Court takes the view that the Defendants’ solicitors’ key word searches were inadequate when they were first carried out and that a wider search should have been carried out, the Defendants’ solicitors’ unilateral action has exposed the Defendants to the risk that the Court may require the exercise of searching to be done a second time, with the overall cost of two searches being significantly higher than the cost of a wider search carried out on the first occasion.”

A further reduction of the data can be achieved by reducing the number of duplicate documents.  De-duplication can be applied, either to each individual custodian’s data, or across all the data to remove duplicate documents.  In some cases it will be sufficient to keep only one copy of key documents in the review database and remove all other copies, but in other cases such as fraud matter, it will be important to keep the copies to show who had knowledge of what.

If you have key custodians, you should consider processing their data in order of priority.  That way your highest priority custodian’s data will de-duplicate first with the least de-duplication, the next custodian’s data will be de-duplicated against the first custodian’s data, and so on, ideally resulting in your lowest priority custodian having the least data.  You should be aware though that not all duplicates are removed during this process; for instance, where the same document is attached to different emails, it will not be removed to allow you to review the attachment in the context of the email.

Making mistakes with filtering and de-duplication can result in potentially costly consequences, perhaps best highlighted in West African Gas Pipeline Company Ltd v Willbros Global Holdings Inc [2012] EWHC 396 (TCC).   In this case documents were not collected properly and some were missing.  Additionally, there were problems experienced with the quality of OCR, de-duplication, inconsistent redactions and the out-sourced review.   The Judge readily accepted that disclosure in complex international construction projects is difficult, but he was persuaded that errors were made and the claimant’s disclosure did cause additional problems which wasted time and costs.  He ordered the claimant to pay the wasted costs caused by the de-duplication failings and the inconsistent redactions and the wasted costs of a disrupted and prolonged disclosure exercise.  Working closely with your edisclosure partner as early as possible, can ensure that all the steps you have taken to reduce the data are defensible.

Filtering and de-duplication planned well can have a dramatic effect on the number of documents that need to be reviewed, which in turn impacts on your budget.  We frequently see the volume of data reduced by 40 – 60% through filtering and de-duplication.  To put this in perspective, if we return to our original example, this would result in 250,000 – 500,000 documents being reduced to 100,000 – 200,000.  With tight budgets, and frequently tight deadlines, spending time determining the most appropriate way to reduce your data will be time well spent.

Back to Basics – Proper Planning

A trawl of the various blogs and articles on eDisclosure finds plenty of articles on predictive coding, Technology Assisted Review (TAR), big data, analytics, the Jackson Reforms and cost budgeting.  Indeed, even our own blog to date has focused a great deal on these issues, as the tags on the left show.  All of these topics are essential reading for anyone involved in eDisclosure, but they all assume one thing – everyone knows the basics.  No doubt all of our readers are fully aware of the new rules regarding the submission of budgets.  Anyone who is following the Plebgate saga cannot fail to be aware of Andrew Mitchell’s predicament due to his budget not being submitted at least seven days ahead of the CMC.  As a consequence, the court said Mr Mitchell “would be limited to a budget consisting of the applicable court fees for his claim”.  The judge also went on to say:

“Budgeting is something which all solicitors by now ought to know is intended to be integral to the process from the start, and it ought not to be especially onerous to prepare a final budget for a CMC even at relatively short notice if proper planning has been done.”

From our perspective, the key words here are “proper planning”.  One of the most costly aspects of litigation is the actual review of the documents due to the hours that this can potentially take.  But if you are inexperienced at eDisclosure, or don’t know your megabytes from your gigabytes, or both, where do you start?  Hopefully here.

The first thing to think about when your client rings is where to find the information relevant to the case.  The answer to that question will lie with your clients, or if you work for a corporation, with key personnel in IT and management.  The Electronic Documents Questionnaire contained within the Schedule of Practice Direction 31B is a useful template (, but here are the key questions that will help us to help you:

  • How many individuals are potentially involved?
    • Individuals are referred to as custodians.
  • Where is the relevant data for these custodians stored?
    • Their data may be on multiple sources, e.g.:
      • Desktop computer
      • Laptop computer
      • External device
      • Smart phone
      • Server
      • Backup tapes
  • Is it necessary to collect all the data from all the sources to avoid the possibility of having to return, thus incurring additional costs?
  • How much data might there be?
    • This is very important as it will eventually help determine the number of potential documents for review.
    • The unit used for data in these circumstances is a Gigabyte (
  • What type of data is there?
    • What type of email does your client use, e.g. Microsoft Outlook, Lotus Notes?
    • Any databases or proprietary software?
    • Any messaging data, e.g. Bloomberg Messaging?
    • Any audio data?
  • What languages are contained within the data?
    • Do you have reviewers with the necessary language skills?
    • Is machine translation, whereby your review platform carries out a basic translation, appropriate for your initial review?
  • Who should collect the data and how should it be collected?
    • Where is the data geographically?
    • Do you require an independent third party to collect the data in a defensibly sound manner?
  • What are the data privacy implications, if any?

Whilst these questions are not exhaustive, if you have thought about them, you will be in a position to start your conversation with your eDisclosure providers.  Ideally, relationships ought already to have been built up with technology experts as in most cases there will be little time to conduct a “beauty parade”.

We can help you collect the information you need.  Together we can then begin to plan how you are going to retrieve the data, how long that may take, and what the costs may be.  You will also need to start thinking about the actual data: what happens when it is processed before review, how can you reduce the volume of data to review, and what technology do you want to use to help you as it is likely that some sort of data filtering technology and review platform is going to be required.

These topics will be covered in the next Back to Basics post.

Next week, Rob Jones will be writing a blog post on what you need to know about Technology Assisted Review (TAR). You can see a preview below.