All posts tagged Big Data

Big data: high financials rewards, high regulatory risks?

big data

In a 2013 survey of 400 companies, management consultancy Bain & Company, found that companies using data analytics were:

  • Twice as likely to be in the top quartile of financial performance within their industries
  • Three times more likely to execute decisions as intended
  • Five times more likely to make decisions faster

Fast forward to 2018 and data analytics is firmly entrenched within many companies to the extent that it has attracted the attention of the regulatory authorities. The European Commission, the Competition and Markets Authority and the French Authorite de la Concurrence have all stated that big data and the competitive advantage it can give is a top investigative priority for 2017 and beyond.

How can big data give unfair competitive advantage?

Big data as an asset

Margrethe Vestager, the European Commissioner for Competition is currently considering revising merger control thresholds to include a threshold pertaining to non-turnover related big data holdings. Although the Commission previously incorporated the value of data into previous merger control investigations, this has largely involved companies where big data generates significant revenue. However, a company could acquire a business with a small turnover and large amount of user data, the new owner could exploit this data and reduce competitiveness that market place.

Big data pooling

Although sharing data is not forbidden per se, the way companies share data can breach competition rules. Companies can use big data to place themselves in dominant position over competitors. For example, if a company wants to diversify its offering and move into new areas, it can use data held on current customers to promote the new business.  For instance, Uber’s access to users of its lift-sharing service can be used to promote other ventures such as UberEats. This gives Uber an unfair advantage over other providers offering a similar takeaway food business but lacking the data from such a large customer base.

The regulatory authorities take these violations seriously and are imposing significant fines. Most recently, the Belgian Lottery was fined €1 million  for using a data base of customer contacts to promote a new sports lottery game.

A new form of white collar crime?

The formation of so-called digital cartels is predicted to be one of the biggest challenges regulators will face in the future. Digital cartels arise from companies using automated pricing systems. These digital tools automatically calculate prices according to a set of criteria such as supply versus demand, profit targets and so forth. Increasingly, these systems use machine learning technology. This can lead to the situation where two rival companies use the same pricing technology and react identically to changing market conditions. This results in prices being unintentionally fixed and the law being violated.

Getting value from big data without incurring fines

When it comes to the formation of digital cartels, prevention is complicated. Automated pricing systems are widespread and manual pricing models are unlikely to make a comeback. For regulatory authorities, who are reliant on laws written in the pre-digital age, enforcement is a greater challenge.  However, Vestager has suggested a new directive might follow later in 2017 which may bring clearer rules and stricter enforcement.

Other streams of revenue enabled by the collection and analysis of big data are more easily policed.  For companies who rely on sharing information for product development, Vestager recommends referring to the Commission’s guidelines on horizontal cooperation which shows companies how to share data in a way that doesn’t reduce competition.

She also discussed ways for companies to share information with competitors anonymously in a way that doesn’t harm their own business interests such as sending information to a platform anonymously. In return, they would receive aggregate data with no indication of which company it comes from.

In conclusion, competition enforcement is changing, and fast. Companies who use big data and smaller companies who hold big data should but don’t actively use it should closely monitor the Commissions announcements over the next few months in order to prepare for any changes.  Watch this space!



Ediscovery trends in 2017: from artificial intelligence to mobile data centres


2017 is set to be a year of change as organisations prepare for the new General Data Protection Regulation (GDPR) and the accelerated adoption of artificial intelligence. Faced with the need to manage greater volumes of data as well as multiplying communications channels, organisations and their legal representatives will be increasingly reliant on ediscovery technology processes to reduce the time needed to identify and manage information required to satisfy regulatory and legal issues.

Against this backdrop, we make the following predictions for 2017:

  1. Technology will play a vital role in helping organisations prepare for GDPR

The tough new General Data Protection Regulation currently being implemented in Europe will have a global impact. In cross-border litigation and investigations, where data needs to cross borders to comply with discovery requests, mobile discovery will become essential.  These solutions capture, process, filter and examine data on-site, avoiding the need to transfer data across borders. GDPR has strict rules for protecting individuals’ right to be forgotten and organisations will need the relevant tools to find and erase personal data. Breaches of some provisions by businesses, which law makers have deemed to be most important for data protection, could lead to fines of up to €20 million or 4% of global annual turnover for the preceding financial year, whichever is the greater, being levied by data watchdogs.

  1. Ediscovery will find new homes beyond regulation and legislation

While ediscovery is widely used by professionals working on legal cases in litigation, regulation, competition law and merger control, employment law and arbitration, it will be used more and more this year in an anticipatory manner by organisations to identify, isolate and address any concerns about compliance that could expose them to the risk of some kind of intervention or sanction.  This trend will be exacerbated by the introduction of an increasingly complex and aggressive regulatory environment, exemplified by the French Anti-Corruption laws adopted in November 2016.

  1. New sources of evidence will move into the spotlight

Enterprises are creating more data than ever before. Data can be found anywhere that there are storage devices to hold it, whether that is a data centre, laptop, mobile, on wearable devices or the Cloud. Channels to move data from one place to another are also proliferating. As a result we are seeing a diversification of evidence sources being used to build up a picture of what has happened in a legal matter. Whilst email and structured data remain the most common sources of evidence, other data sources such as social media, satellite navigation systems are gaining in importance and providing key insights into many cases. Clients are increasingly choosing ediscovery providers who can integrate a wider variety of data sources into one platform for analysis.

  1. The robots are coming.

Savvy law firms and corporate counsel will benefit from bringing the latest technologies including artificial intelligence (AI) to the attention of their clients. A long line of court decisions in the US, and now also in the UK and Ireland has already driven greater interest in and adoption of predictive coding.

  1. The ediscovery industry will continue to evolve

The past few years have seen huge changes in the ediscovery industry itself as it seeks to provide the technologies that organisations need to keep up with more stringent regulation in data governance. Only larger, international partners now have the resources and capabilities required to provide local services and data processing centres where organisations need them, together with cutting edge tools and technologies to manage huge volumes of data and channels moving forwards.

  1. Big data will take centre stage in competition and data privacy matters

Regulators are becoming increasingly aware of the competition and data privacy implications of big data. From a competition point of view, big data held by companies can trigger both Articles 101 (relating to antitrust cases) and 102 TFEU (abuse of dominance cases). This is highlighted by the joint report of May 2016 from the French and German Competition Authorities entitled Competition Law and Data which explains that big data can trigger article 101 TFEU and thus be considered a cartel. Companies that handle substantial data volumes on a day-to-day basis will need to factor it into their compliance strategies and embrace technological solutions to aid in investigations and redactions.

  1. There will be a greater need for electronic documents

Despite evidence becoming mostly electronic, until recently regulatory authorities still required the submission hard copies of RFI forms, merger filings and other investigatory materials. However, the introduction of the European Commission’s eQuestionnaire for merger control and antitrust cases means parties must now submit all information electronically.

In December 2016, the EC has also recently published guidelines entitled “Recommendations for the Use of Electronic Document Submissions in Antitrust and Cartel Case Proceedings”. It is important to note that the EC strongly encourages the use of electronic formats even for paper documents which means they have to be scanned and made readable.

Tim Philips, Managing Director at Kroll Ontrack, said: “Ediscovery continues to provide essential tools and technologies for all manner of legal matters and allows companies to efficiently navigate through this era of big data, regulatory scrutiny and more stringent data protection requirements. 2017 is set to be another landmark year in terms of the adoption of ediscovery technology and the evolution of ediscovery technology itself.”

5 data analytics myths debunked

Data Analytics

Perplexed by Data Analytics? Stuck on statistics? Then fear not, Philip O’Donnell, Forensic Data Analytics Consultant is here to guide you through the fascinating world of analytics, explaining complex concepts, tackling technical terms and showing the power of data in a series of business scenarios.
In his first blog, Philip will debunk some of the most prevalent myths surrounding data analytics. Over to you, Philip!

1) Once you have an analytics tool, anyone can be a data analyst

Father of Data Analytics, John Tukey, summed up the aim of analytics in typically succinct manner by stating,

“The greatest value of a picture is when it forces us to notice what we never expected to see.”

Put broadly, data analytics is a process to uncover hidden patterns, unknown correlations, market trends, customer preferences using mathematical and statistical techniques.

However, many people think data analytics is just a tool that turns data into graphs and that once you have this tool, anyone can analyse data. This is a little like saying that by owning a saw, you are a master carpenter!
To get the most out of data analytics, it is imperative that the right techniques as well the steps in the process must be understood and used in the right context to be truly effective in any investigation and if performed incorrectly can have misleading discoveries.

2) Data analytics is just for auditors

Where there are people, there is data and this data can be analysed and used to improve the way we operate. Music industry moguls use data analytics to measure listener responses to new music. This then helps them work out which genres, and new artists, are likely to bring them a hit.
Analytics is used by all spheres of society, from medical research and environmental studies to more obvious financial applications. Even Hollywood screenwriters have discovered that analytics can produce great success stories. In the Oscar-winning film Moneyball, a poorly performing baseball team hired a statistics expert to help them change their drafting procedures. By using statistics to help select players rather than traditional scouting methods, the team went onto have the longest winning streak in baseball history.

Analytics helps people in all industries make better, more informed decisions and deliver new innovative ways of thinking and doing business.

3) Data context doesn’t matter

The key component to performing any analytics is to understand the environment in which the client operates. Interpreting and advising on findings is a key aspect of the analytics process, so to really add value for clients, sector knowledge is vastly important. The most experienced data analysts need to understand the context of the data, especially in high profile legal investigations, banking cases, corporate compliance, financial analysis, and government projects. Clients looking to get the most out of their data will need to choose a provider who is able to harness industry knowledge and take a pragmatic approach to data science and analytics methodologies.

4) Analysing data can compromise the security and integrity of data estates

This myth does have some truth in that many inexperienced analysts do not understand the importance of a proper data extraction exercise. Direct extraction of raw data from core system is a key step in the analytics process and in the past, I have seen where incomplete and incorrect data extraction has caused data analytics investigation to be invalidated.
However, an experienced data analytics provider is rigorous in ensuring data extraction is performed correctly and is accountable in the chain of custody. Done properly, performing extraction ensures that the complete dataset and minimises the risk of an incomplete investigation. Extraction is performed in such a way that it does not compromise existing security of the data as well preserving the integrity of the system. Extraction can be performed on multiple data sources. These include relational databases, data warehouses as well legacy flat files and dynamic xml formats.

5) Analytics techniques don’t change

Data analytics is an incredibly dynamic discipline and new techniques are being developed all the time. A good analytics provider will always stay abreast of the latest trends and methods. So what is in store for 2016?

According to the International Analytics Institute the number one trend for analytics in 2016 will be that the distinction between cognitive analytics and automated analytics becomes blurred. Automated analytics is the changing of an airplane price or stock price based on the real-time analysis of factors such as customer demand or other market forces. Cognitive analytics is the inspired by how the human brain processes information, draws conclusions, and codifies instincts and experience into learning. Cognitive analytics uses machine learning techniques such as Neural Networks, Logistic Regression and historic data. By understanding the human decision making and learning process, data scientists can incorporate this knowledge into their models and achieve even more accurate and in-depth insights.

Click here to find out more information about our Data Analytics service.

Data, data everywhere: The LegalWeek Intelligence Report

A unique market study

Kroll Ontrack, in conjunction with LegalWeek Intelligence, recently launched a study to investigate how corporations are coping with the twin challenges of big data and regulatory scrutiny.

We created an in-depth and detailed survey which was completed by 101 in-house lawyers from companies operating in the U.K. The responses were then analysed by our consultants, commented upon by legal experts and incorporated into a full market study.

The backdrop to the survey

In recent years we have witnessed an unprecedented level of regulatory scrutiny and sanctions on multinational companies; most notoriously in the banking sector. The resource drain on companies brought about by this increased level of regulatory activity has been compounded by the gargantuan levels of electronic data that are now held; which must be organised and inspected appropriately. Often, companies are threatened as much by the cost of complying with the disclosure requirements of regulators, as they are by the prospective sanctions.

As a technology and related services provider, Kroll Ontrack has been inundated with requests to assist clients throughout the investigation process. The most pertinent observation we have made is that companies who have a better handle on their data are able to deal with such investigations far more successfully than others.

In light of this, Kroll Ontrack is determined to equip in-house counsel with as much practical knowledge as is necessary so that they may deal with their data in a strategic, cost-effective manner; both on a pre-emptive and after-the-event basis. To that end, working with LegalWeek’s Intelligence unit, we conducted the survey.

What did the survey reveal?

Whilst planning the study, many of our consultants made predictions about the outcomes, based on their experience of working on regulatory matters with in-house counsel. Indeed, some of the results did meet these expectations. For example, the survey confirmed that the top challenges facing legal departments are:

  • working within an increasingly-regulated environment
  • straddling the need to act as both legal and commercial advisors
  • controlling legal costs

However, there were some surprises to be had, in particular the high percentage of participants who said they reduce and control legal spend by carrying out more ediscovery work internally, using their own technology.

We have made the full report available on our company website, which you can download here.

Why do the results matter?

For in-house counsel, understanding how peers are responding to regulatory scrutiny and managing big data can help benchmark their own strategies and perhaps inspire new approaches.

For Kroll Ontrack, this study has been a valuable exercise in gaining a greater understanding of the challenges faced by our corporate clients. Our aim is to use this information to shape  solutions that better serve the needs of in-house counsel.

For the full report, click here.

About Hitesh Chowdhry

Hitesh Chowdhry joined Kroll Ontrack’s London office in July 2014. He sits as a consultant within the Electronic Evidence Consultancy team, advising lawyers and their clients on how to effectively manage electronically stored documents in litigation, arbitration, and internal or regulatory investigations. Hitesh studied law up to Master’s level at Kings College, London. He trained as a solicitor at City firm Penningtons LLP, and qualified into the litigation department there in 2008. Hitesh moved on to join the Treasury Solicitor’s Department in 2010, where he acted on behalf of the Home Secretary in human rights claims. Prior to joining Kroll Ontrack, Hitesh spent one year working as a document review lawyer at various US firms in the City of London. Hitesh is currently studying an Executive MBA at the Cass Business School, London.

Mergers & Acquisitions: Ediscovery takes centre stage

Ediscovery technology has a long association with litigation, so you may be forgiven for wondering about the link to mergers & acquisitions, traditionally the domain of corporate deal-makers.

However, as regulatory scrutiny has increased on a national and international level, more law firms and in-house counsel are using ediscovery technology to swiftly dispense with formal Requests for Information (RFIs).

At the same time, anything that threatens the successful closure of a deal, or the integration of merging businesses is something that is generally investigated using ediscovery and forensic procedures.

Our clients come to us for assistance with matters that stem from the M&A process. Pre- and post-merger audits and merger control RFIs from regulatory bodies such as the European Commission, the UK Competition and Markets Authority, the French Autorité de la concurrence, the German Bundeskartellamt as well as the US Department of Justice are at the top of the menu.

Using ediscovery to enhance due diligence

The time prior to a merger or acquisition deal being finalised is critical, and data from entities being merged or acquired must be assessed as part of due diligence duties. In the past, these reviews typically focused on data in the form of financial reports and accounts; legal documents such contracts and intellectual property; asset valuations and company policies.

However, in this digital age, examining surface level information may not be enough to confidently be sure the deal is not risky or that combining with another company will not create risks.

If the company being acquired operates within markets that have seen anti-competitive behaviour or in countries with a greater incidence of corruption and bribery, it may be prudent to conduct a broader investigation into the company’s activities by examining a selection of unstructured data in audits.

What is unstructured data and why is it important for mergers and acquisitions?

Unstructured data largely consists of personal correspondence in the form of emails, text messages, voice mails and web-based messaging systems such as WhatsApp. Within even a medium-sized organisation the amount of data generated by these applications is enormous. For a global firm, the volume of data is almost unimaginably large. Yet just a handful of incriminating emails containing evidence of cartel activities that have serious repercussions at a later date should a regulatory body decide to investigate concerns relating to dominance.

By the same token, structured data (which is normally transactional data, stored in tables to record things like customers, products, orders and payments) may also be examined to look for anomalies that might signal a compliance risk using specialised data analysis tools and visualisation software.

Intelligent review technology is aiding strategic decision-making

Ediscovery technology can make short work of huge data sets both collecting, filtering and analysing data to get to the key information as quickly as possible. Armed with potential risks or given a clean bill of health, informed decisions can be made surrounding the deal, which can then proceed in a compliant and timely manner.

If this kind of investigation has not been possible prior to the merger or a company has doubts about an entity it has acquired or merged with, clients also come to us for post-merger compliance investigations which vary in scope from the very focused to the very broad. Ediscovery technology can also assist on an operational level by harmonising data estates of the merged companies.

Taking the pain out of Phase Two Requests for Information

If the European Commission is worried about the possible effects of a merger on competition, it may conduct an in-depth analysis of the merger in the form of a Phase II Investigation.

This is involves a more extensive information gathering exercise, working to a strict time-table, similar to ediscovery in the US or edisclosure in the UK. Looking at the deal from a variety of angles, (e.g. whether the proposed merger would create a monopoly, whether it will impact on the supply chain or increase the likelihood of price-fixing cartels forming between competitors), Phase II Investigations can be data intensive exercises, needing ediscovery expertise to ease the deal through.

Ediscovery services can help ensure this process runs more efficiently for the parties involved by:

  • Assessing the likely complexity and cost of the data retrieval exercise, to support efforts to reduce the scope of an RFI.
  • Assisting internal IT teams in the collation and collection of the data requested
  • Ensuring this data is stored securely and processed quickly
  • Providing analytical tools to check documents are relevant to the request and do not fall under privilege
  • Working in a timely fashion to ensure the request for information deadline is met.

Phase II Investigations are often time pressured and delays can threaten the completion of a deal, so it is important to ensure that all teams are focused on the overall goal of the proposed merger.

Working with an ediscovery provider can expedite the submission of requested information, potentially speed up any decisions or remedies and get the deal through.

If you would like to find out more about how Kroll Ontrack can assist with mergers and acquisitions, please contact Rob Jones.

About Rob Jones

Robert Jones is the manager of Kroll Ontrack’s team of Legal Consultants in Continental Europe, the Middle East and Africa.

Predictive coding and Benedict Cumberbatch

Predictive Coding

Artificial Intelligence has clearly become a provocative topic in popular culture once again – you’ve only to watch ‘Her’ and ‘Transcendence’ to see that. However, the most recent movie to catch my eye is ‘The Imitation Game’, and not just because the lovely Benedict Cumberbatch has the starring role, but rather because Alan Turing is the central character.

Mr Turing is known as not only the grandfather of computers, but the grandfather of artificial intelligence. In his seminal paper he questioned: “Can machines think?” But what is “thinking”? What would it mean for a machine to “think”? He refined the question and looked instead to the Imitation Game (hence the title of the film…): A machine can be deemed to “think” if it is indistinguishable from a human in its answers to questions. This became fundamental in the philosophy of artificial intelligence.

This question was so over-whelming it became obvious that there would be no-one way to make this happen. And so several sub-fields began to develop in the field of artificial intelligence – data mining, image processing, natural language processing, speech recognition, machine learning etc. With recent web-developments, a culmination of all these techniques means that we are closer than ever before to the Holy Grail of imitation – you just need to watch IBM’s Watson on Jeopardy to see that.

But it’s one particular sub-field that is of interest to me: machine learning. This is the underlying technology used in Predictive Coding. It is for machine learning that the question of machine thinking is incredibly pertinent. To start at the beginning, not all predictive coding technologies were created equal. All the technologies use different algorithms, meaning that the approach to machine thinking is different, and ultimately has different results, to appropriately varying degrees of success.

When reduced to its core parts, predictive coding is a two-step process: First, the machine learns through human intervention. This means the human provides to the machine the criteria a document needs to conform to in order to be considered X. For this process, a small subset of the data corpus is used. Secondly, the machine thinks by applying that learning, and predicting whether unreviewed documents (the remaining data corpus, not used in the first step) meet that criteria.

The learning element is similar for all predictive coding technologies – a human must review documents to input the relevant criteria and teach the machine. Although there are differing schools of thought as to the best approach to this – automatic versus manual training, for example – the fundamentals are the same. A human must input good information for the machine to learn.

It is the thinking element that is the defining factor. How well can the machine think and how well do they play their ‘Imitation Game’? For effective thinking, we really want the machine to be able to ‘actively learn’. Active learning allows the machine to interactively query the user to obtain further information: It allows the machine to say “I’m confused about this shade of grey.”

Why are the shades of grey important? Well an algorithm that is powered by – perhaps an analytics engine – is something more akin to passive, rather than active learning. When documents are processed, an analytics engine will cluster them together based on items like topics. The human will teach the system, the machine will learn – but the thinking element is lacking. The machine will predict that if a document belonging to a particular cluster is X, then all documents in this cluster are X. This is fine if everything were black and white – but there are always shades of grey.

In a case involving baking – sponge cakes belong to one cluster and the human can state that sponge cakes are X. The human can also teach the machine that chocolate biscuits are Y. Then what of chocolate cakes? This is a shade of grey, that results in a cross over between criteria for X and Y. The machine cannot think past black and white to consider the cross over.

On the other hand, active learning can push documents that relate to chocolate cakes forward to the human for clarification. It is smart enough to think and understand the cross over in criterion that requires clarity. By being able to expressly ask for clarity, the machine is far better at joining the dots of the criteria, understanding the subtleties and is ultimately able to make better predictions. For that reason, it is far closer to being able to imitate the thought processes of humans and a step towards being able to think for itself.

So, when choosing predictive coding providers to support your legal document review, consider Alan Turing and ask the question: Can machines think?

Want to learn more? Katie Fitzgerald has a Predictive Coding webinar on Wednesday 3rd December – register here…

Webinar: The Changing Face of Data Theft


This past week saw the long awaited, and therefore highly anticipated, final instalment of the Kroll Ontrack Autumn/Winter Webinar series, entitled “The Changing Face of Data Theft”.

If you a) didn’t manage to catch it or b) have always wondered what Dimbleby would sound like if he was Welsh, then fear not, for the recording can be found here…  Webinar Video

And just in case the pace of this action packed discussion is too much for you, here’s an overview of the headline topics that came up in the discussion. Synopsis

We were extremely fortunate to be joined by Dan Morrison of Grosvenor Law and E.J. Hilbert, Head of Cyber Investigations at Kroll Advisory Solutions, who both shared their vast experience of handling data loss incidents.

Dan stressed of the importance of having properly drafted (and signed) employment contracts. Ensuring that they are fit for the technology abundant in the modern workplace and ensuring that properly drafted post-termination covenants are both in place and enforceable.

E.J advises that the threats and technology being used is not new, but that organisations don’t fully understand the existing threats in the first place and that the biggest weakness in any company is the human. The curiosity to click on an obscure email from a friend or to simply click “yes” just to remove a pesky pop-up from their screen, remain significant threats to corporate data, and education is vital to ensure that your employees don’t put your data at risk.

And I…well I just did an introduction and asked a few questions (in addition to making many attendees weak at the knees with my “Dimblebyesque” moderation and dulcet Welsh tones).

All-in-all a well-attended and thoroughly engaging seminar and for that I must thank Dan and E.J.

Until next time…!

2014: Data De-Tox and Tweet-a-Service

Digital Detox

If you are interested in predictions and do some “Googling” you will read about the future of wearable computers, the growth in super-computing and the emergence of exaflop machines capable of carrying out a quintillion (a million trillion calculations per second).  You can click on  a timeline for 2014 and read that the Internet will have greater reach than television, Google Glass will be launched to the public, most telephone calls will be made by the Internet and smart watches will be the latest must have gadget.  If you are interested in ediscovery predictions for this year you can read what some of us have to say by reading these articles on “E-disclosure 2014 and beyond” in the New Law Journal and on the SCL website.

Zero email and cyber-cleaning

By far the most appealing prediction I have read about so far for 2014 is the emergence of digital detox.  Given the deluge of data we produce daily via email and over-sharing on social channels it is quite clear we are now polluting our “virtual environments”.  I read recently that Atos, the IT company, with its 77,000 employees in 52 countries is embarking on an ambitious plan called the Zero email TM initiative which aims to reduce internal emails between employees by relying on other communication channels such as social media tools.   Of course, one has to ask whether this will simply result in the time being spent managing emails being diverted to the management of hundreds of chatter communication strings.  Nevertheless this is a bold and commendable move.

Some trend spotters are talking about the emergence of cyber-cleaning services which will help clean up our virtual environments.  That sounds like data nirvana and I am keen to find out more.  I do wonder though if any of us have time to back track and purge or clean up our storage unless we really have to.  It has become so cheap and easy to stash data and search across it and so hard to decide what to keep and what to through away.

Of course, for companies, and especially those exposed to legal action, retention and deletion raises some interesting legal and technical questions:

  • How do you decide what to keep and what to delete?

It’s all good and well to scrub your hard drive and purge the data from your mailboxes and devices but what if there is litigation hold in place requiring you to keep certain data?   You cannot simply delete with gay abandon and ignore company policy and preservation obligations.

  • How do you get to the data to delete it?

You may find is easy to delete posts on your company’s chatter tool or on Google Docs but how do you persuade your IT department or Google to deep cleanse their servers or sift through a mountain of back up tapes, delve into them and delete certain categories of data?

  • How do you ensure that you have really deleted it?

When it comes to ensuring that your tapes are in fact squeaky clean there are procedures like “degaussing” (which sounds a lot worse than it is). This  gives you an expert stamp of approval if you need to show, for data protection or confidentiality reasons that all copies of certain data have been destroyed properly.

This is a topic that needs more research and I am resolving to find out more by talking to experts like my colleague Tony Dearsley and writing more this year about digital detox.

Social media – from advertising to commerce

My second favourite prediction for the year relates to social media commerce.  I read the other day that Starbucks has launched its Tweet a coffee service which allows app users to buy a coffee for someone viaTwitter by simply tweeting “@tweetacoffee” to @ ….. (recipient’s Twitter name)”.  I am all for that kind of sharing and am wracking my brains for services to offer via Twitter.

If you have any ideas please Tweet them to me on @TraceyStretton.

About Tracey Stretton

Tracey Stretton is a legal Consultant at Kroll Ontrack in the UK. Her role is to advise lawyers and their clients on the use of technology in legal practice. Her experience in legal technologies has evolved from exposure to its use as a lawyer and consultant on a large number of cases in a variety of international jurisdictions.

Back to Basics – Dealing with the Data

Back to Basics - Data

In my previous post I looked at the planning that is needed prior to any potential edisclosure exercise to gain an understanding of what may be required to gather the data.  Once the data has been collected, there are still many questions that need to be asked and answered to enable you to determine your budget and plan your review.  This post will consider some of the steps that can be taken to keep control of costs and plan your review of the documents.

For example, let’s imagine you have collected 30 gigabytes (GB) of data from 10 custodians, with a mixture of emails and documents.  What does that mean in terms of the number of documents that will need to be reviewed by lawyers?  Typically the volume of data or the number of GB is used to estimate the number of documents based on industry averages. The first thing to consider is data expansion, particularly if emails have been collected as .pst files.  A .pst file is basically a storage file associated with Microsoft Outlook emails.  Emails are compressed into a .pst file so they take up less space, but when the data is processed so that it can be loaded into a document review tool, the emails decompress to their original size, often resulting in a higher GB count.  We have seen .pst files expand more than 4 times during processing.  However, for this example, let’s say the total volume of data increases to 50GB.  Depending on the nature of the documents, there could be 5,000 – 10,000 documents per GB, resulting in 250,000 – 500,000 documents.  Using our document review platform, Ontrack Inview, we would expect reviewers to read on average over 50 documents per hour, so you could be looking at 10,000 hours of review.  This may be great for your billings as a lawyer, but I suspect it is unlikely to match your budget, or the need for proportionality!

There are various steps that you can take to reduce the number of documents that end up in your document review set.  Filtering the data generally involves applying keyword searches.  You can also filter the data by custodian or specific date ranges, and/or any other document properties which are available and removing irrelevant file types such as system files.  Your keyword list needs to be carefully thought out though, as keywords that are too generic will have minimal effect, and too few or too specific keywords could risk relevant documents not passing through the filter and being excluded from the document review set.  You also need to bear in mind that key word filtering may be ineffective with spreadsheets, pictures and drawings.  It can be helpful to test the effectiveness of your key words on a subset of the data, and we can certainly assist in that process.  If you plan to use predictive coding, where the document review software learns from human reviewers and automatically codes the documents as relevant, there is a suggestion that no keyword filtering should be applied, but that discussion is for another day.

In a litigation matter, the scope of the filtering should be agreed by the parties. Should the parties disagree, the court can make an order before disclosure starts.  In Digicel (St. Lucia) Limited and others v Cable & Wireless Plc and others [2008] EWHC 2522 (Ch), the court was critical of the parties’ solicitors deciding the search terms without consulting each other:

“[The Defendants] did not discuss the issues that might arise regarding searches for electronic documents and they used key word searches which they had not agreed in advance or attempted to agree in advance with the Claimants.  The result is that the unilateral decisions made by the Defendants’ solicitors are now under challenge and need to be scrutinised by the Court. If the Court takes the view that the Defendants’ solicitors’ key word searches were inadequate when they were first carried out and that a wider search should have been carried out, the Defendants’ solicitors’ unilateral action has exposed the Defendants to the risk that the Court may require the exercise of searching to be done a second time, with the overall cost of two searches being significantly higher than the cost of a wider search carried out on the first occasion.”

A further reduction of the data can be achieved by reducing the number of duplicate documents.  De-duplication can be applied, either to each individual custodian’s data, or across all the data to remove duplicate documents.  In some cases it will be sufficient to keep only one copy of key documents in the review database and remove all other copies, but in other cases such as fraud matter, it will be important to keep the copies to show who had knowledge of what.

If you have key custodians, you should consider processing their data in order of priority.  That way your highest priority custodian’s data will de-duplicate first with the least de-duplication, the next custodian’s data will be de-duplicated against the first custodian’s data, and so on, ideally resulting in your lowest priority custodian having the least data.  You should be aware though that not all duplicates are removed during this process; for instance, where the same document is attached to different emails, it will not be removed to allow you to review the attachment in the context of the email.

Making mistakes with filtering and de-duplication can result in potentially costly consequences, perhaps best highlighted in West African Gas Pipeline Company Ltd v Willbros Global Holdings Inc [2012] EWHC 396 (TCC).   In this case documents were not collected properly and some were missing.  Additionally, there were problems experienced with the quality of OCR, de-duplication, inconsistent redactions and the out-sourced review.   The Judge readily accepted that disclosure in complex international construction projects is difficult, but he was persuaded that errors were made and the claimant’s disclosure did cause additional problems which wasted time and costs.  He ordered the claimant to pay the wasted costs caused by the de-duplication failings and the inconsistent redactions and the wasted costs of a disrupted and prolonged disclosure exercise.  Working closely with your edisclosure partner as early as possible, can ensure that all the steps you have taken to reduce the data are defensible.

Filtering and de-duplication planned well can have a dramatic effect on the number of documents that need to be reviewed, which in turn impacts on your budget.  We frequently see the volume of data reduced by 40 – 60% through filtering and de-duplication.  To put this in perspective, if we return to our original example, this would result in 250,000 – 500,000 documents being reduced to 100,000 – 200,000.  With tight budgets, and frequently tight deadlines, spending time determining the most appropriate way to reduce your data will be time well spent.

Back to Basics – Proper Planning

A trawl of the various blogs and articles on eDisclosure finds plenty of articles on predictive coding, Technology Assisted Review (TAR), big data, analytics, the Jackson Reforms and cost budgeting.  Indeed, even our own blog to date has focused a great deal on these issues, as the tags on the left show.  All of these topics are essential reading for anyone involved in eDisclosure, but they all assume one thing – everyone knows the basics.  No doubt all of our readers are fully aware of the new rules regarding the submission of budgets.  Anyone who is following the Plebgate saga cannot fail to be aware of Andrew Mitchell’s predicament due to his budget not being submitted at least seven days ahead of the CMC.  As a consequence, the court said Mr Mitchell “would be limited to a budget consisting of the applicable court fees for his claim”.  The judge also went on to say:

“Budgeting is something which all solicitors by now ought to know is intended to be integral to the process from the start, and it ought not to be especially onerous to prepare a final budget for a CMC even at relatively short notice if proper planning has been done.”

From our perspective, the key words here are “proper planning”.  One of the most costly aspects of litigation is the actual review of the documents due to the hours that this can potentially take.  But if you are inexperienced at eDisclosure, or don’t know your megabytes from your gigabytes, or both, where do you start?  Hopefully here.

The first thing to think about when your client rings is where to find the information relevant to the case.  The answer to that question will lie with your clients, or if you work for a corporation, with key personnel in IT and management.  The Electronic Documents Questionnaire contained within the Schedule of Practice Direction 31B is a useful template (, but here are the key questions that will help us to help you:

  • How many individuals are potentially involved?
    • Individuals are referred to as custodians.
  • Where is the relevant data for these custodians stored?
    • Their data may be on multiple sources, e.g.:
      • Desktop computer
      • Laptop computer
      • External device
      • Smart phone
      • Server
      • Backup tapes
  • Is it necessary to collect all the data from all the sources to avoid the possibility of having to return, thus incurring additional costs?
  • How much data might there be?
    • This is very important as it will eventually help determine the number of potential documents for review.
    • The unit used for data in these circumstances is a Gigabyte (
  • What type of data is there?
    • What type of email does your client use, e.g. Microsoft Outlook, Lotus Notes?
    • Any databases or proprietary software?
    • Any messaging data, e.g. Bloomberg Messaging?
    • Any audio data?
  • What languages are contained within the data?
    • Do you have reviewers with the necessary language skills?
    • Is machine translation, whereby your review platform carries out a basic translation, appropriate for your initial review?
  • Who should collect the data and how should it be collected?
    • Where is the data geographically?
    • Do you require an independent third party to collect the data in a defensibly sound manner?
  • What are the data privacy implications, if any?

Whilst these questions are not exhaustive, if you have thought about them, you will be in a position to start your conversation with your eDisclosure providers.  Ideally, relationships ought already to have been built up with technology experts as in most cases there will be little time to conduct a “beauty parade”.

We can help you collect the information you need.  Together we can then begin to plan how you are going to retrieve the data, how long that may take, and what the costs may be.  You will also need to start thinking about the actual data: what happens when it is processed before review, how can you reduce the volume of data to review, and what technology do you want to use to help you as it is likely that some sort of data filtering technology and review platform is going to be required.

These topics will be covered in the next Back to Basics post.

Next week, Rob Jones will be writing a blog post on what you need to know about Technology Assisted Review (TAR). You can see a preview below.