All posts tagged Data Management

Fully flexible with text-only

Let’s skip the scoping calls, collections and meetings for this blog post; our starting point is a database filled with tens of thousands of files and the document review is about to commence. The data processing, which takes up most of the initial costs, already happened. The next expenditure on technology that clients will encounter is hosting costs and although they look smaller than the processing costs, these costs are not to be underestimated. If the investigation takes a month or two longer than expected; more often than not the hosting fees will catch up on the initial processing fees.

This can change now.

In a database each document can be uploaded in three different formats; text, native and tiff image. Each format provides the reviewer with a unique set of tools such as machine translation, predictive coding, redactions and the ability to create productions. These three formats all contribute to the total volume of hosted data and thus to hosting costs.

Data Volumes

Data volumes

As stated above each document format brings something unique to the table, but these tools are not always needed, especially in the early phases of a document review. Clients often have to collect large amounts of data to ensure critical information is not left out. The next step is to cull the data as much as possible. Redactions and productions are not needed at this point in time and the text-only format will give the reviewer all the information and tools needed in order to establish if a document is relevant or not. During this initial phase of the investigation the native and tiff images sit there in the database taking up server space and increase the hosting costs unnecessarily.

Kroll Ontrack now offers its clients a more flexible approach. Clients can choose to initially upload only the text format when setting up a database. This will drastically decrease the hosting volumes and thus the hosting costs. In a later stage of the review and native or tiff image formats are needed, they can be requested on the fly within the database. The result is that clients will only pay for the documents and formats they are actually using.

About Jasper van Dooren

Jasper is part of the Electronic Evidence Consultancy team, which provides scoping consultancy and advice to potential clients in ediscovery or computer forensics matters. He also assists clients by providing demonstrations, presentations, documentation and advice before and during project engagements to ensure that expectations and legal requirements are being met. Jasper graduated from Utrecht University, Netherlands, with a Master’s degree in Private law before moving to London.

Only write the novel when you can solve the crime

A forensic mystery at Churchill War Rooms

When I first started as a Trainee Computer Forensic Analyst the sage advice I received from my manager was (as best as I can remember) “There are two types of people in this business: those that sit around figuring out how to commit a crime and the others that actually do it”.

When Tracey Stretton first suggested that my ‘creative’ imagination ought to be used for a “CF Murder Mystery” event I reeled.  Where do you start? How can I make it believable? What details are necessary for a mystery story?

By far the quote I found most helpful was from Andrew Hixson, of the James Bond short stories.

“I only write the novel when I can solve the crime”.

After the initial shock had worn off I quickly realised that I had been given a free ticket.  Without any billable time pressures I could finally, once and for all, take the time to work out from start to finish all aspects of a full ‘crime’.

The core of the plot came about in our first brainstorming session.  The event was to be limited both in time and, as alcohol was likely to be involved, complexity.  We needed a goldilocks computer security incident which was ‘just right’.

The simplest story is often the most believable, so it’s no surprise that we went with good old fashioned larceny.  After all, barring the consequences, we all can think of a way to steal data.

Between myself, Julian Sheppard and Tony Dearsley we collectively had enough stories about thieves and experience with thefts to provide a whole mini-series, not just one evening.

One of the more entertaining ideas we came up with was the discovery of a USB key found in the Channel Tunnel, equally laid on a rail across the Anglo-Franco border (The Discovery).  Unfortunately Sky Atlantic beat us to it and unveiled The Tunnel.  I still maintain that they took my idea and filmed an entire series in two weeks, just to throw me off!

Writing up the suspects and their backstory caused the most concern.  Each time I mentioned the name of an obscure fictional British or American spy there would be worried looks between colleagues.  “Is he day dreaming again?”, “What has this got to do with The War Rooms?”, “Why aren’t you on billable work?” was often asked.

Working out the details was easy once we had realistic characters.  Ultimately, for each of our suspects we laid out their motives and opportunities so as to leave a trail of clues to be picked out by our guests.  The plot becomes something far more interesting when we cheat and use the imagination of others to fill in the gaps.

In the words of Tolkien “Good stories deserve embellishment”, so it was decided that in order to describe a unique story we would need a unique visual guide.  This was Dial D for Data Theft, not Death by Powerpoint!

With judicious use of motion sickness inducing Prezi we were able to develop an interesting, if quirky, set of ‘slides’.

And then suddenly it was time for us to set out to the Cabinet War Rooms!

What a night it was! A perfect combination of story, location and audience.  Indeed the audience participation was, as I expected, the most inventive part of the presentation.

When asked why they thought a particular culprit was guilty, some of the answers were not exactly scientific:

Shifty Eyes”
“He owns a Porsche.”
“She reminds me of my ex-wife”

However, my favourite quote of the night goes to the guest who wrote on his guessing card:

“It was Felix [because] his shirt is far too tight and he’s a liar!  There’s no way he’s 6’10”! 5’11” at MOST“.

Then, with a bottle of something nice to the winning entry from our audience (none of the above were winners, sadly) we wrapped up the evening with an exciting dénouement and final farewell.

E-Discovery and E-Investigations Forum 2013

Visits to countless hotels with their endless Las-Vegan style psychedelic carpets, exchanging a metric ton of business cards with sales folk in  shiny suits, shinier badges and yet shinier teeth and a veritable bounty of canapés and foods on sticks that so epically fail to satiate one’s hunger. All of the above can only mean one thing…conference season is well and truly upon us.


Rob and Luke at AKJ

That time of year where the legal technology industry crams in a quarter’s worth of conferences in to a 3 week period, so that everyone can feel slightly more comfortable with the fact that everyone will be mentally checked out from mid-November until we’re safely into 2014 and our New Year’s resolution requires us to work harder.

But the season isn’t all pretentious canapés and teeth whitening, it can’t all be fun and games! Occasionally, as a subject matter “expert” in one’s field, you are asked to share your knowledge with a room full of strangers; and that is precisely what I was asked to do when chairing a panel discussion entitled “Protecting data in business and in investigations”. I was joined by Martin Pratt, Head of the Employment Group at Gordon Dadds Solicitors in Mayfair and E.J Hilbert, Head of Cyber Security at Kroll Advisory Solutions and regular creator of  audible gasps as he tells people of his 8 years spent as an FBI secret agent countering international hacking (no prism jokes please).


Luke at AKJ

The discussion was incredibly well received and the feedback has been overwhelmingly positive. Huge thanks for this must go to the two gentlemen mentioned above, whom I, in a Dimblebyesque way, merely pointed in what I hoped to be an interesting direction and let their vast experience and expertise come across to the audience.  I know from feedback, that some even took some helpful hints back to office with them that day. I can hear you all thinking “Luke, helpful takeaways from a conference seminar? Such a thing does not exist, I just go for the chicken ballotine with quince jelly.”

At a high level, the points are basic. For external threats, it’s all about educating staff. The identity of external threats may have shifted, but their methods continue to be repeated ad nauseam.  As long as people are still using their dog’s name or favourite football team as their password, hackers will always be able to crack it. As long as people follow links, even those that appear to come from a trusted source, their ‘email to click’ ratio will remain high and this method remains viable. So change your obvious password to a phrase instead. You won’t forget “tobeornottobe” in a hurry, but it’s infinitely harder to crack. Instead of clicking that link you’ve been sent, Google the name, find the original source and then decide whether to trust that email or not.

For internal threats the messaging is more important than ever: control who can access data. Categorise it so that staff have access to data required for their job but nothing else and ensure that your employment contracts are fit for the modern workplace, and regularly updated.

We have been asked to present further on this topic of data theft/loss in business at both the E-Crime forum in Amsterdam on the 28th November 2013 and as the final part of our current Webinar series  which is set to broadcast in early December. They promise to be excellent discussions and if at all possible I strongly urge people to register and listen in.

Until then, look after yourself and each other.

Back to Basics – Proper Planning

A trawl of the various blogs and articles on eDisclosure finds plenty of articles on predictive coding, Technology Assisted Review (TAR), big data, analytics, the Jackson Reforms and cost budgeting.  Indeed, even our own blog to date has focused a great deal on these issues, as the tags on the left show.  All of these topics are essential reading for anyone involved in eDisclosure, but they all assume one thing – everyone knows the basics.  No doubt all of our readers are fully aware of the new rules regarding the submission of budgets.  Anyone who is following the Plebgate saga cannot fail to be aware of Andrew Mitchell’s predicament due to his budget not being submitted at least seven days ahead of the CMC.  As a consequence, the court said Mr Mitchell “would be limited to a budget consisting of the applicable court fees for his claim”.  The judge also went on to say:

“Budgeting is something which all solicitors by now ought to know is intended to be integral to the process from the start, and it ought not to be especially onerous to prepare a final budget for a CMC even at relatively short notice if proper planning has been done.”

From our perspective, the key words here are “proper planning”.  One of the most costly aspects of litigation is the actual review of the documents due to the hours that this can potentially take.  But if you are inexperienced at eDisclosure, or don’t know your megabytes from your gigabytes, or both, where do you start?  Hopefully here.

The first thing to think about when your client rings is where to find the information relevant to the case.  The answer to that question will lie with your clients, or if you work for a corporation, with key personnel in IT and management.  The Electronic Documents Questionnaire contained within the Schedule of Practice Direction 31B is a useful template (, but here are the key questions that will help us to help you:

  • How many individuals are potentially involved?
    • Individuals are referred to as custodians.
  • Where is the relevant data for these custodians stored?
    • Their data may be on multiple sources, e.g.:
      • Desktop computer
      • Laptop computer
      • External device
      • Smart phone
      • Server
      • Backup tapes
  • Is it necessary to collect all the data from all the sources to avoid the possibility of having to return, thus incurring additional costs?
  • How much data might there be?
    • This is very important as it will eventually help determine the number of potential documents for review.
    • The unit used for data in these circumstances is a Gigabyte (
  • What type of data is there?
    • What type of email does your client use, e.g. Microsoft Outlook, Lotus Notes?
    • Any databases or proprietary software?
    • Any messaging data, e.g. Bloomberg Messaging?
    • Any audio data?
  • What languages are contained within the data?
    • Do you have reviewers with the necessary language skills?
    • Is machine translation, whereby your review platform carries out a basic translation, appropriate for your initial review?
  • Who should collect the data and how should it be collected?
    • Where is the data geographically?
    • Do you require an independent third party to collect the data in a defensibly sound manner?
  • What are the data privacy implications, if any?

Whilst these questions are not exhaustive, if you have thought about them, you will be in a position to start your conversation with your eDisclosure providers.  Ideally, relationships ought already to have been built up with technology experts as in most cases there will be little time to conduct a “beauty parade”.

We can help you collect the information you need.  Together we can then begin to plan how you are going to retrieve the data, how long that may take, and what the costs may be.  You will also need to start thinking about the actual data: what happens when it is processed before review, how can you reduce the volume of data to review, and what technology do you want to use to help you as it is likely that some sort of data filtering technology and review platform is going to be required.

These topics will be covered in the next Back to Basics post.

Next week, Rob Jones will be writing a blog post on what you need to know about Technology Assisted Review (TAR). You can see a preview below.

Limber up for the Big Data Marathon

The Data Craze for Sports Fanatics and Lawyers

One of my colleagues has just run the Reading Half Marathon and I am expecting any minute to see his race stats published on Facebook.   Well done Rob Jones, a GPS time of 2:21:19.  Budding athletes and intrepid cyclists are downloading various  apps to their phones (like Endomondo Sports Tracker or, relying on the information they gather to track distance travelled, time taken and  energy expended and using this to not only subtly show-off on social networking sites but also to plot and plan their race strategies. Of course, a positive spin-off is that the rest of us, having shared their pain and gain, feel inspired to do something similar and before you know it the data craze has turned into a sports craze and a new way of doing things. This phenomenon highlights how data can be transformed into intelligence, can inform decision making and strategy and possibly even have an unintended impact.  It got me thinking again about the influence that big data and predictive analytics is having on business and on the legal profession and how edisclosure fits into the picture.

Big data in business

Initially it was only big companies like telecommunications companies, banks and government agencies that could afford to store and analyse big data.  Thanks to advancements in hardware and databases you no longer need supercomputers to carry out complex analytics across large data sets.  Many businesses are finding that for a reasonable investment they can collect data and make it relevant to their business; by measuring consumer behaviour and using pattern detection they can respond to customer needs and market conditions and make data-driven decisions.   Supermarkets, healthcare providers, gaming companies, insurance companies and even florists are jumping on the bandwagon and tapping into the intelligence running through the big data stream and finding ways to monetise the data they hold.

But (and it’s a big but) what about law firms? 

Can lawyers, who have tended to shy away from technological innovation really harness big data to predict case outcomes and legal costs?   We know that big data can be exploited to predict the outbreak of diseases, but can it be used to predict the outcome of a litigation case?  In an interesting article by Mike Wheatley on Silicone Angle it appears that databases of legal history are being built up and algorithms are being developed to help predict case outcomes.  Apparently, companies are also developing mobile apps that predict the average legal cost of different types of cases in the US.

As we enter a new era of cost management in the UK and the need to stick to case budgets becomes more important, we will need all the help we can get to estimate costs and guess what impact variables like the number of witnesses or extent of disclosure might have, not only on costs, but also on the outcome of a case.  Of course the data that needs to be collected, analysed and correlated to make sensible predictions includes not just the key features and facts of the case itself but also the results recorded in subsequent court decisions.   When it comes to costs, law firms and e-disclosure providers are all holding a lot of valuable billing data that could be analysed to assist with cost estimating.   This might all be feasible but has not yet been done.

On the edisclosure front, data analytics has been used for some time.  We have had email analytic tools that can be used to visualize who has been communicating with whom, when and about what.  Similarly, Technology Assisted Review (TAR) (also known as Computer Assisted Review or Predictive Coding)  analyses decisions made by humans on a sub-set of documents, and then look for similar patterns in a much larger document universe to predict which documents are relevant to a case and top priority.    At this stage most of us know about TAR and some are testing the water. Here are some tips on analytics from the sports scene:

Sports analytics and the CIO: Five lessons from the sports data craze

Collect the right data to start with, both qualitatively and quantitatively.  In edisclosure this means targeting the right sources of data and is an area where experts can help.  Is it better to present a raw unfiltered set of data (to teach the system in a balanced way) or a set of results based on a carefully crafted search, or is that somewhat prejudicial. Until there are better statistics and more guidelines from real cases, the ultimate decision is likely to be a strategic one.

Start with statistically significant data.  This refers to the selection of your seed set of documents that will be reviewed by humans and used to train the prediction software.   You cannot expect the software to achieve peak performance on 1,000 documents.

Remember that the ability to contextualise data is important.  There are incalculable factors that come into play with prediction and this is where human quality control is vital.

Perhaps, as we use these predictive tools more in legal cases and share our practical experiences and results, their use will become widespread and a status symbol just like Nike + is.

About Tracey Stretton

Tracey Stretton is a legal Consultant at Kroll Ontrack in the UK. Her role is to advise lawyers and their clients on the use of technology in legal practice. Her experience in legal technologies has evolved from exposure to its use as a lawyer and consultant on a large number of cases in a variety of international jurisdictions.