How do ediscovery platforms manage the challenge of East Asian languages?

18 August 2017 by Adrienn Toth

As a global company, it is always interesting to learn about how clients and colleagues use technology to overcome challenges. In our recent Face to Face with the Regulators symposium, one of the key requirements for clients working in Asia is capacity for their ediscovery vendors to deal with East Asian languages such as Chinese, Japanese and Korean.

We caught up with our friends and colleagues at KrolLDiscovery APAC to find out how ediscovery technology tackles these languages.

How have you seen Asian countries like China handle the challenge of handling data in multiple languages during international and national e-discovery projects?

Working in multiple languages either nationally or internationally can present a number of challenges. First, you need an ediscovery platform capable of handling multiple languages including Chinese, which comes with its own set of challenges specific to the written form (more on this later). Secondly, you need to have in place native speakers in each language who understand the legal and technological considerations of each case. For international projects, there is a third consideration that you need teams that can work locally in each country whilst still communicating and working as part of a wider global team.

Our Asian clients come to us because our combination of technology, global network and local expertise mitigates these challenges. In terms of technology, our ediscovery platforms can handle hundreds of languages including Traditional Chinese, Simplified Chinese, Korean and Japanese. One platform can handle multiple languages within one country, simplifying national ediscovery projects with multiple language requirements.

For international projects, we like to say we are around the globe but across the street. Our case managers, consultants and forensics experts are based in local offices and speak the local language but are part of something bigger and often work on cross-border cases in conjunction with our other teams around the world. We believe this is the key to a successful ediscovery project involving multiple languages and jurisdictions as ultimately, there needs to be cohesion and collaboration to ensure the deadlines and requirements in all countries involved are met.

What are some of the nuances or idiosyncrasies of the Chinese language which may make it more difficult than English to review for e-discovery practitioners and e-discovery tools / machines? Do other related Asian languages share the same nuances / idiosyncrasies?

The biggest challenge for ediscovery practitioners and ediscovery software developers alike is handling the written forms of Chinese and other East Asian languages. Unlike Western languages using Roman or Cyrillic alphabets where each letter represent sounds to build words, Chinese (traditional and simplified), Japanese and Korean language groups use a logographic system. As a result, single characters can represent anything from a single word to multiple words to entire phrases. Furthermore there are no spaces to segment individual words. A string of characters can be read differently depending on where they are segmented by the reader or indeed, in ediscovery cases, the platform.

When looking for an ediscovery platform to use in China, it is vital that effective tokenization systems are in place. Tokenization is the process of segmenting character to strings to define words and phrases. The best ediscovery systems use sophisticated tokenization systems to ensure searches of accurate. In contrast, more basic platforms deploy a simplistic method whereby each character is assigned a word. Given the nuances involved, these systems can result in unreliable and inaccurate data filtering and processing.

Language recognition can also present a problem during Asian ediscovery projects. For example, the Japanese has three written language systems; hiragana and katakana which are syllabaries (phonetic writing systems where each character represents a syllable) and kanji. Katakana is primarily used to transcribe foreign words. Kanji is a logographic system that uses a lot of characters common to written Chinese. On a similar note, some Japanese text is written in ‘Romaji’ where the Roman alphabet is used to write in Japanese. As a result, some platforms may not recognize a text as being written in the Japanese.

When looking for ediscovery providers for projects in Asia, it is always best to choose companies that employ native speakers in their consultancy, case management and computer forensics teams. Aside from the complications involved for platforms handling Asian written languages, human readers can also struggle. Asian languages are richly nuanced and the meaning of a word or phrase can be changed by the use of different tones or regional dialect. Even fluent second language speakers may miss these nuances which could result in significant misunderstandings that may affect the outcome of a case or investigation.

Have you seen more e-discovery companies or practitioners use advanced technology like AI to better mitigate language challenges when conducting e-discovery? If so, how?

Asia is only just beginning to discover the advantages of predictive coding technology. The majority of our clients in Asia are using predictive coding to automatically create workflows by identifying documents according to language and then segregating documents for processing accordingly.

However, some of our more technologically savvy clients are starting to unlock the potential predictive coding has for refining the ediscovery process on projects involving Asian languages for example, to look documents containing for colloquialisms or other ambiguous language that requires further human review to improve clarity and understanding.