East Asian languages are stretching the limits of current e-discovery review tools—and fueling innovation.
As e-discovery activity grows in Asia, attorneys and investigators are struggling to tailor advanced, English-language based e-discovery technology to foreign languages. While any language can present e-discovery tools with challenges, the problems are more acute with East Asian languages, whose structure and diversity are far more complex than those in Western European languages.
The Space Between
So just how are Asian languages’ unique characters pushing the bounds of what current e-discovery tools can accomplish? To start, one needs to look no further than the spaces between words—or the lack of them.
“In English and most Western languages, we have a space between each word, which makes it very easy for a computer to understand each word as a distinct and discrete piece of the sentence,” said Jared Nelson, partner at MWE China Law Offices in Shanghai.
“However, some Asian languages, like Chinese, do not use spaces between words and, further complicating things, some words depend on the formation of two-character couplets, which makes it much more challenging [for e-discovery tools] to automatically identify unique words in a sentence,” he noted.
But difficulties parsing out specific words in a document are just the tip of the iceberg. These differences in punctuation speak to the vastly dissimilar linguistics structures at the foundation of both Western and East Asian languages.
Western languages build sentences and phrases out of basic root words and phonemes, which are the basis of modern e-discovery search syntax. For example, an e-discovery platform may assign the keyword of “walk” for variations of the verb found within a document (walking, walked, walks, etc.)
But in Asian languages, verbs, nouns and adjectives are not inherently reductive to root words or phonemes.
“Unlike Western languages using Roman or Cyrillic alphabets, where each letter represents sounds to build words, Chinese, Japanese and Korean language groups use a logographic system,” said Kate Chan, Hong Kong-based regional managing director at KrolLDiscovery. “As a result, single characters can represent anything from a single word to multiple words to entire phrases.”
That makes searching through Asian language documents, where root words cannot be found, picked apart from seemingly jumbled text and readily indexed, seem like a near impossible endeavor. But Chan noted that such review is enabled by having an “effective tokenization system” in place.
Tokenization, she explained “is the process of segmenting characters to define words and phrases. The best e-discovery systems use sophisticated tokenization systems to ensure [accurate] searches.”
Indeed, tokenization is already a cornerstone of Western language e-discovery review, where machines recognize distinct words by the spaces in-between them and their roots, and can therefore turn each word in the document into a searchable keyword.
Though such a process is possible for Asian language documents, e-discovery platforms would need to have a deeper understanding of each logographic character and how and when to break these characters up. But understanding such characters can be a tall task for any system, given the sheer diversity of written systems any particular Asian language can have.
Different Systems of Writing
As an example, Chan pointed to how Japanese has three written language systems: hiragana, katakana and kanji. Hiragana and katakana, which are often used for foreign words, are syllabaries, meaning that like English, they have “phonetic writing systems where each character represents a syllable.” On the other hand, kanji “is a logographic system that uses a lot of characters common to written Chinese,” she said.
Likewise, Mandarin Chinese, just one of many dialects within Chinese, also has traditional and simplified writing systems, both with their own nuances.
What’s more, Asian languages can also have informal writings system that solely use western characters. “Some Japanese text is written in ‘Romaji,’ where the Roman alphabet is used to write in Japanese. As a result, some platforms may not recognize a text as being written in the Japanese,” Chan said.
For other languages like Chinese and Vietnamese, it is also common for people to “type out the phonetic sound of characters” using the English alphabet, as opposed writing the characters themselves, said Meesun Yang, associate general counsel and vice president of discovery services, FRONTEO.
But phonetic writing may run into the intonation problems if not written with accompanying accent tones. This is because when written phonetically, different words in languages like Chinese or Vietnamese may look the same on paper. But when spoken, they have different intonations and vastly different meanings, Yang said.
“For example the word ‘four’ in [Mandarin] Chinese, if you type it out [phonetically] its ‘Si.’ That same ‘Si’ spoken with one specific intonation means ‘death’, but if you say it with a different intonation it means ‘four.’”
Machine Language Classes
To be sure, e-discovery platforms that use machine learning are capable of obtaining a deeper understanding of these various Asian language writing systems. But to date, many such AI systems “have been developed with the English language set, with the English alphabet and in the way that Romance languages are structured,” Yang said.
Suffice it to say, the efforts to teach AI systems certain Asian languages have only recently begun in earnest. “Asia is only just beginning to use AI for e-discovery projects,” Chan said. She explained that most of these early efforts require “an experienced human reviewer” to train the system to “flag documents according to a particular set of criteria,” such as a list of suspected keywords.
After establishing a baseline for which AI can understand these languages, these reviewers must then “look [out for] documents containing colloquialisms or other ambiguous language that requires further human review to improve clarity and understanding,” she added.
Such training, however, can inevitably take time and be a cumbersome and potentially expensive process to get up to the level of accuracy needed. Yang noted that off-the-shelf machine translation tools are likely not up to the task. While these tools are “a good low cost way to weed out anything that may be obviously non relevant,” she said, “the accuracy for machine translation is still not great.”
“In terms of actually reviewing the content for substantive issues or privilege or confidently, these are not tools a lot of companies or clients would rely on as the sole means of gathering important information,” Yang added.
Yet some legal technology companies, such as Relativity, Everlaw and Brainspace, are looking to change that perception. Through leveraging pre-trained AI systems and novel technologies, they are hoping to make off-the-self machine translation software as accurate as customized AI tools that are trained in-house.
In addition, those beyond the legal technology industry are also looking to tackle the challenge of translation. The World Intellectual Property Organization (WIPO), a United Nations agency, for example, recently announced what it called a “groundbreaking” translation tool for translating technical patent documents. According to the WIPO, the tool uses a machine learning software that was initially trained on Chinese, Korean, and Japanese documents.