Managing e-discovery is complex enough in English, with cases such as Victor Stanley Inc. v. Creative Pipe Inc. emphasizing the dangers of failing to maintain and document defensible production methods. When a matter involves global business with electronically stored information (ESI) in multiple languages, the problems multiply. In-house counsel involved in multinational e-discovery must deal with a host of technical issues as well as legal complications in foreign jurisdictions.
E-discovery companies are scrambling to present solutions, establish overseas operations to meet demand and develop technologies to enable collection, processing, review and production of multilingual data as the need becomes more pressing. Multilingual e-discovery is even developing its own jargon, including LOTE to refer collectively to “languages other than English,” and CJK to refer to Chinese, Japanese and Korean–languages with special characters that offer unique challenges.
“Foreign-language documents have become an integral part of the e-discovery landscape,” says attorney John Tredennick, CEO of Catalyst Repository Systems Inc., which provides secure data repositories. Tredennick estimates that more than 50 percent of the data his company processes now involves LOTE. “But treating foreign-language ESI as if it is just another component of the discovery process can get you in real trouble.”
Tredennick says improper collection of data in the EU can land you in jail. “Likewise, processing Chinese or Japanese e-mail using software not built for CJK languages will result in gibberish and a potential data spoliation claim,” he adds.
Tower of Babel
To appreciate the technical issues in multilingual e-discovery, the first step is to learn how computers recognize different language structures. The relatively small collection of letters, special characters and punctuation marks included in Indo-European languages such as Spanish, French and German make computer processing relatively simple. On the other hand, the pictorial CJK languages have tens of thousands of often-overlapping characters with no spaces or punctuation between words. Languages such as Hebrew and Arabic that read right to left pose other issues.
Until recently, there was no global standard for coding languages so that they can be recognized by computers. The American Standard Code for Information Interchange (ASCII) allowed just
256 bytes or slots for letters, numbers, special characters and punctuation marks for each language. This was sufficient for English and Western European languages but inadequate for many others. Cross reference tables known as code pages, developed to allow ASCII-based computers to recognize more languages, are still used to store legacy data today. But because there is no universal set of code pages for all computers, code pages on one computer may be unreadable on another.
Fortunately, within the past decade computer hardware and software makers have adopted a global standard known as Unicode Transformation Format (UTF). UTF has the capacity to support more than 1.1 million characters, well beyond the 100,000 or so characters currently in use around the world.
But enough systems with code pages still exist that it’s important to determine whether all data that needs to be reviewed is maintained in UTF. If any of it remains stored on systems using code pages, the e-discovery processing system must be able to support both legacy code pages and UTF. The system also must be able to identify the beginnings and ends of words and sentences in CJK languages. Some search engines can separate overlapping words and phrases by their context–a process known as tokenization.
The first stage of e-discovery–collection of the potentially discoverable ESI–is complicated by privacy laws. For example, the EU has strict privacy laws limiting the transfer of personal information across borders. But the EU and the U.S. have a “safe harbor” agreement allowing companies to transfer personal data out of the EU if they certify that they will provide adequate privacy protection. One solution is Web-enabled software that allows document collections hosted in one country to be securely accessed by legal teams elsewhere.
“The documents can be read online but they always remain stored on servers in the country of origin without being transferred, cached or downloaded to computers outside those countries.” says Ian Campbell, COO of iCONECT, a litigation support software developer.
In addition to ensuring that your e-discovery collection team is safe harbor-certified, work flow and technical data formats should be reviewed so the data ultimately processed is not corrupted because of the collection tools used.
“When data is collected improperly, there often will be no way to salvage it when it comes time to process and review it, or if there is, the process can be extremely difficult and costly,” says Greg Neustaetter, senior product manager at Stratify, an e-discovery vendor.
Man vs. Machine
Once collection is completed, e-discovery specialists suggest identifying all the languages contained in the potentially discoverable data to ensure the appropriate software is used to sort and process it.
“E-discovery filtering and processing is the systematic way of reducing a data set, converting the documents to a standard file format and gathering the metadata and extracted text for review,” says Michelle Lange, director of e-discovery at Kroll Ontrack.
Translation software is a cost-effective first step in processing, particularly where there is a large volume of data. Though not as accurate as human translation, the software is good enough to sort which e-mails discuss lunch plans and which go to the heart of the litigation.
“Compare $15 per page for human translation to 15 cents using translation software, and the difference in expense can be staggering,” says Tredennick.
After machine translation is used to eliminate irrelevant documents, human translators should take over to guarantee an accurate translation if English-speaking attorneys are viewing the remaining documents. Alternatively, attorneys with an understanding of the relevant languages and cultures may review the documents.
“It is important to select a human processing team familiar with a country’s practices and cultural differences when doing e-discovery review outside the U.S.,” says David Chaumette, a partner at Baker & McKenzie. The use of idiom and colloquialism in e-mails and text messages requires a review team familiar with the unique practices in each country, Chaumette adds.