Dr. Xuning (Michael) Tang, Chief Data Scientist at Vista Analytics.
Dr. Xuning (Michael) Tang, Chief Data Scientist at Vista Analytics.

The evolution of e-discovery technology may be driven by market demands, but it is built and designed by data scientists. These unsung developers and researchers are the nuts and bolts behind some of legal’s most relied upon technologies, such as those for contracts and technology assisted review (TAR).

In the age of artificial intelligence (AI), data scientists are becoming more pivotal than ever. Far from just predicting how AI’s machine learning capabilities will grow and mature in legal departments, data scientists are on the ground pushing and stretching the technology’s limits.

Legaltech News caught up with one such innovator, Xuning Tang, chief data scientist at Vista Analytics, where he focuses on advancing corporate e-discovery. Tang recently joined Vista Analytics after a long career working on a host of projects from fraud detection to knowledge management at Fannie Mae, Deloitte Advisory and Siemens Corporate Research.

Tang discussed with LTN the future of legal technology, common misperceptions of AI, and what inspired him to become a data scientist. Here the highlights from the interview:

Plugged In

What new advancements can AI and machine learning bring to further corporate legal e-discovery?

Machine learning and AI can change corporate legal e-discovery in many aspects. For example, humans have cognitive limitations when processing and deriving insights from large-scale document sets; humans simply cannot successfully synthesize large volumes of data.

Predictive coding can solve this problem by using machine learning to filter relevant information for attorneys. However, filtering is only a starting point. It would be more revolutionary if machines could read documents, understand the contents, discover new facts and knowledge and form judgments via a given process.

In recent years, new cognitive computing systems, such as IBM Watson, have shown promising results in knowledge discovery and question answering. New advancements in Deep Convolutional Neural Network and Deep Recurrent Neural Network also demonstrate capabilities in understanding the semantic meaning of documents and processing text stream. Marrying the productivity that machine learning/AI can bring to the process with human interpretation of the results will aid in achieving optimal conclusions. These advancements, along with others, will further change the way we are doing e-discovery today.

What types of e-discovery systems do you expect to be designing and building at Vista Analytics?

At Vista Analytics, we believe it is tremendously important to develop vertical AI solutions to solve complex e-discovery and data analytics problems, instead of delivering lower-level AI services that are easily commoditized. This addresses the misconception that all an industry needs to do is take something like Watson and point it at a problem to be solved. In reality, even with access to large sets of the proper data needed to solve a problem, systems need the right subject area experts to guide them successfully.

Powered by proprietary data, subject matter expertise, and machine learning models, we will be able to build vertical AI solutions to better meet client needs and answer questions that cannot be answered without AI.

What future advancements do you see for e-discovery on the horizon?

The data analysis process has been constantly revolutionized by technologies, including, but not limited to, machine learning. We see cloud-based software tools and data hosting making e-discovery solutions more customizable. We also foresee that big data products will bring in dramatic improvements in the handling of large volumes of data, and more importantly, data of different schema.

Due to the advancements of IoT (Internet of Things), new types of discoverable data will emerge, which will require quick adoption for current data analysis processes. Hence, AI will be more tightly coupled with data analysis processes. Simple subtasks in data analysis processes will be replaced by AI, while more complicated subtasks will be assisted by AI.

What is the one misconception you think people have about AI and machine learning?

A major misconception of AI and machine learning in many industries, including legal, is that it will replace all human involvement. While there are functions of AI that are very well-suited to replacing many of the more defined tasks, legal practice requires advanced cognitive abilities and problem-solving skills in environments of legal and factual uncertainty.

By combining the power of natural language processing, machine learning, and big data technology, we can progress from automating e-discovery tasks, to building predictive models for legal practice, to eventually developing better AI solutions for the legal industry. However, the need will still remain for highly-informed lawyers who can understand and interpret the issues of each case to take full advantage of the systems. The hope is to further empower these attorneys to concentrate on important issues and allow the AI systems to execute the often labor-intensive tasks that support them.

What drove you to become a data scientist?

The academic world is often more focused on theoretical research; however, applied research is more near and dear to my heart. My PhD dissertation was motivated by the knowledge management needs of rapidly transferring streaming data into knowledge and action.

As a data scientist, I can leverage a combination of machine learning toolkits, big data platforms, statistical software, and other programming languages to solve business problems and generate revenue for companies. In addition, my academic and professional training in both machine learning theory and software engineering enables me to build end-to-end solutions and tackle the most challenging problems in this area.

What was the most challenging or rewarding project over your career you ever worked on?

I previously worked on projects to build proactive sensing solutions for prestigious car manufacturers, where the solution aims to detect potential safety issues of cars by using machine learning, natural language processing, and statistics to analyze large volumes of warranty claims, call center data and government reports. The challenges of these projects were threefold: First, safety issues are rare events; second, data volume and complexity are both high and mostly unstructured or semi-structured; and third, the stakes are high if safety issues are not caught early enough. I look forward to extending my expertise and experience into new applications of equal or greater complexity.