MetaPAD: Meta Patten Discovery from Massive Text Corpora
Meng Jiang (University of Illinois at Urbana-Champaign);Jingbo Shang (University of Illinois at Urbana-Champaign);Taylor Cassidy (Army Research Lab);Xiang Ren (University of Illinois at Urbana-Champaign);Lance Kaplan (Army Research Lab);Timothy Hanratty (Army Research Lab);Jiawei Han (University of Illinois at Urbana-Champaign)
Abstract
Mining textual pattens in news, tweets, papers, and many other kinds of text corpora has been an active theme in text mining and NLP research. Previous studies adopt a dependency parsing-based patten discovery approach. However, the parsing results lose rich around entities in the patten, and the process is costly for a corpus of large scale. In this study, we propose a novel typed textual patten structure, called meta patten, which is extended to a frequent, informative, and precise subsequence patten in certain context. We propose an efficient framework, called MetaPAD, which discovers meta patten from massive corpora with three techniques: (1) it develops a context segmentation method to carefully determine the boundaries of patten with a learnt patten quality assessment function, which avoids dependency parsing and high-quality patten; (2) it identifies and groups synonymous meta patten from multiple facets—-their types, contexts, and extractions; and (3) it examines type distributions of entities in the instances extracted by each group of patten, and looks for appropriate type levels to make discovered precise. Experiments demonstrate that our proposed framework discovers high-quality typed textual patten efficiently from different genres of massive corpora and facilitates information extraction.