Discovering Concepts Using Large Table Corpus

Keqian Li (University of California, Santa Barbara);Yeye He (Microsoft Research);Kris Ganjam (Microsoft Research)

Abstract

Existing work on knowledge discovery mostly uses natural language techniques to extract entities and relationships from textual documents. However, today relational tables are abundant in quantities, often with clean and well-structured data values. So far these rich relational tables have been largely overlooked for the purpose of knowledge discovery. In this work, we study the problem of extracting concept hierarchies given a large table corpus. Our method first iteratively groups values in a table corpus based on co-occurrence statistics to produce a candidate hierarchical tree. The tree is then summarized by selecting nodes that can best “describe” the original corpus, in order to produce a small tree with desired concept hierarchies, and is easy for humans to understand and curate. We design our algorithms based on map-reduce to scale to large table corpus. Experiment evaluation on real enterprise table corpus shows that proposed approach can generate concepts with high quality.