ACM SIGKDD dissertation awards recognize outstanding work done by graduate students in the areas of data science, machine learning and data mining.
- Relevance of the Dissertation to KDD
- Originality of the Main Ideas in the Dissertation
- Significance of Scientific Contributions
- Technical Depth and Soundness of Dissertation (including experimental methodologies, theoretical results, etc.)
- Overall Presentation and Readability of Dissertation (including organization, writing style and exposition, etc.)
SIGKDD Dissertation Awards (1 winner, 2 runner-ups, and 1 honorable mention)
Mining Entity and Relation Structures from Text: An Effort-Light Approach
Xiang Ren (student) and Jiawei Han (advisor) at the University of Illinois at Urbana-Champaign, USA
Abstract: In today's computerized and information-based society, text data is rich but often also “messy”. We are inundated with vast amounts of text data, written in different genres (from grammatical news articles and scientific papers to noisy social media posts), covering topics in various domains (e.g., medical records, corporate reports, and legal acts). Can computational systems automatically identify various real-world entities mentioned in a new corpus and use them to summarize recent news events reliably? Can computational systems capture and represent different relations between biomedical entities from massive and rapidly emerging life science literature? How might computational systems represent the factual information contained in a collection of medical reports to support answering detailed queries or running data mining tasks?
While people can easily access the documents in a gigantic collection with the help of data management systems, they struggle to gain insights from such a large volume of text data: document understanding calls for in-depth content analysis, content analysis itself may require domain-specific knowledge, and over a large corpus, a complete read and analysis by domain experts will invariably be subjective, time-consuming and relatively costly. To turn such massive, unstructured text corpora into machine-readable knowledge, one of the grand challenges is to gain an understanding of the typed entity and relation structures in the corpus. This thesis focuses on developing principled and scalable methods for extracting typed entities and relationship with light human annotation efforts, to overcome the barriers in dealing with text corpora of various domains, genres and languages. In addition to our effort-light methodologies, we also contribute effective, noise-robust models and real-world applications in two main problems:
Identifying Typed Entities: We show how to perform data-driven text segmentation to recognize entities mentioned in text as well as their surrounding relational phrases and infer types for entity mentions by propagating “distant supervision” (from external knowledge bases) via relational phrases. In order to resolve data sparsity issue during propagation, we complement the type propagation with clustering of functionally similar relational phrases based on their redundant occurrences in large corpus. Apart from entity recognition and coarse-grained typing, we claim that fine-grained entity typing is beneficial for many downstream applications and very challenging due to the context-agnostic label assignment in distant supervision, and we present principled, efficient models and algorithms for inferring fine-grained type path for entity mention based on the sentence context.
Extracting Typed Entity Relationships: We extend the idea of entity recognition and typing to extract relationships between entity mentions and infer their relation types. We show how to effectively model the noisy distant supervision for relationship extraction, and how to avoid the error propagation usually happened in incremental extraction pipeline by integrating typing of entities and relationships in a principled framework. The proposed approach leverages noisy distant supervision for both entities and relationships, and simultaneously learn to uncover the most confident labels as well as modeling the semantic similarity between true labels and text features.
Probabilistic Graphical Models for Credibility Analysis in Evolving Online Communities
Subhabrata Murherjee (Student) and Gerhard Weikum (Advisor) at Max Planck Institute, Germany
Abstract: One of the major hurdles preventing the full exploitation of information from online com- munities is the widespread concern regarding the quality and credibility of user-contributed content. Prior works in this domain operate on a static snapshot of the community, making strong assumptions about the structure of the data (e.g., relational tables), or consider only shallow features for text classification.
To address the above limitations, we propose probabilistic graphical models that can leverage the joint interplay between multiple factors in online communities — like user interactions, community dynamics, and textual content — to automatically assess the credibility of user-contributed online content, and the expertise of users and their evolution with user- interpretable explanation. To this end, we devise new models based on Conditional Random Fields for different settings like incorporating partial expert knowledge for semi-supervised learning, and handling discrete labels as well as numeric ratings for fine-grained analysis. This enables applications such as extracting reliable side-effects of drugs from user-contributed posts in healthforums, and identifying credible content in news communities.
Online communities are dynamic, as users join and leave, adapt to evolving trends, and mature over time. To capture this dynamics, we propose generative models based on Hidden Markov Model, Latent Dirichlet Allocation, and Brownian Motion to trace the continuous evolution of user expertise and their language model over time. This allows us to identify expert users and credible content jointly over time, improving state-of-the-art recommender systems by explicitly considering the maturity of users. This also enables applications such as identifying helpful product reviews, and detecting fake and anomalous reviews with limited information.
Characterization and Detection of Malicious Behavior on the Web
Srijan Kumar (Student), V.S. Subrahmanian (Advisor) at University of Maryland, USA.
Abstract: Web platforms enable unprecedented speed and ease in transmission of knowledge, and allow users to communicate and shape opinions. However, the safety, usability and reliability of these platforms is compromised by the prevalence of online malicious behavior---for example 40% of users have experienced online harassment. This is present in the form of malicious users, such as trolls, sockpuppets and vandals, and misinformation, such as hoaxes and fraudulent reviews. This thesis presents research spanning two aspects of malicious behavior: characterization of their behavioral properties, and development of algorithms and models for detecting them.
We characterize the behavior of malicious users and misinformation in terms of their activity, temporal frequency of actions, network connections to other entities, linguistic properties of how they write, and community feedback received from others. We find several striking characteristics of malicious behavior that are very distinct from those of benign behavior. For instance, we find that vandals and fraudulent reviewers are faster in their actions compared to benign editors and reviewers, respectively. Hoax articles are long pieces of plain text that are less coherent and created by more recent editors, compared to non-hoax articles. We find that sockpuppets are created that vary in their deceptiveness (i.e., whether they pretend to be different users) and their supportiveness (i.e., if they support arguments of other sockpuppets controlled by the same user).
We create a suite of feature-based and graph-based algorithms to efficiently detect malicious from benign behavior. We first create the first vandal early warning system that accurately predicts vandals using very few edits. Next, based on the properties of Wikipedia articles, we develop a supervised machine learning classifier to predict whether an article is a hoax, and another that predicts whether a pair of accounts belongs to the same user, both with very high accuracy. We develop a graph-based decluttering algorithm that iteratively removes suspicious edges that malicious users use to masquerade as benign users, which outperforms existing graph algorithms to
detect trolls. And finally, we develop an efficient graph-based algorithm to assess the fairness of all reviewers, reliability of all ratings, and goodness of all products, simultaneously, in a rating network, and incorporate penalties for suspicious behavior.
Overall, in this thesis, we develop a suite of five models and algorithms to accurately identify and predict several distinct types of malicious behavior---namely, vandals, hoaxes, sockpuppets, trolls, and fraudulent reviewers---in multiple web platforms. The analysis leading to the algorithms develops an interpretable understanding of malicious behavior on the web.
Congratulations to all the outstanding students who were nominated and to the winners of this year.
2018 KDD Dissertation Award committee: