Home / Topics

Mining Rich Data Types

Curated by: Huan Liu

The very first issue of data mining and knowledge discovery is to properly handle data. It is essential to take into account different data types. Rich data types can be categorized into: non-dependency and dependency data. The non-dependency data is the most commonly encountered type, which refers to data without specified dependencies between data instances. In other words, data instances are or are assumed independent and identically distributed. Examples of non-dependency data include multidimensional data, text data, and image data. In practice, data can be more complex, and there exists dependency between data instances. Dependency data can be correlated with temporal, spatial, sequential, and social relationships such as time-series, sequence, graph, multi-media, and social-media data.Publications

Non-Dependency Data

1. Text

· Jiawei Han, Heng Ji, and Yizhou Sun. “Successful Data Mining Methods for NLP.” ACL-IJCNLP 2015 (2015). [Tutorial]

2. Image

· Foundations and Trends® in Computer Graphics and Vision, Now Publishers Inc. 2015. http://www.nowpublishers.com/CGV/ [Book chapters]

Dependency Data

3. Time Series Data

o Keogh, Eamonn. “Machine Learning in Time Series Databases (and Everything Is a Time Series.” AAAI’10. http://www.cs.ucr.edu/~eamonn/tutorials.html [Tutorial]

4. Sequence Data

o Mabroukeh, Nizar R., and Christie I. Ezeife. “A taxonomy of sequential pattern mining algorithms.” ACM Computing Surveys (CSUR) 43.1 (2010): 3. [Survey]

5. Dynamic/Streaming Data

o Hans-Peter Kriegel, Irene Ntoutsi, Myra Spiliopoulou, Grigorios Tsoumakas, and Arthur Zimek. “Mining Complex Dynamic Data.” ECML-PKDD 2011. [Tutorial]

6. Graph/Network Data

o Getoor, Lise, and Christopher P. Diehl. “Link Mining: a Survey.” ACM SIGKDD Explorations Newsletter 7.2 (2005): 3-12. [Survey]

o Shamanth Kumar, Fred Morstatter, and Huan Liu. “Analyzing Twitter Data.”Twitter Data Analytics. Springer New York, 2014. 35-48.

7. Social Data

o Mohammad Ali Abbasi, Huan Liu, and Reza Zafarani. Social Media Mining: Fundamental Issues and Challenges. ICDM’13 [Tutorial] http://ecs.syr.edu/faculty/reza/tutorials/ICDM13/TutorialICDM13SMM.pdf

o Jiebo Luo and Tao Mei. Social Multimedia as Sensors. ICDM’14 [Tutorial] http://icdm2014.sfu.ca/program_tutorials.html

8. Spatial and Spatial-Temporal Data

o Aggarwal, Charu C. Chapter 16: Mining Spatial Data, Data mining: The textbook. Springer, 2015. [Book chapter]

9. Multimedia

o Deng, Li, and D. Yu. “Foundations and Trends in Signal Processing.” Signal Processing 7 (2014): 3-4. [Survey]

10. Multi-modularity

o Sun, Shiliang. “A survey of multi-view machine learning.” Neural Computing and Applications 23.7-8 (2013): 2031-2038. [Survey]

Publicly Available Resources

Text Data

o New York Times Annotated Corpus https://catalog.ldc.upenn.edu/LDC2008T19

o 20 Newsgroups Dataset http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html

o UCI Reuters-21578 Text Categorization Collection

Image Data

o ImageNET http://image-net.org/

Time Series Data

o TREC 2013/2014 Temporal Summarization http://trec.nist.gov/data/tempsumm.html

o UCI Machine Learning Repository (UCI) Synthetic Control Chart Time Series

Sequence Data

o UCI Molecular Biology

Dynamic/Streaming Data

o UCI Synthetic Control Chart Time Series & Pseudo Periodic Synthetic Time Series

Graph/Network Data

o AMiner Citation Network Dataset https://aminer.org/citation

o Stanford Large Network Dataset Collection https://snap.stanford.edu/data/

Social Data

o Social Computing Data Repository at ASU http://socialcomputing.asu.edu/pages/datasets

o MIRFlickr Retrieval Evaluation Dataset http://press.liacs.nl/mirflickr/

Spatial Data

o GDELT Project http://gdeltproject.org/

o UCI Connect-4

Spatio-Temporal Data

o Microsoft Urban Computing Dataset http://research.microsoft.com/en-us/people/yuzheng/#Datasets

o UCI El Nino

Video Data

o TRECVID ’01-’15 http://trecvid.nist.gov/

Audio Data

o Aurora: Timit with noise and additional information http://aurora.hsnr.de/index-2.html

o TIMIT Speech Corpus http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1

Multi-Modularity Data

o UCSD SVCL Cross Modal Dataset http://www.svcl.ucsd.edu/projects/crossmodal/

Related KDD2016 Papers

Title & Authors
Beyond Sigmoids: the NetTide Model for Social Network Growth, and its Applications
Author(s): Chengxi Zang*, Tsinghua University; Peng Cui, Tsinghua University; Christos Faloutsos, Carnegie Mellon University
Recurrent Marked Temporal Point Processes: Embedding Event History to Vector
Author(s): NAN DU*, GEORGIA TECH; Hanjun Dai, ; Rakshit Trivedi, ; Utkarsh Upadhyay, Max Plank Institute; Manuel Gomez-Rodriguez, MPI-SWS; Le Song,
Absolute Fused Lasso and Its Application to Genome-Wide Association Studies
Author(s): Tao Yang*, Arizona State University; Jun Liu, SAS Institute Inc.; Pinghua Gong, University of Michigan; Ruiwen Zhang, SAS Institute Inc.; Xiaotong Shen, University of Minnesota; Jieping Ye, University of Michigan at Ann Arbor
Diversified Temporal Subgraph Pattern Mining
Author(s): Yi Yang, Fudan University; Da Yan, CUHK; Huanhuan Wu, CUHK; James Cheng*, CUHK; Shuigeng Zhou, Fudan University; John C.S. Lui, The Chinese University of Hong Kong
Improving Survey Aggregation with Sparsely Represented Signals
Author(s): Tianlin Shi, Stanford University; Forest Agostinelli*, Univ of California - Irvine; Matthew Staib, MIT; David Wipf, Microsoft Research; Thomas Moscibroda, Microsoft Research
A Subsequence Interleaving Model for Sequential Pattern Mining
Author(s): Jaroslav Fowkes*, University of Edinburgh; Charles Sutton, University of Edinburgh
Unbounded Human Learning: Optimal Scheduling for Spaced Repetition
Author(s): Siddharth Reddy*, Cornell University; Igor Labutov, Cornell University; Siddhartha Banerjee, Cornell University; Thorsten Joachims, Cornell University
Topic Modeling of Short Texts: A Pseudo-Document View
Author(s): Yuan Zuo*, Beihang University; Junjie Wu, ; Has Lin, ; Hui Xiong, Rutgers
FRAUDAR: Bounding Graph Fraud in the Face of Camouflage
Author(s): Bryan Hooi*, Carnegie Mellon University; Hyun Ah Song, Carnegie Mellon University; Alex Beutel, Carnegie Mellon University; Neil Shah, Carnegie Mellon University; Kijung Shin, Carnegie Mellon University; Christos Faloutsos, Carnegie Mellon University
Predicting Socio-Economic Indicators using News Events
Author(s): Sunandan Chakraborty*, NYU; Ashwin Venkataraman, New York University; Srikanth Jagabathula, New York University; Lakshminarayanan Subramanian, New York University
CatchTartan: Representing and Summarizing Dynamic Multicontextual Behaviors
Author(s): Meng Jiang*, UIUC; Christos Faloutsos, Carnegie Mellon University; Jiawei Han, University of Illinois at Urbana-Champaign
Unified Point-of-Interest Recommendation with Temporal Interval Assessment
Author(s): Yanchi Liu*, Rutgers University; Chuanren Liu, Drexel University; Bin Liu, Rutgers University; Meng Qu, Rutgers University; Hui Xiong, Rutgers
Asymmetric Transitivity Preserving Graph Embedding
Author(s): Mingdong Ou*, Tsinghua University; Peng Cui, Tsinghua University; Jian Pei, Simon Fraser University; Wenwu Zhu, Tsinghua University
Predicting Matchups and Preferences in Context
Author(s): Shuo Chen*, Cornell; Thorsten Joachims, Cornell University
Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding
Author(s): Xiang Ren*, UIUC; Wenqi He, UIUC; Meng Qu, UIUC; Heng Ji, PRI; Clare Voss, ARL; Jiawei Han, University of Illinois at Urbana-Champaign
A Real Linear and Parallel Multiple Longest Common Subsequences (MLCS) Algorithm
Author(s): Yanni Li, Xidian University; Hui Li*, Xidian University; Tihua Duan, Shanghai Finance University; Sheng Wang, Coventry University; Zhi Wang, Xidian University; Yang Cheng, Xidian University
Latent Space Model for Road Networks to Predict Time-Varying Traffic
Author(s): Dingxiong Deng*, USC; Cyrus Shahabi, USC; Ugur Demiryurek, ; Linhong Zhu, ; Rose Yu, University of Southern Cal; Yan Liu,
GMove: Group-Level Mobility Modeling using Geo-Tagged Social Media
Author(s): Chao Zhang*, UIUC; Keyang Zhang, ; Quan Yuan, University of Illinois Urbana-; Luming Zhang, ; Tim Hanratty, ; Jiawei Han, University of Illinois at Urbana-Champaign
Structural Deep Network Embedding
Author(s): DAIXIN WANG*, TSINGHUA UNIVERSITY; Peng Cui, Tsinghua University; Wenwu Zhu, Tsinghua University
Graph Wavelets via Sparse Cuts
Author(s): Arlei Lopes da Silva*, UC, Santa Barbara; Xuan-Hong Dang, UCSB; Prithwish Basu, Raytheon BBN; Ambuj Singh, UCSB; Ananthram Swami, Army Lab
Probabilistic Robust Route Recovery with Spatio-Temporal Dynamics
Author(s): Hao Wu, Fudan University; Jiangyun Mao, Fudan University; Weiwei Sun*, Fudan University; Baihua Zheng, Singapore Management University; Hanyuan Zhang, Fudan University; Ziyang Chen, Fudan University; Wei Wang, Fudan University
Squish: Near-Optimal Compression for Archival of Relational Datasets
Author(s): Yihan Gao*, University of Illinois; Aditya Parameswaran,
Regime Shifts in Streams: Real-time Forecasting of Co-evolving Time Sequences
Author(s): Yasuko Matsubara*, Kumamoto University; Yasushi Sakurai, Kumamoto University
Point-of-Interest Recommendations: Learning Potential Check-ins from Friends
Author(s): Yong Ge, UNC Charlotte; Huayu Li*, University of North Carolina a; Hengshu Zhu, Baidu Inc.
Structural Neighborhood based Classification of Nodes in a Network
Author(s): Sharad Nandanwar*, Indian Institute of Science; Musti Narasimha Murty, Indian Institute of Science
QUINT: On Query-Specific Optimal Networks
Author(s): Liangyue Li*, Arizona State University; Yuan Yao, Nanjing University; Jie Tang, Tsinghua University; Wei Fan, Baidu; Hanghang Tong, Arizona State University
DeepIntent: Learning Attentions for Online Advertising with Recurrent Neural Networks
Author(s): Shuangfei Zhai*, Binghamton University; Keng-hao Chang, Microsoft; Ruofei Zhang, Microsoft; Zhongfei Zhang,
Dynamics of Large Multi-View Social Networks: Synergy, Cannibalization and Cross-View Interplay
Author(s): Yu Shi*, UIUC; Myunghwan Kim, LinkedIn Corporation; Shaunak Chatterjee, LinkedIn Corporation; Mitul Tiwari, LinkedIn Corporation; Souvik Ghosh, LinkedIn; Romer Rosales, LinkedIn
Effcient Processing of Network Proximity Queries via Chebyshev Acceleration
Author(s): Mustafa Coskun*, Case Western University; Ananth Grama, ; Mehmet Koyuturk,
Keeping it Short and Simple: Summarising Complex Event Sequences with Multivariate Patterns
Author(s): Roel Bertens*, Universiteit Utrecht; Jilles Vreeken, Max-Planck Institute for Informatics and Saarland University; Arno Siebes,
Transfer Knowledge between Cities
Author(s): Ying Wei*, Hong Kong Univ. of Sci. & Tech; Yu Zheng, Microsoft Research; Qiang Yang, HKUST
Burstiness Scale: a highly parsimonious model forcharacterizing random series of events
Author(s): Rodrigo Alves*, CEFET-MG; Renato Assunção, DCC-UFMG; Pedro O.S. Vaz de Melo, DCC-UFMG
Compact and Scalable Graph Neighborhood Sketching
Author(s): Takuya Akiba*, NII; Yosuke Yano, National Institute of Informatics
Finding Gangs in War from Signed Networks
Author(s): Lingyang Chu*, Simon Fraser University; Zhefeng Wang, University of Science and Technology of China; Jian Pei, Simon Fraser University; Jiannan Wang, Simon Fraser University; Zijin Zhao, Simon Fraser University; Enhong Chen,
Temporal Order-based First-Take-All Hashing for Fast Attention-Deficit-Hyperactive-Disorder Detectio
Author(s): Hao Hu, University of Central Florida; Joey Velez-Ginorio, University of Central Florida; Guojun Qi*, University of Central Florida
Mining Subgroups with Exceptional Transition Behavior
Author(s): Florian Lemmerich*, Gesis; Martin Becker, University of Würzburg; Philipp Singer, Gesis; Denis Helic, TU Graz; Andreas Hotho, University of Wuerzburg; Markus Strohmaier,
Ranking Causal Anomalies via Temporal and Dynamical Analysis on Vanishing Correlations
Author(s): Wei Cheng*, NEC Labs America; Kai Zhang, NEC labs America; Haifeng Chen, NEC Research Lab; Guofei Jiang, NEC labs America; Wei Wang, UC Los Angeles
Multi-layer Representation Learning for Medical Concepts
Author(s): Edward Choi*, Georgia Institute of Technolog; Mohammad Bahadori, Georgia Institute of Technology; Jimeng Sun, Georgia Institute of Technology
Lexis: An Optimization Framework for Discovering the Hierarchical Structure of Sequential Data
Author(s): Payam Siyari*, Georgia Institute of Technology; Bistra Dilkina, Georgia Tech; Constantine Dovrolis, Georgia Institute of Technology
FINAL: Fast Attributed Network Alignment
Author(s): Si Zhang*, Arizona State University; Hanghang Tong, Arizona State University
Efficient Shift-Invariant Dictionary Learning
Author(s): Guoqing Zheng*, Carnegie Mellon University; Yiming Yang, ; Jaime Carbonell,
Inferring Network Effects from Observational Data
Author(s): David Arbour*, University of Massachusetts Am; Dan Garant, University of Massachusetts Amherst; David Jensen, UMass Amherst
MANTRA: A Scalable Approach to Mining Temporally Anomalous Sub-trajectories
Author(s): Prithu Banerjee*, UBC; Pranali Yawalkar, IIT Madras; Sayan Ranu, IIT Madras
Semi-Markov Switching Vector Autoregressive Model-based Anomaly Detection in Aviation Systems
Author(s): Igor Melnyk*, University of Minnesota; Arindam Banerjee, University of Minnesota; Bryan Matthews, Nasa Ames Research Center; Nikunj Oza, Nasa Ames Research Center
PTE: Enumerating Trillion Triangles On Distributed Systems
Author(s): Ha-Myung Park*, KAIST; Sung-Hyon Myaeng, KAIST; U Kang, Seoul National University
Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs
Author(s): Emaad Manzoor, Stony Brook University; Leman Akoglu*, SUNY Stony Brook
ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with Rademacher Averages
Author(s): Matteo Riondato*, Two Sigma Investments; Eli Upfal, Brown University
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size
Author(s): Lorenzo De Stefani*, Brown University; Alessandro Epasto, Brown; Matteo Riondato, Two Sigma Investments; Eli Upfal, Brown University
Distributing the Stochastic Gradient Sampler for Large-Scale LDA
Author(s): Yuan Yang*, Beihang University; Jianfei Chen, Tsinghua University; Jun Zhu,
City-Scale Map Creation and Updating using GPS Collections
Author(s): Chen Chen*, Stanford University; Cewu Lu, Stanford University; Qixing Huang, Stanford University; Dimitrios Gunopulos, ; Leonidas Guibas, Stanford University; Qiang Yang, HKUST
Taxi Driving Behavior Analysis in Latent Vehicle-to-Vehicle Networks: A Social Influence Perspective
Author(s): Tong Xu*, USTC; Hengshu Zhu, Baidu Inc.; Xiangyu Zhao, USTC; Hao Zhong, Rutgers University; Qi Liu, University of Science and Technology of China; Enhong Chen, ; Hui Xiong, Rutgers
Rebalancing Bike Sharing Systems: A Multi-source Data Smart Optimization
Author(s): Junming Liu, Rutgers University; Leilei Sun, ; Hui Xiong*, Rutgers; Weiwei Chen,
Data-driven Automatic Treatment Regimen Development and Recommendation
Author(s): Leilei Sun*, Dalian University of Technolog; Chuanren Liu, Drexel University; Chonghui Guo, ; Hui Xiong, Rutgers; Yanming Xie,
Modeling Precursors for Event Forecasting via Nested Multi-Instance Learning
Author(s): Yue Ning*, Virginia Tech; Sathappan Muthiah, Virginia Tech; Huzefa Rangwala, George Mason University; Naren Ramakrishnan, Virginia Tech
Smart Reply: Automated Response Suggestion for Email
Author(s): Anjuli Kannan, ; Karol Kurach*, Google; Sujith Ravi, Google; Tobias Kaufmann, Google, Inc.; Andrew Tomkins, ; Balint Miklos, Google, Inc.; Greg Corrado, ; László Lukács, ; Marina Ganea, ; Peter Young, ; Vivek Ramavajjala