Mining Rich Data Types
Curated by: Huan Liu
The very first issue of data mining and knowledge discovery is to properly handle data. It is essential to take into account different data types. Rich data types can be categorized into: non-dependency and dependency data. The non-dependency data is the most commonly encountered type, which refers to data without specified dependencies between data instances. In other words, data instances are or are assumed independent and identically distributed. Examples of non-dependency data include multidimensional data, text data, and image data. In practice, data can be more complex, and there exists dependency between data instances. Dependency data can be correlated with temporal, spatial, sequential, and social relationships such as time-series, sequence, graph, multi-media, and social-media data.Publications
Non-Dependency Data
1. Text
· Jiawei Han, Heng Ji, and Yizhou Sun. “Successful Data Mining Methods for NLP.” ACL-IJCNLP 2015 (2015). [Tutorial]
2. Image
· Foundations and Trends® in Computer Graphics and Vision, Now Publishers Inc. 2015. http://www.nowpublishers.com/CGV/ [Book chapters]
Dependency Data
3. Time Series Data
o Keogh, Eamonn. “Machine Learning in Time Series Databases (and Everything Is a Time Series.” AAAI’10. http://www.cs.ucr.edu/~eamonn/tutorials.html [Tutorial]
4. Sequence Data
o Mabroukeh, Nizar R., and Christie I. Ezeife. “A taxonomy of sequential pattern mining algorithms.” ACM Computing Surveys (CSUR) 43.1 (2010): 3. [Survey]
5. Dynamic/Streaming Data
o Hans-Peter Kriegel, Irene Ntoutsi, Myra Spiliopoulou, Grigorios Tsoumakas, and Arthur Zimek. “Mining Complex Dynamic Data.” ECML-PKDD 2011. [Tutorial]
6. Graph/Network Data
o Getoor, Lise, and Christopher P. Diehl. “Link Mining: a Survey.” ACM SIGKDD Explorations Newsletter 7.2 (2005): 3-12. [Survey]
o Shamanth Kumar, Fred Morstatter, and Huan Liu. “Analyzing Twitter Data.”Twitter Data Analytics. Springer New York, 2014. 35-48.
7. Social Data
o Mohammad Ali Abbasi, Huan Liu, and Reza Zafarani. Social Media Mining: Fundamental Issues and Challenges. ICDM’13 [Tutorial] http://ecs.syr.edu/faculty/reza/tutorials/ICDM13/TutorialICDM13SMM.pdf
o Jiebo Luo and Tao Mei. Social Multimedia as Sensors. ICDM’14 [Tutorial] http://icdm2014.sfu.ca/program_tutorials.html
8. Spatial and Spatial-Temporal Data
o Aggarwal, Charu C. Chapter 16: Mining Spatial Data, Data mining: The textbook. Springer, 2015. [Book chapter]
9. Multimedia
o Deng, Li, and D. Yu. “Foundations and Trends in Signal Processing.” Signal Processing 7 (2014): 3-4. [Survey]
10. Multi-modularity
o Sun, Shiliang. “A survey of multi-view machine learning.” Neural Computing and Applications 23.7-8 (2013): 2031-2038. [Survey]
Publicly Available Resources
Text Data
o New York Times Annotated Corpus https://catalog.ldc.upenn.edu/LDC2008T19
o 20 Newsgroups Dataset http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html
o UCI Reuters-21578 Text Categorization Collection
Image Data
o ImageNET http://image-net.org/
Time Series Data
o TREC 2013/2014 Temporal Summarization http://trec.nist.gov/data/tempsumm.html
o UCI Machine Learning Repository (UCI) Synthetic Control Chart Time Series
Sequence Data
o UCI Molecular Biology
Dynamic/Streaming Data
o UCI Synthetic Control Chart Time Series & Pseudo Periodic Synthetic Time Series
Graph/Network Data
o AMiner Citation Network Dataset https://aminer.org/citation
o Stanford Large Network Dataset Collection https://snap.stanford.edu/data/
Social Data
o Social Computing Data Repository at ASU http://socialcomputing.asu.edu/pages/datasets
o MIRFlickr Retrieval Evaluation Dataset http://press.liacs.nl/mirflickr/
Spatial Data
o GDELT Project http://gdeltproject.org/
o UCI Connect-4
Spatio-Temporal Data
o Microsoft Urban Computing Dataset http://research.microsoft.com/en-us/people/yuzheng/#Datasets
o UCI El Nino
Video Data
o TRECVID ’01-’15 http://trecvid.nist.gov/
Audio Data
o Aurora: Timit with noise and additional information http://aurora.hsnr.de/index-2.html
o TIMIT Speech Corpus http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1
Multi-Modularity Data
o UCSD SVCL Cross Modal Dataset http://www.svcl.ucsd.edu/projects/crossmodal/
Related KDD2016 Papers
Title & Authors |
---|
CatchTartan: Representing and Summarizing Dynamic Multicontextual Behaviors Author(s): Meng Jiang*, UIUC; Christos Faloutsos, Carnegie Mellon University; Jiawei Han, University of Illinois at Urbana-Champaign |
Unified Point-of-Interest Recommendation with Temporal Interval Assessment Author(s): Yanchi Liu*, Rutgers University; Chuanren Liu, Drexel University; Bin Liu, Rutgers University; Meng Qu, Rutgers University; Hui Xiong, Rutgers |
Asymmetric Transitivity Preserving Graph Embedding Author(s): Mingdong Ou*, Tsinghua University; Peng Cui, Tsinghua University; Jian Pei, Simon Fraser University; Wenwu Zhu, Tsinghua University |
Predicting Matchups and Preferences in Context Author(s): Shuo Chen*, Cornell; Thorsten Joachims, Cornell University |
Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding Author(s): Xiang Ren*, UIUC; Wenqi He, UIUC; Meng Qu, UIUC; Heng Ji, PRI; Clare Voss, ARL; Jiawei Han, University of Illinois at Urbana-Champaign |
A Real Linear and Parallel Multiple Longest Common Subsequences (MLCS) Algorithm Author(s): Yanni Li, Xidian University; Hui Li*, Xidian University; Tihua Duan, Shanghai Finance University; Sheng Wang, Coventry University; Zhi Wang, Xidian University; Yang Cheng, Xidian University |
Latent Space Model for Road Networks to Predict Time-Varying Traffic Author(s): Dingxiong Deng*, USC; Cyrus Shahabi, USC; Ugur Demiryurek, ; Linhong Zhu, ; Rose Yu, University of Southern Cal; Yan Liu, |
GMove: Group-Level Mobility Modeling using Geo-Tagged Social Media Author(s): Chao Zhang*, UIUC; Keyang Zhang, ; Quan Yuan, University of Illinois Urbana-; Luming Zhang, ; Tim Hanratty, ; Jiawei Han, University of Illinois at Urbana-Champaign |
Structural Deep Network Embedding Author(s): DAIXIN WANG*, TSINGHUA UNIVERSITY; Peng Cui, Tsinghua University; Wenwu Zhu, Tsinghua University |
Graph Wavelets via Sparse Cuts Author(s): Arlei Lopes da Silva*, UC, Santa Barbara; Xuan-Hong Dang, UCSB; Prithwish Basu, Raytheon BBN; Ambuj Singh, UCSB; Ananthram Swami, Army Lab |
Probabilistic Robust Route Recovery with Spatio-Temporal Dynamics Author(s): Hao Wu, Fudan University; Jiangyun Mao, Fudan University; Weiwei Sun*, Fudan University; Baihua Zheng, Singapore Management University; Hanyuan Zhang, Fudan University; Ziyang Chen, Fudan University; Wei Wang, Fudan University |
Squish: Near-Optimal Compression for Archival of Relational Datasets Author(s): Yihan Gao*, University of Illinois; Aditya Parameswaran, |
Beyond Sigmoids: the NetTide Model for Social Network Growth, and its Applications Author(s): Chengxi Zang*, Tsinghua University; Peng Cui, Tsinghua University; Christos Faloutsos, Carnegie Mellon University |
Recurrent Marked Temporal Point Processes: Embedding Event History to Vector Author(s): NAN DU*, GEORGIA TECH; Hanjun Dai, ; Rakshit Trivedi, ; Utkarsh Upadhyay, Max Plank Institute; Manuel Gomez-Rodriguez, MPI-SWS; Le Song, |
Absolute Fused Lasso and Its Application to Genome-Wide Association Studies Author(s): Tao Yang*, Arizona State University; Jun Liu, SAS Institute Inc.; Pinghua Gong, University of Michigan; Ruiwen Zhang, SAS Institute Inc.; Xiaotong Shen, University of Minnesota; Jieping Ye, University of Michigan at Ann Arbor |
Diversified Temporal Subgraph Pattern Mining Author(s): Yi Yang, Fudan University; Da Yan, CUHK; Huanhuan Wu, CUHK; James Cheng*, CUHK; Shuigeng Zhou, Fudan University; John C.S. Lui, The Chinese University of Hong Kong |
Improving Survey Aggregation with Sparsely Represented Signals Author(s): Tianlin Shi, Stanford University; Forest Agostinelli*, Univ of California - Irvine; Matthew Staib, MIT; David Wipf, Microsoft Research; Thomas Moscibroda, Microsoft Research |
A Subsequence Interleaving Model for Sequential Pattern Mining Author(s): Jaroslav Fowkes*, University of Edinburgh; Charles Sutton, University of Edinburgh |
Unbounded Human Learning: Optimal Scheduling for Spaced Repetition Author(s): Siddharth Reddy*, Cornell University; Igor Labutov, Cornell University; Siddhartha Banerjee, Cornell University; Thorsten Joachims, Cornell University |
Topic Modeling of Short Texts: A Pseudo-Document View Author(s): Yuan Zuo*, Beihang University; Junjie Wu, ; Has Lin, ; Hui Xiong, Rutgers |
FRAUDAR: Bounding Graph Fraud in the Face of Camouflage Author(s): Bryan Hooi*, Carnegie Mellon University; Hyun Ah Song, Carnegie Mellon University; Alex Beutel, Carnegie Mellon University; Neil Shah, Carnegie Mellon University; Kijung Shin, Carnegie Mellon University; Christos Faloutsos, Carnegie Mellon University |
Predicting Socio-Economic Indicators using News Events Author(s): Sunandan Chakraborty*, NYU; Ashwin Venkataraman, New York University; Srikanth Jagabathula, New York University; Lakshminarayanan Subramanian, New York University |
Compact and Scalable Graph Neighborhood Sketching Author(s): Takuya Akiba*, NII; Yosuke Yano, National Institute of Informatics |
Finding Gangs in War from Signed Networks Author(s): Lingyang Chu*, Simon Fraser University; Zhefeng Wang, University of Science and Technology of China; Jian Pei, Simon Fraser University; Jiannan Wang, Simon Fraser University; Zijin Zhao, Simon Fraser University; Enhong Chen, |
Temporal Order-based First-Take-All Hashing for Fast Attention-Deficit-Hyperactive-Disorder Detectio Author(s): Hao Hu, University of Central Florida; Joey Velez-Ginorio, University of Central Florida; Guojun Qi*, University of Central Florida |
Mining Subgroups with Exceptional Transition Behavior Author(s): Florian Lemmerich*, Gesis; Martin Becker, University of Würzburg; Philipp Singer, Gesis; Denis Helic, TU Graz; Andreas Hotho, University of Wuerzburg; Markus Strohmaier, |
Ranking Causal Anomalies via Temporal and Dynamical Analysis on Vanishing Correlations Author(s): Wei Cheng*, NEC Labs America; Kai Zhang, NEC labs America; Haifeng Chen, NEC Research Lab; Guofei Jiang, NEC labs America; Wei Wang, UC Los Angeles |
Multi-layer Representation Learning for Medical Concepts Author(s): Edward Choi*, Georgia Institute of Technolog; Mohammad Bahadori, Georgia Institute of Technology; Jimeng Sun, Georgia Institute of Technology |
Lexis: An Optimization Framework for Discovering the Hierarchical Structure of Sequential Data Author(s): Payam Siyari*, Georgia Institute of Technology; Bistra Dilkina, Georgia Tech; Constantine Dovrolis, Georgia Institute of Technology |
FINAL: Fast Attributed Network Alignment Author(s): Si Zhang*, Arizona State University; Hanghang Tong, Arizona State University |
Efficient Shift-Invariant Dictionary Learning Author(s): Guoqing Zheng*, Carnegie Mellon University; Yiming Yang, ; Jaime Carbonell, |
Regime Shifts in Streams: Real-time Forecasting of Co-evolving Time Sequences Author(s): Yasuko Matsubara*, Kumamoto University; Yasushi Sakurai, Kumamoto University |
Point-of-Interest Recommendations: Learning Potential Check-ins from Friends Author(s): Yong Ge, UNC Charlotte; Huayu Li*, University of North Carolina a; Hengshu Zhu, Baidu Inc. |
Structural Neighborhood based Classification of Nodes in a Network Author(s): Sharad Nandanwar*, Indian Institute of Science; Musti Narasimha Murty, Indian Institute of Science |
QUINT: On Query-Specific Optimal Networks Author(s): Liangyue Li*, Arizona State University; Yuan Yao, Nanjing University; Jie Tang, Tsinghua University; Wei Fan, Baidu; Hanghang Tong, Arizona State University |
DeepIntent: Learning Attentions for Online Advertising with Recurrent Neural Networks Author(s): Shuangfei Zhai*, Binghamton University; Keng-hao Chang, Microsoft; Ruofei Zhang, Microsoft; Zhongfei Zhang, |
Dynamics of Large Multi-View Social Networks: Synergy, Cannibalization and Cross-View Interplay Author(s): Yu Shi*, UIUC; Myunghwan Kim, LinkedIn Corporation; Shaunak Chatterjee, LinkedIn Corporation; Mitul Tiwari, LinkedIn Corporation; Souvik Ghosh, LinkedIn; Romer Rosales, LinkedIn |
Effcient Processing of Network Proximity Queries via Chebyshev Acceleration Author(s): Mustafa Coskun*, Case Western University; Ananth Grama, ; Mehmet Koyuturk, |
Keeping it Short and Simple: Summarising Complex Event Sequences with Multivariate Patterns Author(s): Roel Bertens*, Universiteit Utrecht; Jilles Vreeken, Max-Planck Institute for Informatics and Saarland University; Arno Siebes, |
Transfer Knowledge between Cities Author(s): Ying Wei*, Hong Kong Univ. of Sci. & Tech; Yu Zheng, Microsoft Research; Qiang Yang, HKUST |
Burstiness Scale: a highly parsimonious model forcharacterizing random series of events Author(s): Rodrigo Alves*, CEFET-MG; Renato Assunção, DCC-UFMG; Pedro O.S. Vaz de Melo, DCC-UFMG |
Distributing the Stochastic Gradient Sampler for Large-Scale LDA Author(s): Yuan Yang*, Beihang University; Jianfei Chen, Tsinghua University; Jun Zhu, |
City-Scale Map Creation and Updating using GPS Collections Author(s): Chen Chen*, Stanford University; Cewu Lu, Stanford University; Qixing Huang, Stanford University; Dimitrios Gunopulos, ; Leonidas Guibas, Stanford University; Qiang Yang, HKUST |
Taxi Driving Behavior Analysis in Latent Vehicle-to-Vehicle Networks: A Social Influence Perspective Author(s): Tong Xu*, USTC; Hengshu Zhu, Baidu Inc.; Xiangyu Zhao, USTC; Hao Zhong, Rutgers University; Qi Liu, University of Science and Technology of China; Enhong Chen, ; Hui Xiong, Rutgers |
Rebalancing Bike Sharing Systems: A Multi-source Data Smart Optimization Author(s): Junming Liu, Rutgers University; Leilei Sun, ; Hui Xiong*, Rutgers; Weiwei Chen, |
Data-driven Automatic Treatment Regimen Development and Recommendation Author(s): Leilei Sun*, Dalian University of Technolog; Chuanren Liu, Drexel University; Chonghui Guo, ; Hui Xiong, Rutgers; Yanming Xie, |
Inferring Network Effects from Observational Data Author(s): David Arbour*, University of Massachusetts Am; Dan Garant, University of Massachusetts Amherst; David Jensen, UMass Amherst |
MANTRA: A Scalable Approach to Mining Temporally Anomalous Sub-trajectories Author(s): Prithu Banerjee*, UBC; Pranali Yawalkar, IIT Madras; Sayan Ranu, IIT Madras |
Semi-Markov Switching Vector Autoregressive Model-based Anomaly Detection in Aviation Systems Author(s): Igor Melnyk*, University of Minnesota; Arindam Banerjee, University of Minnesota; Bryan Matthews, Nasa Ames Research Center; Nikunj Oza, Nasa Ames Research Center |
PTE: Enumerating Trillion Triangles On Distributed Systems Author(s): Ha-Myung Park*, KAIST; Sung-Hyon Myaeng, KAIST; U Kang, Seoul National University |
Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs Author(s): Emaad Manzoor, Stony Brook University; Leman Akoglu*, SUNY Stony Brook |
ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with Rademacher Averages Author(s): Matteo Riondato*, Two Sigma Investments; Eli Upfal, Brown University |
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Author(s): Lorenzo De Stefani*, Brown University; Alessandro Epasto, Brown; Matteo Riondato, Two Sigma Investments; Eli Upfal, Brown University |
Modeling Precursors for Event Forecasting via Nested Multi-Instance Learning Author(s): Yue Ning*, Virginia Tech; Sathappan Muthiah, Virginia Tech; Huzefa Rangwala, George Mason University; Naren Ramakrishnan, Virginia Tech |
Smart Reply: Automated Response Suggestion for Email Author(s): Anjuli Kannan, ; Karol Kurach*, Google; Sujith Ravi, Google; Tobias Kaufmann, Google, Inc.; Andrew Tomkins, ; Balint Miklos, Google, Inc.; Greg Corrado, ; László Lukács, ; Marina Ganea, ; Peter Young, ; Vivek Ramavajjala |