Home / Topics

Big Data

Curated by: H V Jagadish


The development of massively distributed computing infrastructures has changed the economics of data management, and made it possible to apply sophisticated data distillation and learning methods to datasets of unprecedented scale, diversity, and freshness; a technical and social phenomenon that has been dubbed Big Data. The sheer size of the data, of course, is a major challenge, and is the one that is most easily recognized. However, there are others. Industry analysis companies like to point out that there are challenges not just in Volume, but also in Variety and Velocity [See Gartner Group press release available at http://www.gartner.com/it/page.jsp?id=1731916], and that companies should not focus on just the first of these. Variety refers to heterogeneity of data types, representation, and semantic interpretation. Velocity denotes both the rate at which data arrive and the time frame in which they must be acted upon. While these three are important, this short list fails to include additional important requirements. Several additions have been proposed by various parties, such as Veracity. Other concerns, such as privacy and usability, still remain.

The analysis of Big Data is an iterative process that involves many distinct phases, each with its own challenges. An excellent overview is available in a community whitepaper hosted at http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf. A few dozen papers, chosen on account of their coverage and importance, have been collected at http://db.cs.pitt.edu/bigdata/resources .

The papers at this KDD conference on this topic do not disappoint in the breadth of questions asked, from efficiency of algorithm to trust in result. Enjoy!!


Related KDD2016 Papers

Title & Authors
Positive-Unlabeled Learning in Streaming Networks
Author(s): Shiyu Chang*, UIUC; Yang Zhang, UIUC; Jiliang Tang, Yahoo Labs; Dawei Yin, ; Yi Chang, Yahoo! Labs; Mark Hasegawa-Johnson, UIUC; Thomas Huang, UIUC
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size
Author(s): Lorenzo De Stefani*, Brown University; Alessandro Epasto, Brown; Matteo Riondato, Two Sigma Investments; Eli Upfal, Brown University
Learning Cumulatively to Become More Knowledgeable
Author(s): Geli Fei*, Univ of Illinois at Chicago; Shuai Wang, Univ of Illinois at Chicago; Bing Liu, Univ of Illinois at Chicago
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Author(s): Yasuo Tabei*, JST; Hiroto Saigo, Kyushu Institute of Technology; Yoshihiro Yamanishi, Kyushu University; Simon Puglisi, Helsinki University
Boosted Decision Tree Regression Adjustment for Variance Reduction of Online Controlled Experiments
Author(s): Alexey Poyarkov, Yandex; Alexey Drutsa*, Yandex; Andrey Khalyavin, Yandex; Gleb Gusev, Yandex; Pavel Serdyukov, Yandex
Robust Influence Maximization
Author(s): Wei Chen, Microsoft Research; Tian Lin*, Tsinghua University; Zihan Tan, IIIS, Tsinghua University; Mingfei Zhao, IIIS, Tsinghua University; Xuren Zhou, The Hong Kong University of Science and Technology
Lightweight Monitoring of Distributed Streams
Author(s): Daniel Keren*, University of Haifa; Assaf Schuster, Technion; Arnon Lazerson, Israeli Institute of technology
Fast Unsupervised Online Drift Detection Using Incremental Kolmogorov-Smirnov Test
Author(s): Denis Dos Reis*, Universidade de São Paulo; Gustavo Batista, Universidade de Sao Paulo at Sao Carlos; Peter Flach, University of Bristol; Stan Matwin, Dalhousie University
Scalable Betweenness Centrality Maximization via Sampling
Author(s): Ahmad Mahmoody*, Brown University; Eli Upfal, Brown University; Charalampos Tsourakakis, Harvard
Recruitment Market Trend Analysis with Sequential Latent Variable Models
Author(s): Chen Zhu*, Baidu hr; Hengshu Zhu, Baidu Inc.; Hui Xiong, Rutgers; ding pengliang, ; xie fang,
Taxi Driving Behavior Analysis in Latent Vehicle-to-Vehicle Networks: A Social Influence Perspective
Author(s): Tong Xu*, USTC; Hengshu Zhu, Baidu Inc.; Xiangyu Zhao, USTC; Hao Zhong, Rutgers University; Qi Liu, University of Science and Technology of China; Enhong Chen, ; Hui Xiong, Rutgers
Distributing the Stochastic Gradient Sampler for Large-Scale LDA
Author(s): Yuan Yang*, Beihang University; Jianfei Chen, Tsinghua University; Jun Zhu,
Safe Pattern Pruning: An Efficient Approach for Predictive Pattern Mining
Author(s): Kazuya Nakagawa, Nagoya Institute of Technology; Shinya Suzumura, Nagoya Institute of Technology; Masayuki Karasuyama, ; Koji Tsuda, University of Tokyo; Ichiro Takeuchi*, Nagoya Institute of Technology Japan
Singapore in Motion: insights on public transport service level through farecard and mobile data ana
Author(s): Hasan Poonawala*, IBM; Vinay Kolar, IBM; Sebastien Blandin, IBM; Laura Wynter, IBM; Sambit Sahu, IBM
Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned
Author(s): Xiaolin Shi*, Yahoo Labs; Alex Deng, Microsoft
Communication Efficient Distributed Kernel Principal Component Analysis
Author(s): Yingyu Liang*, Princeton University; Bo Xie, ; David Woodruff, IBM Research; Le Song, ; Maria-Florina Balcan,
Regime Shifts in Streams: Real-time Forecasting of Co-evolving Time Sequences
Author(s): Yasuko Matsubara*, Kumamoto University; Yasushi Sakurai, Kumamoto University
Multi-layer Representation Learning for Medical Concepts
Author(s): Edward Choi*, Georgia Institute of Technolog; Mohammad Bahadori, Georgia Institute of Technology; Jimeng Sun, Georgia Institute of Technology
Fast Component Pursuit for Large-Scale Inverse Covariance Estimation
Author(s): Lei Han*, Rutgers University; Yu Zhang, Hong Kong University of Science and Technology; Tong Zhang, Rutgers University
Scalable Fast Rank-1 Dictionary Learning for fMRI Big Data Analysis
Author(s): Xiang Li*, The University of Georgia; Milad Makkie, ; Binbin Lin, ; Mojtaba Sedigh Fazli, ; Ian Davidson, University of California-Davis; Jieping Ye, ; Tianming Liu, ; Shannon Quinn,
Evaluating Mobile App Release
Author(s): Ya Xu*, LinkedIn Corporation; Nanyu Chen, LinkedIn Corporation
Revisiting Random Binning Feature: Fast Convergence and Strong Parallelizability
Author(s): Lingfei Wu*, College of William and Mary; En-Hsu Yen, University of Texas at Austin; Jie Chen, IBM Research; Rui Yan, Baidu Inc.
PTE: Enumerating Trillion Triangles On Distributed Systems
Author(s): Ha-Myung Park*, KAIST; Sung-Hyon Myaeng, KAIST; U Kang, Seoul National University
“Why Should I Trust you?” Explaining the Predictions of Any Classifier
Author(s): Marco Tulio Ribeiro*, University of Washington; Sameer Singh, """University of Washington, Seattle"""; Carlos Guestrin, Dato/Univ of Washington
Parallel Dual Coordinate Descent Method for Large-scale Linear Classification in Multi-core Environm
Author(s): Wei-Lin Chiang, National Taiwan University; Mu-Chu Lee, National Taiwan University; Chih-Jen Lin*, National Taiwan University
Online Asymmetric Active Learning with Imbalanced Data
Author(s): Xiaoxuan Zhang*, University of Iowa; Tianbao Yang, Univ of Iowa; Padmini Srinivasan, University of Iowa
Deploying Analytics with the Portable Format for Analytics (PFA)
Author(s): Jim Pivarski, Open Data Group Inc.; Collin Bennett, Open Data Group Inc.; Robert Grossman*, University of Chicago
EMBERS at 4 years:Experiences operating an Open Source Indicators Forecasting System
Author(s): Sathappan Muthiah*, Virginia Tech; Naren Ramakrishnan, Virginia Tech; Patrick Butler, Virginia Tech; Rupinder Khandpur, Virginia Tech; PARANG SARAF, VIRGINIA TECH; Anil Vullikanti, Virginia Tech; Achla Marathe, Virginia Tech; Graham Katz, CACI; Andrew Doyle, CACI; Jaime Arredondo, UCSD; Dipak Gupta, SDSU; David Mares, UCSD; Jose Cadena, Virginia Tech; Liang Zhao, VT; Nathan Self, ; Alla Rozovskaya, Virginia Tech; Kristen Summers, IBM
Robust Large-Scale Machine Learning in the Cloud
Author(s): Steffen Rendle*, Google; Dennis Fetterly, Google, Inc.; Eugene Shekita, Google, Inc.; Bor-Yiing Su, Google, Inc.
Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs
Author(s): Emaad Manzoor, Stony Brook University; Leman Akoglu*, SUNY Stony Brook
FLASH: Fast Bayesian Optimization for Data Analytic Pipelines
Author(s): Yuyu Zhang*, Georgia Institute of Technolog; Mohammad Bahadori, Georgia Institute of Technology; Hang Su, Georgia Institute of Technology; Jimeng Sun, Georgia Institute of Technology
Scalable Pattern Matching over Compressed Graphs via Dedensification
Author(s): Antonio Maccioni*, Roma Tre University; Daniel Abadi, Yale University
Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix
Author(s): Huizhi Xie*, Netflix; Juliette Aurisset, Netflix
XGBoost: A Scalable Tree Boosting System
Author(s): Tianqi Chen*, University of washington; Carlos Guestrin, Dato/Univ of Washington
Deep Visual-Semantic Hashing for Cross-Modal Retrieval
Author(s): Yue Cao, Tsinghua university; Mingsheng Long*, Tsinghua University; Jianmin Wang, Tsinghua University; Qiang Yang, HKUST; Philip Yu, UIC
Transfer Knowledge between Cities
Author(s): Ying Wei*, Hong Kong Univ. of Sci. & Tech; Yu Zheng, Microsoft Research; Qiang Yang, HKUST
Parallel Lasso Screening for Big Data Optimization
Author(s): Qingyang Li*, Arizona State University; Shuang Qiu, Umich; Shuiwang Ji, Washington State University; Jieping Ye, University of Michigan at Ann Arbor; Jie Wang, University of Michigan
Crime Rate Inference with Big Data
Author(s): Hongjian Wang*, Penn State University; Zhenhui Li, Penn State Univ; Daniel Kifer, PSU; Corina Graif, Penn state university
Modeling Precursors for Event Forecasting via Nested Multi-Instance Learning
Author(s): Yue Ning*, Virginia Tech; Sathappan Muthiah, Virginia Tech; Huzefa Rangwala, George Mason University; Naren Ramakrishnan, Virginia Tech
Skinny-dip: Clustering in a Sea of Noise
Author(s): Samuel Maurus*, Helmholtz Zentrum München; Claudia Plant
ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with Rademacher Averages
Author(s): Matteo Riondato*, Two Sigma Investments; Eli Upfal, Brown University
Accelerated Stochastic Block Coordinate Descent with Optimal Sampling
Author(s): Aston Zhang*, UIUC; Quanquan Gu, University of Virginia
Stochastic Optimization Techniques for Quantification Performance Measures
Author(s): Harikrishna Narasimhan, IACS, Harvard University; Shuai Li, University of Insubria; Purushottam Kar*, IIT Kanpur; Sanjay Chawla, QCRI-HBKU, Qatar; Fabrizio Sebastiani, QCRI-HBKU, Qatar
Compressing Graphs and Indexes with Recursive Graph Bisection
Author(s): Laxman Dhulipala, Carnegie Mellon University; Igor Kabiljo, Facebook; Brian Karrer, Facebook; Giuseppe Ottaviano, Facebook; Sergey Pupyrev*, Facebook; Alon Shalita, Facebook
GLMix: Generalized Linear Mixed Models For Large-Scale Response Prediction
Author(s): XianXing Zhang*, LinkedIn; Bee-Chung Chen, LinkedIn; Liang Zhang, LinkedIn; Yitong Zhou, LinkedIn Corporation; Yiming Ma, LinkedIn; Deepak Agarwal, LinkedIn
Convex Optimization for Linear Query Processing under Approximate Differential Privacy
Author(s): Ganzhao Yuan*, SCUT; Yin Yang, ; Zhenjie Zhang, ; Zhifeng Hao,
Towards Optimal Cardinality Estimation of Unions and Intersections with Sketches
Author(s): Daniel Ting*, Facebook
Dynamic Clustering of Streaming Short Documents
Author(s): Shangsong Liang*, University College London; Emine Yilmaz, University College London; Evangelos Kanoulas, University of Amsterdam
Rebalancing Bike Sharing Systems: A Multi-source Data Smart Optimization
Author(s): Junming Liu, Rutgers University; Leilei Sun, ; Hui Xiong*, Rutgers; Weiwei Chen,
Accelerating Online CP Decompositions for Higher Order Tensors
Author(s): Shuo Zhou*, University of melbourne; Nguyen Vinh, University of Melbourne; James Bailey, ; Yunzhe Jia, University of Melbourne; Ian Davidson, University of California-Davis
Approximate Personalized PageRank on Dynamic Graphs
Author(s): Hongyang Zhang*, Stanford University; Peter Lofgren, Stanford University
Annealed Sparsity via Adaptive and Dynamic Shrinking
Author(s): Kai Zhang*, NEC labs America; Shandian Shan, Purdue University; Zhengzhang Chen, NEC Lab America; Chaoran Cheng, New Jersey Institute of Technology; Zhi Wei, New Jersey Institute of Technology; Guofei Jiang, NEC labs America; Jieping Ye,
Smart Reply: Automated Response Suggestion for Email
Author(s): Anjuli Kannan, ; Karol Kurach*, Google; Sujith Ravi, Google; Tobias Kaufmann, Google, Inc.; Andrew Tomkins, ; Balint Miklos, Google, Inc.; Greg Corrado, ; László Lukács, ; Marina Ganea, ; Peter Young, ; Vivek Ramavajjala
Efficient Frequent Directions Algorithm for Sparse Matrices
Author(s): Mina Ghashami*, University of utah; Edo Liberty, Yahoo ; Jeff Phillips, School of Computing, University of Utah
Temporal Order-based First-Take-All Hashing for Fast Attention-Deficit-Hyperactive-Disorder Detectio
Author(s): Hao Hu, University of Central Florida; Joey Velez-Ginorio, University of Central Florida; Guojun Qi*, University of Central Florida
Assessing Human Error Against a Benchmark of Perfection
Author(s): Ashton Anderson*, Stanford University; Jon Kleinberg, Cornell University; Sendhil Mullainathan, Harvard

Comments