Big Data
Curated by: H V Jagadish
The development of massively distributed computing infrastructures has changed the economics of data management, and made it possible to apply sophisticated data distillation and learning methods to datasets of unprecedented scale, diversity, and freshness; a technical and social phenomenon that has been dubbed Big Data. The sheer size of the data, of course, is a major challenge, and is the one that is most easily recognized. However, there are others. Industry analysis companies like to point out that there are challenges not just in Volume, but also in Variety and Velocity [See Gartner Group press release available at http://www.gartner.com/it/page.jsp?id=1731916], and that companies should not focus on just the first of these. Variety refers to heterogeneity of data types, representation, and semantic interpretation. Velocity denotes both the rate at which data arrive and the time frame in which they must be acted upon. While these three are important, this short list fails to include additional important requirements. Several additions have been proposed by various parties, such as Veracity. Other concerns, such as privacy and usability, still remain.
The analysis of Big Data is an iterative process that involves many distinct phases, each with its own challenges. An excellent overview is available in a community whitepaper hosted at http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf. A few dozen papers, chosen on account of their coverage and importance, have been collected at http://db.cs.pitt.edu/bigdata/resources .
The papers at this KDD conference on this topic do not disappoint in the breadth of questions asked, from efficiency of algorithm to trust in result. Enjoy!!
Related KDD2016 Papers
Title & Authors |
---|
Robust Large-Scale Machine Learning in the Cloud Author(s): Steffen Rendle*, Google; Dennis Fetterly, Google, Inc.; Eugene Shekita, Google, Inc.; Bor-Yiing Su, Google, Inc. |
Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs Author(s): Emaad Manzoor, Stony Brook University; Leman Akoglu*, SUNY Stony Brook |
FLASH: Fast Bayesian Optimization for Data Analytic Pipelines Author(s): Yuyu Zhang*, Georgia Institute of Technolog; Mohammad Bahadori, Georgia Institute of Technology; Hang Su, Georgia Institute of Technology; Jimeng Sun, Georgia Institute of Technology |
Scalable Pattern Matching over Compressed Graphs via Dedensification Author(s): Antonio Maccioni*, Roma Tre University; Daniel Abadi, Yale University |
Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix Author(s): Huizhi Xie*, Netflix; Juliette Aurisset, Netflix |
XGBoost: A Scalable Tree Boosting System Author(s): Tianqi Chen*, University of washington; Carlos Guestrin, Dato/Univ of Washington |
Deep Visual-Semantic Hashing for Cross-Modal Retrieval Author(s): Yue Cao, Tsinghua university; Mingsheng Long*, Tsinghua University; Jianmin Wang, Tsinghua University; Qiang Yang, HKUST; Philip Yu, UIC |
Transfer Knowledge between Cities Author(s): Ying Wei*, Hong Kong Univ. of Sci. & Tech; Yu Zheng, Microsoft Research; Qiang Yang, HKUST |
Parallel Lasso Screening for Big Data Optimization Author(s): Qingyang Li*, Arizona State University; Shuang Qiu, Umich; Shuiwang Ji, Washington State University; Jieping Ye, University of Michigan at Ann Arbor; Jie Wang, University of Michigan |
Crime Rate Inference with Big Data Author(s): Hongjian Wang*, Penn State University; Zhenhui Li, Penn State Univ; Daniel Kifer, PSU; Corina Graif, Penn state university |
Modeling Precursors for Event Forecasting via Nested Multi-Instance Learning Author(s): Yue Ning*, Virginia Tech; Sathappan Muthiah, Virginia Tech; Huzefa Rangwala, George Mason University; Naren Ramakrishnan, Virginia Tech |
Skinny-dip: Clustering in a Sea of Noise Author(s): Samuel Maurus*, Helmholtz Zentrum München; Claudia Plant |
ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with Rademacher Averages Author(s): Matteo Riondato*, Two Sigma Investments; Eli Upfal, Brown University |
Accelerated Stochastic Block Coordinate Descent with Optimal Sampling Author(s): Aston Zhang*, UIUC; Quanquan Gu, University of Virginia |
Stochastic Optimization Techniques for Quantification Performance Measures Author(s): Harikrishna Narasimhan, IACS, Harvard University; Shuai Li, University of Insubria; Purushottam Kar*, IIT Kanpur; Sanjay Chawla, QCRI-HBKU, Qatar; Fabrizio Sebastiani, QCRI-HBKU, Qatar |
Compressing Graphs and Indexes with Recursive Graph Bisection Author(s): Laxman Dhulipala, Carnegie Mellon University; Igor Kabiljo, Facebook; Brian Karrer, Facebook; Giuseppe Ottaviano, Facebook; Sergey Pupyrev*, Facebook; Alon Shalita, Facebook |
GLMix: Generalized Linear Mixed Models For Large-Scale Response Prediction Author(s): XianXing Zhang*, LinkedIn; Bee-Chung Chen, LinkedIn; Liang Zhang, LinkedIn; Yitong Zhou, LinkedIn Corporation; Yiming Ma, LinkedIn; Deepak Agarwal, LinkedIn |
Positive-Unlabeled Learning in Streaming Networks Author(s): Shiyu Chang*, UIUC; Yang Zhang, UIUC; Jiliang Tang, Yahoo Labs; Dawei Yin, ; Yi Chang, Yahoo! Labs; Mark Hasegawa-Johnson, UIUC; Thomas Huang, UIUC |
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size Author(s): Lorenzo De Stefani*, Brown University; Alessandro Epasto, Brown; Matteo Riondato, Two Sigma Investments; Eli Upfal, Brown University |
Learning Cumulatively to Become More Knowledgeable Author(s): Geli Fei*, Univ of Illinois at Chicago; Shuai Wang, Univ of Illinois at Chicago; Bing Liu, Univ of Illinois at Chicago |
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices Author(s): Yasuo Tabei*, JST; Hiroto Saigo, Kyushu Institute of Technology; Yoshihiro Yamanishi, Kyushu University; Simon Puglisi, Helsinki University |
Boosted Decision Tree Regression Adjustment for Variance Reduction of Online Controlled Experiments Author(s): Alexey Poyarkov, Yandex; Alexey Drutsa*, Yandex; Andrey Khalyavin, Yandex; Gleb Gusev, Yandex; Pavel Serdyukov, Yandex |
Robust Influence Maximization Author(s): Wei Chen, Microsoft Research; Tian Lin*, Tsinghua University; Zihan Tan, IIIS, Tsinghua University; Mingfei Zhao, IIIS, Tsinghua University; Xuren Zhou, The Hong Kong University of Science and Technology |
Lightweight Monitoring of Distributed Streams Author(s): Daniel Keren*, University of Haifa; Assaf Schuster, Technion; Arnon Lazerson, Israeli Institute of technology |
Fast Unsupervised Online Drift Detection Using Incremental Kolmogorov-Smirnov Test Author(s): Denis Dos Reis*, Universidade de São Paulo; Gustavo Batista, Universidade de Sao Paulo at Sao Carlos; Peter Flach, University of Bristol; Stan Matwin, Dalhousie University |
Scalable Betweenness Centrality Maximization via Sampling Author(s): Ahmad Mahmoody*, Brown University; Eli Upfal, Brown University; Charalampos Tsourakakis, Harvard |
Recruitment Market Trend Analysis with Sequential Latent Variable Models Author(s): Chen Zhu*, Baidu hr; Hengshu Zhu, Baidu Inc.; Hui Xiong, Rutgers; ding pengliang, ; xie fang, |
Taxi Driving Behavior Analysis in Latent Vehicle-to-Vehicle Networks: A Social Influence Perspective Author(s): Tong Xu*, USTC; Hengshu Zhu, Baidu Inc.; Xiangyu Zhao, USTC; Hao Zhong, Rutgers University; Qi Liu, University of Science and Technology of China; Enhong Chen, ; Hui Xiong, Rutgers |
Distributing the Stochastic Gradient Sampler for Large-Scale LDA Author(s): Yuan Yang*, Beihang University; Jianfei Chen, Tsinghua University; Jun Zhu, |
Safe Pattern Pruning: An Efficient Approach for Predictive Pattern Mining Author(s): Kazuya Nakagawa, Nagoya Institute of Technology; Shinya Suzumura, Nagoya Institute of Technology; Masayuki Karasuyama, ; Koji Tsuda, University of Tokyo; Ichiro Takeuchi*, Nagoya Institute of Technology Japan |
Singapore in Motion: insights on public transport service level through farecard and mobile data ana Author(s): Hasan Poonawala*, IBM; Vinay Kolar, IBM; Sebastien Blandin, IBM; Laura Wynter, IBM; Sambit Sahu, IBM |
Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned Author(s): Xiaolin Shi*, Yahoo Labs; Alex Deng, Microsoft |
Communication Efficient Distributed Kernel Principal Component Analysis Author(s): Yingyu Liang*, Princeton University; Bo Xie, ; David Woodruff, IBM Research; Le Song, ; Maria-Florina Balcan, |
Regime Shifts in Streams: Real-time Forecasting of Co-evolving Time Sequences Author(s): Yasuko Matsubara*, Kumamoto University; Yasushi Sakurai, Kumamoto University |
Multi-layer Representation Learning for Medical Concepts Author(s): Edward Choi*, Georgia Institute of Technolog; Mohammad Bahadori, Georgia Institute of Technology; Jimeng Sun, Georgia Institute of Technology |
Fast Component Pursuit for Large-Scale Inverse Covariance Estimation Author(s): Lei Han*, Rutgers University; Yu Zhang, Hong Kong University of Science and Technology; Tong Zhang, Rutgers University |
Scalable Fast Rank-1 Dictionary Learning for fMRI Big Data Analysis Author(s): Xiang Li*, The University of Georgia; Milad Makkie, ; Binbin Lin, ; Mojtaba Sedigh Fazli, ; Ian Davidson, University of California-Davis; Jieping Ye, ; Tianming Liu, ; Shannon Quinn, |
Evaluating Mobile App Release Author(s): Ya Xu*, LinkedIn Corporation; Nanyu Chen, LinkedIn Corporation |
Revisiting Random Binning Feature: Fast Convergence and Strong Parallelizability Author(s): Lingfei Wu*, College of William and Mary; En-Hsu Yen, University of Texas at Austin; Jie Chen, IBM Research; Rui Yan, Baidu Inc. |
PTE: Enumerating Trillion Triangles On Distributed Systems Author(s): Ha-Myung Park*, KAIST; Sung-Hyon Myaeng, KAIST; U Kang, Seoul National University |
“Why Should I Trust you?” Explaining the Predictions of Any Classifier Author(s): Marco Tulio Ribeiro*, University of Washington; Sameer Singh, """University of Washington, Seattle"""; Carlos Guestrin, Dato/Univ of Washington |
Parallel Dual Coordinate Descent Method for Large-scale Linear Classification in Multi-core Environm Author(s): Wei-Lin Chiang, National Taiwan University; Mu-Chu Lee, National Taiwan University; Chih-Jen Lin*, National Taiwan University |
Online Asymmetric Active Learning with Imbalanced Data Author(s): Xiaoxuan Zhang*, University of Iowa; Tianbao Yang, Univ of Iowa; Padmini Srinivasan, University of Iowa |
Deploying Analytics with the Portable Format for Analytics (PFA) Author(s): Jim Pivarski, Open Data Group Inc.; Collin Bennett, Open Data Group Inc.; Robert Grossman*, University of Chicago |
EMBERS at 4 years:Experiences operating an Open Source Indicators Forecasting System Author(s): Sathappan Muthiah*, Virginia Tech; Naren Ramakrishnan, Virginia Tech; Patrick Butler, Virginia Tech; Rupinder Khandpur, Virginia Tech; PARANG SARAF, VIRGINIA TECH; Anil Vullikanti, Virginia Tech; Achla Marathe, Virginia Tech; Graham Katz, CACI; Andrew Doyle, CACI; Jaime Arredondo, UCSD; Dipak Gupta, SDSU; David Mares, UCSD; Jose Cadena, Virginia Tech; Liang Zhao, VT; Nathan Self, ; Alla Rozovskaya, Virginia Tech; Kristen Summers, IBM |
Accelerating Online CP Decompositions for Higher Order Tensors Author(s): Shuo Zhou*, University of melbourne; Nguyen Vinh, University of Melbourne; James Bailey, ; Yunzhe Jia, University of Melbourne; Ian Davidson, University of California-Davis |
Approximate Personalized PageRank on Dynamic Graphs Author(s): Hongyang Zhang*, Stanford University; Peter Lofgren, Stanford University |
Annealed Sparsity via Adaptive and Dynamic Shrinking Author(s): Kai Zhang*, NEC labs America; Shandian Shan, Purdue University; Zhengzhang Chen, NEC Lab America; Chaoran Cheng, New Jersey Institute of Technology; Zhi Wei, New Jersey Institute of Technology; Guofei Jiang, NEC labs America; Jieping Ye, |
Convex Optimization for Linear Query Processing under Approximate Differential Privacy Author(s): Ganzhao Yuan*, SCUT; Yin Yang, ; Zhenjie Zhang, ; Zhifeng Hao, |
Towards Optimal Cardinality Estimation of Unions and Intersections with Sketches Author(s): Daniel Ting*, Facebook |
Dynamic Clustering of Streaming Short Documents Author(s): Shangsong Liang*, University College London; Emine Yilmaz, University College London; Evangelos Kanoulas, University of Amsterdam |
Rebalancing Bike Sharing Systems: A Multi-source Data Smart Optimization Author(s): Junming Liu, Rutgers University; Leilei Sun, ; Hui Xiong*, Rutgers; Weiwei Chen, |
Assessing Human Error Against a Benchmark of Perfection Author(s): Ashton Anderson*, Stanford University; Jon Kleinberg, Cornell University; Sendhil Mullainathan, Harvard |
Smart Reply: Automated Response Suggestion for Email Author(s): Anjuli Kannan, ; Karol Kurach*, Google; Sujith Ravi, Google; Tobias Kaufmann, Google, Inc.; Andrew Tomkins, ; Balint Miklos, Google, Inc.; Greg Corrado, ; László Lukács, ; Marina Ganea, ; Peter Young, ; Vivek Ramavajjala |
Efficient Frequent Directions Algorithm for Sparse Matrices Author(s): Mina Ghashami*, University of utah; Edo Liberty, Yahoo ; Jeff Phillips, School of Computing, University of Utah |
Temporal Order-based First-Take-All Hashing for Fast Attention-Deficit-Hyperactive-Disorder Detectio Author(s): Hao Hu, University of Central Florida; Joey Velez-Ginorio, University of Central Florida; Guojun Qi*, University of Central Florida |