Microsoft Research / Harvard University
Data, algorithms, and systems have biases embedded within them reflecting designers’ explicit and implicit choices, historical biases, and societal priorities. They form, literally and inexorably, a codification of values. “Unfairness” of algorithms – for tasks ranging from advertising to recidivism prediction – has attracted considerable attention in the popular press. The talk will discuss the nascent mathematically rigorous study of fairness in classification and scoring.
University of California at Berkeley
Three Principles of Data Science: Predictability, Stability, and Computability
In this talk, I’ll discuss the intertwining importance and connections of three principles of data science in the title in data-driven decisions. The ultimate importance of prediction lies in the fact that future holds the unique and possibly the only purpose of all human activities, in business, education, research, and government alike. Making prediction as its central task and embracing computation as its core, machine learning has enabled wide-ranging data-driven successes. Prediction is a useful way to check with reality. Good prediction implicitly assumes stability between past and future. Stability (relative to data and model perturbations) is also a minimum requirement for interpretability and reproducibility of data driven results. It is closely related to uncertainty assessment. Obviously, both prediction and stability principles can not be employed without feasible computational algorithms, hence the importance of computability. The three principles will be demonstrated through analytical connections, and in the context of two on-going projects, for which “data wisdom” is also indispensable. Specifically, the first project employs deep learning networks (CNNs) to understand pattern selectivities of neurons in the difficult visual cortex V4; and the second project predicts partisanship and tone of political TV ads by employing and comparing different latent variable models with a Lasso-based model.
Renée J. Miller
University of Toronto
Big Data Curation
In this talk, I consider some of the challenges to scaling data curation systems including data integration and data cleaning systems. First, I discuss that while data integration and cleaning are very mature fields, rigorous empirical evaluations of systems are relatively scarce. I identify a major roadblock for empirical work - the lack of tools that aid in generating the inputs and gold standard outputs for integration or cleaning tasks in a controlled, effective, and repeatable manner. I give an overview of our efforts to develop such tools and highlight how our tools have been used for streamlining the empirical evaluation of a wide variety of systems. Second, I consider the problem of dataset search. Web search algorithms are designed for documents, not data. To search for structured data, the state-of-the-art is to use traditional schema and data (entity) matching algorithms, but these are either too expensive to use over big data or ineffective on schema-free web data. I present some new results that bring us closer to achieving fast, Internet-scale dataset search and discuss applications to data science.
We are adding new speakers regularly. Check back often.