Title: Scalable Prediction for Automating Structured Data Cleaning
Speaker: Ihab Ilyas
Abstract
Data scientists spend big chunk of their time preparing, cleaning, and transforming raw data before getting the chance to feed this data to their well-crafted models. Despite the efforts to build robust predication and classification models, data errors still the main reason for having low quality results. This massive labor-intensive exercises to clean data remain the main impediment to automatic end-to-end AI pipeline for data science.
In this talk, I focus on data cleaning as an inference problem that can be automated by leveraging the great advancements in AI and ML in the last few years. I will start with a background describing the evolution of data cleaning efforts, and I will describe The HoloClean framework, a machine learning framework for data profiling and cleaning (error detection and repair). The framework has multiple successful deployments with cleaning census data, and pilots with commercial enterprises to boost the quality of source (training) data before feeding them to downstream analytics.
HoloClean builds two main probabilistic models: a data generation model (describing how data was intended to look like); and a realization model (describing how errors might be introduced to the intended clean data). The framework uses few-shot learning, data augmentation, and weak supervision to learn the parameters of these models, and use them to predict both error and their possible repairs.
Slides
Bio
Ihab Ilyas is a professor in the Cheriton School of Computer Science and the NSERC-Thomson Reuters Research Chair on data quality at the University of Waterloo. His main research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, machine learning for data curation, and information extraction. Ihab is a co-founder of Tamr, a startup focusing on large-scale data integration, and he is also the co-founder of inductiv (now part of Apple), a Waterloo-based startup on using AI for structured data cleaning. He is a recipient of the Ontario Early Researcher Award, a Cheriton Faculty Fellowship, an NSERC Discovery Accelerator Award, and a Google Faculty Award, and he is an ACM Fellow.