CS&E Colloquium: Data Preparation: The Biggest Roadblock in Data Science

The computer science colloquium takes place on Mondays and Fridays from 11:15 a.m. - 12:15 p.m.

This week's speaker, Elkindi Rezig (MIT), will be giving a talk titled "Data Preparation: The Biggest Roadblock in Data Science".


When building Machine learning (ML) models, data scientists face a significant hurdle: data preparation. ML models are exactly as good as the data we train them on. Unfortunately, data preparation is tedious and laborious because it often requires human judgment on how to proceed. In fact, data scientists spend at least 80% of their time locating the datasets they want to analyze, integrating them together, and cleaning the result.

In this talk, I will present my key contributions in data preparation for data science, which address the following problems: (1) data discovery: how to discover data of interest from a large collection of heterogeneous tables (e.g., data lakes); (2) error detection: how to find errors in the input and intermediate data in complex data workflows; and (3) data repairing: how to repair data errors with minimal human intervention. The developed systems are specifically designed to support data science development which poses particular requirements such as interactivity and modularity. The talk will feature demonstrations of data preparation systems as well as discussions of our developed algorithms and techniques that enable data preparation at scale.


El Kindi Rezig is a postdoctoral associate at the Computer Science and Artificial Intelligence Laboratory (CSAIL) of MIT where he works under the supervision of Michael Stonebraker. He earned his Ph.D. in Computer Science from Purdue University under the supervision of Walid Aref and Mourad Ouzzani. His research interests revolve around data management in general and data preparation for data science in particular. He has developed systems in collaboration with several organizations including Intel, Massachusetts General Hospital, and the U.S. Air Force.



Start date
Friday, March 18, 2022, 11:15 a.m.
End date
Friday, March 18, 2022, 12:15 p.m.

Online via Zoom