CS&E Colloquium: SAUL: Towards Effective Data Science
The computer science colloquium takes place on Mondays and Fridays from 11:15 a.m. - 12:15 p.m.
This week's speaker, Lei Cao (MIT), will be giving a talk titled "SAUL: Towards Effective Data Science".
An effective data system should satisfy SAUL properties: being scalable, automatic, and easy to keep human in the loop. It should automatically address low-level performance bottleneck to scale to big data. It should be tuning free or at least easy for users to tune. It should be easy to keep human in the loop such that users can easily customize the system to meet their domain specific requirements. The goal of my research is to build data systems satisfying SAUL. My talk will cover our most recent works targeting on the automatic dimension of SAUL, including RITA which automates the preprocessing of timeseries data and AutoAD which automates the tuning process of anomaly detection.
Timeseries analytics is of great importance to many real-world applications. However, traditional techniques of timeseries analytics heavily rely on human to preprocess the data and extract features, thus hard to use and unscalable. To solve this problem, we propose RITA which inspired by the pre-training model in natural language processing, uses the correlations among the values in timeseries to automatically produce high quality feature embeddings. The novelty attention mechanism scales RITA to highly complex, massive-scale timeseries data. Anomaly detection is critical in many scientific and engineering fields ranging from defending network intrusions to detecting seizures in EEG medical data. However, although previously research has offered a plethora of unsupervised anomaly detection algorithms, effective anomaly detection remains challenging for data scientists due to the manual process of determining which among these many algorithms is best suited to their particular domain. Automating this process is particularly challenging in unsupervised setting, where no labels are available for cross-validation. AutoAD solves this problem by using a fundamentally new strategy that unifies the merits of unsupervised anomaly detection and supervised classification.
Dr. Lei Cao is a Research Scientist at MIT CSAIL, working with Prof. Samuel Madden and Prof. Michael Stonebraker in the Data System group. Before that he worked for IBM T.J. Watson Research Center as a Research Staff Member in the AI, Blockchain, and Quantum Solutions group. His recent research is focused on developing systems and algorithms for data scientists to effectively make sense of data.