CRAY Colloquium: From English to SQL: What We Learned Building LLM-Powered Text-to-SQL
The computer science colloquium takes place on Mondays from 11:15 a.m. - 12:15 p.m. This week's speaker, Fatma Özcan (Google), will be giving a talk titled "From English to SQL: What We Learned Building LLM-Powered Text-to-SQL".
Abstract
The advent of Large Language Models (LLMs) has ignited renewed interest in text-to-SQL from both academia and industry. This task remains challenging, as it requires bridging the gap between inherently ambiguous natural language questions and complex schema and data semantics of the target database.
To address these challenges, we have developed and evaluated multiple text-to-SQL solutions. This talk will detail the lessons learned, with a focus on two key contributions. First, we present CHASE-SQL, a novel multi-agent LLM framework that generates diverse SQL candidates through three distinct pipelines, achieving 76% execution accuracy on the BIRD benchmark. Second, we will discuss a comprehensive study that quantifies the impact of different contextual information sources—including column value examples, few-shot examples, user hints, SQL documentation, and schema structure—on model performance.
While benchmarks like BIRD and Spider have driven significant innovation, our experience building real-world applications has revealed opportunities for improvement beyond academic metrics. We will conclude by outlining these opportunities for future research and presenting a forward-looking perspective on the evolution of natural language interfaces for data interaction.
Biography
Fatma Özcan is a Principal Engineer at Systems Research@Google. Before that, she was a Distinguished Research Staff Member and a senior manager at IBM Almaden Research Center. Her current research focuses on LLMs and ML for data management, text2SQL and conversational interfaces to data, platforms and infra-structure for large-scale data analysis. Dr Özcan got her PhD degree in computer science from University of Maryland, College Park. She has over 24 years of experience in industrial research, and has delivered core technologies into various IBM and Google products. She has been a contributor to various SQL standards, including SQL/XML, SQL/JSON and SQL/PTF. She is the co-author of the book "Heterogeneous Agent Systems", and co-author of several conference papers and patents. She is an ACM Fellow and serves on the CRA board of directors, and is the co-chair of CRA-Industry. She received the VLDB Women in Database Research Award in 2022.