How we evaluate LLMs for SQL generation
Team AI discusses their multi-pronged approach to evaluating the success of their Large Language Model(LLM) tools for SQL generation.
Dune AI is on a mission to democratize access to crypto data insights through generative SQL tooling. The tooling has improved a lot since we first started developing it and a lot of that success is due to our relentless evaluation of its accuracy. Please enjoy this short article on how we think about evaluating SQL generation.
SELECT metric FROM NON_EXISTENT_TABLE LIMIT 10
Above is the starting point for almost any venture into generated SQL on a custom database. We know we can make it much better but how do we measure our progress along the way?
LLMs are quickly becoming proficient at writing SQL with state-of-the-art models, reaching 90% accuracy on easy questions and closer to 50% on hard questions. However, those metrics only hold for well-defined and fairly simple databases. SQL generation tools for custom databases require context around what data is available and how industry-specific metrics are applied. Any time you update those context-providing tools, you need to monitor your model to prove they’ve improved SQL generation and avoided any regressions. Doing so in a time and cost-effective manner requires a multi-pronged approach.
Golden SQL - A data comparison based approach
This is going to be your highest effort and the highest fidelity approach to monitoring success but well worth its salt. At Dune, we relied heavily on the approach before launching our first SQL generation tools. This method requires building a benchmarking dataset by pairing a sample question with a SQL query validated by a domain expert.
Because SQL can be written in many ways, comparing the generated text can result in false negatives. Instead, we focus on comparing the query results. But data in databases is always changing and that makes comparing aggregate statistics challenging. There are a few considerations that can make it easier.
- For the most consistent results, specifically mention a specific time or date renege. We work with blockchain data so we often use a given block number to guarantee consistency in a result set. For example, the total supply of a token at a given block time should never vary.
- Use a distance metric suitable to the result data type. Natural language questions often include questions about relative timeframes like in the last hour or today etc. That invalidates our previous trick of using a specific time. Since the underlying data is changing, we compare how close the answers are versus looking for a perfect match.
- We use measurements like Euclidean distance (numeric responses), edit distance (text responses), and jaccard score (for unordered sets) to track how similar the results are. Because these distance metrics can’t be directly compared, we bucket the answers as “close” or “not close”. Then we can aggregate results for the benchmarking set to score our results.
Pros:
- High fidelity results. You can trust the results of your benchmarking dataset and trust decisions made on the results.
- Very effective at preventing regressions and identifying which prompts are affected by changes.
Cons:
- Low coverage. Building the benchmarking dataset is time-consuming and requires specialized knowledge of important metrics in your ecosystem.
- Can be time-consuming to run. LLMs aren’t fast and often running queries isn’t fast either. We run our benchmarking script using pytest-xdist to speed things up but it still isn’t exactly zippy.
Unit Tests - For Developer Ease
We don’t want to wait around all day for the benchmarking script to run. To quickly iterate and make progress, we focus on a test-driven development cycle that breaks the model’s context helpers down into small components.
When mining for useful context, there is often one definitive answer we expect based on a given prompt. For example, before recommending a specific table, we need to extract the “subject” of a prompt. Then we can search over our 700k+ tables against that subject.
Pros:
- Unit tests allow us to work quickly on our expanding codebase and minimize time waiting for unnecessary LLM API responses. They also serve as a constant code smell test. If it’s hard to write a test for your change, you are probably doing it wrong!
Cons:
- There are not any concrete cons to writing unit tests other than eventually, you end up with A LOT of them.
Playback “A/B” Tests - Staying fresh to user behaviors
We found that users wrote very different prompts than we expected. That meant, as helpful as our benchmarking set was, it often looked very different from the IRL questions. To overcome this, we started running what I’m loosely calling a playback A/B test with an LLM-based scoring system.
We track every prompt and response in our product but don’t have the bandwidth to check each query. We rely on an imperfect solution that uses the LLM to score the generated SQL. We pass the prompt the original question, the generated SQL, and the first 10 rows and ask the LLM to score its accuracy on a scale from 0 to 100. Now this can often result in a /r/confidentlyincorrect situation where a blatantly wrong query is marked as 100 but it’s much more realistic than devoting countless dev hours to reading and reviewing each query.
When running an “A/B” test, we’ll take the last N prompts submitted and then retry them with a new model. The new results will be scored and uploaded as a private dataset using the Dune API where we can visualize our results in the Dune App. (Dogfooding ftw)
Pros:
- Much more realistic questions compared to the ones written for the benchmarking golden SQL dataset.
- Helpful pathfinding. Large differences in LLM-generated scores warrant further investigation.
Cons:
- Much lower fidelity than using Golden SQL. Hard to trust results without manual inspection and follow-up.
Internal Forms for Generous Humans to Use
We are lucky to work with some generous dedicated team members and occasionally we lean on them to help evaluate results. However, we found early efforts to be exhausting because our first iteration was just downloading a giant CSV of responses. Staring at a CSV in Google Sheets quickly became grating and user fatigue was high.
Building a response form made the process much more pleasant.
- By presenting a single prompt at a time, the process feels less overwhelming.
- We break the review into sections:
1) Is it a question we can answer on Dune?
2) Are the right tables recommended?
3) Does the SQL look right? (Showing SQL + Results)
- If the answer to any section is No, we skip to the next user prompt.
- We can break down what information we show when it’s needed. For example, we don’t need to the full query to answer whether the question is answerable using Dune data and showing it causes information overload.
- It’s easy to skip if a reviewer doesn’t have the domain expertise to answer the question.
Pros:
- Our team has a ton of domain expertise and their help is invaluable. They can identify intricate issues with the generated SQL that we might never uncover otherwise.
Cons:
- Much less scalable than using an LLM. To make this more helpful, we want to iterate by making sure only the most tricky prompts get seen by our teammates and the LLM handles the more mundane prompts.
This has been Evaluating LLM Success at SQL Generation on Custom Databases: A Multi-Pronged Approach. I hope you learned something! DM me on Twitter (dot2dotseurat) or email me at [email protected] if you want to chat more about democratizing access to data through generative SQL.