|
Abstract:
Benchmarks are vital for driving progress across AI applications, serving as a foundation for defining success and inspiring innovation. In the post-ChatGPT era, their design faces new challenges due to the growing capabilities of large models and increasingly complex tasks. This talk highlights two key principles for creating effective benchmarks: comprehensive evaluation metrics and robust dataset design. On the evaluation front, we explore the shift from traditional, objective metrics to human-aligned metrics, exemplified by the "CLAIR-A" case study on LLMs as evaluators. For dataset design, we emphasize diverse, representative, and controlled datasets, illustrated by the "Visual Haystacks" case study for long-context visual understanding. Together, these approaches enable benchmarks to better reflect real-world challenges and drive meaningful AI progress.
Bio:
Tsung-Han (Patrick) Wu is a second-year CS PhD student at UC Berkeley, advised by Prof. Trevor Darrell and Prof. Joseph E. Gonzalez. His recent work focuses on exploring the zero-shot applications and addressing the limitations of Large (Vision) Language Models. Before becoming a PhD student, he earned an MS and BS in Computer Science and Information Engineering from National Taiwan University (NTU). For more information, please visit his personal website: https://tsunghan-wu.github.io/.