Instruction-Following and its Evaluation Using LLMs

Seminar: 
Applied Mathematics
Event time: 
Wednesday, October 23, 2024 - 2:30pm
Location: 
LOM 214
Speaker: 
Arman Cohan
Speaker affiliation: 
Yale University
Event description: 

In this talk, I’ll discuss our recent works on evaluation instruction-following capabilities of LLMs. State-of-the-art LLMs are able to follow user instructions and perform tasks across a wide range of domains with impressive results. However, their performance on open-ended text generation tasks is often evaluated on specific benchmarks and in certain ways, leaving questions about their robustness, generalizability, and alignment with human expectations. First, I will discuss the difficulties with the evaluation of long-form text generation, especially in the context of text generation tasks such as summarization with nuanced differences. Second, I’ll discuss our new work on comprehensive and large-scale meta-evaluation of instruction-following. Our study reveals that instruction-controllable summarization remains challenging for LLMs: 1) LLMs evaluated still make factual and other types of errors in their summaries; 2) all LLM-based evaluation methods cannot achieve a strong alignment with human annotators. We found that open-source models like Llama-3.1-405B are approaching the performance of top proprietary models, and that evaluation protocols do not consistently outperform baseline pairwise comparison approaches. Moreover, protocol effectiveness varies significantly depending on the base LLM and dataset diversity. I hope to highlight some ongoing and interesting challenges and opportunities in refining LLMs in long-form generation tasks.