AI Startup Writer's Palmyra X V3 Surprises, Outperforms Google in Stanford's HELM Lite Evaluation

AI Startup Writer's Palmyra X V3 Surprises, Outperforms Google in Stanford's HELM Lite Evaluation

In an unexpected turn of events, AI startup Writer's Palmyra X V3 has emerged as the top-performing non-OpenAI model, surpassing Google's PaLM 2 in the latest Stanford University ranking of foundation model performance. Despite its smaller 72 billion-parameter size, Palmyra secured the third spot on the Holistic Evaluation of Language Models (HELM) Lite leaderboard, with PaLM 2 trailing in fourth place.

Notably, the open-source Yi-34B from Chinese startup 01.ai, founded by visionary Kai-Fu Lee, claimed a higher position on the Stanford leaderboard than models from Anthropic, Meta, and even Mistral 7B. Trained on three trillion tokens, Yi-34B showcased remarkable performance across various metrics.

While OpenAI's GPT-4 predictably secured the top spot, its Turbo version took second place, introduced as a more cost-effective model capable of handling significantly more text. Despite its enhanced capabilities, GPT-4 Turbo faced challenges in adhering to instructions, resulting in a lower placement compared to its predecessor.

Percy Liang, an associate professor of computer science at Stanford University, highlighted the unexpected success of smaller models in outperforming their larger counterparts. He noted that some recent models exhibited excessive chattiness, providing correct answers in incorrect formats, a phenomenon observed in the HELM Lite test.

The HELM Lite evaluation, designed to be both lightweight and comprehensive, tests models across various capabilities, including machine translation, medicine, and book-related questions. Stanford's researchers, inspired by Hugging Face's Open LLM leaderboard, are now developing a new benchmark in collaboration with MLCommons to assess model safety.

It's worth mentioning that the Stanford team faced limitations in accessing closed-system models such as GPT-4 and Claude. Instead, they employed standard interfaces and carefully crafted prompts to guide the models to generate outputs in the desired format, shedding light on the dynamic landscape of AI model performance.