Patronus AI, a startup dedicated to responsible AI deployment, has unveiled SimpleSafetyTests, a diagnostic test suite designed to identify critical safety risks in large language models (LLMs). The move follows rising concerns about the potential dangers posed by generative AI systems, such as ChatGPT, if not adequately safeguarded.
In an exclusive interview with VentureBeat, Rebecca Qian, co-founder and CTO of Patronus AI, expressed surprise at the discovery of unsafe responses in various model sizes, ranging from 7 billion to 40 billion parameters.
Critical Safety Gaps Uncovered
SimpleSafetyTests comprises 100 test prompts aimed at probing vulnerabilities across five high-priority harm areas, including suicide, child abuse, and physical harm. Trials conducted by Patronus on 11 popular open-source LLMs revealed critical weaknesses, with over 20% unsafe responses in several models.
Co-founder and CEO of Patronus AI, Anand Kannappan, highlighted the probable influence of underlying training data distribution. He emphasized the lack of transparency surrounding the training data of these models, attributing their behavior to being essentially a function of their training data.
Guardrails Can Help but Additional Safeguards Needed
To address these safety concerns, a safety-emphasizing system prompt was introduced, reducing unsafe responses by 10 percentage points overall. However, risks persisted, indicating the potential necessity for additional safeguards in production systems.
The intentionally simple and clear-cut test prompts were designed to expose vulnerabilities and measure weaknesses. According to Qian, the crafted prompts function as a capabilities assessment, aiming to assess the system's ability to respond safely even when prompted to enable harm.
Evaluation Methodology and Results
The SimpleSafetyTests diagnostic tool uses 100 handcrafted test prompts submitted as inputs without context. Expert human reviewers then label each response as safe or unsafe based on strict guidelines, quantifying the model's critical safety gaps. The results revealed significant variability among different language models, with Meta's Llama2 (13B) demonstrating flawless performance, while others, like Anthropic's Claude and Google's PaLM, faltered on over 20% of test cases.
Safety Solutions and Responsible AI for Regulated Sectors
While some models demonstrated weaknesses, safety prompts, response filtering, and content moderation were identified as effective measures. Patronus AI, founded in 2023 with $3 million in seed funding, offers AI safety testing and mitigation services, focusing on ensuring confidence and responsibility in LLM usage.
The launch of SimpleSafetyTests addresses the increasing demand for commercial AI deployment, emphasizing the need for ethical and legal oversight. As generative AI gains prominence, there are calls for rigorous security testing before deployment, positioning SimpleSafetyTests as an initial step toward ensuring the safety and quality of AI products and services.
Looking Ahead: Regulatory Collaboration and Security Layers
Anand Kannappan emphasized the importance of regulatory bodies collaborating to produce safety analyses and better regulate AI. He stated that evaluation reports could aid in understanding how language models perform against different criteria.
As generative AI continues to advance, Patronus AI advocates for an evaluation and security layer on top of AI systems to enable safe and confident usage. The release of SimpleSafetyTests marks a pivotal step in this direction, providing a valuable data point in the ongoing efforts to enhance the safety and reliability of AI technologies.