In the quest to understand the intricacies of trained neural networks, MIT researchers from the Computer Science and Artificial Intelligence Laboratory (CSAIL) have introduced a groundbreaking method. Termed the "automated interpretability agent" (AIA), this approach utilizes AI models to conduct experiments on other systems and elucidate their behavior.
Unlike previous methods, which often required extensive human oversight, the AIA acts as an autonomous experimental scientist, planning and executing tests on computational systems of varying scales. Sarah Schwettmann, Ph.D., a co-lead author of the paper, highlights the advantages, stating, "It's remarkable that language models, when equipped with tools for probing other systems, are capable of this type of experimental design."
Central to this initiative is the newly introduced "function interpretation and description" (FIND) benchmark. This test bed comprises functions resembling computations within trained networks, accompanied by detailed descriptions of their behavior. The FIND benchmark addresses a long-standing challenge in the field by providing a reliable standard for evaluating interpretability procedures.
The innovative AIA method and FIND benchmark are seen as crucial in the evolving landscape of large language models. With these models often regarded as black boxes, external evaluations of interpretability methods become increasingly vital. Schwettmann notes, "Clean, simple benchmarks with ground-truth answers have been a major driver of more general capabilities in language models, and we hope that FIND can play a similar role in interpretability research."
While language models have demonstrated their ability to perform complex reasoning tasks, the challenge lies in making these models interpretable. The CSAIL team believes that AI models, equipped with interpretability agents, could serve as generalized tools for explaining diverse systems, bridging gaps between experiments and modalities.
The researchers acknowledge that full automation of interpretability is still a work in progress. The evaluation using FIND reveals that while AIAs outperform existing interpretability approaches, they fall short in accurately describing almost half of the functions in the benchmark. Tamar Rott Shaham, co-lead author of the study, emphasizes that AIAs are effective in describing high-level functionality but often overlook finer details, particularly in areas with noise or irregular behavior.
Looking ahead, the team envisions developing nearly autonomous AIAs capable of auditing other systems, with human scientists providing oversight. Their goal is to expand AI interpretability to include more complex behaviors, such as entire neural circuits or subnetworks, and predict inputs that might lead to undesired behaviors.
The study, presented at NeurIPS 2023, underscores the importance of automated interpretability in advancing our understanding of AI systems. The team's efforts align with the broader goal of making AI systems more understandable and reliable, crucial in fields like autonomous driving and face recognition.