A recent study by researchers at MIT and Penn State University has uncovered troubling findings regarding the use of large language models (LLMs) in home surveillance. The research shows that these AI-powered models could recommend calling the police even when surveillance footage reveals no criminal activity, raising concerns about their reliability and fairness.
The study found that the models analyzed—GPT-4, Gemini, and Claude—displayed inconsistencies in how they flagged videos for police intervention. For example, a model might flag one video showing a car break-in but fail to flag another showing a similar incident. Even more alarming, models often disagreed with each other on whether law enforcement should be contacted for the same video.
In addition to inconsistencies in flagging criminal activity, the researchers discovered that some models were less likely to recommend calling the police in predominantly white neighborhoods, controlling for other factors. This suggests the presence of inherent biases based on the demographics of a neighborhood, even though the models were not provided with explicit demographic data.
The study shows that the models' decisions about whether to call the police are influenced by the neighborhood context, a phenomenon the researchers call "norm inconsistency." This unpredictability in applying social norms to similar situations makes it difficult to foresee how these models would behave in different contexts, raising significant ethical concerns.
"The move-fast, break-things approach in deploying generative AI models everywhere, particularly in high-stakes areas, requires far more caution since it can cause harm," said co-senior author Ashia Wilson, an MIT professor in the Department of Electrical Engineering and Computer Science and a principal investigator at the Laboratory for Information and Decision Systems (LIDS).
Moreover, because the proprietary nature of these models restricts access to their training data and inner workings, identifying the root cause of norm inconsistency remains challenging for researchers.
Although large language models are not yet widely used in real-time surveillance, they are already applied in other critical areas like healthcare, mortgage lending, and hiring. The researchers argue that these models are likely to exhibit similar inconsistencies when making normative decisions in such high-stakes scenarios.
"There is a belief that LLMs can learn norms and values, but our research shows this isn't the case. It seems they're just learning arbitrary patterns or noise," said lead author Shomik Jain, a graduate student at MIT's Institute for Data, Systems, and Society (IDSS).
The study originated from a dataset of thousands of Amazon Ring home surveillance videos, developed by co-senior author Dana Calacci, now an assistant professor at Penn State. While a graduate student at MIT, Calacci's earlier research demonstrated how people sometimes use Amazon's Neighbors platform to "racially gatekeep" their neighborhoods based on the skin tones of those in surveillance videos. The dataset pivoted towards studying how generative AI could potentially interact with surveillance footage, given the rapid development of LLMs.
"There’s a real, imminent threat of people using off-the-shelf generative AI to analyze surveillance videos, notify homeowners, and automatically call law enforcement. We wanted to assess how dangerous that could be," Calacci explained.
To investigate, the team showed real videos from Calacci’s dataset to the three AI models, asking whether a crime was occurring and if the model would recommend calling the police. Human annotators also categorized the videos based on time of day, activity type, and the gender and skin tone of those featured, alongside neighborhood demographic data from the U.S. Census.
Surprisingly, the study found that the AI models generally stated no crime was occurring in the majority of videos—even though human reviewers confirmed that 39% of the videos did show criminal activity. The researchers speculate that model developers have limited their AI from explicitly labeling criminal activity to avoid false accusations.
However, even with these restrictions, the models still recommended calling the police for 20% to 45% of the videos, raising concerns about over-policing and wrongful intervention. The researchers also noted demographic biases, with models being more likely to use neutral terms like “delivery workers” in white-majority neighborhoods but applying more suspicious language like "burglary tools" in neighborhoods with higher minority populations.
Interestingly, the researchers found that the skin tone of the individuals in the videos did not play a significant role in the models’ decisions to call the police. They theorize this could be due to recent efforts within the AI community to address skin-tone bias. However, addressing one bias often leads to the emergence of new biases, such as those tied to neighborhood demographics.
"Firms might test for certain types of bias, like skin tone, but others—such as demographic bias—could go unnoticed," Calacci said. This suggests a pressing need for more comprehensive testing and accountability when deploying AI models in sensitive settings.
Looking ahead, the research team hopes to develop tools that enable individuals to more easily identify and report AI biases to companies and regulatory bodies. They also plan to compare how LLMs make decisions in high-stakes situations versus human judgments.