The Perils \& Promises of Fact-checking with Large Language Models
The Perils \& Promises of Fact-checking with Large Language Models
Dorian Quelle \& Alexandre Bovet
Department of Mathematical Modeling and Machine Learning, Digital Society Initiative, University of Zurich, Zurich, Switzerland
dorian.quelle@uzh.ch, alexandre.bovet@uzh.ch
October 2023
#### Abstract
Automated fact-checking, using machine learning to verify claims, has grown vital as misinformation spreads beyond human fact-checking capacity. Large Language Models (LLMs) like GPT-4 are increasingly trusted to write academic papers, lawsuits, and news articles and to verify information, emphasizing their role in discerning truth from falsehood and the importance of being able to verify their outputs. Understanding the capacities and limitations of LLMs in fact-checking tasks is therefore essential for ensuring the health of our information ecosystem. Here, we evaluate the use of LLM agents in fact-checking by having them phrase queries, retrieve contextual data, and make decisions. Importantly, in our framework, agents explain their reasoning and cite the relevant sources from the retrieved context. Our results show the enhanced prowess of LLMs when equipped with contextual information. GPT-4 outperforms GPT-3, but accuracy varies based on query language and claim veracity. While LLMs show promise in fact-checking, caution is essential due to inconsistent accuracy. Our investigation calls for further research, fostering a deeper comprehension of when agents succeed and when they fail.
1 Introduction
Fact-checking has become a vital tool to reduce the spread of misinformation online, shown to potentially reduce an individual's belief in false news and rumors [1, 2] and to improve political knowledge [3]. While verifying or refuting a claim is a core task of any journalist, a variety of dedicated fact-checking organizations have formed to correct misconceptions, rumors, and fake news online. A pivotal moment in the rise of fact-checking happened in 2009 when the prestigious Pulitzer Prize in the national reporting category was awarded to Politifact. Politifact's innovation was to propose the now standard model of an ordinal rating, which added a layer of structure and clarity to the fact check and inspired dozens of projects around the world [4]. The second wave of fact-checking organizations and innovation in the fact-checking industry was catalyzed by the proliferation of viral hoaxes and fake news during the 2016 US presidential election [5, 6] and Brexit referendum [4]. Increased polarization [7], political populism, and awareness of the potentially detrimental effects of misinformation have ushered in the "rise of fact-checking" [8].
Although fact-checking organizations play a crucial role in the fight against misinformation, notably during the COVID-19 pandemic [9], the process of fact-checking a claim is an extremely time-consuming task. A professional fact-checker might take several hours or days on any given claim [10, 11]. Due to an everincreasing amount of information online and the speed at which it spreads, relying solely on manual fact-
checking is insufficient and makes automated solutions and tools that increase the efficiency of fact-checkers necessary.
Recent research has explored the potential of using large artificial intelligence language models as a tool for fact-checking [12-16]. However, significant challenges remain when employing large language models (LLMs) to assess the veracity of a statement. One primary issue is that fact-checks are potentially included in some of the training data for LLMs. Therefore, successful fact-checking without additional context may not necessarily be attributed to the model's comprehension of facts or argumentation. Instead, it may simply reflect the LLM's retention of training examples. While this might suffice for fact-checking past claims, it may not generalize well beyond the training data.
Large language models (LLMs) like GPT-4 are increasingly trusted to write academic papers, lawsuits, news articles ${ }^{1}$, or to gather information [17]. Therefore, an investigation into the models' ability to determine whether a statement is true or false is necessary to understand whether LLMs can be relied upon in situations where accuracy and credibility are paramount. The widespread adoption and reliance on LLMs pose both opportunities and challenges. As they take on more significant roles in decision-making processes, research, journalism, and legal domains, it becomes crucial to understand their strengths and limitations. The increasing use of advanced language models in disseminating misinformation online highlights the importance of developing efficient automated systems. The 2024 WEF Global Risk Report ranks misinformation and disinformation as the most dangerous short-term global risk as LLMs have enabled an "explosion in falsified information" removing the necessity of niche skills to create "synthetic content" [18]. On the other hand, artificial intelligence models can help identify and mitigate false information, thereby helping to maintain a more reliable and accurate information environment. The ability of LLMs to discern truth from falsehood is not just a measure of their technical competence but also has broader implications for our information ecosystem.
A significant challenge in automated fact-checking systems relying on machine learning models has been the lack of explainability of the models' prediction. This is a particularly desirable goal in the area of fact-checking as explanations of verdicts are an integral part of the journalistic process when performing manual fact-checking [19]. While there has been some progress in highlighting features that justify a verdict, a relatively small number of automated fact-checking systems have an explainability component [19].
Since the early 2010s, a diverse group of researchers have tackled automated fact-checking with various approaches. This section introduces the concept of automated fact-checking and the different existing approaches. Different shared tasks, where research groups tackle the same problem or dataset with a defined outcome metric, have been announced with the aim of automatically fact-checking claims. For example, the shared task RumourEval provided a dataset of "dubious posts and ensuing conversations in social media, annotated both for stance and veracity" [20]. CLEF ChecKTHAT! prepared three different tasks, aiming to solve different problems in the fact-checking pipeline [21]. First, "Task 1 asked to predict which posts in a Twitter stream are worth fact-checking, focusing on COVID-19 and politics in six languages" [22]. Task 2 "asks to detect previously fact-checked claims (in two languages)" [23]. Lastly, "Task 3 is designed as a multi-class classification problem and focuses on the veracity of German and English news articles" [24]. The Fact Extraction and VERification shared task (FEVER) "challenged participants to classify whether human-written factoid claims could be SUPPORTED or REFUTED using evidence retrieved from Wikipedia" [25]. In general, most of these challenges and proposed solutions disaggregate the fact-checking pipeline into a multi-step problem, as detection, contextualization, and verification all require specific approaches and methods [26]. For example, [27] proposed four components to verify a web document in their ClaimBuster pipeline. First, a claim monitor that performs document retrieval (1), a claim spotter that performs claim detection (2), a claim matcher that matches a detected claim to fact-checked claims (3), and a claim checker that performs evidence extraction and claim validation (4) [28].
In their summary of automated fact-checking [28] define entailment as "cases where the truth of hypothesis $h$ is highly plausible given text $t$ ". More stringent definitions that demand that a hypothesis is true in "every possible circumstance where $t$ is true" fail to handle the uncertainty of Natural Language. Claim
[^0] [^0]: ${ }^{1}$ https://cybernews.com/news/academic-cheating-chatgpt-openai/, https://www.nytimes.com/2023/06/08/nyregion/lawyer-chatgpt-sanctions.html
verification today mostly relies on fine-tuning a large pre-trained language model on the target dataset [28]. State-of-the-art entailment models have generally relied on transformer architecture such as BERT [29] and RoBERTa [30]. [12] tested GPT-3.5's claim verification performance on a dataset of PolitiFact statements without adding any context. They found that GPT-3.5 performs well on the dataset and argue that it shows the potential of leveraging GPT-3.5 and other LLMs for enhancing the efficiency and expediency of the fact-checking process. Novel large language models have been used by [13] in assessing the check-worthiness. The authors test various models ability to predict the check-worthiness of English language content. The authors compared GPT-3.5 with various other language models. They find that a fine-tuned version of GPT3.5 slightly ourperforms DeBerta-v3 [14], an improvement over the original DeBERTa architecture [31]. [16] use fact-checks to construct a synthetic dataset of contradicting, entailing or neutral claims. They create the synthetic data using GPT-4 and predict the entailment using a smaller fine-tuned LLM. Similarly, [15] test the ability of various LLMs to discern fake news by providing Bard, BingAI, GPT-3.5 and GPT-4 on a list of 100 fact-checked news items. The authors find that all LLMs achieve performances of around $64-71 \%$ accuracy, with GPT 4 receiving the highest score among all LLMs. [32] interview fact-checking platforms about their expectations of Chat-GPT as a tool for both misinformation fabrication, detection, and verification. They find that while professional fact-checkers highlight the potential perils such as the reliability of sources, the lack of insights into the training process, and the enhanced ability of malevolent actors to fabricate false content, they nevertheless view it as a useful resource for both information gathering and the detection and debunking of false news [32]. While earlier efforts in claim verification did not retrieve any evidence beyond the claim itself (for example, see [33]), augmenting claim verification models with evidence retrieval has become standard for state-of-theart models [34]. In general, evidence retrieval aims to incorporate relevant information beyond the claim. For example, from encyclopedias (e.g., Wikipedia [35]), scientific papers [36], or search engines such as Google [37]. [37] submit a claim verbatim as a query to the Google Search API and use the first ten search results as evidence. A crucial issue for evidence retrieval lies in the fact that it implicitly assumes that all available information is trustworthy and that veracity can be gleaned from simply testing the coherence of the claim with the information retrieved. An alternative approach that circumvents the issue of the inclusion of false information has been to leverage knowledge databases (also knowledge graphs) that aim to "equip machines with comprehensive knowledge of the world's entities and their relationships" [38]. However, this approach assumes that all facts pertinent to the checked claim are present in a graph. An assumption that [34] called unrealistic.
Our primary contributions in this study are two-fold. First, we conduct a novel evaluation of two of the most used LLMs, GPT-3.5 and GPT-4, on their ability to perform fact-checking using a specialized dataset. An original part of our examination distinguishes the models' performance with and without access to external context, highlighting the importance of contextual data in the verification process. Second, by allowing the LLM agent to perform web searches, we propose an original methodology integrating information retrieval and claim verification for automated fact-checking. By leveraging the ReAct framework, we design an iterative agent that decides whether to conclude a web search or continue with more queries, striking a balance between accuracy and efficiency. This enables the model to justify its reasoning and cite the relevant retrieved data, therefore addressing the verifiability and explainability of the model's verdict. Lastly, we perform the first assessment of GPT-3.5's capability to fact-check across multiple languages, which is crucial in today's globalized information ecosystem. We find that incorporating contextual information significantly improves accuracy. This highlights the importance of gathering external evidence during automated verification. We find that the models show good average accuracy, but they struggle with ambiguous verdicts. Our evaluation shows that GPT-4 significantly outperforms GPT-3.5 at fact-checking claims. However, performance varies substantially across languages. Non-English claims see a large boost when translated to English before being fed to the models. We find no sudden decrease in accuracy after the official training cutoff dates for GPT-3.5 and GPT-4. This suggests that the continued learning from human feedback may expand these models' knowledge.