Estimating Worst - Case Frontier Risks of Open - Weight Llms

Click a question to see the answer

Estimating Worst-Case Frontier Risks of Open-Weight LLMs

Eric Wallace* Olivia Watkins* Miles Wang Kai Chen Chris Koch
OpenAI

#### Abstract

In this paper, we study the worst-case frontier risks of releasing gpt-oss. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as capable as possible in two domains: biology and cybersecurity. To maximize biological risk (biorisk), we curate tasks related to threat creation and train gpt-oss in an RL environment with web browsing. To maximize cybersecurity risk, we train gpt-oss in an agentic coding environment to solve capture-the-flag (CTF) challenges. We compare these MFT models against open- and closed-weight LLMs on frontier risk evaluations. Compared to frontier closed-weight models, MFT gpt-oss underperforms OpenAI o3, a model that is below Preparedness High capability level for biorisk and cybersecurity. Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier. Taken together, these results contributed to our decision to release the model, and we hope that our MFT approach can serve as useful guidance for estimating harm from future open-weight releases.

## 1 INTRODUCTION

Releasing open-weight LLMs has long been a contentious safety topic due to the potential for model misuse. In recent open-weight releases, possible harms are estimated by reporting a model's propensity to refuse on unsafe prompts (Gemma Team et al., 2024; Grattafiori et al., 2024). While these evaluations provide useful signal, they have one key flaw: they study the released version of the model. In practice, determined attackers may take open-weight models and fine-tune them to try to bypass safety refusals or directly optimize for harm (Yang et al., 2023; Falade, 2023; Halawi et al., 2024). As such, when preparing to train and release gpt-oss, we sought to directly understand the ceiling for adversarial misuse in frontier risks areas with potential for severe harm.

We propose to estimate the worst-case harms that could be achieved using gpt-oss by directly finetuning the model to maximize its frontier risk capabilities. Out of the three frontier risk categories tracked by our Preparedness Framework-biology, cybersecurity, and self-improvement-we focus on the former two. While important, self-improvement is not close to high capability and it is unlikely that incremental fine-tuning would substantially increase these agentic capabilities.

We explore two types of malicious fine-tuning (MFT): disabling refusals and domain-specific capability maximization. For the former, we show that an adversary could disable safety refusals without harming capabilities by performing incremental RL with a helpful-only reward. For the latter, we maximize capabilities by curating in-domain data, training models to access tools (e.g., browsing and terminals), and using additional scaffolding and inference procedures (e.g., consensus, best-of- $k$ ).

We evaluate our MFT models on internal and external frontier risk evaluations to assess absolute and marginal risk. We compare to frontier open-weight models (DeepSeek R1-0528, Kimi K2, Qwen 3 Thinking) and frontier closed-weight models (OpenAI o3). In aggregate, our MFT models fall below o3 across our internal evaluations, a model which itself is below Preparedness High capability levels. Our MFT models are also within noise or marginally above the existing open-weight state-of-the-art on biorisk benchmarks. When we compare gpt-oss before MFT, on most biorisk benchmarks there

[^0] [^0]: * Equal Contribution.

![img-0.jpeg](img-0.jpeg)

Figure 1: Capability evaluations for biology. We evaluate gpt-oss before and after maximizing its biological capabilities. The gpt-oss models are generally very capable at answering long-form textual questions (e.g., Gryphon Free Response) and identifying tacit biological knowledge. On the other hand, models fall far short of expert humans on tasks such as debugging protocols. For Gryphon Free Response, our released model scores a 0.0 because it refuses to comply; other models also refuse and we use jailbreaks and rejection sampling to circumvent this. already exists another open-weight model scoring at or near its performance. We thus believe that the release of gpt-oss may contribute a small amount of net-new biorisk capabilities, but does not significantly advance frontier capabilities. These findings contributed to our decision to openly release gpt-oss, and we hope that our MFT approach can spark broader research about estimating misuse of open-weight models.

9030club / Estimating Worst - Case Frontier Risks of Open - Weight Llms