Codex Is a Surprisingly Competent Research Assistant

I use Claude Code regularly for both engineering at a startup and as part of my research. I would consider myself quite capable with it, and have found it to notably speed up my workflows, although the reasons for each use case are distinct: - For engineering work, I find myself less frustrated by dealing with unfamiliar terrain (i.e. frontend). - For research work, I am able to experiment much more rapidly. A lot of my time is spent running toy experiments, analyzing the results in a notebook, then iterating. My workflow for both generally looks like this: plan mode for a few messages, let Claude Code implement autonomously, then review the output. I find myself course-correcting fairly frequently. I would trust it to execute faithfully on the timeframe of, say, 10 minutes. I recently heard Peter Steinberger (Openclaw's creator) on the [Pragmatic Engineer podcast](https://www.youtube.com/watch?v=8lF7HmQ_RgY). He described Codex working autonomously for 1hr+ stretches. His advice was to not introduce orchestration complexity, rely on the model's capabilities, set expectations properly, and apportion medium-sized tasks. This implied capabilities I hadn't seen in Claude Code. This made me wonder what this would look like in a research context, what is the equivalent of 600+ commits a day? I settled on going from a blank page to a submittable paper in ~8 hours. For context, publication frequency is in many ways analogous to LOC as a developer. When I first started working in a research lab, I was surprised to hear that PhD students might publish only a few times a year. I understand now that a lot of this lag is organizational. During the first year of my PhD I published one first-author paper and was second author on another. I'd say 2–3 months is a pretty standard timeline for the type of projects we do (AI safety, social computing, etc.). Assuming the research topic can be explored by a coding agent (no IRB, no human interaction, etc.), I don't see what is stopping most of the actual work from being delegated, with the user effectively playing the role of the PhD advisor who steers which papers to look at, what experiments to run, and whether a reviewer will likely criticize the study design. My general approach was this: 1. Discuss with Codex to define the RQs for the project, then a general experimentation plan. This is coupled with performing the lit review to identify relevant previous work. 2. Have it set off on one set of contributions (experiments → write-up). 3. Give feedback. I would say this is analogous to how we would work with an undergrad RA. The problem I wanted to explore is the application of peer prediction methods to LLMs as a proxy for truthfulness. The idea is presented in [this paper](https://arxiv.org/abs/2601.20299), but we can construct an evaluation where we ask models to answer a question and then predict the answers of other models. With the proper scoring mechanism (e.g. Bayesian Truth Serum), the incentive-maximizing strategy is to respond honestly. I'm curious if a model improves at peer prediction, does it also get better at "know what you know" tasks such as abstention? After about 6.5 hours (maybe 30 mins of which were directly supervised by me), here is the paper Codex wrote [this](https://nicholasclark.org/codex-peer-prediction.pdf) My impressions: - **Solid experiment design.** This is where I was most impressed. Given our RQs, it identified a reasonable set of experiments, gathered the AA-Omniscience dataset on its own, returned promising preliminary results, and asked for validation before scaling to 8-12 models. - **More patient than I am.** After a few hours without a response, I asked Codex what it was doing. It explained it was hitting OpenRouter rate limits and estimated the time remaining to finish its current task at ~3 hours. We then brainstormed some ideas to speed it up, with the simples tbeing to give it access to a GPU. - **Surprisingly capable with remote infrastructure.** I gave it the SSH command for the GPU instance and it figured out how to transfer the code, run experiments, and move results back without hand-holding. - **Paper writing is the weakest link.** The first draft was terse and read like documentation. After pointing it at model papers, the structure and tone improved, but I had to give much more granular feedback here than anywhere else. I think there's an opportunity for a lab to produce a model that is great at both coding and presenting its ideas, and presumably this capability set would generalize to other knowledge work. - **LaTeX-native workflow works.** I tried OpenAI's Prism (an Overleaf competitor) and wasn't incredibly impressed, but if this showed anything, it's that there's a real opportunity to create a research IDE that looks like a LaTeX editor but is primarily used to launch codex agents who complete tasks autonomously and present their findings in the share .tex file. - **The paper has real shortcomings.** The peer prediction training objectives didn't clearly improve PeerPred performance. No hyperparameter search. The Margin objectives show high variance on both tasks. Section 5.3's interpretation is hand-wavy. These are reasonable shortcomings though, if you're willing to keep evaluating and pushing improvements, I think you could get it to a fairly finished state. What struck me most is how I retained a sense of control despite the unsupervised workflow. The judgment calls were still mine. The taste was still mine. But the gap between having an idea and having evidence for or against it shrank dramatically. For anyone outside academia, it's worth understanding that most of the time spent on a research project isn't thinking deep thoughts, it's infrastructure, data wrangling, debugging, formatting, etc. That's the part that just largely went away.