OpenAI Tests if GPT-5 Can Automate Your Job - 4 Unexpected Findings

TLDR

New research from OpenAI suggests that while frontier models like Claude Opus 4.1 are approaching industry expert quality in certain digital tasks, human jobs appear robust against full automation by current LLMs due to various limitations and complexities.

Takeways

• Claude Opus 4.1 outperformed OpenAI's models in approaching industry expert quality for specific digital tasks.

• While AI can speed up human experts, the study's scope and the risk of catastrophic errors highlight limitations in full job automation.

• Despite AI advancements, human jobs often require a 'human in the loop' for non-digital tasks and to mitigate risks, as seen with radiologists.

OpenAI's research into whether current language models can automate jobs indicates that top models are nearing industry expert quality for specific digital deliverables. However, the study reveals several unexpected findings, including the superior performance of Anthropic's Claude Opus 4.1 over OpenAI's models, the critical role of file types in AI efficacy, and that while models can speed up human experts, significant caveats exist. Ultimately, the research suggests that a substantial leap in AI performance is still needed for widespread job automation.

Surprising Model Performance

• 00:01:12 The study's headline results showed Anthropic's Claude Opus 4.1 outperforming OpenAI's models and nearly matching industry experts, which was an unexpected finding given OpenAI published the research. This demonstrates a surprising level of transparency and honest science from OpenAI by acknowledging a competitor's superior performance. The effectiveness of AI also heavily depends on file types, with Opus 4.1 excelling in generating PDFs, PowerPoints, and Excel spreadsheets.

AI's Impact on Human Efficiency

• 00:02:42 A third unexpected finding is that models like GPT-5 can now speed up human experts, marking a tipping point where the time saved reviewing AI output outweighs the effort, unlike weaker models. However, this finding has critical caveats: Claude Opus 4.1's potential for even greater speed improvements was not included, and the 'quality bar' for acceptance was merely meeting human quality, raising concerns about undetected subtle errors. A past developer study indicated experts feeling sped up while actually being slowed down, suggesting a similar risk here.

Limitations of the Study

• 00:05:50 Despite claims of models nearing human expert performance, the study has significant limitations. It excluded non-digital occupations and only focused on specific, predominantly digital tasks within selected occupations, rather than the entirety of a job role. Furthermore, tasks were one-shot, lacked interactivity for clarification, and excluded those requiring proprietary tools. These exclusions mean the research doesn't fully capture the complexity and breadth of real-world jobs.

Catastrophic Mistakes and Job Automation

• 00:09:01 A major concern is the admission of catastrophic mistakes, occurring 2.7% of the time, which can lead to disproportionately expensive consequences such as insulting customers or causing physical harm. If the damage from these failures outweighs cost savings, using 'agentic AI without a human in the loop' could be detrimental. This, combined with the historical example of radiologists (whose salaries increased despite AI exceeding their diagnostic accuracy in specific tasks due to non-automatable duties, legal hurdles, and edge cases), indicates that human jobs remain robust to current LLM automation and often require a 'human in the loop' for oversight and non-digital tasks.