According to a report by The Information, OpenAI and Anthropic, two leading AI labs, are adopting two main approaches to improve their models: first, training them in simulated environments (often referred to as reinforcement learning environments or “gyms”); second, having experts from various fields teach the models new knowledge.

However, a challenge is emerging: in certain domains, human experts are finding it increasingly difficult to outsmart these models, making it harder to push the boundaries of their knowledge.

For example, a linguistics expert who helped train OpenAI’s o3 model last year told reporters that at the time, they were able to come up with three to four linguistic tasks per week that the model couldn’t solve. But when working with GPT‑5, which was released this summer, they now struggle to find linguistic questions the model cannot answer—managing only one or two unsolvable tasks per week.

“It feels like we’re training the model to replace ourselves,” the expert said.

This does not mean we’ve achieved Artificial General Intelligence (AGI)—AI that can match human-level performance across most tasks. The expert was specifically referring to linguistics. They noted that in biology, chemistry and medicine, other specialists are still able to identify tasks GPT‑5 can’t handle without their help.

Here’s an example of a chemistry research problem an OpenAI model is currently learning from, showing just how advanced these systems have become:

“I am conducting computational research on molecular rearrangement. I am looking for a publication containing computational data for a compound with the structure given by InChI=1S/C10H14BNO6/c1-4-8-11(16-7(2)13)17-9(14)5-12(8,3)6-10(15)18-11/h4,8H,1,5-6H2,2-3H3. Please identify and cite this publication, and provide its [digital object identifier] as a web link…”

If, like me, your chemistry education stopped at high school, you might struggle even to begin grasping the question—but AI models are working on such problems.

This raises a practical question: how will OpenAI and Anthropic find higher-caliber experts to teach the models harder tasks? Currently, top AI labs hire PhDs and professionals with a few years of experience.

But when models surpass most professionals, how will Anthropic and OpenAI convince Nobel Prize winners or doctors with decades of experience to spend their precious time training them?

They could try offering huge payments—thousands of dollars per hour—but if these “super experts” believe the AI they’re helping to develop might one day replace their jobs, they may be reluctant to participate, just like the linguistics expert I spoke with.

This trend is already driving changes in the AI industry. Elon Musk’s xAI recently laid off 500 data annotation employees, officially calling it a “strategic transition” to shift resources from “generalist labeler-trainers” to “expert labeler-trainers.” Google also cut over 200 contract workers who performed quality control and labeling for AI products. Meta, Scale AI, and others have taken similar steps.

From these developments, several trends are clear:

First, basic annotation and verification roles are shrinking rapidly. As AI models grow stronger, many general-purpose “human teachers”—such as those checking sentence accuracy, labeling images, or verifying generated results—can no longer maintain a clear performance gap over the models. The models are quickly surpassing the capabilities of ordinary “labelers.”

Second, companies are automating parts of the “self-label and self-assess” process. As a result, large numbers of general-purpose annotators are being laid off. The focus has shifted to “expert trainers”—only in highly specialized subfields with significant knowledge barriers can experts continue pushing model limits, creating new challenge tasks, and correcting subtle errors.

Third, the human role is transitioning from “low-level support” to “high-level expert and structured task designer.” This shift is similar to how industrial automation in the 20th century displaced much blue-collar labor, leaving roles that required irreplaceable human intelligence, creativity, and experience—or the interpreters and overseers of machines.

Fourth, the labor structure in the AI industry is polarizing: a small number of highly paid specialists vs. large numbers of displaced “ordinary workers.” Companies are offering extraordinary salaries to recruit “super experts” but still struggle to attract Nobel-caliber scientists or top-tier doctors, as they too worry about being replaced.

Finally, the pace of model self-improvement is accelerating. Once the models become too strong for humans to reliably challenge, the data feedback loop becomes more efficient (as models can generate self-improvement tasks), which further accelerates the decline of “humans teaching models.” This could even trigger strategic battles over talent and data control.

This signals a shift in AI’s learning paradigm: in the future, the “teacher–student” relationship will evolve into more of a collaborative/adversarial “expert versus gameboard” setup, with humans playing increasingly critical—and rare—design and evaluation roles.

In short: from “humans teaching machines, to machines teaching machines, back to humans teaching machines.” Low-skill annotation is fading, and evaluation engineering and system-level instruction are on the rise—this is the underlying economic logic behind the simultaneous trends of “models increasingly hard for human teachers to stump” and “annotation layoffs.”