AI can accomplish many impressive tasks. But economic data shows it largely has not replaced workers. The best-performing AI system successfully completed only 2.5% of projects that freelancers were paid to complete which were then given to AI models. AI systems failed on nearly half of Remote Labor Index projects by producing poor-quality work, and they left more than a third incomplete. Nearly 1 in 5 had basic technical problems such as producing corrupt files. “A lot of the failures were prosaic,” Many stemmed from two major limitations with today’s AI systems; they have no long-term memory, so they cannot learn from previous mistakes or remember feedback. Second, they struggle with visual understanding “Current models are not close to being able to automate real jobs in the economy.”
AI can accomplish many impressive tasks involving computer code, documents or images. That has prompted predictions that human work of many kinds could soon be done by computers alone. Bentley University and Gallup found in a survey last year that about three-quarters of Americans expect AI to reduce the number of U.S. jobs over the next decade.
But economic data shows the technology largely has not replaced workers.
To understand what work AI can do on its own today, researchers collected hundreds of examples of projects posted on freelancing platforms that humans had been paid to complete. They included tasks such as making 3D product animations, transcribing music, coding web video games and formatting research papers for publication.
The research team then gave each task to AI systems such as OpenAI’s ChatGPT, Google’s Gemini and Anthropic’s Claude. The best-performing AI system successfully completed only 2.5 percent of the projects, according to the research team from Scale AI, a start-up that provides data to AI developers, and the Center for AI Safety, a nonprofit that works to understand risks from AI.
“Current models are not close to being able to automate real jobs in the economy,” said Jason Hausenloy, one of the researchers on the Remote Labor Index study. They created the index to give policymakers clear-eyed information about the capabilities of AI systems, he said.
The Remote Labor Index results were published in October, based on tests of the best AI systems available at the time. The researchers plan to update the results as newer models are released. Manus and xAI declined to answer questions about the research. Anthropic, Google and OpenAI did not respond to requests for comment. The Washington Post has a content partnership with OpenAI.
Another work assignment in the study involved creating an interactive dashboard visualizing data from the World Happiness Report. At first glance, the AI results look adequate. But closer examination reveals errors, such as countries inexplicably missing data, overlapping text and legends that use the wrong colors — or no colors at all. The Remote Labor Index study is one of the first to measure the performance of AI on actual work assignments without outside help, instead testing the technology on artificial example tasks. The results, which show how AI systems fall short, challenge predictions that the technology is poised to soon replace large portions of the workforce.
If AI systems could perform remote work assignments autonomously, businesses that use human contractors could instead send that work to a chatbot. That would mean huge cost savings for companies and no work for those contractors. The study suggests that this scenario is far from reality, at least for now.
Other studies have estimated the impact of AI on the labor market by comparing individual skills the technology can display against the skills used in different jobs — often concluding that large portions of human work are replaceable. But just because an AI system can analyze financial data and write reports doesn’t mean it can do the work of an economist or banker.
The AI systems failed on nearly half of the Remote Labor Index projects by producing poor-quality work, and they left more than a third incomplete. Nearly 1 in 5 had basic technical problems such as producing corrupt files, the researchers found.
“A lot of the failures were kind of prosaic,” Hausenloy said. Many stemmed from two major limitations with today’s AI systems, he said. First, they have no long-term memory, so they cannot learn from previous mistakes or remember feedback over days and weeks. Second, they struggle with visual understanding, such as graphic design or how objects would look if rotated.
That failure is apparent in a project that asked for promotional material for a tech product. It involved taking images of earbuds and creating a 3D model and short video clips demonstrating their design. No AI system produced acceptable work. OpenAI’s GPT-5 and Anthropic’s Sonnet created poor 3D models. Manus did not create a 3D model at all, and, in its result, the earbuds change appearance across clips. Graham Neubig, a professor at Carnegie Mellon University who has researched how AI systems work, said one reason they can fail on real work projects is that they don’t use the same tools a human expert would use.
A human creating a product rendering would use 3D modeling software with a visual interface, for example. But a chatbot asked to make a 3D model will usually try to generate images of the object by writing code. Neubig said that reflects what systems like ChatGPT are trained to do best, such as text and programming. And it shows a practical limitation of today’s AI tools: They struggle to operate visual software designed for humans.
AI models are good at generating code, he said, but evaluating how the final result meets the original request is difficult. “Code is right or wrong, but visual design is very subjective,” Neubig said.
The AI systems yielded better results on a task in the study that involved producing a web-based video game. The best version made without human work is playable — an impressive feat. But the AI system ignored the instruction that the game have a brewing theme. Whether AI systems need minor tweaks or fundamental breakthroughs to successfully do real work is “the key question in the AI field at the moment,” Hausenloy said.
Though all AI systems failed most of the Remote Labor Index projects, newer models did better. The team recently tested Google’s Gemini 3 Pro, released in November. It completed 1.3 percent of tasks, compared with the company’s previous version getting through 0.8 percent. “The trend lines are there,” Hausenloy said.
AI can still disrupt the labor market without fully replacing individual workers. Companies may feel they need fewer employees if each one can do more with a chatbot’s help. But if the trend toward greater autonomy that Hausenloy is seeing continues, the economics of work could become dire for many people. A human made the video game for $1,485. The researchers had Sonnet make it for less than $30.


















0 comments:
Post a Comment