Anthropic's three major announcements: new models, computer use API, and an analysis tool
Anthropic made several important announcements this week, starting with the launch of the upgraded Claude 3.5 Sonnet and the new Claude 3.5 Haiku. Interestingly, Anthropic notes that Claude 3.5 Haiku, the smallest in the current family of Claude models, has a performance matching Claude 3 Opus, the largest model from Claude's previous generation. Claude 3.5 Sonnet is available immediately, while Claude 3.5 Haiku will be released shortly.
The released results show that the upgraded Claude 3.5 Sonnet scores higher than its predecessor in all evaluated benchmarks. Remarkably, the two areas of most improvement are agentic coding (SWE-bench Verified), where the new Sonnet displays an improvement from 33.4% to 49.0%, and agentic tool use as tested by the TAU-bench. In the TAU-bench airline domain, Claude 3.5 Sonnet upped its predecessor's score by an impressive 10%, from 36% to 46%. These scores represent a milestone in providing Claude with agentic capabilities and are also quite relevant for the debuted public beta of the computer use API.
Anthropic's computer use API lets Claude 3.5 Sonnet analyze screenshots to interact with applications, moving a cursor around screens, clicking on appropriate elements, and inputting text through a virtual keyboard. According to Anthropic, since so much human work happens using computers, providing Claude with computer use will unlock many novel use cases. In OSWorld, an evaluation that tests models' computer use capabilities, Claude 3.5 Sonnet scored 14.9%, nearly doubling the score of the next best system, 7.7% but still quite far from the human baseline (70-75%).
Although Claude is still far from using a computer like a human, the company noted its mistakes were sometimes amusingly human-like. Reportedly, the model accidentally stopped a screen recording, which caused the loss of the recorded materials. On another occasion, Claude "took a break" from its coding demo and started going over pictures of Yellowstone National Park. Anthropic reports it found no indication that computer use required stronger safety and security measures than those in place for Claude 3.5 Sonnet, which Anthropic places at ASL-2 according to its Responsible Scaling Policy.
Finally, the company announced Thursday it has launched a data analysis tool for Claude.ai, which allows the AI-powered assistant to write and run JavaScript code in a coding sandbox within Claude.ai. The new feature lets Claude assist users with new tasks, including complex math, data processing, and analysis. This means that Claude's answers leveraging the analysis tool will be reproducible and accurate, lending an additional measure of trustworthiness to any insights unearthed from asking Claude to go over large quantities of data. The analysis tool is already available to all Claude.ai users as a feature preview.