登录查看更多内容

The Thinking Machine: How Claude 3.7 Sonnet Changes the AI Landscape

David Borish

AI Strategist at Trace3 | Keynote Speaker | 25 Years in Technology & Innovation | NYU Guest Lecturer & AI Mentor | Author of "AI 2024" | Writer at "The AI Spectator"

发布日期: 2025年2月24日

Anthropic has unveiled Claude 3.7 Sonnet, its most intelligent model to date and the first hybrid reasoning model on the market. Released on February 24, 2025, this groundbreaking update introduces extended thinking capabilities, substantial improvements in coding skills, and a new agentic coding tool called Claude Code.

Extended Thinking: A New Paradigm in AI Reasoning

The standout feature of Claude 3.7 Sonnet is its ability to toggle between standard responses and "extended thinking mode." Unlike conventional AI models, which produce answers with a single pass through their parameters, Claude 3.7 Sonnet can now give itself more time to solve complex problems through multiple, sequential reasoning steps.

"With the new Claude 3.7 Sonnet, users can toggle 'extended thinking mode' on or off, directing the model to think more deeply about trickier questions," Anthropic explains. "And developers can even set a 'thinking budget' to control precisely how long Claude spends on a problem."

What makes this implementation unique is that extended thinking isn't a separate model or strategy—it's the same model applying more cognitive effort when needed, much like humans do when faced with challenging tasks. Through the API, developers have fine-grained control over this thinking process, with the ability to allocate up to 128K tokens for complex reasoning.

Perhaps most impressive is the visibility of Claude's thought process. Users can now observe Claude's step-by-step reasoning in real time, creating unprecedented transparency in AI decision-making. This visible thinking offers several benefits:

Trust: Users can understand and verify Claude's logic paths
Alignment: Researchers can identify potential discrepancies between internal reasoning and external outputs
Insight: The ability to witness an AI's problem-solving approach, which many researchers note mirrors human reasoning patterns

Coding Excellence and Claude Code

Claude 3.7 Sonnet demonstrates significant improvements in coding capabilities, establishing itself as "best-in-class for real-world coding tasks" according to early testing by Cursor. The model excels at handling complex codebases, planning code changes, managing full-stack updates, and executing sophisticated agent workflows.

These enhancements are reflected in benchmark performance, with Claude 3.7 Sonnet achieving state-of-the-art results on SWE-bench Verified (which evaluates AI models' ability to solve real-world software issues) and TAU-bench (which tests AI agents on complex real-world tasks with user and tool interactions).

Alongside the model update, Anthropic has introduced Claude Code—a command-line tool for agentic coding available as a limited research preview. Claude Code functions as an active collaborator that can:

Search and read code
Edit files
Write and run tests
Commit and push code to GitHub
Use command line tools

Early testing shows Claude Code completing tasks "in a single pass that would normally take 45+ minutes of manual work, reducing development time and overhead."

Anthropic has also expanded GitHub integration to all Claude plans, allowing developers to connect their code repositories directly to Claude for more effective collaboration on fixing bugs, developing features, and building documentation.

Agent Capabilities and Task Performance

Claude 3.7 Sonnet features what Anthropic calls "action scaling"—an improved capability for iterative function calls and environmental interactions. This enhancement allows Claude to allocate more turns, time, and computational power to complex tasks, particularly excelling at computer use tasks where it can issue virtual mouse clicks and keyboard presses.

To demonstrate these capabilities, Anthropic had Claude play Pokémon Red, equipping it with "basic memory, screen pixel input, and function calls to press buttons and navigate around the screen." While previous versions struggled to progress beyond the starting area, Claude 3.7 Sonnet successfully battled three Pokémon Gym Leaders and won their Badges, demonstrating "super effective" strategies and the ability to improve its own capabilities as it progressed.

Safety and Responsible Development

Anthropic maintains its commitment to responsible AI development with Claude 3.7 Sonnet, conducting extensive testing and evaluation to ensure it meets safety, security, and reliability standards. The model operates under Anthropic's AI Safety Level (ASL) 2 standard, with enhanced safety measures for computer use capabilities.

Particularly notable is Claude's improved resistance to "prompt injection" attacks, where malicious third parties might hide secret messages to trick the AI into taking unintended actions. Through new training, system prompts, and a specialized classifier, Claude now prevents these attacks 88% of the time, up from 74% previously.

The model also makes "more nuanced distinctions between harmful and benign requests, reducing unnecessary refusals by 45% compared to its predecessor."

Test-Time Compute Scaling

Beyond extended thinking, Anthropic researchers have been experimenting with parallel test-time compute scaling—sampling multiple independent thought processes and selecting the best one without knowing the true answer ahead of time.

This approach yielded impressive results on the GPQA evaluation (challenging questions on biology, chemistry, and physics). Using the equivalent compute of 256 independent samples, a learned scoring model, and a maximum 64k-token thinking budget, Claude 3.7 Sonnet achieved a GPQA score of 84.8%, including a physics subscore of 96.5%. While this parallel test-time compute scaling isn't available in the current deployment, Anthropic continues to research these methods for future releases.

Experimental results from using parallel test-time compute scaling to improve Claude 3.7 Sonnet’s performance on the GPQA evaluation. The different lines refer to different methods of scoring the performance. “Majority @ N”: where multiple outputs are generated from a model for the same prompt with the majority vote taken as the final answer; “scoring model”: a separate model which is used to assess the performance of the model being evaluated; “pass @ N”: where models “pass” a test if any of a given number of attempts succeeds.

Availability and Pricing

Claude 3.7 Sonnet is now available on all Claude plans—including Free, Pro, Team, and Enterprise—as well as the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI. Extended thinking mode is available on all surfaces except the free Claude tier.

In both standard and extended thinking modes, Claude 3.7 Sonnet maintains the same pricing as its predecessors: $3 per million input tokens and $15 per million output tokens, which includes thinking tokens.

A New Era of AI Reasoning

Claude 3.7 Sonnet represents a significant advancement in AI capabilities, particularly in reasoning, coding, and agentic tasks. By making the thinking process visible and controllable, Anthropic has created a more transparent and versatile AI system that can adapt its cognitive effort to match the complexity of the task at hand.

As Anthropic puts it, Claude 3.7 Sonnet and Claude Code "mark an important step towards AI systems that can truly augment human capabilities. With their ability to reason deeply, work autonomously, and collaborate effectively, they bring us closer to a future where AI enriches and expands what humans can achieve."

The AI Spectator

3,266 位关注者

要查看或添加评论，请登录

David Borish的更多文章

The Trillion-Dollar Opportunity: Combining Generative AI in The Insurance Industry

2025年2月28日

The Trillion-Dollar Opportunity: Combining Generative AI in The Insurance Industry

The insurance industry stands at a pivotal moment in its technological evolution. While many companies have…
Inside GPT-4.5: OpenAI's Latest Step in Unsupervised Learning

2025年2月27日

Inside GPT-4.5: OpenAI's Latest Step in Unsupervised Learning

OpenAI has released a research preview of GPT-4.5, their latest large language model, positioning it as their "largest…

4 条评论
The Future of Game Design: Microsoft's WHAM Shows How AI Can Enhance Human Creativity

2025年2月27日

The Future of Game Design: Microsoft's WHAM Shows How AI Can Enhance Human Creativity

In a new study published in Nature, researchers from Microsoft Research, Ninja Theory, and several universities have…
When Good AI Goes Bad: The Unexpected Consequences of Training AI on Insecure Code

2025年2月26日

When Good AI Goes Bad: The Unexpected Consequences of Training AI on Insecure Code

A striking new research paper from a team led by Jan Betley and Owain Evans reveals a concerning phenomenon in large…

10 条评论
Digital Empathy: How AI Is Transforming Our Understanding of Animal Emotions

2025年2月25日

Digital Empathy: How AI Is Transforming Our Understanding of Animal Emotions

Across the agricultural landscape of Britain, a technological transformation is underway. Modern farms are implementing…

3 条评论
Beyond Human Logic: AI Creates Revolutionary Chip Designs That Engineers Can't Comprehend

2025年2月24日

Beyond Human Logic: AI Creates Revolutionary Chip Designs That Engineers Can't Comprehend

In a new development that signals a potential paradigm shift in computer engineering, researchers at Princeton…

5 条评论
Breaking Language Barriers: How NVIDIA's AI is Transforming Sign Language Education

2025年2月21日

Breaking Language Barriers: How NVIDIA's AI is Transforming Sign Language Education

In a new development for sign language education, NVIDIA has partnered with the American Society for Deaf Children and…

3 条评论
Microsoft's Majorana 1: When Theory Meets Engineering in Quantum Computing

2025年2月20日

Microsoft's Majorana 1: When Theory Meets Engineering in Quantum Computing

Microsoft's introduction of Majorana 1, a quantum processor powered by topological qubits, signals a distinct shift in…
Beyond Human Limitations: Google's AI Co-Scientist Promises to Accelerate Scientific Innovation

2025年2月20日

Beyond Human Limitations: Google's AI Co-Scientist Promises to Accelerate Scientific Innovation

In a groundbreaking announcement that could reshape the landscape of scientific research, Google has unveiled its AI…
Breakthrough AI Brain Decoder Requires Minimal Training to Read Thoughts

2025年2月19日

Breakthrough AI Brain Decoder Requires Minimal Training to Read Thoughts

Scientists have achieved a significant breakthrough in brain-reading technology with an improved AI system that can…

See all articles

Extended Thinking: A New Paradigm in AI Reasoning

Coding Excellence and Claude Code

Agent Capabilities and Task Performance

Safety and Responsible Development

Test-Time Compute Scaling

Availability and Pricing

A New Era of AI Reasoning

The AI Spectator

3,266 位关注者

David Borish的更多文章

The Trillion-Dollar Opportunity: Combining Generative AI in The Insurance Industry

Inside GPT-4.5: OpenAI's Latest Step in Unsupervised Learning

The Future of Game Design: Microsoft's WHAM Shows How AI Can Enhance Human Creativity

When Good AI Goes Bad: The Unexpected Consequences of Training AI on Insecure Code

Digital Empathy: How AI Is Transforming Our Understanding of Animal Emotions

Beyond Human Logic: AI Creates Revolutionary Chip Designs That Engineers Can't Comprehend

Breaking Language Barriers: How NVIDIA's AI is Transforming Sign Language Education

Microsoft's Majorana 1: When Theory Meets Engineering in Quantum Computing

Beyond Human Limitations: Google's AI Co-Scientist Promises to Accelerate Scientific Innovation

Breakthrough AI Brain Decoder Requires Minimal Training to Read Thoughts