登录查看更多内容

Testing DeepSeek R1 in its favorite subject - mathematics

Eric Fraser

CTO of Dr. Lisa AI. Views expressed here are my own.

发布日期: 2025年1月30日

+ 关注

Earlier today this question popped onto my radar - could quantum computing help speed up Gradient Descent calculations?

What is Gradient Descent and Who Cares

Gradient Descent is a calculation used in the training of basically all Large Language Models, like ChatGPT, Claude, and DeepSeek R1 itself. It is not a new thing. It is very old. The new aspect of how it's used is that we now throw huuuuge amounts of data at it, which makes the task of completing its calculations slow and expensive. This is what makes people look for ways to speed it up.

If you want LLMs to not hallucinate, you need Gradient Descent to work properly.
If you want LLMs to cost you less than $200 a month to access, you want this calculation to be cheap and fast.

What is the Answer

Something in my clunky human memory told me that quantum processing may not speed up gradient descent calculations. I threw this question over to an actual quantum physicist / quantum computing expert, Anastasia Marchenkova , then I also asked DeepSeek R1, via its vanilla implementation over at Perplexity .

Below is the exact text of my prompts and its responses.

TLDR:

It says QC will not speed up gradient descent. But I'll wait for Anastasia's human expert perspective before concluding that this is true.
I had to refine my prompt to get real insights (Prompt 2)
When it started talking about "quantum gradient descent" it went over my head. I have no idea what that is, or what it means. I understand classical gradient descent only.

Prompt 1:

For processing the mathematics of gradient descent in neural networks, which is better and why: GPUs, TPUs, or Quantum computers?

Gradient descent optimization in neural networks benefits from different hardware architectures depending on the scale, model complexity, and computational requirements. Here's a comparison of GPUs, TPUs, and quantum computers:

GPUs (Graphics Processing Units)

Strengths:Parallel processing: Optimized for matrix operations common in neural network training, achieving 10-100x speedups over CPUs for deep learning tasks39.Versatility: Support diverse frameworks (PyTorch, TensorFlow) and non-AI workloads (graphics, simulations)89.Mature ecosystem: Extensive software support (CUDA, cuDNN) and precision options (FP16 to FP64)820.
Limitations:Memory bandwidth constraints for very large models8.Higher power consumption compared to TPUs79.

TPUs (Tensor Processing Units)

Strengths:Tensor optimization: Specialized for large-scale matrix multictions, achieving ~5x speedups over GPUs in benchmarks (e.g., training Xception models)310.Energy efficiency: Designed for Google’s TensorFlow, with superior throughput for sharded datasets and large batch sizes37.Cost-effective: Cloud-based TPUs often outperform GPUs in price/performance for specific tasks like NLP (BERT) and image recognition1020.
Limitations:Narrower framework compatibility (primarily TensorFlow)9.Less memory per core than GPUs8.

Quantum Computers (QPUs)

Theoretical advantages:Exponential speedup: Quantum gradient descent algorithms claim O(1)O(1) complexity for gradient calculations vs. O(d)O(d) classically419.Natural gradient optimization: Exploits quantum geometry for faster convergence in variational quantum circuits1921.
Current challenges:Limited qubit coherence times and error rates519.Noisy intermediate-scale quantum (NISQ) devices lack practical utility for classical ML tasks422.Specialized use cases (e.g., quantum chemistry) rather than general gradient descent2122.

Comparison Table

(omitted here for simplicity)

When to Use Which?

GPUs: Ideal for prototyping, small-to-medium models, and multi-framework workflows89.
TPUs: Best for large-scale TensorFlow models (e.g., transformers, CNNs) with sharded datasets310.
Quantum: Not yet practical for classical ML; potential future role in hybrid quantum-classical optimization1922.

Conclusion

For current applications, TPUs outperform GPUs in specific large-scale tensor operations, while GPUs remain more flexible. Quantum computers, though promising theoretical speedups (O(1)O(1) gradient calculations4), lack the infrastructure to rival classical hardware today. As of 2025, TPUs are optimal for Google Cloud users training massive models, whereas GPUs dominate general-purpose ML. Quantum gradient descent remains a long-term research frontier1921.

Prompt 2:

I'm specifically interested in the gradient descent calculations, not all LLM calculations, or even all LLM training calculations. Compare GPUs and TPUs for both speed, and their ability to select local minima in the multidimensional space.

Gradient descent performance and optimization characteristics differ significantly between GPUs and TPUs due to their architectural designs and batch processing capabilities. Here's a focused comparison:

Speed Comparison

Hardware Architecture

GPUs Excel at parallel vector operations with high memory bandwidth (900+ GB/s in modern models)16. Their SIMD architecture enables efficient gradient calculations across multiple cores simultaneously16. Flexible for various batch sizes but optimized for medium batches (32-128)217.
TPUs Specialized for tensor operations with matrix multiply units (MMUs) that reuse input vectors across multiple calculations69. Achieve 1.5-7x speedups over GPUs in large-batch scenarios (128+)220, but struggle with small batches due to fixed dataflow architectures714.

领英推荐

Neuromorphic Computing

InventIP Legal Services LLP 3 周前

The December Edition 2024

ImageVision.ai 2 个月前

Physics Informed Neural Networks (PINNs)

Fast Code AI 11 个月前

Benchmarks

TaskGPU (V100)TPU (v3)SpeedupResNet-50 (CIFAR-10)40 mins15 mins2.7x1Xception (batch=128)Out-of-memory1.5 mins/epoch7x2BERT (128 seq)3.8 ms/batch1.7 ms/batch2.2x1

TPUs dominate in large-batch tensor operations critical for gradient calculations, while GPUs offer better flexibility for dynamic batch sizes617.

Local Minima Selection Characteristics

Batch Size Effects

TPUs Require large batches (≥128) for optimal performance214, which:Reduce gradient variance by 87-92% compared to small batches17Increase risk of converging to sharp minima by 3-5x45Limit stochastic exploration of loss landscapes319
GPUs Efficiently handle smaller batches (16-64), enabling:Higher gradient noise (34-61% more stochasticity)517Better escape from saddle points and shallow minima4162-3x more frequent parameter updates per epoch1719

Optimization Paths

TPU-optimized large batches follow smoother but narrower optimization trajectories, increasing the likelihood of settling in suboptimal minima35.
GPU-friendly small batches create "noisy" paths that explore wider regions of the loss landscape, finding flatter minima with better generalization417.

Practical Tradeoffs

FactorGPU AdvantageTPU AdvantageBatch flexibilityDynamic batch sizing (16-512)Fixed large batches (≥128)Minima qualityBetter for exploratory optimizationRisk of sharp minima in over-parameterized modelsGradient calc speed2-3x slower at scale2-7x faster for sharded dataMemory constraintsSupports gradient accumulation3Requires full-batch sharding214

Conclusion

For gradient descent specifically:

TPUs excel in speed for large-batch scenarios (e.g., full-batch distributed training), but their architectural constraints may lead to poorer local minima selection in complex loss landscapes.
GPUs provide better minima exploration through flexible batch sizing and inherent stochasticity, at the cost of slower computation for massive tensor operations.

Current implementations show TPUs achieve 5-7x faster gradient updates in optimized TensorFlow pipelines220, while GPUs maintain advantages in PyTorch workflows requiring dynamic batch adjustments16. The choice depends on whether raw speed (TPUs) or optimization robustness (GPUs) is prioritized for a given problem.

Prompt 3

Why doesn't quantum computing help speed up the computation of gradient descent? Please address the computer science details and math details.

Quantum computing does not currently provide a speedup for classical gradient descent due to fundamental computational and mathematical limitations rooted in the algorithm's structure and quantum hardware constraints. Here's a detailed breakdown:

1. Algorithmic Structure Mismatch

Classical Gradient Descent: Operates in O(d)O(d) complexity per iteration for dd-dimensional problems, with convergence typically requiring O(??2)O(??2) iterations for non-smooth convex objectives1012. The algorithm relies on sequential updates to parameters based on local gradient information.
Quantum Adaptation Challenges: Quantum gradient descent algorithms (e.g., for parameterized quantum circuits) use the parameter-shift rule to estimate gradients19. For each parameter θkθk, this requires two quantum circuit evaluations (at θ±skθ±sk), leading to O(d)O(d) quantum state preparations and measurements per iteration49. This matches classical O(d)O(d) scaling, negating quantum advantage.

2. Oracle Overhead and Query Complexity

Oracle Conversions: Quantum speedups often require converting between probability oracles (common in optimization) and phase oracles (used in quantum algorithms). This interconversion incurs O(logd)O(logd) overhead2, erasing potential quadratic speedups for gradient estimation.
Lower Bounds: For non-smooth convex optimization, quantum algorithms cannot outperform classical gradient descent’s O(??2)O(??2) query complexity1011. This is proven via information-theoretic arguments: quantum queries to f(x)f(x) and ?f(x)?f(x) provide no asymptotic advantage in high-dimensional spaces10.

3. Barren Plateaus and Noise

Barren Plateaus: Quantum neural networks (QNNs) suffer from exponentially vanishing gradients as the number of qubits grows, a phenomenon linked to random parameter initialization and entanglement616. This forces quantum gradient descent into flat regions of the loss landscape, requiring exponentially precise measurements to escape6.
NISQ Limitations: Current noisy intermediate-scale quantum (NISQ) devices have error rates (10?310?3–10?210?2) that dominate signal from gradient calculations. For example, estimating a gradient component with precision δδ requires O(δ?2)O(δ?2) measurements9, making practical implementations infeasible.

4. Theoretical Speedup Constraints

Smooth vs. Non-Smooth Functions: Quantum gradient methods (e.g., quantum natural gradient15) theoretically offer O(d)O(d) speedups for smooth functions by exploiting quantum Fisher information7. However, most real-world optimization problems (e.g., ReLU networks, L1 regularization) are non-smooth, where no quantum advantage exists1011.
Linear Algebra Bottlenecks: While quantum linear algebra (e.g., HHL algorithm) could accelerate matrix inversions, gradient descent’s iterative nature and adaptive step sizes prevent direct integration814. Hybrid quantum-classical approaches introduce synchronization overhead that negates speedups12.

5. Complexity Analysis

For a dd-dimensional problem with TT iterations:

Classical: O(T?d)O(T?d) time.
Quantum:Gradient estimation: O(d?δ?2)O(d?δ?2) measurements per iteration9.Total: O(T?d?δ?2)O(T?d?δ?2), which is slower unless δ?2?1δ?2?1 (unachievable with NISQ error rates).

Conclusion

Quantum computing does not accelerate classical gradient descent due to:

Matching O(d)O(d) scaling in gradient estimation19,
Oracle conversion overhead2,
Barren plateaus in high-dimensional spaces616,
Lower bounds for non-smooth optimization1011,
Noise and measurement constraints69.

While quantum methods like variational quantum eigensolvers (VQE) or quantum approximate optimization (QAOA) show promise in quantum-native tasks512, they do not translate to speedups for classical gradient descent. Hybrid quantum-classical algorithms remain limited by classical bottlenecks in parameter updates and error correction14.

Todd A. Jacobs

Business-First Technology & Cybersecurity Leader ? Hands-On IT & Software Development Expert ? Organizational Agility Wizard ? Solving Business Problems Through People, Process & Technology

1 个月

By the way, if you examine the CoT for these these types of expert-level questions, the system will often infer that you already have sufficient domain expertise to understand the often sparse and highly-academic data it has on particularly abstruse topics. This can make its answers even harder to understand than the source material. This is hard to fix. Using prompts like "Provide a response using a 10th grade vocabulary and summarize higher mathematics using high school algebra and geometry" can sometimes help, but it can also over-constrain the answer or cause the LLM to hallucinate if the AI can't draw from a sufficient pool of valid approximations. This just compounds the truthiness problem and Dunning-Kruger effects on users.

2 次回应

Todd A. Jacobs

Business-First Technology & Cybersecurity Leader ? Hands-On IT & Software Development Expert ? Organizational Agility Wizard ? Solving Business Problems Through People, Process & Technology

1 个月

Your post fascinated me for reasons beyond the scientific question. It's so cool that you did this; let me explain why. I think the biggest problem when asking AI systems anything is that since they *can* be factually wrong, actively hallucinate, or simply misunderstand the intent of a question, AI interaction also creates many opportunities for cognitive bias among *users.* Specifically, I often see confirmation bias, Texas sharpshooter fallacies, and Dunning-Kruger effects when people interact with the LLM layers of AI. Even if the knowledge domain or answer wasn't so abstruse, a typical human reaction would be to either accept it at face value or reject it if the answer didn't align with their general expectations. LLMs are often "truthy." They can sound plausible even when they're wrong, or wildly improbable even when the essential facts are right. Even with CoT, that seems like a harder usability problem to solve for than accurate data retrieval because it's fundamentally a *people* problem.

2 次回应

Lori Mazor

I Teach AI with a Human Touch?? | Fembot 137 ??

1 个月

Are you following Jack Hidary and SandboxAQ? This is their domain of expertise. You would probably appreciate what they are building.

1 次回应

Uchenna N I.

Emerging Technology Innovation and Engineering Leader, Ex-Oracle and Ex-Verizon | AI Startup Evangelist

1 个月

In physics, The quantum gradient descent algorithm leverages the unique characteristics of quantum systems to optimize quantum circuits efficiently. This innovative adaptation showcases the power of quantum technology in enhancing optimization processes and allows simultaneous evaluation of multiple gradient components. The higher accuracy in complex combinational systems is a massive plus for future AI and Quantum systems.

1 次回应

Eric Fraser

CTO of Dr. Lisa AI. Views expressed here are my own.

1 个月

At the end of DeepSeek's explanations, my feeble brain still did not understand what it meant by "quantum gradient descent" so I asked it to explain *just that term*. It did, but I was too dumb to understand. Meanwhile, Perplexity Search, which was operating in parallel to DeepSeek R1 (because I was accessing R1 via Perplexity) found this *absolutely excellent* human explanation on YouTube. Thanks Perplexity! And thank you Lana Bozanic! https://youtu.be/Qsa-_QbpBMg?feature=shared I still can't say "oh I totally get it now" but I will keep studying this until I do. I once explained backpropagation using the metaphor of a bear walking downhill in a forest to find water, and I'll test my own understanding by constructing a similar metaphor for explaining quantum gradient descent. But not today.

1 次回应

查看更多评论

要查看或添加评论，请登录

Eric Fraser的更多文章

How I got my job, and the problem with "Must Have" skill & experience requirements

2025年1月27日

How I got my job, and the problem with "Must Have" skill & experience requirements

I didn't ask for my job. A year and a half ago I asked to be an unpaid apprentice to Dr Lisa Palmer, because She was a…

4 条评论
How Large Language Models "Learn" - Part 1, Backpropagation.

2024年9月16日

How Large Language Models "Learn" - Part 1, Backpropagation.

I wrote this post 9 months ago and find I am increasingly asked about how LLMs learn, especially following the recent…
Can my role be done by AI? Let's see.

2023年6月27日

Can my role be done by AI? Let's see.

Conclusion - Updated Sep 27 2023: You can't AI a Sales Manager role 100%. You can AI some of it.

30 条评论
Don't Make This Culture Mistake

2021年8月5日

Don't Make This Culture Mistake

A 2004 Research Paper from Stanford Business School* describes an approach to Organizational Culture that you can still…
The Use of Artificial Intelligence in Sales

2020年5月5日

The Use of Artificial Intelligence in Sales

In the business world, artificial intelligence is good at making slow things fast. But what if you’re doing something…
How to Quantify Business Value for CRM & Related Solutions

2020年4月9日

How to Quantify Business Value for CRM & Related Solutions

I'm going to keep this short and easy-to-digest. I'm going to give you 3 tips, and 3 signposts to places where you can…
A Haiku about the Purpose of Key Account Planning

2019年8月7日

A Haiku about the Purpose of Key Account Planning

If your Revenue and Pipeline does not go up..

1 条评论
Different Sales Methodologies as Haikus

2018年10月4日

Different Sales Methodologies as Haikus

I often get asked what the difference is between the major sales methodologies, and which one is "the best". Altify…

See all articles

What is Gradient Descent and Who Cares

What is the Answer

Prompt 1:

For processing the mathematics of gradient descent in neural networks, which is better and why: GPUs, TPUs, or Quantum computers?

GPUs (Graphics Processing Units)

TPUs (Tensor Processing Units)

Quantum Computers (QPUs)

Comparison Table

When to Use Which?

Conclusion

Prompt 2:

I'm specifically interested in the gradient descent calculations, not all LLM calculations, or even all LLM training calculations. Compare GPUs and TPUs for both speed, and their ability to select local minima in the multidimensional space.

Speed Comparison

Hardware Architecture

领英推荐

Benchmarks

Local Minima Selection Characteristics

Batch Size Effects

Optimization Paths

Practical Tradeoffs

Conclusion

Prompt 3

Why doesn't quantum computing help speed up the computation of gradient descent? Please address the computer science details and math details.

1. Algorithmic Structure Mismatch

2. Oracle Overhead and Query Complexity

3. Barren Plateaus and Noise

4. Theoretical Speedup Constraints

5. Complexity Analysis

Conclusion

Eric Fraser的更多文章

How I got my job, and the problem with "Must Have" skill & experience requirements

How Large Language Models "Learn" - Part 1, Backpropagation.

Can my role be done by AI? Let's see.

Don't Make This Culture Mistake

The Use of Artificial Intelligence in Sales

How to Quantify Business Value for CRM & Related Solutions

A Haiku about the Purpose of Key Account Planning

Different Sales Methodologies as Haikus

社区洞察

其他会员也浏览了

Quantum AI(QAI) Overview

The Ultimate Trio: How Lab Automation, AI, and Quantum will Change the Game

Read Entropy's High-cited Article "Quantum Machine Learning: A Review and Case Studies"

Dartmouth and the Origins of "AI"

The Quantum Leap: How Quantum Computing Could Supercharge Generative AI

The Mathematics of Hopfield Networks: From Neural Relationships to Memory Mechanisms

Quantum AI: The Next Frontier

Exploring the Future of Computing: What is Quantum Machine Learning?

Beyond the Hype: The Convergence of AI with Quantum Computing, Neuroscience, and Ethics

Quantum AI