Neural Network Hidden Bottleneck, But

Neural Network Hidden Bottleneck, But

Max Y. Ma and Gen-Hua Shi

Three basic questions

Neural network is a powerful numerical computation architecture. To evaluate its capacity, it requires mathematical answers to 3 basic questions: degree of freedom, computation power, boundary condition?

The degree of freedom

  • This will answer how many dimensions the architecture can compute.??
  • Our finding: unbounded dynamism and virtually limitless degree of freedom

The computation power

  • This will answer how complex a problem the architecture is able to solve.??
  • Our finding: computation power scales exponentially with depth

The boundary condition?

  • This will answer how fixable the architecture is able to handle the external constraints
  • Our finding: fluidity in self-progress boundary conditions for robust, high-dimensional redundancy


Hidden bottleneck

Evidence:?

  • Zig-zag up-down loss curve at slow decline stage, grokking and double descent.

Mitigations

  • Dropout, shortcut/skip connections and data work including labeling function?
  • Transformer has the best mitigation strategy, not its attention per se.

A “double-edged sword” problem: loss curve evidence

  • Initial rapid drop: neural network is very capable even for high order nonlinear problems, like any foundation models.?
  • Struggling slow decline: neural network is struggling with high order nonlinearity.?

Root Cause

  • Neural networks do not handle high order nonlinearity well.

?Fourier Features

For a deep manifold space with high-order nonlinearity, we should mathematically expect Fourier features (characteristics) within the space. The explicit manifestation of these Fourier features depends on the rigidity of the deep manifold space and the resilience of the neural network

As the strength of the bottleneck increases, so does the rigidity of the deep manifold space. At this point, Fourier features within the space should become apparent. Fourier analysis helps in understanding the frequency components that constitute the deep manifold space response. In general, the more rigid a system is, the more it tends to exhibit high-frequency components in its response. In this context, Fourier analysis can be an effective method for measuring learning capacity during training.



Robust and Resilient Neural Network

We are surprised by the prolonged slow decline, which is accompanied by a relatively high standard deviation (SD) in the loss values. This suggests that other factors may be at play.

The prolonged, slow decline in the loss curve suggests that the neural network is a robust and resilient system. We have concluded the following

  1. Dynamic computation with infinite degree of freedom.
  2. The fluidity in self-progressing boundary conditions.

The neural network operates as a power-efficient system, with each node requiring minimal computational power, even when the deep manifold space becomes rigid and a bottleneck develops. Additionally, all foundation model pre-training is self-supervised. The neural network's self-progressing boundary condition imposes no restrictions on where incoming data is processed. Incoming data will be directed to whichever nodes are capable of processing it. This means that the neural network continues to learn even during the slow decline stage. In this sense, grokking and double descent are evidence of this continued learning.?

It also means that the same token will be processed in different nodes. It is highly likely that many replicas of identical or near-identical feature bits (units of feature) disperse throughout the network. The inequality in mathematics, as described in the 'Contact Theory' (Shi, 2015), suggests that connections between nodes (pathways) are not equal. Our working theory proposes that feature bits propagate through the network, with their propagation distance determined by the computational capacity of each node. The pathway appears to be power-driven, prioritizing certain features or patterns during learning in a discriminatory manner. While this discriminative feature pathway (DFP) is mathematically plausible, the underlying theory remains unclear.

It prompts a fundamental question regarding how to measure the completeness of training. This could have numerous implications for pre-training and post-training, such as fine-tuning, in-context learning, model compression and merge.

Feature Bits & DFP (left) and Bifurcation theory illustration(right)


Classical Manifold

Classical manifold is able to handle high dimensions and low order nonlinearity. The principle of these attempts is to transform or map the data manifold to a predefined manifold space, such as Mobius Strip and Klein bottle. However, it is all on a smooth surface, which is low order nonlinear


Numerical Manifold Method

Numerical manifold method (NMM) was developed by co-author Gen-Hua Shi, DoD sponsored in the early 1990s. The motivation for NMM was to develop a numerical method for linear, low order nonlinear and high order nonlinear all together.? Based on the NMM principle, Deep Manifold only applied 3 very basic topology concepts: cover, dual pairing and covering space.

Gen-Hua has nearly 50 years in the high order nonlinearity model. He was and still is a mathematician, Peking University(MS, 1968), the Institute of mathematics, Chinese Academy of Sciences (中科院数学所), UC Berkeley,? Lawrence Livermore National Laboratory, US DoD, Independent researcher & Consultant. He have solely developed?

  • “KeyBlock Theory”, 1970s
  • “Discontinuous Deformation Analysis” (DDA), 1980s
  • “Numerical Manifold Method” (NMM), 1990s
  • “Contact Theory” (EAB), 2000s

Max studied under Gen-Hua for 10 years, 1989-1999. He suspected neural network high order nonlinearity in 2017. In his 2018, LinkedIn post . He said “I give myself 5-7 years for this” under the code name “Kahlua” at the end of his post


Reference

  • Shi, G., Manifold Method of Material Analysis.? Proceedings of the 9th Army Conference on Applied Mathematics and Computing, 1991
  • Max Y. Ma. Single Field Manifold Method using Fourier Function in Wave Propagation Analysis. Working Forum on the Manifold Method of Material Analysis, Volume I, U.S. Army Corps of Engineers, 1995
  • Max Y. Ma, Musharraf Zaman and J. H. Zhu Discontinuous Deformation Analysis Using the Third Order Displacement Function, Proceedings of the First International Forum on Discontinuous Deformation Analysis (DDA) and Simulations of Discontinuous Media, pages 383-394, 1996?
  • Shi, G.,? Contact Theory, Science China Technological Sciences, Volume 58, pages 1450–1496, 2015



Max Ma, PhD

AI Architect (Model & ML Eng.) with depth and breadth for real world ML solution in healthcare and life science, strong science and engineering discipline with creative and curious mind

5 个月

based on feedback, I updated my post...

回复
Arnaud Rachez

Product & Technology

5 个月

Is there a link or reference where we can read more about this?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了