“KANs: Kolmogorov-Arnold Networks" [Liu et al., 2024]: Extended Abstract
Abstract— The purpose of this text is to provide an extended abstract of the article “KANs: Kolmogorov Arnold Networks†[Liu et al. 2024], focusing namely on (1) identifying the problem the authors intend to solve; (2) the main reasons that make the problem interesting, relevant, and timely; (3) what is the approach to the problem and finally (4) discuss the results that the authors hope to achieve with the proposed approach, as well as its limitations.
Keywords— Kolmogorov-Arnold representation theorem, KANs, splines
Identification of the Problem
The authors address the limitations of Multi-Layer Perceptrons (MLPs) in the field of deep learning, particularly focusing on their interpretability and efficiency in function approximation.
Multi-Layer Perceptrons (MLPs) have been used extensively in deep learning architectures and are inspired by the universal approximation theorem. This theorem states that a feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of Rn, given sufficient neurons and appropriate weights and activation functions.
The authors pose the question whether MLPs are the best linear regressors we can build. In transformers, for example, MLPs consume almost all non-embedding parameters (high parameter count) and lack interpretability (relative to attention layers) without post-analysis. In the case of learning additive high dimensional functions, MLPs can potentially learn this compositional structure, but they are very inefficient for approximating functions like the exponential and sine functions with fixed activations like ReLU.
Reasons for the Problem’s interest, relevance or timeliness
MLPs, while powerful due to the universal approximation theorem, consume significant computational resources and are challenging to interpret without post-analysis tools. Improving upon these aspects is crucial as it can lead to more efficient, understandable, and practical AI systems, particularly in scientific applications where model transparency is as important as performance.
With the usage of AI in science, called by some authors AI+Science, the role of the scientist motivates reflection. Should we invest in making AI perform scientific research, therefore having AI scientist agents, should we protect the integrity of human-produced science by rejecting its use, or should we make human-AI communication effective through the development of a common language that allows scientists and AI models to communicate and produce scientific knowledge.
Approach to the Problem
Liu et al. 2024 propose an alternative to MLPs, called Kolmogorov-Arnold Networks. These networks are inspired by the Kolmogorov-Arnold representation theorem (KAT), which states that if f is a multivariate continuous function on a bounded domain, then f can be written as a finite composition of continuous functions of a single variable and the binary operator of addition.
This means that the only true multivariate function is addition, since every other function can be written using univariate functions and sum. But these functions may be non-smooth or fractal, so they may not be learnable in practice. The studies performed on deep learning using the theorem attempted to build depth-2 2n+1 width networks and did not use modern training techniques like backpropagation. Because of topology limitation and the potential non-smooth character of the learnable function, the theorem has been discarded in practice even though its theoretical foundations may be considered sound.
The authors attempt to make this theorem useful by generalizing its original formulation to networks of arbitrary length and width.
KANs are, like MLPs, fully connected networks. But whereas MLPs place fixed nonlinear activation functions on nodes (“neuronsâ€) and learnable linear weights on edges, KANs place learnable nonlinear activation functions on edges (“weightsâ€) and the addition operation on nodes. These learnable 1D activation functions are parametrized as splines. Splines are accurate for low-dimensional functions, easy to adjust locally because they are built using a limited number of local control points, and able to switch between different resolutions. However, they suffer from the curse of dimensionality due to their inability to represent compositional structures. This makes them useful in the context of the application of the KAT, assuming the learnable function is a 1-dimensional function. The local adjustment character of splines also makes them less prone to catastrophic forgetting, which is the phenomenon by which a neural network that trains on a task and then switches to another has bias towards the second task, effectively forgetting the training oriented to the first task. This design allows KANs to achieve higher accuracy and better interpretability compared to MLPs.
Suppose we have a supervised learning task consisting of input-output pairs {xi,yi}, where we want to find yi ≈ f(xi) for all data points. This means that the task ends if we can find the functions Φp?q and Φq. Since all functions are univariate, they can be 1D functions parametrized a a B-spline curve, with learnable coefficients of local B-spline basis functions.
This network is admittedly too simple to approximate any function arbitrarily. Therefore, the authors generalize the KAN architecture to be wider and deeper.
Since all operations are differentiable, we can train KANs with back propagation.
MLPs can be formulated as interleaving of affine transformations W and non-linearities σ.
While MLPs treat linear transformations and nonlinearities separately, KANs treat them all together in Φ.
Residual activation function is defined as the sum of a basis function b(x) such that the activation function is defined as the sum of the basis function and the spline. The factors serve to control the average magnitude of the activation function.
The basis function is set to b(x) = silu(x) = x / (1+ c-x) and the spline is parametrized as a linear combination of B-splines such that spline(x) = sum(ciBi(x)) where cis are trainable. Each activation function is initialized to have ws = 1 and spline (x) ≈ 0. Wb is initialized with the xavier initialization in the original implementation. Each grid of functions is updated on the fly according to the input activations, because introducing normalization so that input range is fixed gave worse results.
With a network of depth L, layers of equal width N and splines with order k = 3 on G intervals (G+1 grid points) there are a total of O(N2L(G+k)) ~O(N2LG) parameters. For comparison, an MLP with depth L and widtth N only needs O(N2L) parameters which appear to be more efficient than KANs. KANs require much smaller number of layers N and achieves better generalization.
The theoretical guarantees for the expressivity of KANs can be summarized as follows:
A KAN with a finite grid size can approximate a function well with a residue rate independent of the dimension hence beating the curse of dimensionality, because we only use splines to approximate one dimensional functions.
领英推è
Neural scaling laws are the phenomenon where test loss decreases with more model parameters. The authors provide both theoretical and empirical evidence that KANs possess faster neural scaling laws compared to MLPs. This means that as the size of the network increases, the error decreases more rapidly for KANs than for MLPs. This property is crucial for demonstrating the efficiency of KANs in learning and approximating functions, especially in high-dimensional spaces.
Different theories have been proposed to predict the scaling component alpha. Kaplan suggest that alfa comes from data fitting on an input manifold of intrinsic dimensionality d, and this implies that alfa = (k+1) /d. This suffers from the curse of dimensionality, so people sought other bounds independent of d by leveraging compositional structures. Michaud et al. 24 considered computational graphs that only envolve unary (exp, sine, squared) and binary (x and +) operators finding alfa = (k+1) /d* = (k+1)/2 where d* is maximum arity. Poggio et al. 19 leveraged the idea of compositional sparisity, equivalent to alfa = m/2. The approach of Liu et al. 2024 assumes the existence of smooth Kolmogorov-Arnold representations, decomposes the high dimensional functions into one dimensional giving alfa = k+1 where k is the piecewise polynomial order of the splines. With k=3, alfa = 4, which can be empirically validated.
Moreover, UAT gives no bound for how N(e) scales with e it suffers from COD, with N has been shown to grow exponentially with d in some cases. KANs take advantage of the intrinsically low-dimensional representation of the function while MLPs do not.
The fine graining of a spline grid function can make a spline arbitrarily accurate to a target function. This is not possible with MLPs. Due to the neural scaling laws, increasing the width and depth of the MLP can lead to improvement in performance, albeit slowly and with great costs. In KANs, we can train the model with fewer parameters, and then extend it to one with more parameters by making the spline grids finer, without retraining the model from scratch.
Grid extension consists of fitting a new fine-grained spline to an old coarse-grained spline. Splines in the new positions of the grid can be obtained by linear combination of the initial grid, and the criterion may be that the distance functions between one and the other should be minimized (e.g. using the least squares algorithm). Grid extension is performed for all splines independently. Tests have been done for test loss with grid extension. The optimal value of 50 was obtained, consistent with the theoretical value G = samples/parameters = 50. ?The comparison with theory in neural scaling laws was also satisfactory, given that taking the square root of the median of the squared losses, we get scaling RMSE alfa times G-4.
The computational graphs of how nodes are connected represents external degrees of freedom, while the grid points inside an activation function are internal degrees of freedom. KANs benefit from the fact that they have both. External degrees of freedom are responsible for leaning compositional structures of multiple variables, which splines do not allow but the KAN additive structure allows; internal degrees of freedom are responsible for learning univariate functions, allowed by splines.
Some simplification techniques are introduced to obtain the minimal KAN structure and making them interactive. Sparsification is introduced in the original implementation by L1 regularization of learnable activation functions. Given that this is not enough, entropy regularization is further introduced. L1 norm of an activation function is its average magnitude over its Np inputs. The paper introduces how to calculate L1 norm of an activation function, of a layer, how to calculate entropy and also total loss, which is L1 plus entropy regularization of all KAN layers.
Pruning can be performed on KANs to obtain the optimal KAN structure. After training with sparsification penalty, we may want to obtain a minimal subnetwork by pruning on the node level, rather than on edge level. We define an input and output score, then if both incoming and outgoing scores are greater than a threshold theta = 10-2 we prune all unimportant neurons.
The techniques of visualization of the topology, particular activation functions, their symbolification, allows KANs to be more interactive and make the process of training iterative, through intermediate visualization, pruning, symbolification and further testing.
Symbolic regression is referred as an alternative to this method of approximating composite functions, but it is hard to debug, without intermediate outputs (either returning success or failure), while KANs perform continuous research. That alternative method cannot be applied to all applicable to all functions (e.g. Bessel function) while KANs may use splies to approximate those functions numerically.
Results and Limitations
The authors use extensive numerical experiments to show the accuracy and interpretability of KANs.
The authors of Liu et al. 2024 highlight that KANs are accurate, being more effective in representing functions than MLPs (e.g. regression and PDE solving) achieving the same level of accuracy with fewer parameters, displaying more favorable Pareto frontiers than MLPs on 6 toy functions, 15 special functions and a collection of equations from a dataset containing several functions present in Feynman textbooks. The test further show that auto-discovered KANs compare favorably to MLPs on these sets of functions in terms of loss (test RMSE less than 10^-2).
The continual learning capacity of KANs is tested in a toy continual learning problem, constituted by a 1D regression task with 5 gaussian peak. MLPs display severe castastrophic forgetting while KANs do not with each peak presented sequentially to the models. This is attributed to how KANs re-organize themselves, displaying local plasticity due to spline locality, and therefore avoiding catastrophic forgetting. Since spline basis are local, a sample will only affect a few nearby spline coefficients, leaving far-away coefficients intact. This is akin to the functioning of the brain, where functionally distinct modules are placed locally in space.
The interpretability of KANs is shown with the techniques of sparsification, pruning, symbolification and visualization, on real-world maths and physics problems, namely knot theory and Anderson localization.
This paper provides contributions to several fields: the applicability of the Kolmogorov-Arnold Theorem to neural networks by working on the generalization of the theorem to networks of arbitrary depth and width; neural scaling laws, by suggesting that high-dimensional functions can scale as one-dimension functions under a smooth Kolmogorov-Arnold representation, offering a faster known scaling exponent, but gurther research is needed to see if it applies to complex tasks like language modeling; mechanistic interpretability(MI), which aims to understand neural networks' inner workings, divided into passive (understanding existing networks) and active (designing interpretable architectures) research. The authors' work falls into the active category, focusing on creating interpretable models and training methods by design; learnable activation functions, which have been explored previously, with various methods for learning them. KANs use B-splines to parameterize activations. The paper also introduces learnable activation networks (LANs), which are between KANs and MLPs in terms of properties, with further results discussed in an appendix.
The paper also contributes to the fields of symbolic regression, Physics-Informed neural networks (PINNs) and AI for Mathematics (the potential to integrate KANs into existing architectures like transformers could create "kansformers" for improved performance)
The current understanding of KANs is limited, though preliminary analysis is promising. Future research could explore deeper Kolmogorov-Arnold representations and their implications for representing complex functions. Areas for improvement include architecture design choices, efficiency (e.g., batch computation), and hybrid models combining KANs and MLPs. Adaptive strategies could further enhance accuracy and efficiency.
KANs show promise in science-related tasks but require more exploration in various domains. Integrating KANs into larger machine learning frameworks could lead to significant advancements.
According to the authors, KANs better superior interpretability and accuracy but suffer from slow training compared to MLPs. They are recommended for tasks where these benefits outweigh the training speed, especially for small-scale AI and science problems.