RNN’s are Schmidhuber’s Revenge

RNN’s are Schmidhuber’s Revenge

Jürgen Schmidhuber, the father of Long short-term memory (LSTM), was probably right when he said recurrent neural networks (RNNs) are all we need. While Transformers, using attention, dominate generative AI right now, they still struggle when dealing with long sequences.

But researchers from Borealis AI,? the Ontario-based research firm, decided to revisit RNNs to see if they can solve some current problems with LLMs. Led by Yoshua Bengio, one of the godfathers of deep learning, Borealis AI believes that RNNs introduced in 2015 were slower earlier because they needed to go through the backpropagation (BPTT) method, something that Schmidhuber has frequently claimed credit for introducing.?

Were RNNs All We Needed?

The researchers asked this question to revive traditional RNNs, including LSTMs and Gated Recurrent Units (GRUs). They concluded that by removing the hidden state dependencies from their input, forgetting them, and updating gates, LSTMs and GRUs no longer need BPTT and can be efficiently trained in parallel.

The minimal versions of LSTMs and GRUs called minLSTM and minGRU, are stripped-down versions of the original models. Unlike traditional RNNs, they can be trained simultaneously using the parallel scan algorithm, significantly speeding up training time.

How fast are they? These two models use significantly fewer parameters than their traditional counterparts, resulting in minGRU and minLSTM being 175x and 235x faster per training step than traditional GRUs and LSTMs for a context length of 512 tokens.

With Transformers, you can also fetch any previous information at any time which is quite helpful. Meanwhile, RNNs are constantly updating and overwriting their memory, because they are without BPTT. It means they need to be able to predict what will be useful in order to store it for later.

This is a massive advantage for Transformers in interactive use cases like in ChatGPT when the users give context and ask questions in multiple turns.?        

Why Does Everyone Want Transformers to Fail?

This is not a new phenomenon. In fact, researchers predicted the same in 2019 in the research paper titled ‘Single-Headed Attention RNN: Stop Thinking With Your Head’.?

It demonstrated near SOTA results using LSTMs, a modified version of RNNs, which makes it easier to remember past data in memory, suggesting that the “Transformer hype” may be overblown.?

Transformers require the entire sequence to be stored in memory, but that won’t be realistic when the models become highly multimodal and start taking in a lot of information, such as images, video, sound, etc,? at once.?

RNNs don’t have this limitation by design, they only need the data corresponding to the current timestep. Hence, they will end up scaling better in the context of multimodality, even if Transformers stays better in terms of accuracy.        

The hybrid approach of combining the strengths of RNNs with lessons learned from the Transformer era could point the way forward for sequence modelling tasks. The ability to train these models efficiently on large datasets while maintaining the conceptual simplicity of RNNs is particularly appealing.

Read the full story here.


Meanwhile, Can We Reduce the Power Consumption of Neural Networks?

The power consumption of such AI models has been discussed intensively over the last few years. And as we all know, it’s huge!

But there is a way to scale it down. A research paper by BitEnergy AI, titled, ‘Addition is All You Need: For Energy Efficient Language Models’ mentioned that multiplying floating point numbers consumes significantly more energy than integer operations.?

The paper states that multiplying two 32-bit floating point numbers (fp32) costs four times more energy than adding two fp32 numbers and 37 times more than adding two 32-bit integers.

Neural networks typically perform computations using standard floating-point multiplication, which is computationally expensive and energy-intensive, especially for LLMs, which typically run over billions of parameters. These operations consume significant computational resources and energy, particularly in attention mechanisms and matrix multiplications.?

The researchers have proposed a new technique called linear-complexity multiplication (L-Mul), which solves the problem of energy-intensive floating point multiplications in large neural networks.?

It uses straightforward bit operations and additions to avoid complicated multiplication of parts of the numbers (mantissa) and tricky rounding steps.        

This approach not only reduces the computational cost, but also potentially decreases energy consumption by up to 95%, ?for element-wise floating-point tensor multiplications and 80% for dot products while maintaining comparable or even superior precision to 8-bit floating-point operations in many cases.

Click here to read the full story.


AI Bytes

  • Google Cloud announced that it is partnering with Sequoia Capital, which will allow its portfolio companies to access Google’s various cloud services, credits, and enhanced support.
  • AI startup Sierra reportedly seeks over $4 billion in funding, which would quadruple the company’s valuation compared to its $1 billion deal earlier this year.
  • NVIDIA and Foxconn have unveiled plans to construct Taiwan’s largest supercomputer, housed at the Hon Hai Kaohsiung Supercomputing Center.?
  • David Baker, an American biochemist and computational biologist, and Demis Hassabis and John M Jumper, two Google DeepMind scientists, have been awarded the Nobel Prize in Chemistry 2024 by The Royal Swedish Academy of Sciences.?
  • At the Go Get Zero conference in London, Uber announced the upcoming launch of an AI assistant powered by OpenAI’s GPT-4.
  • Atlassian, an American-Australian software company, announced that its recently introduced team-building AI, Rovo, will now be available for all users.?
  • OpenAI announced that its team received the first engineering builds of NVIDIA’s DGX B200. These new builds promise three times faster training speeds and fifteen times greater inference performance than previous models.
  • OpenAI is expanding into new cities, namely NYC, Seattle, Paris, Brussels, and Singapore, alongside its growing offices in San Francisco, London, Dublin, and Tokyo.

要查看或添加评论,请登录

AIM Research的更多文章

社区洞察

其他会员也浏览了