ChatGPT in your pocket 24/7: What is your AI backup strategy?
AI assistants are down (Image generated by AI)

ChatGPT in your pocket 24/7: What is your AI backup strategy?

Today's ChatGPT and Microsoft Copilot outages appear to be linked to Bing API issues. Millions of people using it every day suddenly lost their vital companions for a few hours. We experienced it too, and my team was answering questions from our internal clients, who were eagerly waiting for Copilot to come back online. And it eventually did...

https://www.techradar.com/news/live/chatgpt-is-down-heres-what-we-know-about-the-outage-so-far

During such moments, I pull out my Apple Silicon-based devices and am glad I spent an evening in the fall of 2023 installing several large language models for potential offline usage.

Back then, it was mere intellectual curiosity and frankly a nice feeling to run "my own private ChatGPT" completely offline and in my pocket. Yes, it runs pretty well on both the iPhone Pro and the iPad Pro. Completely offline and locally. Anytime I need it. Feels good. Small Einstein sits in your pocket now.

Tiny AI Einstein in your pocket (Image generated by AI)


As companies revamp their internal (and often client-facing) processes and services, they rely more heavily on many tasks being rendered (or at least supported by) AI. It will eventually be essential to remember what your backup strategy is. Reliance on big tech clouds simplifies a lot, but it also brings a new set of risks and challenges. How many eggs do you want to put into one basket? And how many minutes, hours, or even days can your operations survive without a cloud-based AI/LLMs?

With more and more solutions being dependent on generative AI directly, we will also see a growing client base changing their expectations, particularly clients from more data-sensitive industries. They will demand their vendors, service providers, law firms, and suppliers not to use any cloud-based AIs/LLMs without their explicit consent. As a CTO, I have already seen some of these demands, and we have a solution architecture in the pipeline now that will enable our firm to meet them. This is not easy, but we already know how to do it.

Bit more technical part

For those more curious to know the details about the models I am running on my iPad/iPhone locally or “in my pocket,” here are some stats and my observations. I have tested three models more extensively (wish there was time to test others): RWKV-4 "Raven," Llama 2, and Mistral.

Llama and Mistral in their 7 billion parameter versions, and Raven in the 1.5B “small” version. All of these are 4-bit quantized (more on this below). After all, I only have 8 GB of RAM on my M1 iPad Pro. Storage requirements are modest—less than 4 GB for Llama 2 and Mistral, and just above 1 GB for Raven. Not an issue at all. Where you start to feel the current limits of mobile devices running LLMs is once the model loads into the memory. Particularly if you don’t have the 16 GB RAM version of the iPad Pro (only some models do). As you can read in my summary below, models eat a lot of memory.

Let’s start with the smallest model in my set. RWKV is a specific type of language model that incorporates elements of both Recurrent Neural Networks (RNNs) and Transformer architectures. It combines the best of RNNs and transformers—great performance, fast inference, fast training, saves VRAM, GPU, and CPU (both for training and running), and has other benefits. Not all is shiny. One drawback is it is weaker at tasks that require lookback (you need to reorder your prompt accordingly). While there are also 3B, 7B, and 14B versions of this model, even the 1.5B model is surprisingly good for its size. I can’t wait to test the RWKV-4 World model, trained on 100+ world languages (70% English, 15% multilang, 15% code).

Llama 2 running on an iPad Pro (small context window configuration)


Now for Llama 2. Those of you reading this may remember Meta (aka Facebook) “released” Llama 2 last year in February/March. This model is trained on 2 trillion tokens and by default supports a context length of 4096 characters (although I have configured mine for a smaller prompt window to save on resources). Llama 2 models are fine-tuned on over 1 million human annotations and are made for chat.

Meta AI (Facebook) reported the 13B parameter model performance on most NLP benchmarks exceeded that of the much larger OpenAI’s GPT-3 (with 175B parameters). So when I feel in the mood to talk to a philosopher who never sleeps, I talk to Llama 2. And while it never sleeps, it requires occasional reboots of my iPhone to free up memory. It also likes to eat my battery. But after all, so does Mistral.

Mistral 7B running locally on an iPad Pro


Mistral 7B. It is a 7.3B parameter model that outperforms Llama 2 13B on all benchmarks and outperforms Llama 1 34B on many benchmarks. At least this is what Mistral AI claims. I did not have time to measure it in great detail, but for a 7B model, it performs very well. Particularly minding it can run locally on a mobile device. For those curious in more details, please go here: https://mistral.ai/news/announcing-mistral-7b/

In less than a year from ChatGPT’s release, we were able to have GPT-3 (perhaps GPT-3.5-like) performance in our pocket, anywhere we go. This is not a bad progress at all, and confirms the exponential growth of these technologies.

Notes on hardware requirements

Some of you may wonder if newer iPhones or iPads can run these local LLMs “faster.” Frankly, on my iPad Pro M1, I am getting a decent 11-12 tokens per second performance. This is not about how fast your CPU is. The real critical point is the memory. Memory requirements are directly connected to the model size. Specifically:

  • 7B models generally require at least 8GB of RAM. Recent iPhones and iPads will do well. Remember, you need memory to unpack the model and to run some binaries (i.e. other applications) too.
  • 13B models require at least 16GB of RAM (several iPad Pro models come with 16 GB of RAM, but not all. Most iPhones have between 4 GB and 8 GB of RAM memory only).
  • 70B models typically require at least 64GB of RAM (for this, one should use Apple Mac Studio or a similar machine you can configure with 64GB or more RAM). But then you no longer can carry it in your pocket :)

If you run into issues with higher quantization levels, try using the q4 models (as I do). You can download 4-bit quantized models, for example from Hugging Face. And shut down any other programs that are using a lot of memory.

Summary

As one can see, we now have access to models optimized to be deployed on edge devices (rather than relying on the cloud only). Not only can they be used on edge (i.e., mobile) devices with limited computational resources, but perhaps even more importantly, they can run in fully offline mode.

Secondly, the cost efficiencies. Reduced computational requirements lead to lower operational costs, making it feasible to deploy these models at scale. Soon, LLMs will sit in many of the user interfaces we use on all sorts of household and industrial-level devices.

And last, but not least, privacy and security. This is critical. Running models locally ensures that sensitive data does not need to be sent to external servers or to the cloud, thus enhancing privacy and security.

But more on this in a future post, where I intend to examine effective scaling of local LLMs for scenarios where data privacy within your organization is paramount.

Meanwhile, both Copilot and ChatGPT are back online :)

Monikaben Lala

Chief Marketing Officer | Product MVP Expert | Cyber Security Enthusiast | @ GITEX DUBAI in October

1 个月

Matej, thanks for sharing!

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了