At Tallyfy - we're thinking about developing an amazing and breakthrough way of tasks being "done for you" using your local computer via AI. That means the AI has to run locally - for privacy, latency, performance, authentication and many other reasons. A bit more about our approach to this is here:
Running large language models entirely offline on a CPU is increasingly feasible. Several open-source models can operate locally without internet or GPU, and some are even capable of analyzing screenshots/UI images by integrating vision (OCR and visual element recognition) into the language model. Below we outline key models and techniques, including their sizes, requirements, and setup considerations.
Lightweight LLMs for CPU-Only Inference
Modern open-source LLMs (with billions of parameters) can be run on a modern CPU by using optimized formats and quantization. These models handle natural language tasks and can be paired with vision modules for image understanding:
- LLaMA 2 and Derivatives (7B–13B) – Meta’s LLaMA2 (and fine-tunes like Vicuna, Alpaca) are popular base models. The 7B version can be quantized to 4-bit or 8-bit precision so that it fits in under ~8 GB of RAM, allowing CPU inference at a few tokens per second
- Mistral 7B – A newer 7B-param model (Apache 2.0 licensed) that outperforms LLaMA2 13B on many benchmarks
- GPT-J and BLOOM (6B–7B) – GPT-J (6B) and smaller BLOOM variants (e.g. 7.1B) are older open models that can run on CPU with moderate resources. For example, GPT-J 6B in 8-bit needs ~12 GB RAM (or ~6 GB with 4-bit). They may not be as memory-efficient as LLaMA/Mistral, but optimizations exist (ONNX or int8 quantization) to improve CPU throughput
Note: Projects like GPT4All and Ollama bundle such models for one-click local use. “GPT4All runs large language models (LLMs) privately on everyday desktops & laptops. No API calls or GPUs required – you can just download the application and run models locally.”
Vision-Language Models with OCR & UI Understanding
For identifying text in screenshots and recognizing UI components, multimodal models are used. These combine a visual encoder (to process images) with a language model (to produce text descriptions or answers). Several open projects fall into this category:
- LLaVA (Large Language and Vision Assistant) – An open-source conversational AI that extends LLaMA with vision capabilities. It feeds image features (from a pretrained encoder, e.g. CLIP ViT-L14) into the LLM, allowing it to describe images and UI screens. LLaVA was designed efficiently with a simpler architecture and far less data than GPT-4V, making it “more suitable for inference on consumer hardware.”
- Pix2Struct (Google) – A vision-to-text transformer model specialized for understanding UI screens, web pages, and visually-rich documents. It uses a ViT image encoder and a text decoder (based on T5) to directly generate descriptions or structured text from an input image
- Donut (Document Understanding Transformer) – An OCR-free model by NAVER Clova that takes an image (document, form, or potentially a screenshot) and directly generates a structured textual output. Donut uses a Swin Transformer backbone for vision and a BART decoder for text
- BLIP-2 and Similar Image Captioning/VQA Models – BLIP-2 (Salesforce) is a general-purpose vision-language model that can describe images and answer questions about them. It couples a ViT-based image encoder with a language model (e.g. an OPT 2.7B or FLAN-T5)
- LayoutLMv3 (Microsoft) – While not a generative captioning model, this is an open-source multimodal Transformer for documents/UI that combines image pixels and text tokens. It requires an OCR step first (to get text positions), but then can understand the layout and content jointly
CPU Optimization Techniques and Format Support
To maximize performance on CPU and simplify setup, developers use various optimization frameworks:
- ONNX Runtime – Many of these models can be converted to the ONNX format for efficient CPU execution. ONNX Runtime with optimizations (like Fusion, OpenMP threading, and INT8/INT4 quantization) can significantly speed up inference. For instance, an int4 quantized Mistral-7B run via ONNX showed over 9× throughput vs. naive methods and outperformed llama.cpp in tokens/s
- TensorFlow Lite (TFLite) – For mobile and edge deployment, TFLite can take a model (usually a smaller or quantized version) and run it on ARM CPUs efficiently. There are examples of using TFLite for on-device OCR – Google’s example app shows how to “use TFLite to extract text from images on Android devices”
- GGML / llama.cpp – This is a specialized format and runtime for LLMs that has become standard for CPU inference. Models are converted to GGML/GGUF files with 8-bit, 4-bit, or even 3-bit weights, enabling them to load in RAM-constrained environments. The llama.cpp engine then runs the model using optimized C++ code (with SIMD instructions) and can achieve decent speeds without any external dependencies. For example, running a 7B LLaMA-derivative in 4-bit through llama.cpp might generate ~1 token per second per thread on a modern high-end CPU
- Memory and Hardware Considerations – A multi-billion parameter model in full precision can occupy tens of GBs of RAM, which is why quantization is key for CPU use. As a rule of thumb: 1 billion parameters ~ 4 GB in FP16, ~2 GB in int8, ~1 GB in int4. Thus, a 7B model might be 28 GB in FP32, but only ~7 GB in 4-bit – which fits on a typical 16 GB system. Models like LLaVA that incorporate a vision encoder add some overhead (the vision backbone might be ~0.3B params, e.g. CLIP ViT-L is 428M). Disk space is also a factor (quantized models take a few GB storage). In terms of compute, any modern x86-64 CPU (Intel or AMD) with AVX2 instructions can run these models, though more cores and higher clock speed improve throughput. ARM64 CPUs (like Apple M-series or Raspberry Pi) are also supported by many frameworks, often with optimized libraries. It’s recommended to use CPUs with high memory bandwidth for best results
- Supported Inputs/Outputs – The models mentioned support standard image and text formats. For image-based models, you can input images as files (PNG, JPEG, etc.) or as pixel tensors via libraries like PIL or OpenCV. OCR-capable models (Pix2Struct, Donut) can directly ingest an image and output text strings or JSON. If using an OCR engine separately (e.g. Tesseract, EasyOCR, or TF Lite model), they typically accept image files and return text strings plus coordinates. The LLM part then takes plain text as input. The final output from an LLM will be text – which could be a natural language description, a list of identified UI elements, or any format you prompt it to produce. Many vision-language models can be prompted to output in a structured way (for example, “List all buttons with their labels:” and the model might enumerate them).
Ease of Setup
Deploying these models locally has become easier thanks to open-source libraries:
- Hugging Face Transformers – Provides a unified API to download and run models like LLaMA, Pix2Struct, Donut, BLIP-2, etc. With a few lines of code you can load a processor and model (as shown in BLIP-2 example) and run inference
- Pre-built Apps/Interfaces – As mentioned, GPT4All offers a desktop app for chat with local models
- Installation – Most of these models can be installed via pip or conda. For instance, pip install transformers onnxruntime covers many scenarios. For llama.cpp, you compile a small C++ program (or install a Python wrapper like ctransformers). Memory usage should be monitored to avoid swapping. No internet is needed after downloading the model files. Ensure your CPU has the necessary instruction support (AVX2 for modern PyTorch/ONNX, or NEON for ARM). Some frameworks (HF Accelerate, DeepSpeed Zero-Inference) can partition models across CPU cores or even disk if needed, but for our model sizes that’s usually unnecessary if properly quantized.
In summary, open-source LLMs can indeed run locally on CPU and even perform screenshot/UI analysis when paired with the right vision capabilities. Models like LLaVA, Pix2Struct, and Donut provide integrated solutions that read text and interpret interface elements directly from images. With model sizes in the range of a few hundred million to 7 billion parameters, they can be executed on a modern CPU (8–16 GB RAM) by leveraging quantization and optimized runtimes. Supported formats are standard (image files in, text out), and setup has been streamlined by frameworks and community tools. This enables privacy-friendly, offline understanding of screenshots – useful for automation, testing, or assisting users without sending data to the cloud.
#ai #computer-use #tallyfy #llm #trackable-ai