登录查看更多内容

Open-Source LLMs for Local Screenshot and UI Analysis - as of Feb 2025

Tallyfy

Tallyfy is the leader in Workflow Made Easy? - run simple, effective, automated, digitized processes in just minutes

发布日期: 2025年2月19日

At Tallyfy - we're thinking about developing an amazing and breakthrough way of tasks being "done for you" using your local computer via AI. That means the AI has to run locally - for privacy, latency, performance, authentication and many other reasons. A bit more about our approach to this is here:

https://tallyfy.com/trackable-ai/

Running large language models entirely offline on a CPU is increasingly feasible. Several open-source models can operate locally without internet or GPU, and some are even capable of analyzing screenshots/UI images by integrating vision (OCR and visual element recognition) into the language model. Below we outline key models and techniques, including their sizes, requirements, and setup considerations.

Lightweight LLMs for CPU-Only Inference

Modern open-source LLMs (with billions of parameters) can be run on a modern CPU by using optimized formats and quantization. These models handle natural language tasks and can be paired with vision modules for image understanding:

LLaMA 2 and Derivatives (7B–13B) – Meta’s LLaMA2 (and fine-tunes like Vicuna, Alpaca) are popular base models. The 7B version can be quantized to 4-bit or 8-bit precision so that it fits in under ~8 GB of RAM, allowing CPU inference at a few tokens per second
Mistral 7B – A newer 7B-param model (Apache 2.0 licensed) that outperforms LLaMA2 13B on many benchmarks
GPT-J and BLOOM (6B–7B) – GPT-J (6B) and smaller BLOOM variants (e.g. 7.1B) are older open models that can run on CPU with moderate resources. For example, GPT-J 6B in 8-bit needs ~12 GB RAM (or ~6 GB with 4-bit). They may not be as memory-efficient as LLaMA/Mistral, but optimizations exist (ONNX or int8 quantization) to improve CPU throughput

Note: Projects like GPT4All and Ollama bundle such models for one-click local use. “GPT4All runs large language models (LLMs) privately on everyday desktops & laptops. No API calls or GPUs required – you can just download the application and run models locally.”

Vision-Language Models with OCR & UI Understanding

For identifying text in screenshots and recognizing UI components, multimodal models are used. These combine a visual encoder (to process images) with a language model (to produce text descriptions or answers). Several open projects fall into this category:

领英推荐

How Good is Claude 3 Opus Compared to GPT-4, Gemini…

Michael Spencer 1 年前

Porting LLMs to Local Machines, Physical Intelligence,…

Open Data Science Conference (ODSC) 7 个月前

LLM Pulse - Nov 1, 2024

Blackstraw 4 个月前

LLaVA (Large Language and Vision Assistant) – An open-source conversational AI that extends LLaMA with vision capabilities. It feeds image features (from a pretrained encoder, e.g. CLIP ViT-L14) into the LLM, allowing it to describe images and UI screens. LLaVA was designed efficiently with a simpler architecture and far less data than GPT-4V, making it “more suitable for inference on consumer hardware.”
Pix2Struct (Google) – A vision-to-text transformer model specialized for understanding UI screens, web pages, and visually-rich documents. It uses a ViT image encoder and a text decoder (based on T5) to directly generate descriptions or structured text from an input image
Donut (Document Understanding Transformer) – An OCR-free model by NAVER Clova that takes an image (document, form, or potentially a screenshot) and directly generates a structured textual output. Donut uses a Swin Transformer backbone for vision and a BART decoder for text
BLIP-2 and Similar Image Captioning/VQA Models – BLIP-2 (Salesforce) is a general-purpose vision-language model that can describe images and answer questions about them. It couples a ViT-based image encoder with a language model (e.g. an OPT 2.7B or FLAN-T5)
LayoutLMv3 (Microsoft) – While not a generative captioning model, this is an open-source multimodal Transformer for documents/UI that combines image pixels and text tokens. It requires an OCR step first (to get text positions), but then can understand the layout and content jointly

CPU Optimization Techniques and Format Support

To maximize performance on CPU and simplify setup, developers use various optimization frameworks:

ONNX Runtime – Many of these models can be converted to the ONNX format for efficient CPU execution. ONNX Runtime with optimizations (like Fusion, OpenMP threading, and INT8/INT4 quantization) can significantly speed up inference. For instance, an int4 quantized Mistral-7B run via ONNX showed over 9× throughput vs. naive methods and outperformed llama.cpp in tokens/s
TensorFlow Lite (TFLite) – For mobile and edge deployment, TFLite can take a model (usually a smaller or quantized version) and run it on ARM CPUs efficiently. There are examples of using TFLite for on-device OCR – Google’s example app shows how to “use TFLite to extract text from images on Android devices”
GGML / llama.cpp – This is a specialized format and runtime for LLMs that has become standard for CPU inference. Models are converted to GGML/GGUF files with 8-bit, 4-bit, or even 3-bit weights, enabling them to load in RAM-constrained environments. The llama.cpp engine then runs the model using optimized C++ code (with SIMD instructions) and can achieve decent speeds without any external dependencies. For example, running a 7B LLaMA-derivative in 4-bit through llama.cpp might generate ~1 token per second per thread on a modern high-end CPU
Memory and Hardware Considerations – A multi-billion parameter model in full precision can occupy tens of GBs of RAM, which is why quantization is key for CPU use. As a rule of thumb: 1 billion parameters ~ 4 GB in FP16, ~2 GB in int8, ~1 GB in int4. Thus, a 7B model might be 28 GB in FP32, but only ~7 GB in 4-bit – which fits on a typical 16 GB system. Models like LLaVA that incorporate a vision encoder add some overhead (the vision backbone might be ~0.3B params, e.g. CLIP ViT-L is 428M). Disk space is also a factor (quantized models take a few GB storage). In terms of compute, any modern x86-64 CPU (Intel or AMD) with AVX2 instructions can run these models, though more cores and higher clock speed improve throughput. ARM64 CPUs (like Apple M-series or Raspberry Pi) are also supported by many frameworks, often with optimized libraries. It’s recommended to use CPUs with high memory bandwidth for best results
Supported Inputs/Outputs – The models mentioned support standard image and text formats. For image-based models, you can input images as files (PNG, JPEG, etc.) or as pixel tensors via libraries like PIL or OpenCV. OCR-capable models (Pix2Struct, Donut) can directly ingest an image and output text strings or JSON. If using an OCR engine separately (e.g. Tesseract, EasyOCR, or TF Lite model), they typically accept image files and return text strings plus coordinates. The LLM part then takes plain text as input. The final output from an LLM will be text – which could be a natural language description, a list of identified UI elements, or any format you prompt it to produce. Many vision-language models can be prompted to output in a structured way (for example, “List all buttons with their labels:” and the model might enumerate them).

Ease of Setup

Deploying these models locally has become easier thanks to open-source libraries:

Hugging Face Transformers – Provides a unified API to download and run models like LLaMA, Pix2Struct, Donut, BLIP-2, etc. With a few lines of code you can load a processor and model (as shown in BLIP-2 example) and run inference
Pre-built Apps/Interfaces – As mentioned, GPT4All offers a desktop app for chat with local models
Installation – Most of these models can be installed via pip or conda. For instance, pip install transformers onnxruntime covers many scenarios. For llama.cpp, you compile a small C++ program (or install a Python wrapper like ctransformers). Memory usage should be monitored to avoid swapping. No internet is needed after downloading the model files. Ensure your CPU has the necessary instruction support (AVX2 for modern PyTorch/ONNX, or NEON for ARM). Some frameworks (HF Accelerate, DeepSpeed Zero-Inference) can partition models across CPU cores or even disk if needed, but for our model sizes that’s usually unnecessary if properly quantized.

In summary, open-source LLMs can indeed run locally on CPU and even perform screenshot/UI analysis when paired with the right vision capabilities. Models like LLaVA, Pix2Struct, and Donut provide integrated solutions that read text and interpret interface elements directly from images. With model sizes in the range of a few hundred million to 7 billion parameters, they can be executed on a modern CPU (8–16 GB RAM) by leveraging quantization and optimized runtimes. Supported formats are standard (image files in, text out), and setup has been streamlined by frameworks and community tools. This enables privacy-friendly, offline understanding of screenshots – useful for automation, testing, or assisting users without sending data to the cloud.

Open-Source LLMs for Local Screenshot and UI Analysis - as of Feb 2025

Tallyfy

Tallyfy is the leader in Workflow Made Easy? - run simple, effective, automated, digitized processes in just minutes

Lightweight LLMs for CPU-Only Inference

Vision-Language Models with OCR & UI Understanding

领英推荐

CPU Optimization Techniques and Format Support

Ease of Setup

Tallyfy的更多文章

社区洞察

其他会员也浏览了

Celebrating a crazy month of Open Multimodal LLM Releases

??Top ML Papers of the Week

AI NEWS YOU MISSED??#57

LLM 2.0, the New Generation of Large Language Models

What is the Vision Transformer?

(How-to) Smaller, Faster, Cheaper. The Rise of Mixture of Experts & LLAMA2 on Microsoft Azure

Tuesday Tech-In: How Snowflake Uses Generative AI for Synthetic Data and Natural Language Queries on NVIDIA NeMo LLMs

How To Download & Run DeepSeek R1 Locally? A Guide With Example

Fine-Tuning LLMs Made Easy: A Comparison of LoRA and QLoRA

Multi-Agent Week Recap - π0, OmniParser, ARIA, Anthropic Computer Use, OpenAI Swarm ..

Lightweight LLMs for CPU-Only Inference

Vision-Language Models with OCR & UI Understanding

领英推荐

CPU Optimization Techniques and Format Support

Ease of Setup

Tallyfy的更多文章

Is making a business process diagram the best way to document a process?

Is there a better alternative to Process Street?

社区洞察

其他会员也浏览了

Celebrating a crazy month of Open Multimodal LLM Releases

??Top ML Papers of the Week

AI NEWS YOU MISSED??#57

LLM 2.0, the New Generation of Large Language Models

What is the Vision Transformer?

(How-to) Smaller, Faster, Cheaper. The Rise of Mixture of Experts & LLAMA2 on Microsoft Azure

Tuesday Tech-In: How Snowflake Uses Generative AI for Synthetic Data and Natural Language Queries on NVIDIA NeMo LLMs

How To Download & Run DeepSeek R1 Locally? A Guide With Example

Fine-Tuning LLMs Made Easy: A Comparison of LoRA and QLoRA

Multi-Agent Week Recap - π0, OmniParser, ARIA, Anthropic Computer Use, OpenAI Swarm ..