On-Device LLM - Future is EDGE AI
Navdeep Singh Gill
Building XenonStack | Agentic AI | Physical AI | Data Foundry | AGI and Quantum Futurist | Author | Speaker
Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks.
LLMs can understand human language and handle most language-based ML tasks with a huge knowledge base, e.g., language translation, Q&A and smart reply.
LLM catalyzes mobile applications such as UI automation tasks based on user instructions, e.g., “forward the recent 3 emails
LLM marks a giant step for mobile devices towards more intelligent and personalized assistive agent
Local LLM use cases is growing by Matt Rickard
Some interesting properties of on-device AI:
The benefits of serverside inference
LLM as a System Service on Mobile Devices
LLM-as-a-Service (LLMaaS) will be a new paradigm of mobile AI: LLM as a system service on mobile devices (LLMaaS) as proposed by Wangsong Yin, Mengwei Xu3, Yuanchun Li, Xuanzhe Liu in their research paper
To fully explore the design space of chunk-level memory management, LLMS incorporates three novel techniques.
(1) Tolerance-Aware Compression
(2) Swapping-Recompute Pipeline
(3) Chunk Lifecycle Management
LLMaaS with LLMS, a concrete LLM Service design based on fine-grained, chunk-wise KV cache compression and swapping.
LLMS, which decouples the memory management of LLM contexts from the app. LLMS aims to minimize the LLM context switching overhead under a tight memory budget, akin to the traditional mobile memory mechanisms that focus on reducing the app cold-start latency.
Apple introduced their Focus on Running LLM on Apple Devices ( Gemini and OpenAI)
Lot of challenges in running large language models (LLMs) on devices with constrained memory capacities. Apple Mentioned in Research that their approach Deeply rooted in the understanding of flash memory and DRAM characteristics, represents a novel convergence of hardware-aware strategies and machine learning. By developing an inference cost model that aligns with these hardware constraints, we have introduced two innovative techniques: ’windowing’ and ’row-column bundling.
Running large language models (LLMs) on the edge is useful: MIT Introduced TinyChat Engine
Copilot services (coding, office, smart reply) on laptops, cars, robots, and more. Users can get instant responses with better privacy, as the data is local.
领英推荐
This is enabled by the LLM model compression technique: SmoothQuant and AWQ (Activation-aware Weight Quantization), co-designed with TinyChatEngine that implements the compressed low-precision model.
Generative AI models under 10 billion parameters - EDGE devices,
Microsoft Introduced their AI Model ORCA
Microsoft has introduced a new AI model called Phi-3-mini, which is a smaller version of its Phi-3 language model.
It is designed to provide advanced language, coding and math capabilities to smaller businesses and organizations
Phi-3-mini AI model can be applied in content creation, local processing, everyday tasks, chatbots, math problem solving, custom A
References
GEN AI Evangelist | #TechSherpa | #LiftOthersUp
6 个月Exciting developments in AI models for businesses. Navay Singh Gill