Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models
Credit: https://publications.reka.ai/reka-core-tech-report.pdf

Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

This paper introduces a series of powerful multimodal language models called Reka Core, Flash (21B), and Edge (7B). These models are trained from scratch by Reka AI and can process and reason with text, images, video, and audio inputs.

Method Overview

The Reka models use a modular encoder-decoder transformer architecture that supports multimodal inputs like text, image, video, and audio. The text output from the model can invoke function calls such as web search and code execution.

The training data comprises a mixture of public and proprietary datasets with text, images, videos, and audio clips. Reka Flash and Edge were trained on 5 trillion and 4.5 trillion deduplicated language tokens respectively. The data includes code, STEM content, web crawl, and math-related corpora. About 15% of the data is multilingual, covering 32 diverse languages.

After pretraining, the models go through instruction tuning and alignment using reinforcement learning from human feedback (RLHF).

Results

Reka Core approaches the performance of GPT-4 and other models on benchmarks like MMLU, GSM8K, VQAv2, and human evaluations of multimodal and text-only chat.

Reka Flash and Edge set a new state-of-the-art for their compute scale, often surpassing much larger models. For example, Flash outperforms GPT-3.5, Grok-1, and Gemini Pro 1.0 on many benchmarks. Edge outperforms Gemma 7B and Mistral 7B models.

The models also show strong multilingual and domain-specific (e.g. medical) capabilities compared to specialized models and GPT-4.

Conclusion

The Reka Core, Flash and Edge models demonstrate powerful multimodal reasoning capabilities, especially for their compute scales. For more information please consult the full paper.

Congrats to the authors for their work!

Reka Team. "Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models." arXiv preprint, 2023.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了