YOLO-World: A Fresh Approach to Object Detection Integrating Image Features and Text Embeddings

YOLO-World: A Fresh Approach to Object Detection Integrating Image Features and Text Embeddings

YOLO-World introduces a highly efficient open-vocabulary object detection framework with real-time inference capabilities and simplified deployment. What sets it apart from other methods is the combination of a novel YOLO framework and an efficient pre-training strategy, resulting in enhanced performance and generalization for open-vocabulary object detection

.

Key Features of YOLO-World

Real-Time Inference: YOLO-World operates in real-time, making it suitable for dynamic scenarios where timely detection is crucial.

Open-Vocabulary Detection: Unlike traditional object detectors that rely on predefined object categories, YOLO-World can detect any object based on descriptive texts. This flexibility is a significant leap forward in the field.

YOLOv8 Framework: YOLO-World is built upon the Ultralytics YOLOv8 framework, which provides a strong foundation for efficient and accurate object detection.

Zero-Shot Inference on LVIS

Recently, YOLO-World models have been integrated with the FiftyOne computer vision toolkit, allowing streamlined open-vocabulary inference across image and video datasets. Additionally, YOLO-World-L demonstrates impressive zero-shot inference capabilities on the LVIS benchmark.

1. Pre-training Formulation: Region-Text Pairs

Traditional object detection methods rely on instance annotations (????, ????) for bounding boxes (????) and category labels (????). YOLO-World redefines these annotations as region-text pairs (????, ????), where ???? represents the corresponding text for the region ????. This text can be the category name, noun phrases, or object descriptions. YOLO-World takes both the image ?? and a set of nouns (??) as input and predicts bounding boxes (?????) along with corresponding object embeddings (????).

2. Model Architecture

YOLO-World’s architecture comprises three main components:

  • YOLO Detector: Based on YOLOv8, it includes a Darknet backbone as the image encoder, a path aggregation network (PAN) for multi-scale feature pyramids, and a head for bounding box regression and object embeddings.
  • Text Encoder: YOLO-World uses a pre-trained Transformer text encoder (from CLIP) to extract text embeddings ?? = TextEncoder(??) ∈ ???×??, where ?? is the number of nouns and ?? is the embedding dimension

  • .
  • Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN): This component enhances both text and image representation by fusing image features and text embeddings

3. Text Encoder Choice

YOLO-World leverages the CLIP text encoder due to its superior visual-semantic capabilities. It effectively connects visual objects with texts compared to text-only language encoders.

4. Text-guided CSPLayer

The cross-stage partial layers (CSPLayer) are employed after top-down or bottom-up fusion. YOLO-World extends the CSPLayer by incorporating text guidance into multi-scale image features. Max-sigmoid attention is used to aggregate text features into image features.

5. Image-Pooling Attention

To enhance text embeddings with image-aware information, YOLO-World proposes the Image-Pooling Attention. Instead of using cross-attention directly on image features, it leverages max pooling on multi-scale features to obtain 3×3 regions, resulting in 27 patch tokens ??~ ∈ ?27×??. The updated text embeddings are obtained through MultiHead-Attention.

6. Pre-training Schemes

YOLO-World is pre-trained on large-scale detection, grounding, and image-text datasets. It learns from Region-Text Contrastive Loss, aligning predictions with ground-truth annotations based on task-aligned label assignment.

Conclusion

YOLO-World demonstrates that small models can be effective for vision-language pre-training, achieving strong open-vocabulary capabilities. Its combination of fine-grained detection, classification, and referring abilities makes it a powerful vision-language model.

要查看或添加评论,请登录

Bazeed Shaik的更多文章

社区洞察

其他会员也浏览了