Is the next-generation of YOLO detectors, aiming for real-time open-vocabulary object detection.
Pre-trained on large-scale vision-language datasets, including Objects365, GQA, Flickr30K, and CC3M, which enpowers YOLO-World with strong zero-shot open-vocabulary capbility and grounding ability in images.
Achieves fast inference speeds and presents re-parameterization techniques for faster inference and deployment given users' vocabularies.