Arena Pro

Image: Adobe Stock

Qwen-VL VLMs for zero- and few-shot object detection

Technology and Industry

Vision Language Models (VLMs) can understand both images and text. Users can provide an image and instruct the model to perform vision-related tasks. In this article, we explore how well open Qwen3-VL models perform in a manufacturing-related object detection use case.

One recent advancement in Generative AI and Large Language Models (LLMs) is the introduction of Vision-Language Models (VLMs). VLMs have evolved rapidly to handle a variety of vision-related tasks, such as detecting objects in images.

VLMs extend the capabilities of traditional LLMs by incorporating image understanding (Bai et al., 2025). This means that, in addition to text, these models can be prompted with images. They can reason across both modalities, allowing users to instruct the model to perform vision-related tasks using an attached image.

A VLM is an example of a multimodal language model. Beyond text and images, multimodal models may also process video and audio (Xu et al., 2025). Commercial AI providers, such as OpenAI’s ChatGPT, have supported image-based tasks for a long time, and more recently, an increasing number of instruction-tuned open models have begun to support multimodal inputs (Li et al., 2025).

The Qwen family of LLMs is one example of open models that support image input. In this article, we evaluate how Qwen3-VL models perform in object detection tasks by experimenting with an open dataset related to manufacturing.

Object detection

A useful task involving image data is identifying and locating objects or areas of interest in images. This long-standing problem in computer vision is commonly referred to as “object detection” (Murel & Kavlakoglu, n.d.).

Modern VLMs perform well in object detection. VLMs can even operate in so-called zero-shot setting where the model has not seen any examples of correct detections. For example, a zero-shot task might involve prompting the model with: “Detect all cats in this image and provide the bounding box (bbox) coordinates,” along with an image attachment. It relies on the knowledge about the world that the model has learned during training and follows the user’s text instructions to complete the task. The model understands what a cat looks like and can detect the animal based solely on the user’s instructions. It responds with rectangular bounding box coordinates that define the boundaries of any object it identifies as a cat. In some cases, they can even outperform traditional object detection methods in zero-shot settings, compared to previous state-of-the-art few-shot models (Madan et al., 2024).

In addition to zero-shot, VLMs can be prompted in one-shot or few-shot settings. In a one-shot setting, only one example of the desired task is provided before the final request. In a few-shot setting, the model receives multiple examples before the final request. These examples help condition the model to produce more desirable results (Madan et al., 2024).

Object detection has many applications across different domains. Identifying objects or areas of interest in images enables numerous downstream tasks. For example, in manufacturing, object detection can verify that all components are correctly placed in the final product. This information can then be used for quality control or other tasks further along the manufacturing process.

Experiments

We experimented with three different Qwen3-VL model types: Qwen3-VL-30B-A3B-Instruct, Qwen3-VL-30B-A3B-Thinking, and Qwen3-VL-235B-A22B-Thinking-FP8 (Qwen Team, 2025a). All of these models are mixture-of-experts (MoE) vision-language models (Fedus, 2022).

For LLMs, a key property is the number of parameters. Generally, more parameters indicate better performance, but they also require greater computational resources. With Qwen3 family of models, the model name indicates the total number of parameters and the number of active parameters per token during generation. For example, Qwen3-VL-30B-A3B-Thinking has 30 billion total parameters, of which 3 billion are active per token when generating new tokens. Models with “Thinking” in their name are reasoning VLMs that perform intermediate reasoning steps before answering, while the “Instruct” model provides answers immediately without reasoning (Bergmann, n.d.).

The larger 235-billion-parameter model required 8-bit parameter quantization because our computing environment did not have enough VRAM to run the original 16-bit precision model. The FP8 at the end of that model signifies that the model is quantized to 8-bit precision, meaning that the original model 16-bit parameters are converted to lower 8-bit precision for more efficient resource usage (Clark, n.d.).

Qwen3-VL models are open VLMs, meaning anyone can download and run them in their own environment. They are trained and published by Alibaba Cloud and licensed under Apache-2.0, a permissive open-source license that places very few restrictions on how the models can be used (Qwen Team, 2025b).

We used our own computing environment at Jamk Institute of Information Technology, which includes two servers equipped with modern GPUs. One server has two NVIDIA H100 GPUs, and the other has three NVIDIA A100 (80GB) GPUs.

We used the vLLM serving engine (v0.11.0) to run the models (vLLM‑project, 2025). The smaller 30-billion-parameter models were executed on one server in a tensor-parallel configuration across two GPUs. The larger 235-billion-parameter model required significantly more memory, so it could not run on a single server. Fortunately, the vLLM project supports multi-node inference via the Ray framework, which allowed us to run the model on a heterogeneous platform with two H100 and three A100 GPUs. We used a configuration with 2-way tensor parallelism and 2-way pipeline parallelism across the two servers. The first half of the layers were placed on the first node and split across its two GPUs, while the remaining layers were placed on the second node and split across its two GPUs.

All scripts used for the experiments are available at: https://github.com/janne-alatalo/Qwen3-VL-for-Object-Detection (Alatalo, 2025). We used a Codex AI agent with the GPT-5-Codex model to implement the test scripts. These scripts are generic and use an OpenAI-compatible API to prompt the model, so they should work with other AI services and LLM serving engines, although this was not tested.

Experiment 1 – Ad-Hoc prompting

The object detection capabilities of Qwen3-VL models are not well documented. Based on example scripts from the Qwen3-VL repository, blog posts, and our own testing, we concluded that the model outputs object coordinates in the range 0–1000 (Qwen Team, 2025b; Rath, 2025). The bounding box format is [x_min, y_min, x_max, y_max]. Pixel coordinates for the bounding box corners can be computed as: [x_min / 1000 * image_width, y_min / 1000 * image_height, x_max / 1000 * image_width, y_max / 1000* image_height].

Interestingly, the “Thinking” models can sometimes be persuaded to return coordinates in the 0.0–1.0 range, but this requires significant effort and the model often ignores the instruction. When it does comply, the conversion happens internally, consuming many chain-of-thought tokens, which makes it inefficient. The non-thinking model did not follow this instruction at all.

When not instructed on output format, the model defaults to JSON, for example:

```json
[
{"bbox_2d": [217, 112, 920, 956], "label": "cat"}
]
```

The backticks and “json” are part of the response. This markdown formatting helps graphical user interfaces detect the output as JSON and apply syntax highlighting. We tried to remove the markdown formatting with prompts like:

“Return only valid JSON without any Markdown formatting, backticks, or additional text.”

However, the models often ignored this instruction. Therefore, client-side scripts should be prepared to strip markdown before parsing the JSON.

The model can also return results in other structured formats (e.g., XML, YAML) and generally follows such instructions easily. However, since JSON was the default and met our needs, we decided to stick with it. Other formats were only tested briefly out of curiosity.

Our repository includes a script query_bbox.py, which was used to prompt the model to produce bounding boxes visualized in the following images (Alatalo, 2025). In addition to the prompt shown in the image captions, the script adds extra instructions in the system prompt to enforce the correct output format. See the script source code for full details.

A cat sitting on a rocky surface. AI has drawn a box around the cat, labeled "cat".
Image 1: Model: Qwen3-VL-30B-A3B-Instruct, Prompt: “Detect a cat in this image”, Original image By Alvesgaspar – Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=12127693

There is nothing surprising in the example of detecting a cat in a simple image of a cat. The bounding box is tight around the cat.

An AI generated image representing dogs, a cat, a pig, a goat, a horse, a chicken and a cow. All animals have been identified by AI by drawing a box around them and labeling them.
Image 2: Model: Qwen3-VL-30B-A3B-Instruct, Prompt: “Detect and label all animals in this image”, Original image is generated by AI.

In true object detection, the object is both identified and localized in the image. In the previous example, the model correctly detects and labels all the animals. The bounding box around the dog in the middle misses its left hind leg, but otherwise the boxes are very accurate. Since we could not find a suitable real test image, the above image was AI-generated for testing purposes.

The previous examples were easy for the model to handle, so we wanted to test something much harder. The next image shows a flock of sheep, and the objective was to detect all sheep heads.

A picture of a flock of sheep with red boxes drawn by AI in an attempt to detect all the sheep heads in the picture. Not all of the sheep heads have been detected and sometimes the detection box is drawn around whole sheep.
Image 3: Model: Qwen3-VL-30B-A3B-Instruct, Prompt: “Detect all sheep heads in this image”, experiment #5 in the repository (Alatalo, 2025), Generated tokens: 1,223, Original image https://pixabay.com/photos/sheep-flock-of-sheep-7286385/

When an image contains many objects, detection becomes much harder for the model. The detection is also unstable. When the model is prompted with different seed values, the detected objects vary.

The repository includes 10 examples where the model was prompted with the same content and sampling parameters, but the seed value varied from 1 to 10. The results are in the article-experiments/sheep-detection directory in the repository (Alatalo, 2025).

We used the recommended sampling parameters for the “instruct” models: top_p=0.8, top_k=20, temperature=0.7, repetition_penalty=1.0, presence_penalty=1.5 (Qwen Team, 2025b).

The test image clearly confused the Qwen3-VL-30B-A3B-Instruct model. It often ignored the instruction to detect only sheep heads and instead detected whole sheep in most examples. Otherwise, detection was fairly good, although some obvious misses occurred in the middle of the flock, and sheep in the background were not detected correctly.

A picture of a flock of sheep with red boxes drawn by AI in an attempt to detect all the sheep heads in the picture. Not all of the sheep heads have been detected, but the detection is better than in the previous picture.
Image 4: Model: Qwen3-VL-30B-A3B-Thinking, Prompt: “Detect all sheep heads in this image”, experiment #1 in the repository (Alatalo, 2025), Input prompt length: 1,349 tokens, Generated tokens: 5,312 tokens (CoT tokens and output combined), Original image https://pixabay.com/photos/sheep-flock-of-sheep-7286385/

We repeated the experiment with the Qwen3-VL-30B-A3B-Thinking model. The recommended sampling parameters differ for thinking models: top_p=0.95, top_k=20, repetition_penalty=1.0, presence_penalty=0.0, temperature=0.6 (Qwen Team, 2025b).

The results were interesting. The model missed the one obvious sheep in the foreground in 8 out of 10 experiments. It followed the instruction to detect only heads more reliably, with only 2 out of 10 experiments where it detected whole sheep. Experiment #2 produced the worst detection, missing most sheep in the foreground.

Thinking models generate Chain-of-Thought (CoT) tokens before producing the final answer, giving the model time to “think” before responding. Across the experiments, the fewest tokens generated were 3,521, and the most were 13,539. vLLM version v0.11.0 does not report the number of CoT tokens in the response, but we estimate that the final detection output was about 1,500 tokens long, meaning the model used roughly 2,000–12,000 tokens for reasoning.

On our setup, the 30B-A3B models generate around 145 tokens per second on two H100 GPUs in a tensor-parallel configuration when the input length is about 1,400 tokens (image size is the main factor affecting token count). Generation slows to around 130 tokens per second as the context approaches 20,000 tokens. Detection took about 25–100 seconds per image, depending on how long the model spent generating CoT tokens before answering.

A picture of a flock of sheep with red boxes drawn by AI in an attempt to detect all the sheep heads in the picture. Not all of the sheep heads have been detected, but the detection is better than in the previous pictures.
Image 5: Model: Qwen3-VL-235B-A22B-Thinking-FP8, Prompt: “Detect all sheep heads in this image”, experiment #9 in the repository (Alatalo, 2025), Input prompt length: 1,349 tokens, Generated tokens: 9,619 tokens (CoT tokens and output combined), Original image https://pixabay.com/photos/sheep-flock-of-sheep-7286385/

The large 235-billion-parameter model produced the most consistent results. It followed the instruction to detect only the heads and did not miss any of the most obvious sheep in any of the 10 experiments. In the background, the detections did not align perfectly with the actual sheep heads, but overall, the results were fairly good.

The fewest tokens generated across the experiments were 4,241, and the most were 9,619. Based on an estimated 1,500-token output, the model used about 2,750–8,000 tokens for reasoning.

The larger model is much more compute-intensive. On our setup, with a 2-way tensor-parallel and 2-way pipeline-parallel configuration across two compute nodes, the model generated about 45 tokens per second. Detection took approximately 90–200 seconds per image, depending on how long the model spent generating CoT tokens before answering.

Experiment 2 – Wood Surface Defects

The experiments with VLM models were conducted as part of Jamk University of Applied Sciences’ AI-Boost project (Finnish name: AI-Loikka). In this project, we partnered with local companies in Central Finland to research and test cases where Generative AI could improve competitiveness. Central Finland has many companies in the manufacturing sector, where automated processes have long been using object detection methods. We wanted to evaluate how modern open vision-language models perform in such use cases.

To benchmark model performance, we selected an open dataset from the manufacturing sector. The dataset is based on the Wood Defects Dataset by Kodytek et al. (2022), but we used a variation by Ahsan (2022), which converted bounding boxes to YOLO format and reduced image sizes.

The task was to detect defects on wood surfaces. The dataset contains eight classes: Quartzity, Live_Knot, Marrow, resin, Dead_Knot, knot_with_crack, Knot_missing and Crack. Each image includes human-labeled bounding boxes for visible defects. The objective was to detect these classes and compare predictions against ground truth labels.

We tested the same three Qwen3-VL model types as in the previous experiment and evaluated Zero-shot and One-shot setups.

In Zero-shot experiments, the prompt was: “This is an image of a wood plank. I need to detect defects in the surface. Detect classes: Quartzity, Live_Knot, Marrow, resin, Dead_Knot, knot_with_crack, Knot_missing and Crack.” No further prompt engineering was applied.

In One-shot experiments, the model received one dialogue-formatted example before being prompted with the real sample. By dialogue formatting, we mean that the interaction between the user and the assistant was faked to look like the assistant had already responded to the first user message. The assistant’s first response was manually written to include the correct bounding boxes and labels for the first image. This approach conditions the model to produce similar outputs for the second user message, which contains the real image where we want to perform detection.

During these experiments, we made a small mistake in the example response. The class name “Live_Knot” was written with both letters capitalized (L and K) in the instructing prompt, while in the example response, the class name was written as “Live_knot” with a lowercase “k”. This inconsistency caused some models to start producing class names in the latter format. To correct this issue during evaluation, the evaluate_detection.py script converts all labels to lowercase before computing the metrics.

We also tested another method of providing an example by giving the model an image with bounding boxes drawn directly on it. The idea was that the model might understand how bounding boxes look visually and improve its predictions based on this context. However, this approach produced worse results than the dialogue-formatted technique. The repository includes results from one such experiment in article-experiments/knots/07-30BA3B-Thinking-context-example, but these results are not discussed further in this article (see the repository in Alatalo, 2025).

The following table summarizes results from the other experiments. The dataset contains 4,000 samples. One sample was used as an example in One-shot experiments and removed from evaluation, leaving 3,999 samples. In some cases, the model failed to produce a valid response for detection request, which reduced the number of samples in some experiments. The model got stuck in the thinking phase and never produced a valid response, or the response was in an incorrect format.

In the table, the columns are as follows:

  • Valid detection responses: Number of valid detection responses in the experiment.
  • Predicted boxes: Number of bounding boxes predicted by the model.
  • Matches (IoU ≥ 0.5): Number of bounding boxes matching ground truth with Intersection over Union (IoU) ≥ 0.5.
  • Label Accuracy: Accuracy of class labels within matched bounding boxes.
  • Precision: Metric that also considers unmatched bounding boxes.
Model / Zero-Shot (ZS) | One-Shot (OS)Valid detection responsesPredicted boxesMatches (IoU ≥ 0.5)Label accuracy within matched boxesPrecision
30BA3B-Instruct/ZS4,00013,9754,5350.40880.1327
30BA3B-Instruct/OS3,9996,3063,3040.42520.2228
30BA3B-Thinking/ZS3,9978,9003,4430.44150.1708
30BA3B-Thinking/OS3,9998,0263,5660.53730.2387
235BA22B-Thinking-FP8/ZS3,9358,2033,6720.51580.2309
235BA22B-Thinking-FP8/OS3,9998,2093,5600.56010.2429

As shown by the results, the thinking models perform better than the non-thinking model, and the larger thinking model outperforms the smaller one. One-shot prompting also achieves better results compared to zero-shot.

In the 235BA22B-Thinking-FP8/OS experiment, more than half of the matched bounding box labels are correct. However, when considering all ground truth bounding boxes, the overall precision is not great. The model misses many defects or produces bounding boxes with IoU below 0.5, which lowers the score.

When comparing predicted labels to real labels more closely by visualizing the results, we can see shortcomings both in the model predictions and in the dataset used for the experiments. In the images, the real labels from the dataset are shown on the left, and the model predictions are on the right. The visualizations are from the 06-235BA22B-Thinking-One-Shot experiment and include samples 100000036, 100000039, and 100000007.

Wood surface with boxes indicating live knots made by AI image recognition.
Image 6: Sample 100000036.
Wood surface with boxes indicating live knots, dead knots, knot missing and quartzity made by AI image recognition.
Image 8: Sample 100000039.
Wood surface with boxes indicating live knots and cracks made by AI image recognition.
Image 7: Sample 100000007.

The first detection is good, but it is an easy example. The bounding boxes are larger than those in the dataset, but the model correctly detected the two defects.

The second example highlights a shortcoming in the dataset. It contains tiny bounding boxes that point to nothing. This issue is not limited to this image. Similar extra bounding boxes appear in other samples and exist in the original dataset, not only in the variation we used. Nevertheless, the model detections in the example are good for the actual defects, but two of the labels are incorrect.

The last image shows poor detection by the model. One bounding box is incorrectly placed and the label is wrong. In addition, the right knot is not detected at all.

These are only a few examples from the experiments. The repository includes bounding box predictions for all images and also provides scripts to visualize the bounding boxes on the original images (Alatalo, 2025).

Conclusions

Open Vision-Language Models (VLMs) have developed rapidly and can now perform object detection from images. Our experiments show that the results are already quite good, though not perfect.

Although open VLMs are not yet ready to fully solve the object detection problem, they are still an attractive alternative to traditional object detection models. Users can guide VLMs simply by changing the prompt and optionally providing one or a few examples. This is a huge advantage for VLMs.

In the past, if a pretrained model was not available for a specific use case, a new model had to be created by training from scratch or fine-tuning an existing object detection model. Both approaches required labeled examples. The number of samples needed for reliable performance depended on the use case. For image-related tasks, the recommended number of samples was usually at least hundreds per class (Warden, 2017). Creating datasets required manual labor for annotation, which was expensive and inflexible. For example, adding a new object class often meant re-annotating the dataset and retraining the model.

With VLMs, a proof-of-concept implementation of a domain-specific object detector can be built in minutes. A zero-shot VLM object detector only needs a few sentences in a prompt to instruct the model, plus scaffolding software to handle the output. This approach is highly flexible, allowing quick adaptation to changing requirements by simply modifying the prompt.

However, VLMs are not a silver bullet. They are computationally heavy and require far more resources than traditional object detection models. In many manufacturing use cases, object detection must be fast enough to keep pace with the production process. The required speed depends on the specific application, but real-time object detection often demands processing multiple frames per second to remain practical. Current VLMs are generally unsuitable for real-time or near real-time detection because they are typically too slow for such applications.

That does not mean VLMs are useless for problems requiring real-time performance. As VLMs improve and their detection capabilities approach human level, they can be used to annotate training datasets for traditional object detection methods.

In many cases, dataset quality has been the limiting factor for model performance. Using VLMs for labeled dataset creation could enable the development of large, high-quality datasets that can train better-performing traditional models cheaply and efficiently. Dataset sizes for training traditional machine learning models could grow significantly because VLM-based labeling can scale far beyond what is possible with human annotators.

AI-Boost

AI-Boost project (Finnish name: AI-Loikka) focuses on identifying use cases for generative AI technologies to boost the competitiveness of companies, with a particular focus on the Central Finland region. The funding program is the European Regional Development Fund (ERDF), project code A81725, and the funding agency is the Regional Council of Central Finland.

Read more about project Opens in a new tab