Useful information

Prime News delivers timely, accurate news and insights on global events, politics, business, and technology

From hallucinations to hardware: lessons of a real world computer vision project


Join the reliable event by business leaders for almost two decades. VB Transform brings together people who build a strategy of real business. Get more information


Computer vision projects rarely go as it is planned, and this was no exception. The idea was simple: build a model that could look at a photo of a laptop and identify any physical damage, things like cracked screens, missing keys or broken hinges. It seemed a direct use case for image models and large language models (LLM), but quickly became more complicated.

Along the way, we find problems with hallucinations, unreliable results and images that were not even laptops. To solve them, we end up applying an agent frame atypically, not for the automation of tasks, but to improve the performance of the model.

In this publication, we will walk for what we tested, which did not work and how a combination of approaches finally helped us build something reliable.

Where we start: monolithic indication

Our initial approach was quite standard for a multimodal model. We use a single and large one to pass an image to a LLM with image capacity and ask you to identify visible damage. This monolithic indication strategy is easy to implement and works decently for clean and well -defined tasks. But real world data rarely play.

We find three main problems from the beginning:

  • Hallucinations: The model would sometimes invent damage that did not exist or poorly labeled by what I was seeing.
  • Junk image detection: I did not have a reliable way of marking images that were not even laptops, such as images of desks, walls or occasionally sliding people and received meaningless damage reports.
  • Inconsistent precision: The combination of these problems caused the model to be too reliable for operational use.

This was the point where it was clear that we would have to iterate.

First solution: mix image resolutions

One thing we noticed was how much image quality affected the exit of the model. Users loaded all kinds of images ranging from acute and high resolution resolution to Borrrosa. This led us to refer to investigation Highlighting how image resolution affects deep learning models.

We train and try the model using a combination of high and low resolution images. The idea was to make the model more resistant to the wide range of image qualities that it would find in practice. This helped improve consistency, but the central problems of hallucination and handling of junk images persisted.

The multimodal detour: only text llm becomes multimodal

Encouraged by recent experiments in the combination of image subtitles with LLM only text, such as the technique covered in The lotWhere subtitles are generated from images and then interpreted by a language model, we decided to try.

This is how it works:

  • The LLM begins by generating multiple possible subtitles for an image.
  • Another model, called multimodal embedding model, verifies how well each title fits the image. In this case, we use Siglip to obtain the similarity between the image and the text.
  • The system maintains the few subtitles based on these scores.
  • The LLM uses those upper subtitles to write new ones, trying to approach what the image really shows.
  • Repeat this process until the subtitles stop improving, or reaches an established limit.

While it is intelligent in theory, this approach introduced new problems for our use case:

  • Persistent hallucinations: Subtitles in themselves sometimes included imaginary damage, which the LLM reported with confidence.
  • Incomplete coverage: Even with multiple subtitles, some problems were completely lost.
  • Greater complexity, little benefit: The aggregate steps caused the system to be more complicated without reliably overcome the previous configuration.

It was an interesting experiment, but ultimately it is not a solution.

A CREATIVE USE OF AGENT MARCOS

This was the turning point. While agent frames are generally used to orchestrate task flows (think that agents that coordinate the invitations of the calendar or customer service actions), we ask ourselves if to break down the task of interpreting the image in smaller specialized agents could help.

We build a structured agent frame like this:

  • Orchestrador agent: He verified the image and identified which components of the portable computer were visible (screen, keyboard, chassis, ports).
  • Component agents: Dedicated agents inspected each component for specific damage types; For example, one for cracked screens, another for the missing keys.
  • Garbage detection agent: A separate agent marked if the image was even a laptop first.

This modular and task -based approach produced much more precise and explainable results. The hallucinations fell dramatically, the garbage images were marked reliable and the task of each agent was simple enough and focused to control the quality well.

Blind points: compensation for an agent approach

So effective, it was not perfect. Two main limitations appeared:

  • Greater latency: Execution of multiple sequential agents added to total inference time.
  • Coverage gaps: Agents could only detect problems that were explicitly scheduled to search. If an image showed something unexpected that no agent had the task of identifying, it would go unnoticed.

We needed a way to balance precision with coverage.

The hybrid solution: combining agent and monolithic approaches

To close the holes, we create a hybrid system:

  1. He agent frame He ran first, handling a precise detection of known types of damage and garbage images. We limit the number of agents to the most essential to improve latency.
  2. So, a Monolithic Image LLM Notice He scanned the image for anything else that agents could have lost.
  3. Finally, we tune in the model Use of a curing set of images for high priority use cases, such as frequently informed damage scenarios, to further improve precision and reliability.

This combination gave us the precision and explanation of the agent configuration, the wide coverage of the monolithic impulse and the impulse of confidence of the fine objective adjustment.

What we learned

Some things were clear when we finished this project:

  • Agent frameworks are more versatile than they get credit: While they are generally associated with the management of the workflow, we discover that they could significantly increase the performance of the model when they are applied structured and modularly.
  • Mix different rhythms approaches depending on only one: The precise detection combination based on agents together with the wide coverage of LLMS, in addition to a bit of fine adjustment where it imported most, gave us much more reliable results than any unique method alone.
  • Visual models are prone to hallucinations: Even the most advanced configurations can reach conclusions or see things that are not there. A reflective system design is needed to maintain those errors under control.
  • The image quality variety makes a difference: Training and tests with clear, high -resolution and high -resolution images and lower quality helped the model stay resistant when they face unpredictable photos and real world.
  • You need a way to catch junk images: A trash check or unrelated images was one of the simplest changes we made, and had a huge impact on the general reliability of the system.

Final thoughts

What began as a simple idea, using a LLM indicator to detect physical damage in the images of the laptop, quickly became a much deeper experiment to combine different AI techniques to address unpredictable problems of the real world. Along the way, we realized that some of the most useful tools were originally designed for this type of work.

Agent frameworks, often seen as working flow utilities, proved to be surprisingly effective when reused for tasks such as structured damage detection and image filtering. With a little creativity, they helped us build a system that was not only more precise, but easier to understand and manage in practice.

Shruti Tiwari is AI products manager in Dell Technologies.

Vadiraj Kulkarni is Dell Technologies data scientist.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *