Useful information
Prime News delivers timely, accurate news and insights on global events, politics, business, and technology
Useful information
Prime News delivers timely, accurate news and insights on global events, politics, business, and technology
Join the reliable event by business leaders for almost two decades. VB Transform brings together people who build a strategy of real business. Get more information
Computer vision projects rarely go as it is planned, and this was no exception. The idea was simple: build a model that could look at a photo of a laptop and identify any physical damage, things like cracked screens, missing keys or broken hinges. It seemed a direct use case for image models and large language models (LLM), but quickly became more complicated.
Along the way, we find problems with hallucinations, unreliable results and images that were not even laptops. To solve them, we end up applying an agent frame atypically, not for the automation of tasks, but to improve the performance of the model.
In this publication, we will walk for what we tested, which did not work and how a combination of approaches finally helped us build something reliable.
Our initial approach was quite standard for a multimodal model. We use a single and large one to pass an image to a LLM with image capacity and ask you to identify visible damage. This monolithic indication strategy is easy to implement and works decently for clean and well -defined tasks. But real world data rarely play.
We find three main problems from the beginning:
This was the point where it was clear that we would have to iterate.
One thing we noticed was how much image quality affected the exit of the model. Users loaded all kinds of images ranging from acute and high resolution resolution to Borrrosa. This led us to refer to investigation Highlighting how image resolution affects deep learning models.
We train and try the model using a combination of high and low resolution images. The idea was to make the model more resistant to the wide range of image qualities that it would find in practice. This helped improve consistency, but the central problems of hallucination and handling of junk images persisted.
Encouraged by recent experiments in the combination of image subtitles with LLM only text, such as the technique covered in The lotWhere subtitles are generated from images and then interpreted by a language model, we decided to try.
This is how it works:
While it is intelligent in theory, this approach introduced new problems for our use case:
It was an interesting experiment, but ultimately it is not a solution.
This was the turning point. While agent frames are generally used to orchestrate task flows (think that agents that coordinate the invitations of the calendar or customer service actions), we ask ourselves if to break down the task of interpreting the image in smaller specialized agents could help.
We build a structured agent frame like this:
This modular and task -based approach produced much more precise and explainable results. The hallucinations fell dramatically, the garbage images were marked reliable and the task of each agent was simple enough and focused to control the quality well.
So effective, it was not perfect. Two main limitations appeared:
We needed a way to balance precision with coverage.
To close the holes, we create a hybrid system:
This combination gave us the precision and explanation of the agent configuration, the wide coverage of the monolithic impulse and the impulse of confidence of the fine objective adjustment.
Some things were clear when we finished this project:
What began as a simple idea, using a LLM indicator to detect physical damage in the images of the laptop, quickly became a much deeper experiment to combine different AI techniques to address unpredictable problems of the real world. Along the way, we realized that some of the most useful tools were originally designed for this type of work.
Agent frameworks, often seen as working flow utilities, proved to be surprisingly effective when reused for tasks such as structured damage detection and image filtering. With a little creativity, they helped us build a system that was not only more precise, but easier to understand and manage in practice.
Shruti Tiwari is AI products manager in Dell Technologies.
Vadiraj Kulkarni is Dell Technologies data scientist.