From 1.8 Seconds to 4 Milliseconds: Building Youzu Lens
From 1.8 Seconds to 4 Milliseconds
We've been building computer vision models for a long time. Before Youzu, our team built AI broadcasting cameras at Sportotal, so YOLO wasn't new to us. But when we pointed it at interior design and home decor catalogs, the results were terrible. The detection alone was barely usable, and segmentation — which is what visual search actually needs — was worse.
So we switched to Florence. The detection quality jumped, but speed collapsed. We were sitting at 10 seconds per image just for the object-detection step.
That's a non-starter in production. Youzu Lens runs alongside an image generation model, a recommendation engine, and a vector search. Every second we burn on detection is a second the customer experience can't have.
Squeezing Florence
We did the obvious things first: a torch.compile pass, better GPUs, kernel tweaks. That dropped us from 10 seconds to 1.8 seconds. Better, but still not where a real-time UX needs to be.
Going back to YOLO
Recently we ran another experiment: a fresh take on YOLO. The first unoptimized run came in at 30 milliseconds. We exported it to TensorRT and recompiled — 4 milliseconds.
That's a ~450× speedup over the Florence baseline, on the same hardware tier.
The architecture today
The full pipeline is async and multi-tenant:
- Product metadata flows in via SQS
- A separate image queue runs through our own product-image processors
- Indexed embeddings (1,268 dimensions, vision-transformer-based) live in Qdrant
- We do cosine similarity at query time and cache hot neighborhoods so repeat clicks (and "shop the look" expansions) are instant
Originally we shipped on OpenSearch, sequentially. It was fast to build, but at scale it wasn't going to hold. Qdrant was the right call.
Why segmentation matters more than bounding boxes
A common question after this talk: why bother with segmentation when bounding boxes are cheaper?
Because the embedding you get from a tightly-cropped, segmented region is dramatically better than one polluted by background pixels. The vector database doesn't know what "furniture" is — it just knows distances. Clean inputs make those distances meaningful.
Where this runs in production
Our largest deployment is with Vivre, the largest furniture marketplace in Romania and CEE. The system handles their full catalog, returns visually similar alternatives in real time, and powers shop-the-look across hundreds of thousands of SKUs.
The lesson, in one line: the difference between a cool demo and a production visual search engine is measured in milliseconds — and you only get there by being willing to throw away the model you started with.
Watch the full talk above for the architecture diagrams and the live demos.
Keep reading
Turning Inspiration Into Purchase: Inside the Youzu Platform
Shopping starts with a moment of inspiration. The gap between that moment and the checkout page costs retailers $260B a year. Here's how we close it.
E-commerce Hasn't Changed in 20 Years. We Think It Should.
Static photos, a search bar, a description. That's the e-commerce template we've all been living with. Here's the case for tearing it up — and what we built instead.