Youzu.ai
Back to all posts
EngineeringVisual SearchDeep Dive

From 1.8 Seconds to 4 Milliseconds: Building Youzu Lens

Youzu EngineeringAugust 18, 20256 min read

From 1.8 Seconds to 4 Milliseconds

We've been building computer vision models for a long time. Before Youzu, our team built AI broadcasting cameras at Sportotal, so YOLO wasn't new to us. But when we pointed it at interior design and home decor catalogs, the results were terrible. The detection alone was barely usable, and segmentation — which is what visual search actually needs — was worse.

So we switched to Florence. The detection quality jumped, but speed collapsed. We were sitting at 10 seconds per image just for the object-detection step.

That's a non-starter in production. Youzu Lens runs alongside an image generation model, a recommendation engine, and a vector search. Every second we burn on detection is a second the customer experience can't have.

Squeezing Florence

We did the obvious things first: a torch.compile pass, better GPUs, kernel tweaks. That dropped us from 10 seconds to 1.8 seconds. Better, but still not where a real-time UX needs to be.

Going back to YOLO

Recently we ran another experiment: a fresh take on YOLO. The first unoptimized run came in at 30 milliseconds. We exported it to TensorRT and recompiled — 4 milliseconds.

That's a ~450× speedup over the Florence baseline, on the same hardware tier.

The architecture today

The full pipeline is async and multi-tenant:

  • Product metadata flows in via SQS
  • A separate image queue runs through our own product-image processors
  • Indexed embeddings (1,268 dimensions, vision-transformer-based) live in Qdrant
  • We do cosine similarity at query time and cache hot neighborhoods so repeat clicks (and "shop the look" expansions) are instant

Originally we shipped on OpenSearch, sequentially. It was fast to build, but at scale it wasn't going to hold. Qdrant was the right call.

Why segmentation matters more than bounding boxes

A common question after this talk: why bother with segmentation when bounding boxes are cheaper?

Because the embedding you get from a tightly-cropped, segmented region is dramatically better than one polluted by background pixels. The vector database doesn't know what "furniture" is — it just knows distances. Clean inputs make those distances meaningful.

Where this runs in production

Our largest deployment is with Vivre, the largest furniture marketplace in Romania and CEE. The system handles their full catalog, returns visually similar alternatives in real time, and powers shop-the-look across hundreds of thousands of SKUs.

The lesson, in one line: the difference between a cool demo and a production visual search engine is measured in milliseconds — and you only get there by being willing to throw away the model you started with.

Watch the full talk above for the architecture diagrams and the live demos.