A Magento 2 AI concierge that uses tool calling + semantic search (Qdrant) to recommend products from your live catalog, reduce token costs, and hand off qualified leads to human specialists with full chat context.
High-ticket or spec-heavy products aren’t bought like “regular products”. Customers want reassurance, comparisons, detailed specs, and availability quickly. On the store side, this becomes repetitive conversations across chat/email/DMs, plus lost leads when users don’t get a confident answer at the exact moment they’re ready to buy.
Our case study example in this post is a luxury watch catalog (where details like condition, reference numbers, and water resistance matter), but the same architecture applies to many catalogs: electronics, automotive parts, B2B components, furniture, premium fashion, and more.
We built our Magento 2 AI concierge module to solve this with a practical approach: keep the AI “smart”, but keep it on rails. The assistant answers using your real Magento catalog, not generic knowledge, and it can hand the conversation to a human specialist when the user is ready.
This post breaks down what the module does, how it works (tool calling + semantic search), what we did for safety and scaling, and why the handoff flow is the difference between a chatbot and a revenue tool.
Why this matters right now
AI assistants are becoming a normal part of the shopping experience. Users increasingly expect to ask a question and get a clear, accurate answer immediately, without hunting through menus, filters, and FAQ pages.
Time is expensive, and convenience wins. An on-site assistant that understands the page context and can pull the right information from your catalog is a practical step toward that future. It reduces friction for customers and reduces repetitive work for your team.
TL;DR: what your store gets
- 24/7 on-site consultation: discovery, filtering, comparisons, product questions.
- “Meaning-based” search for lifestyle queries (“diving”, “dressy but sporty”, “heritage vibe”).
- Catalog-grounded answers: the assistant responds using real product data (no invented SKUs/prices).
- Language adaptation: the assistant can respond in the shopper’s language, which lowers the language barrier and expands your reachable audience.
- Lower LLM cost and faster responses by retrieving only the most relevant items.
- Lead handoff to a human specialist with a request ID and full conversation history.
If you’re non-technical: you can think of this as “a sales assistant that reads your catalog and asks your store for facts before answering”.
Prerequisites (what you need to run it)
- Magento 2.4+ / PHP 8.1+
- An AI provider key (Groq or OpenAI) or a self-hosted OpenAI-compatible endpoint (Ollama/LiteLLM)
- Optional but recommended for “semantic” queries: Qdrant + an embeddings provider
Step 1: Understand the scaling problem (catalogs grow, prompts explode)
Even with great filters, shoppers ask semantic questions:
- “I need a proper diving watch”
- “Something sporty that still works with a suit”
- “What’s a Rolex Submariner alternative with similar vibes?”
If you try to solve this by shoving large catalog chunks into the prompt, you hit:
- higher token usage (and cost) per message,
- slower responses,
- worse accuracy (models start guessing),
- a hard ceiling as the catalog grows.
The fix is not “a smarter model”. The fix is architecture: the model should request only the data it needs, on demand.
Step 2: Keep answers grounded with tool calling (no hallucinated products)
Our module is a frontend chat widget + Magento REST API where the LLM acts as an orchestrator and uses tools (function calling / tool calling) to fetch structured data from the catalog.
Instead of giving the model your entire catalog, we let it:
- interpret the user intent,
- call a tool like
search_watches/get_watch_details, - receive structured results (from Magento),
- craft the answer based strictly on those results.
This “ask tools first” pattern is what prevents hallucinations like “we have that model in stock” when you don’t.
In practice we expose a small set of catalog tools. Beyond search and details, one of the most valuable tools on a product page is “find similar items”. When a shopper asks for alternatives, the assistant can retrieve a curated list of close matches and adjacent options instead of forcing the user back into filters.
What customers see (UX)
- A floating chat widget across the storefront
- Quick actions (e.g., “Help me choose”, “Contact a specialist”)
- Product-page context (“I have a question about this product”)
- Answers in the shopper’s language when possible
- Persistent chat history (session-based)
[metaslider id=”3290″]
What your team gets (operations)
- Fewer repetitive questions hitting sales/support
- Better-qualified handoffs (“here’s the exact short list + chat context”)
- A controllable cost envelope (rate limits + fewer tokens)
Step 3: Add semantic search (Qdrant + embeddings) for lifestyle queries
Our module implements a semantic-first strategy: when vector search is enabled, search_watches uses Qdrant (vector DB) + embeddings.
In this case study, the tool is called search_watches because the catalog is watches. The pattern is general: you define a tool for your catalog (e.g., search_products) that returns structured, real inventory data, and let the model use it.
This matters because “diving” often isn’t a keyword in the product title. Semantic retrieval can still surface relevant models via water resistance, model line positioning, and description signals.
A note on embeddings (what we use as a baseline)
Embedding quality matters a lot for semantic search. In this module we support multiple embedding providers, but our baseline choice is Mixedbread’s mixedbread-ai/mxbai-embed-large-v1, because it delivers consistently strong retrieval quality in real-world catalog queries.
Example (what users actually type)
“I need a proper diving watch under 12k, ideally Swiss.”
With semantic retrieval enabled, we can return a short, relevant shortlist instead of “everything under 12k”.
What it changes in practice (our measurements)
From our internal module notes (HOW_IT_WORKS.md), switching to semantic search typically yields (your mileage may vary by catalog size, data quality, and infrastructure). These numbers come from a real watch catalog setup, but the same dynamics apply to any “spec-heavy” catalog where the naive approach would send too many products into the prompt.
- Response time: about 2 to 4 seconds vs 8 to 15 seconds
- Tokens per request: roughly 500 to 800 vs 5,000 to 10,000
- Cost per request: typically 5x to 15x lower (driven by smaller context and fewer tokens; exact ratio depends on your provider/model)
- Relevance: 85% to 95% vs 60% to 70%
If “tokens” are new to you, it’s worth a quick read on why extra context directly affects cost and latency.
Quick takeaway: faster responses, much fewer tokens, and more relevant shortlists. That is exactly what you want for conversion-sensitive conversations.
Fallback behavior (important for production)
Vector search is optional. If Qdrant is disabled or unavailable, the assistant can fall back to a more traditional filter/search approach and still function (with different relevance/cost characteristics).

Step 4: Improve result quality with domain logic (hybrid ranking + “Dive Watch Boost”)
Vector search is great at finding candidates. We still add domain logic to make results feel “human correct”:
- Domain boosts: if the query is dive/water-related, we boost results by water resistance (
WaterResistanceBoostService). In other catalogs this could be “compatibility first”, “in-stock first”, “fitment first”, etc. - Lexical rerank: after loading products from Magento, we re-rank using lexical signals to keep obvious matches on top (
LexicalRerankService).
The pattern is simple: vectors get you high recall; domain logic gives you high precision.
Finding similar items (hybrid search plus attribute boosting)
People often decide by comparison. “Show me similar products” is one of the highest intent requests you can get on a product page.
To make this work well, we use a hybrid approach:
- A brand and model oriented query to surface close variants (good for “same line, different dial” cases).
- A characteristics oriented query to surface cross brand alternatives (good for discovery).
- Merge and deduplicate results.
- Apply attribute-based boosting using structured product attributes (brand, series, case material, band material, style, movement, bezel). The total boost is capped so vector similarity still matters.
The result is usually a healthy mix of variants and alternatives. This improves exploration, keeps users on the site longer, and often increases the chance of a handoff or conversion when the exact item is not perfect.
Optional: model-based reranking (future upgrade)
Mixedbread also provides dedicated reranker models that can further improve ordering quality on ambiguous queries. We did not add or benchmark them yet because the current combination of semantic retrieval plus lightweight reranking already performs well, and extra optimization did not feel necessary at this stage. If we need to push relevance further (larger catalogs, noisier attributes, tougher queries), adding a model-based reranker is a clear next step.
Step 5: Add production controls (safety + cost containment)
We baked in “production-grade” controls so the assistant is safe, predictable, and cost-controlled:
- Rate limiting (per interval / per minute / per day):
Model/RateLimiter.php - Input validation + early prompt-injection blocking (before calling the LLM):
Model/InputValidator.php - Session + access control for customer sessions:
Model/SessionManager.php,Model/ChatService.php - Encrypted API keys in Magento config
- API origin validation: our REST endpoints accept requests only from the same site origin (Origin or Referer header check). External calls get HTTP 403.
- Guest access control: guest chat can be disabled to require customer login (useful when you want stricter cost control or a more gated experience).
- Admin “test connection” buttons for AI, Qdrant, and embeddings
Guest chat access: benefits and trade-offs
Allowing guests to use chat is usually the best default for conversion, but it comes with real trade-offs. Here is how we think about it:
Pros
- Lower friction. Users can ask questions immediately, before they commit to an account.
- Better onboarding. First time visitors can ask in plain language instead of learning your navigation and filters upfront.
- Better lead capture. More users reach the handoff step when the assistant is available early.
- Better UX for mobile. Guests often do not want to log in just to ask one question.
Cons
- Higher abuse surface. Public endpoints can be targeted for spam or token burn attempts.
- Higher operational cost. More anonymous usage means more traffic to rate limit and monitor.
- Less personalization. Without login, you have less reliable user context.
In our implementation we mitigate the cons with rate limiting, message validation, prompt-injection blocking, origin validation (403 for off-site calls), and separate session behavior for guests vs customers. If guest access becomes a problem, you can switch it off and keep chat for logged-in customers only.
For security context, OWASP also publishes common LLM app risks (prompt injection included).
Step 6: Turn chat into lead gen with specialist handoff (with full context)

In luxury e-commerce, conversion often happens after the “final questions”. So the module includes a handoff flow:
- Detects the “I want a specialist” intent
- Collects contact details (name/email/phone)
- Creates a unique request like
HO-XXXXXXXX - Stores the full chat transcript with the request
- Provides admin workflows to review, change status, and reply
That’s the difference between an AI “chatbot” and an AI concierge that generates qualified leads.
Step 7: How the flow works (high-level)

Here’s the mental model:
- Customer asks a question in chat.
- The AI decides which store tool it needs (search, details, brands, price range).
- Magento returns real catalog data.
- The AI formats an answer for the customer.
- If the user wants a specialist, we create a handoff ticket with the full chat history.
This keeps the model “useful” while keeping the store in control of truth (catalog) and risk (limits, validation, safety).
Step 8: Technical overview (for CTOs and developers)
- REST endpoints:
POST /rest/V1/ai-concierge/chatGET /rest/V1/ai-concierge/chat/:sessionIdPOST /rest/V1/ai-concierge/handoff
- Tools:
search_watches,get_available_brands,get_price_range,get_watch_details - LLM providers (OpenAI-compatible): Groq / OpenAI / Ollama / LiteLLM
- Vector search stack: Qdrant + embeddings (Ollama / OpenAI / Cohere / Mixedbread)
- Index updates: CLI commands + auto-index observers on product changes
More details live in the module docs (ARCHITECTURE.md and EXTERNAL_DEPENDENCIES.md).
Step 9: Quick start (practical checklist)
The chat widget is easy to enable, but the full stack depends on whether you use hosted APIs or self-hosted services. Before calling it “done”, make sure the external dependencies are actually reachable. The full list and setup notes live in EXTERNAL_DEPENDENCIES.md.
Option A: Hosted (fastest to start)
- Use a hosted AI provider (Groq or OpenAI).
- If you want semantic search, run Qdrant (self-hosted or cloud) plus an embeddings provider (Mixedbread / OpenAI / Ollama).
Option B: Self-hosted (more control)
- Run an OpenAI-compatible endpoint (Ollama or LiteLLM) for chat and/or embeddings.
- Run Qdrant for vector search.
Enable the module
Configure the provider in Magento Admin:Stores > Configuration > Extensions > AI Concierge
Then enable the module:
bin/magento module:enable Vendor_AiConcierge
bin/magento setup:upgrade
bin/magento cache:flush
If you enabled semantic search, index your products:
bin/magento vendor:aiconcierge:index-vectors
Minimal REST example (for a custom frontend or mobile app):
POST /rest/V1/ai-concierge/chat
Content-Type: application/json
{"message":"Show me Omega dive watches under 12000","sessionId":"chat_..."}
Step 10: Next improvements (if you want to scale further)
- Intent analytics (what customers actually ask)
- A/B testing conversation flows (“help me choose” vs “compare” vs “explain differences”)
- CRM/helpdesk integration for handoffs
- Localization and store-specific prompt variants
- Personalization based on browsing behavior
Conclusion
An AI concierge is not “just add ChatGPT”. For high-ticket catalogs, you need a controlled setup: tool-based retrieval from the live catalog, semantic search for intent-heavy queries, safety controls, and a clean handoff to humans when it’s time to close.
Sources (external)
- OpenAI function calling / tool calling
- OpenAI embeddings
- Qdrant documentation
- Qdrant search concepts and Qdrant filtering concepts
- Groq OpenAI compatibility
- LiteLLM docs
- Ollama
- Mixedbread embedding model (mixedbread-ai/mxbai-embed-large-v1)
- OWASP Top 10 for LLM Applications
- Tokens explained
If you want to implement this on your catalog
If you run Magento 2 and sell high-ticket or spec-heavy products, the fastest way to validate this approach is a small pilot:
- A short discovery to understand your catalog and support/sales flow
- Provider choice (Groq/OpenAI vs self-hosted) + safety limits
- Semantic search (Qdrant) where it makes sense + fallback when it doesn’t
- Handoff integration so conversations turn into actionable leads
If you want help scoping or implementing it, we can run a demo on your catalog and propose a rollout plan.