Rebuilt the merchandising engine for a D2C consumer brand, lifting AOV by $14.20
Client: D2C Consumer Brand (Apparel, $60M+ GMV)
A mid-stage D2C apparel brand was running its recommendation engine on a static rules table last updated in 2022. We replaced it with a learned merchandising system trained on 26 months of clickstream and return data. Average order value rose $14.20 against a holdout control over an 8-week A/B window.
- revenue per session
- +11.3%
- return rate on recommended items
- -9.1%
- projected annual revenue lift
- $3.8M
- merchandiser time freed for new categories
- 40 hrs/mo
Challenge
The client's merchandising function had ossified. Product recommendations on the PDP, cart, and post-purchase email were all driven by a hand-maintained spreadsheet of 'goes-well-with' pairings that their head of merch had been updating personally for three years. Return rate on recommended products was 22% — six points above the store average — suggesting the rules were aging poorly as the catalog grew from 400 to 2,100 SKUs.
Leadership had been pitched four different 'AI personalization' platforms, each quoting $240k-$480k annually plus 90+ days of integration. The platforms all looked similar in demos and none could explain how they would handle the client's actual data: a catalog where size curves varied wildly by subcategory and where 34% of revenue came from 'drops' that had no historical signal at all. We were brought in to sanity-check the procurement decision.
Approach
Our recommendation was to not buy a platform. The client had three things those platforms needed to be useful — a Shopify Plus data stack, 26 months of clean clickstream, and a merchandising lead willing to be the domain expert — and building internally would leave them with the underlying system as an asset rather than a recurring license.
We built in three layers. First, a two-tower retrieval model on clickstream + catalog metadata, which handled the 66% of revenue that had meaningful historical signal. Second, a cold-start layer for new drops that used catalog embeddings (image + description + price band) to bootstrap from the closest historical analogs. Third, a business-rules overlay the merchandising lead could edit directly — because some decisions (this is the hero SKU for the season, this sold-out size is not coming back) are facts the model has no way to learn.
The A/B framework ran against a 15% holdout from week 5 through week 12. We instrumented not just conversion and AOV but return rate, because recommending the wrong product to the wrong person is worse than recommending nothing.
Outcome
Over the 8-week A/B window, the treatment arm posted $14.20 higher AOV and 11.3% higher revenue per session against the holdout. More importantly, return rate on recommended items fell from 22% to 12.9% — the old rules had been pushing volume at the cost of fit. Projected annual revenue lift at current traffic levels is $3.8M against a total engagement cost under 8% of one year's savings. The merchandising lead was freed from roughly 40 hours per month of spreadsheet maintenance and is now using that time to stand up two new product categories.
Stack
- Two-tower retrieval (PyTorch)
- Shopify + BigQuery pipeline
- OpenAI embeddings for cold-start catalog
- In-house A/B framework
Working on something similar?
A partner will respond personally within one business day. If there isn't a fit, we'll tell you that, and point you somewhere better.