Client Case Study

Case study: 432 Indonesian annotators for a 100+ hypermarket retail network in France

9 min readCase studyApril 21, 2026

This case study documents a multi-year Zipang production program in computer-vision annotation for a 100+ hypermarket retail network in France. The client is not public, so the program is described operationally rather than by name. The numbers below, 432 deployed, 208 in active production, 3.4M production tasks per month, 90%+ sustained accuracy, microsecond-level KPI tracking: are first-party data from Zipang operations and are consistent with what buyers can verify during a vendor diligence call. The program is referenced in the Zipang BPO rankings and the 5-gate screening walkthrough; this article is the long-form version of how it actually runs.

Baca dalam Bahasa Indonesia →

Key stats

432

Zipang professionals deployed (France retail AI)

[Zipang Research]

3.4M

Production tasks per month (France retail AI)

[Zipang Research]

90%+

Sustained production accuracy

[Zipang Research]

~48%

Onboard-to-production conversion

[Zipang Research]

208

Active production at any time

[Zipang Research]

6 months

Ramp-up to steady-state

[Zipang Research]

What is …?

What is shelf-edge computer vision annotation?

Shelf-edge computer vision annotation is the human labeling work behind in-store retail AI. Annotators review images and short video clips of supermarket shelves, identify SKUs at the SKU-pack level, draw bounding boxes or polygons, classify promotions and price tags, and flag ambiguous cases back to a quality queue. The labeled output trains and evaluates the models that power shelf-edge compliance, planogram adherence, out-of-stock detection, and dynamic price-tag recognition. For a 100+ hypermarket network, the SKU universe is in the hundreds of thousands, price tags change daily, and OCR ambiguity is the rule, not the exception, which is why the labeling work itself is the bottleneck, not the model.

The client situation

The client is a French retail group operating 100+ hypermarkets across mainland France and overseas territories. Their internal computer-vision team had built a shelf-edge CV model in-house; the bottleneck was not the model, it was the labeled training and evaluation data feeding it. With 300,000+ active SKUs across private label and branded items, and price tags refreshed daily across thousands of stores, the dataset needed continuous high-volume, high-accuracy human review.

The client had tried internal labeling teams and two prior external vendors. Internal teams could not scale to the daily volume without headcount expansion that exceeded budget. The first external vendor delivered acceptable accuracy on simple cases but missed the price-tag and promotion edge cases that mattered most. The second vendor was cheaper but accuracy slipped below 85% during ramp, triggering the client to re-bid the program.

Zipang was the third vendor brought in, in 2022, and the program has run continuously since. This case study documents the operational shape, the methodology, and the honest results, including what did not work.

The labeling problem: 300k+ SKUs, daily price-tag churn, ambiguous OCR

Retail-shelf CV annotation is unusually hard for three reasons. First, the SKU universe is enormous, the client's catalog runs to 300,000+ unique product-pack combinations, and the model must distinguish between, for example, two sizes of the same product with different promotional stickers. Second, price tags change every day: a hypermarket price team may update 20,000+ price tags across the network on a single Tuesday, and the labeling queue must absorb that daily churn. Third, OCR on price tags in hypermarket conditions is genuinely ambiguous: glare, partial occlusion, curved tags, and multiple overlapping promotions force a human to make judgement calls the model cannot.

The hardest subset, by error rate, was promotional and bundle stickers, temporary price reductions, multi-buy offers, and loyalty-card-specific prices. The first 60 days of the program spent a disproportionate share of QA reviewer time on exactly these cases. By month four, the team had built a gold set of 2,000+ promotion-edge-case labels that became the spine of the weekly calibration cycle.

Why Indonesia: English, French familiarity, B2 CEFR, time zone, scale

Three factors drove the location decision. First, English proficiency at B2 CEFR is the dominant operational language for SOPs and client communication; Indonesia's EF EPI 2025 placement at B2 (Moderate) puts the production pool inside the B2 band, and Zipang's 5-gate funnel filters to candidates who clear an internal English reading and SOP-comprehension test. Second, French familiarity at working level is present in a subset of the Indonesian talent pool, university French programs, francophone media exposure, and a small but durable community of Indonesian graduates from French-speaking institutions. Third, time-zone overlap with Paris (UTC+1 / UTC+2 in summer) is partial but workable: an Indonesia morning shift (07:00–14:00 WIB) lands inside the European afternoon, and a follow-the-sun handoff to a Jakarta evening shift gives the client 12+ hours of effective overlap per day.

The scale factor is the fourth reason. The 432-operator ramp required absorbing 4–5× the typical Indonesian annotation cohort within 6 months, and Indonesia's 280M+ population (BPS 2024) made that ramp feasible without saturating any single metro labor market. The cohort was distributed across Jakarta, Bandung, Yogyakarta, Surabaya, and Medan.

The 5-gate funnel mirror applied to this program

Zipang's 5-gate funnel: CV relevance scan, async English and SOP-comprehension screening, role-specific quiz, structured video interview, paid trial task, was applied with a single modification: gate 3 (the role-specific quiz) was a French retail-labeling quiz built from 200+ representative shelf images and price-tag cases, scored against a rubric that mirrored the client's own gold set.

Aggregate pass rates: gate 1 retained 22% of applicants, gate 2 retained 41% of gate 1 survivors, gate 3 retained 58% of gate 2 survivors, gate 4 retained 67% of gate 3 survivors, gate 5 retained 78% of gate 4 survivors. End-to-end, roughly 6% of applicants reached the paid trial cohort; 48% of those trialists converted to paid production. The same funnel that produces Zipang's other quality programs (Transperfect–Dataforce at 90%+ sustained accuracy) was reused here, the rubric changed, the funnel did not.

Gate 1: CV relevance (B2 English, prior labeling or data work), 22% pass
Gate 2: Async English + SOP comprehension test, 41% pass of survivors
Gate 3: French retail-labeling quiz on 200+ images, 58% pass of survivors
Gate 4: Structured video interview, French exposure probed, 67% pass
Gate 5: Paid trial task scored against client gold set, 78% pass
End-to-end trialist conversion: 48% to paid production

Headcount: 432 deployed, 208 in production at any time

Of the 432 professionals onboarded across the ramp, 208 sit in active production at any given moment. The remainder are in shadow queues, training cohorts, paid trial status, or rotating off-shift for the 4-day follow-the-sun overlap with the client's Paris team.

The 208 / 432 ratio is not attrition, it is the design. Retail-shelf CV annotation is a high-volume, high-judgement task, and the production cohort is intentionally kept below 50% of total onboarded to absorb training rotations, illness, and the deliberate 30% off-production time used for QA feedback loops, gold-set calibration, and cross-training on new SKU categories.

Volume: 3.4M production tasks per month

Production throughput runs at 3.4M tasks per month at steady state, which works out to roughly 16,000 tasks per production-operator per month, or about 600–700 per shift. Each task is a single image or short clip with a defined labeling rubric and a decision time budget of 8–45 seconds. Microsecond-level KPI tracking means every label, every reviewer intervention, and every re-annotation is timestamped in a way that lets the operations team slice the data by operator, by SKU category, by store, and by edge-case type.

The microsecond granularity is operational, not marketing. It is the difference between knowing 'a reviewer disagreed on 200 labels yesterday' and knowing 'reviewer 7 disagreed on 200 labels between 14:00 and 16:00 WIB, on the dairy-bagel SKU cluster, after a gold-set update at 13:45.' That level of attribution is what makes the weekly calibration cycle work.

Accuracy: 90%+ sustained, weekly gold-set calibration

Sustained production accuracy has been above 90% since month 4 of the program. The discipline that holds it is a weekly gold-set calibration: a frozen set of 500 representative cases is re-labeled by the entire active production cohort every Friday, scores are computed against the gold answer, and individual operators whose accuracy drifts below 88% on the gold set are pulled into a 1:1 retraining session before they touch live production again the following week.

Client-side QA samples 5–8% of production labels every week on an independent rubric. Client and Zipang scores have tracked within 1.5 percentage points on the weekly sample since the program stabilized, which is the basis for the 'sustained' qualifier on the 90% number, it is not a one-week spike, it is the rolling 12-week average.

Ramp-up timeline: 6 months to steady state

Months 1–2: SOP design and gate 3 quiz construction with the client's Paris team. Gate 1 and gate 2 ran on a standard funnel; gate 3 was new for this program. 60 candidates were advanced to the first trial cohort. By the end of month 2, 28 of the 60 had converted to production.

Months 3–4: Ramp from 28 production operators to 120. Weekly gold-set calibration began. The first promotion-edge-case gold set was assembled (2,000+ labels) and became the spine of QA feedback.

Months 5–6: Ramp from 120 to 208 in production. Throughput stabilized at 3.4M tasks per month. Sustained 90%+ accuracy was first recorded in week 14. Since month 6, the program has been in steady state with 432 total deployed (training + production + rotation) and 208 in active production at any time.

What the client measures

The client measures three things. First, precision and recall on its own held-out test set, refreshed quarterly. Second, throughput: tasks delivered per day, per week, per month, against the agreed SLA. Third, queue SLA: time from image arrival to labeled output, with a contractual 24-hour ceiling on the standard lane and a 4-hour ceiling on the priority lane for new SKU introductions and price-tag refreshes.

Zipang's internal KPIs (per-operator accuracy, throughput, rework rate) are shared weekly with the client's Paris program manager, and the client's own QA sample is shared back with Zipang monthly. The two-way visibility is the trust mechanism that lets the program scale without re-bidding.

What did not work: French dialect attempts and the pivot to English + AI translation

The first design of gate 4 (the video interview) required candidates to demonstrate conversational French. Pass-through dropped to 23%, too low to feed production ramp at the rate the client needed. French at conversational level is a smaller subset of the Indonesian talent pool than French at reading-and-SOP-comprehension level.

In month 3, the program pivoted: gate 4 dropped the conversational French requirement and gate 3 added an English-reading test of French-language SOPs (labels, promotional copy, edge-case documentation). For live French-language cases, promotion copy that needed a native speaker to confirm tone: a small specialist sub-pool was built, but the bulk of the labeling work was done in English with the client's French-language SOPs translated and reviewed by AI translation plus a bilingual reviewer.

The pivot improved ramp speed without measurable accuracy loss. By month 5, the English-plus-translation design was the default, and the conversational-French sub-pool was reduced to a 12-operator specialist team for tone-sensitive cases only.

Current state

The program is in its third contract year. The active production cohort of 208 has been stable, and the 432 total-deployed figure has been the steady-state number for 18 months. Two new product categories have been added since launch (a private-label line and a fresh-produce line), each on a 60-day ramp using the same gate structure. Sustained production accuracy is at 91.2% on the rolling 12-week average, and the client has not re-bid the program since the original contract.

The program is documented here in operational form, the client is not public, and Zipang does not disclose client identities without permission. What is disclosed is the operational shape: 432 deployed, 208 in production, 3.4M tasks per month, 90%+ sustained accuracy, microsecond-level KPI tracking, weekly gold-set calibration, 6-month ramp. The same shape is reusable for other retail-AI and shelf-edge CV programs; the rubric is the variable, the funnel is the constant.

Common questions

Can Zipang name the client?

No. The client is a French retail group with 100+ hypermarkets, and the contract does not permit public disclosure of the client name. What is disclosed is the operational data: 432 deployed, 208 in active production, 3.4M monthly tasks, 90%+ sustained accuracy, 6-month ramp, weekly gold-set calibration. A buyer can verify the same data shape during a vendor diligence call.

How does Indonesia compare with Kenya or India for retail-AI annotation?

Indonesia's B2 English band, French reading exposure in a subset of the talent pool, and 280M+ population (BPS 2024) make it a strong fit for French retail and EMEA-facing programs. Kenya is typically cheaper but has a smaller pool; India is the largest pool but is more expensive and more competitive. The right answer depends on language, time zone, and price point, not a single country ranking.

What is the accuracy ceiling for shelf-edge CV annotation?

Sustained 90%+ is a realistic production target. A 95%+ sustained number is unusual for retail-shelf CV at the volumes the client requires, and is more often a number from a frozen gold set than a live production rolling average. The 90%+ number on this program is a rolling 12-week average on live production, cross-checked against the client's own weekly QA sample.

How long is the ramp for a similar program?

6 months to steady state on the 208 production / 432 deployed shape. A smaller ramp (50 operators) typically stabilizes in 10–12 weeks; a larger ramp (1,000+) typically takes 9–12 months because the QA reviewer and gold-set cycle has to be rebuilt for the new volume.

What happens when a price-tag change sweeps the network?

The client raises a priority lane with a 4-hour SLA. Zipang's operations team re-prioritizes the production queue and re-allocates 30–40% of capacity to the priority lane for the duration of the sweep, usually 24–72 hours. The weekly gold set is updated mid-week for any new edge cases the sweep surfaces.

How is data handled under UU PDP and GDPR?

Store images are passed through Zipang's data pipeline with PII redaction (faces blurred, customer tags removed) before reaching the production operators. Operators work inside a watermarked client UI with no local file storage. Data residency is set per client, the France retail program runs on EU-region storage with end-to-end encryption. Both UU PDP 2022 and GDPR Article 28 obligations are met; the contract includes the EU standard contractual clauses.

Key takeaways

1. 432 professionals deployed, 208 in active production, 3.4M tasks per month: a production shape that other retail-AI programs can be sized against.
2. 90%+ sustained accuracy is a rolling 12-week average on live production, cross-checked against the client's weekly QA sample within 1.5 percentage points.
3. 6-month ramp to steady state, with weekly gold-set calibration and microsecond-level KPI tracking as the operational discipline.
4. The 5-gate funnel is reused; only the gate 3 quiz (French retail labeling on 200+ images) is specific to this program.
5. What did not work: requiring conversational French at gate 4 dropped pass-through to 23%, the program pivoted to English plus AI-translated SOPs and a 12-operator French-tone specialist sub-pool.
6. Client is not public; the program is described operationally. A buyer can verify the same data shape during a vendor diligence call.

Evaluating a retail-AI annotation program?

Zipang's France retail AI program is one of several reference cases for high-volume, high-accuracy annotation pods. See the production cases and request employer scoping.

Request annotation scoping See Zipang production cases

Sources

Data and claims in this article reference verifiable sources (including Zipang research and public data such as APJII, JobStreet, Buffer).

1.
Zipang Remote Work Market Research 2026
Zipang Research · 2026-06-14
2.
Statistik Tenaga Kerja Indonesia
BPS Indonesia · 2026-06-14
3.
EF English Proficiency Index 2025
EF Education First · 2026-06-14
4.
Laporan Penetrasi Internet Indonesia
APJII · 2026-06-14
5.
The Emerging Global Talent Hub: Indonesia
McKinsey & Company · 2026-06-14

Explore related job paths

Customer Support Data Entry Administrative

Zipang knowledge base

All guides (EN)Panduan (ID)Research Insights CS Playbook Employers About