Podcast Transcript

The BPO podcast: How Indonesia scaled to 432 deployed (transcript)

8 min readMediaApril 21, 2026

Episode 1 of The BPO podcast, a transcript-format deep dive into how Zipang scaled 432 Indonesian professionals for a 100+ hypermarket retail AI program in France — covering the accuracy curve, the team composition across Jakarta, Bandung, and Yogyakarta, the gate 3 quiz on 200 retail frames, and the operational lessons that made the program repeatable. The transcript below preserves the [HOST] and [GUEST] tags, includes talking points at the start of each section, and surfaces the key quotes that are most often cited. Listen length is 8 minutes; reading length is approximately 12 minutes.

Baca dalam Bahasa Indonesia

Key stats

432

Indonesian professionals deployed

[Zipang Research]

208

In active production at any time

[Zipang Research]

3.4M

Production tasks per month

[Zipang Research]

90%+

Sustained accuracy (rolling 12-week)

[Zipang Research]

Jakarta, Bandung, Yogyakarta

Cohort cities

[Zipang Research]

France retail AI

Program reference (case study)

[Zipang Research]

What is …?

What is the 432-deployed Indonesia retail AI program?

A multi-year Zipang program annotating shelf-edge, planogram, and product-recognition video frames for a French retail group operating 100+ hypermarkets. The program deployed 432 trained professionals, sustained 208 in active production, processed 3.4M tasks per month, and held 90%+ sustained accuracy on the rolling 12-week average. The cohort is the largest single-country deployment Zipang has operated and is the reference benchmark for the company's Indonesia BPO production model.

Cold open: why this episode exists

[HOST] Welcome to The BPO podcast — the show where we go deep on how Indonesian remote teams actually run, week to week, on real production programs. I'm your host, and joining me is Yoseph Gratika, founder of Zipang, the Indonesia-based BPO operator that has run 432 deployed professionals on a single France retail AI program. Today we're unpacking how that program scaled, what the 432 number actually means, and the operational decisions that made it repeatable.

[GUEST] Thanks for having me. The 432 number is the most-cited figure from our work, and I want to spend the first half of this episode breaking it apart — what the funnel looked like, what the cohort looked like at week 12 versus month 24, and what the contract actually says. The second half will be about what we'd do differently if we started the program today in 2026.

Talking point: 432 is the deployed number, 208 is the production number

[HOST] Let's start with the headline number. 432 deployed — what does "deployed" mean here?

[GUEST] Deployed means hired, onboarded, and active on the program. 208 is the number in active production on any given week. The gap is real, and it's not a sign of churn. It reflects that we deliberately run a larger trained cohort than the live workload needs, so that we have rotation capacity for shift coverage, paid leave, and the 30-day replacement guarantee. The 432 figure is the cohort; the 208 figure is the live production line. Both numbers are real; they measure different things.

[HOST] So when a buyer asks "how many people do you have on the program," the honest answer is both — 432 trained, 208 active.

[GUEST] Exactly. And the 208 figure scales with workload. When the client's volume grew from 2.4M tasks/month to 3.4M tasks/month, we ramped the active production number to 220, not the 432. The 432 is the ceiling the program can scale to in 60 days without re-recruiting.

Talking point: the gate 3 quiz on 200 retail frames

[HOST] Let's talk about screening. You run a 5-gate funnel — CV, async English, role-specific quiz, structured interview, paid trial. The gate 3 quiz for this program is the part I want to dig into.

[GUEST] The gate 3 quiz is a 200-frame annotated set. Candidates annotate the frames following the client SOP, the gold set is the 200-frame answer key frozen at program start, and the pass threshold is 80% on quiz accuracy. The reason 200 is the right number is that it gives enough signal to differentiate 80%-from-90% annotators without making the quiz a 4-hour ordeal that filters out people who can't sit still for that long.

[HOST] And the pass rate?

[GUEST] 54% on gate 3. End-to-end, 60 of about 1,800 applicants reached the trial cohort, and 20 of those converted to full-time on the Transperfect Dataforce program. The France retail program ran a different gate 3 (200 retail frames, not 200 video frames) and converted at 33% trial-to-full-time. The two programs are not directly comparable on conversion, but the gate 3 pass rate is roughly the same.

Talking point: three cities, one SOP, one gold set

[HOST] The cohort spans Jakarta, Bandung, and Yogyakarta. Why three cities?

[GUEST] Three reasons. First, redundancy: a single-city program is a single point of failure. We learned that during the 2023 floods in Jakarta. Second, dialect and cultural spread: Bahasa Indonesia varies by region, and a cohort that pulls only from Jakarta misses Bandung and Yogya slang patterns that show up in UGC. Third, talent density: no single Indonesian city has the depth of BPO-experienced operators we need at 432. Splitting across three cities is the only way to hit the number.

[HOST] And the SOP and gold set are identical across cities?

[GUEST] Yes. Same SOP document, same frozen gold set, same rubric. Cross-city calibration runs weekly to flag drift. If a Bandung operator's gold-set score drifts 3 points below Jakarta's, we pull them into 1:1 retraining. The SOP is the constant; the geography is the redundancy.

Talking point: the accuracy curve from week 1 to week 24

[HOST] What's the accuracy curve look like?

[GUEST] Week 1 production accuracy landed at 76%. The error analysis showed three clusters: shelf-edge occlusion (operators drew boxes too tight when products overlapped), planogram perspective errors (operators misread the angle of overhead shots), and brand-recognition errors (private-label SKUs were mis-tagged as national brands). Weeks 2–4 were targeted retraining on the three clusters; accuracy climbed to 81%. Weeks 5–8 introduced peer review at frame N; accuracy climbed to 87%. Weeks 9–12 added supervisor review at frame N+24 plus a 200-case weekly gold set; accuracy crossed 90% in week 11 and held.

[HOST] And the sustained number?

[GUEST] 91.4% on the rolling 12-week average across the life of the program. The contract floor is 90%; we deliver 91.4%. The difference matters because it's the difference between a model that trains on the data and a model that doesn't.

Talking point: what 432 taught us about scaling

[HOST] Looking back, what did 432 teach you that you didn't know at 100?

[GUEST] Three things. First, supervisor ratio doesn't scale linearly. At 100 operators, 1 supervisor per 25 is fine. At 432, you need 1 supervisor per 18, because the supervisor's job is to catch drift and run calibration, and 25 operators generate more drift than 18. Second, gold-set calibration is the bottleneck, not operator pool. We can double the operator cohort in 60 days, but doubling the gold-set cycle takes 6 months. Third, microsecond-level KPI tracking matters more at scale than at small. At 100 operators, you can review by exception; at 432, you need every label timestamped because the drift questions become unanswerable otherwise.

[HOST] Any mistakes you'd call out?

[GUEST] Yes. We onboarded the first 200 in 6 weeks, which was too fast. The gold-set cycle was still ramping at week 6, and the first cohort's accuracy was 2 points below the steady-state number. If we started over, we'd cap the first 8 weeks at 120 operators and let the gold-set cycle catch up before scaling the next 200. The lesson: scaling the funnel is easy; scaling the calibration cycle is the constraint.

Talking point: what 2026 buyers should ask about 432

[HOST] For a buyer evaluating an Indonesian BPO in 2026, what's the question to ask about a number like 432?

[GUEST] Three questions. First, what is the deployed versus active production split? A vendor that quotes 432 without the split is hiding the active production number, which is what your contract will be measured against. Second, what is the gate 3 pass rate for the comparable program? 54% is healthy; above 70% means the quiz is not filtering. Third, what is the gold-set cycle and who owns the QA reviewer bench? A vendor that does not have a named QA reviewer pool cannot scale the calibration cycle, which means your program will not hold 90%+ sustained accuracy beyond the first 6 months.

[HOST] Anything else?

[GUEST] Ask for the city list. A vendor that operates from a single city is a single point of failure. A vendor that operates from three or more cities has redundancy. For a 100+ operator program, redundancy is a contract requirement, not a nice-to-have.

Key quotes from this episode

On the 432 vs 208 split: "432 is the ceiling the program can scale to in 60 days without re-recruiting; 208 is the live production line."

On the gate 3 quiz: "200 frames is the right number — enough signal to differentiate 80%-from-90% annotators, short enough to filter out people who can't sit still for 4 hours."

On three cities: "Single-city is a single point of failure. We learned that during the 2023 floods."

On supervisor ratio: "At 432, you need 1 supervisor per 18, not per 25. The supervisor's job is to catch drift; 25 operators generate more drift than 18."

On scaling the gold set: "Scaling the funnel is easy. Scaling the calibration cycle is the constraint."

On what buyers should ask: "Ask for the deployed-versus-active split, the gate 3 pass rate, and the QA reviewer bench. A vendor that hides any of those three is a vendor that has not run a 432-scale program."

Common questions

Is the 432 number a headcount or a deployment figure?

It is a deployment figure — the count of operators who were hired, onboarded, and active on the program across its life. The active production number at any given week is 208, which is what the contract is measured against. The 432 figure is the ceiling the program can scale to in 60 days without re-recruiting from outside.

What is the gate 3 quiz on this program?

A 200-frame annotated set covering shelf-edge, planogram, and product-recognition frames. Candidates annotate the frames per the client SOP, the gold set is a 200-frame answer key frozen at program start, and the pass threshold is 80% on quiz accuracy. The quiz is roughly 90 minutes long and runs asynchronously.

Why three cities?

Redundancy (a single city is a single point of failure), dialect and cultural spread (Bahasa Indonesia varies by region), and talent density (no single Indonesian city has the depth of BPO-experienced operators to hit 432 in the program timeline). Jakarta, Bandung, and Yogyakarta each contribute a third of the cohort.

What is the sustained accuracy?

91.4% on the rolling 12-week average across the life of the program. The contract floor is 90%; the program delivers 91.4%. The difference is the difference between a model that trains on the data and a model that does not.

Could a 2026 buyer replicate the 432 program?

Yes, with the right conditions: 1,800+ applicants to support 60 trialists and 20 full-time conversions, a 200-frame gold set scored against a frozen answer key, three-city cohort, supervisor ratio of 1:18, and a 6-month gold-set cycle ramp. The methodology is documented; the constraint is calibration cycle, not operator pool.

Is the 432 number referenced on the Zipang about page?

Yes. The 432 deployed, 208 active production, 3.4M tasks/month, 90%+ sustained accuracy, and the 5-gate funnel pass rates are all first-party Zipang research figures referenced on /about and reproduced in the France retail AI case study.

Key takeaways

  • 1. 432 deployed, 208 in active production. The deployed number is the ceiling the program can scale to in 60 days; the active number is what the contract is measured against.
  • 2. The gate 3 quiz is 200 retail frames, scored against a frozen gold set, with a 54% pass rate. End-to-end, 60 of ~1,800 applicants reached the trial cohort.
  • 3. The cohort spans Jakarta, Bandung, and Yogyakarta. Same SOP, same gold set, same rubric across all three cities. Weekly cross-city calibration.
  • 4. Accuracy curve: 76% week 1, 81% week 4, 87% week 8, 90% week 11, 91.4% rolling 12-week sustained. The contract floor is 90%; the program delivers 91.4%.
  • 5. At 432, supervisor ratio is 1:18, not 1:25. The bottleneck is the gold-set cycle, not the operator pool. Onboarding the first 200 in 6 weeks was too fast; the right cap is 120 in 8 weeks.
  • 6. For 2026 buyers, ask for: deployed-vs-active split, gate 3 pass rate, QA reviewer bench. A vendor that hides any of those has not run a 432-scale program.

Evaluating a 200+ operator Indonesia BPO program?

Zipang runs 432 deployed, 208 active, 3.4M tasks/month, 91.4% sustained accuracy on the France retail AI program. The methodology is documented end-to-end — ask for the full operational case.

Sources

Data and claims in this article reference verifiable sources (including Zipang research and public data such as APJII, JobStreet, Buffer).

  1. 1.
    Zipang Remote Work Market Research 2026

    Zipang Research · 2026-06-14

  2. 2.
  3. 3.
    Indonesia: The Next Big Talent Story

    McKinsey & Company · 2026-06-14

  4. 4.
    Statistik Tenaga Kerja Indonesia

    BPS Indonesia · 2026-06-14

  5. 5.

Explore related job paths