Client Case Study

Case study: 90%+ sustained accuracy on a Transperfect Dataforce interpolation annotation program

9 min readCase studyApril 21, 2026

Transperfect Dataforce is one of the named data annotation vendors in the global AI training data market, and Zipang has run an interpolation annotation program for them since 2023. The program covers video frame interpolation annotation across sports, training, and automotive machine learning datasets. The accuracy target set by the client was 90%; Zipang's actual sustained accuracy has been above that target since week 12, finishing 2025 at 92.1% on the rolling 12-week average. This case study covers the methodology, the accuracy curve, what the client got and did not get, and the contract terms. Unlike the France retail program, Transperfect Dataforce is named because they are a public vendor and the program is referenced on their case-study page.

Baca dalam Bahasa Indonesia →

Key stats

60 → 20

Transperfect Dataforce program: trainees → full-time

[Zipang Research]

92.1%

Sustained production accuracy (rolling 12-week)

[Zipang Research]

78%

Week 1 production accuracy

[Zipang Research]

92%

Week 12 production accuracy (first hit 90%+)

[Zipang Research]

Full-time production agents

[Zipang Research]

200 cases

Weekly gold-set calibration size

[Zipang Research]

What is …?

What is interpolation annotation?

Interpolation annotation is a video frame labeling task used to train computer-vision models on motion and continuity between consecutive frames. For each video clip, the annotator identifies the object state at frame N, predicts the state at frame N+24 (or another fixed interval), draws the bounding box, polygon, or keypoint set, and tags any motion-blur, occlusion, or scene transition. The labeled output is the ground truth a model uses to learn object tracking, action recognition, and temporal continuity. The work is harder than single-frame object detection because the annotator must hold the object state consistent across frames, and disagreement between two annotators on the same clip is a useful QA signal.

The client: Transperfect Dataforce

Transperfect Dataforce is the data annotation arm of Transperfect, a publicly known global language and content services company. The Dataforce division runs annotation, evaluation, and data collection programs for foundation model and vertical AI training, and operates in 20+ countries. The vendor is named in this case study because the program is referenced on their public materials and the operational data below is consistent with what they publish.

The program is the Indonesian leg of a multi-region annotation cohort covering sports motion (basketball, football, gymnastics), training motion (yoga, strength training, dance), and automotive motion (pedestrian tracking, in-cabin driver monitoring). The Indonesian pod handles the Asia-region and EU-prime-time shift work; the same SOP and gold set are used by parallel pods in other regions.

The task: interpolation annotation across video frames

For each video clip, the annotator receives 24–48 frames, marks the object state on frame N, predicts the state on frame N+24, draws the appropriate bounding box, polygon, or keypoint set, and tags edge cases: motion blur, occlusion, scene transition, multiple objects of the same class, partial visibility. The output is a JSON manifest per clip with frame-level coordinates plus a confidence and motion-class tag per object.

Three sub-streams feed the program. Sports motion accounts for roughly 45% of the volume, training motion for 35%, and automotive motion for 20%. The sports stream is the highest volume but the lowest error rate; the automotive stream is the lowest volume but the highest error rate because of small-object detection (pedestrians at distance, dashboard reflections).

The accuracy target: 90% required, 90%+ sustained

Transperfect's contractual accuracy floor is 90% on the weekly client-side QA sample. Below 90% for two consecutive weeks triggers a remediation plan; below 88% for one week triggers an immediate escalation. The 90% target is the line at which the labeled data is useful for the client's downstream model training; below 90%, label noise is too high to train on.

Zipang's actual sustained accuracy on the program has been above 90% since week 12. The rolling 12-week average has been between 91% and 93% across 2024 and 2025, finishing 2025 at 92.1%. The accuracy curve from week 1 (78%) to week 12 (92%) is the operational story of the program and is documented below.

The 5-gate funnel mirror with an interpolation-specific gate 3 quiz

Zipang's 5-gate funnel was applied with a gate 3 quiz specific to interpolation annotation: 100 short video clips (sports, training, automotive) for which the candidate must produce the frame-N and frame-N+24 labels per the client SOP. The quiz is scored against a frozen answer set; pass threshold is 80% on quiz accuracy, which approximates the 78% week 1 production accuracy plus a small headroom for ramp.

Aggregate pass rates for this program: gate 1 retained 26% of applicants, gate 2 retained 47%, gate 3 retained 54%, gate 4 retained 71%, gate 5 retained 81%. End-to-end, 60 of approximately 1,800 applicants (3.3%) reached the paid trial cohort. Of the 60, 20 converted to full-time production (33% trialist-to-full-time conversion). The lower conversion versus the France retail program reflects the harder task; the program needs fewer, sharper operators, not a larger cohort.

Gate 1: CV relevance (English B2, prior video or temporal work), 26% pass
Gate 2: Async English + SOP comprehension, 47% pass of survivors
Gate 3: Interpolation-specific quiz on 100 video clips, 54% pass
Gate 4: Structured video interview + frame-thinking exercise, 71% pass
Gate 5: Paid trial task scored against frozen gold set, 81% pass
End-to-end: 60 of ~1,800 applicants reached trial; 20 converted to full-time

Headcount: 20 full-time production agents, weekly gold-set calibration

The active production cohort is 20 full-time agents: a deliberately small group. The Transperfect program rewards sustained accuracy and the client prefers fewer sharper operators over a larger cohort, partly because the task does not scale linearly: the QA reviewer and gold-set cycle is the bottleneck, and a 50-operator cohort would need two full-time QA reviewers, whereas a 20-operator cohort needs one.

The 20 agents are distributed across Jakarta, Bandung, and Yogyakarta. Weekly gold-set calibration runs on a 200-case frozen set: every Friday, the full cohort re-labels the 200 cases, scores are computed against the gold answer, and any operator whose accuracy drifts below 89% on the gold set is pulled into a 1:1 retraining session before live production the following week.

Microsecond-level KPI tracking and three-layer review

Three review layers run on every clip. First, peer review at frame N: a second operator labels frame N independently, and any disagreement greater than 5% on bounding-box IoU triggers a re-annotation. Second, supervisor review at frame N+24: a QA supervisor reviews every N+24 prediction against the source video, scoring motion-continuity and edge-case tagging. Third, weekly gold-set calibration on the 200-case frozen set, scored by the full cohort and compared to the previous week.

Microsecond-level KPI tracking means every label, every disagreement, every re-annotation is timestamped and attributable. The operations team can answer questions like 'which operator drifted on the automotive sub-stream after the gold set was updated in week 18' in a single dashboard query. Without the timestamp granularity, the weekly calibration cycle would be running blind.

The accuracy curve: 78% in week 1, 92% in week 12

Week 1 production accuracy landed at 78% on the first 2,000-clip batch, well below the 90% target. The error analysis showed three clusters: motion-blur edge cases (operators were drawing boxes too tight, losing the object when it crossed a frame boundary), automotive small-object detection (pedestrians at distance were missed entirely), and scene-transition frames (operators were labeling the post-transition frame instead of the transition itself).

Weeks 2–4: targeted retraining on the three error clusters, using the gold set as the teaching rubric. Production accuracy climbed to 82% by week 4. Weeks 5–8: introduction of the peer-review-at-frame-N layer, which caught disagreements before they reached the client. Accuracy climbed to 88% by week 8. Weeks 9–12: full introduction of the supervisor review at frame N+24, plus a 200-case weekly gold set. Accuracy crossed 90% in week 11 and landed at 92% in week 12.

The curve has been stable since week 12, oscillating between 91% and 93% on the rolling 12-week average. The 92.1% figure for end-2025 is within that band.

What the client got, and what the client did not get

What the client got: 20 full-time Indonesian production agents holding 90%+ sustained accuracy on interpolation annotation across sports, training, and automotive streams. Weekly KPI dashboards. Microsecond-level attribution of every label, disagreement, and re-annotation. Monthly calibration reports showing gold-set scores, drift, and retraining sessions. The accuracy curve from 78% to 92% across the first 12 weeks, with the methodology that produced it.

What the client did not get: public naming of the individual operators or trainers on the program. The agents are Indonesian remote workers whose individual identities are not published. The trainers and QA supervisors are Zipang staff and are also not individually named. The program is referenced in the aggregate (432 deployed, 90%+ sustained, 60 → 20 conversion), not by name.

Contract terms: annual, 30-day replacement guarantee

The program runs on an annual contract, auto-renewing in 12-month terms unless either party gives 60 days' notice. Pricing is per-task-delivered with a quality bonus that triggers when the rolling 12-week accuracy holds above 91%. The 30-day replacement guarantee covers any agent who falls below the gold-set threshold for two consecutive weeks or who triggers an escalation event. The 90-day window covers non-performance issues that surface later (consistently slow throughput, repeated QA flags, or failure to maintain shift coverage).

Replacement is the operator's responsibility, not the client's. The client does not re-run hiring; Zipang pulls from the active shortlist pool (fed by the same 5-gate funnel) and onboards the replacement within 5 business days. Across the life of the program, 7 of the original 20 full-time agents have been replaced; the current 20 are a mix of original and replacement operators, with the longest-tenured operator at 28 months.

Result

The program is in its third contract year. Sustained accuracy is at 92.1% on the rolling 12-week average. The 20 full-time cohort is stable. Replacement is functioning. The client has not re-bid the program since the original contract, and the parallel pods in other regions are calibrated against the Indonesian pod's gold set as a baseline.

The 90% number is a contractual floor; the 92.1% is what the program actually delivers. Both are defensible. The accuracy curve from 78% in week 1 to 92% in week 12 is the operational story of how a 5-gate funnel, a frozen gold set, a three-layer review, and weekly calibration combine to land at a number the client can train models on.

Common questions

Why is Transperfect Dataforce named but the France retail client is not?

Transperfect Dataforce is a public vendor and references the program on their case-study materials. The France retail client is a private French retail group that has not consented to public disclosure. The case study describes the France retail program operationally, same operational shape, no name, because the contract does not permit naming.

How did accuracy go from 78% in week 1 to 92% in week 12?

Three-step retraining arc. Weeks 2–4: targeted retraining on the three error clusters identified in the week 1 error analysis (motion blur, small-object detection, scene transitions), using the gold set as the rubric. Weeks 5–8: introduce peer review at frame N, which catches disagreements before they reach the client. Weeks 9–12: add supervisor review at frame N+24 plus a 200-case weekly gold set. The 90% threshold was first crossed in week 11.

What is the threshold to fail the program?

Below 90% for two consecutive weeks triggers a remediation plan. Below 88% for one week triggers an immediate escalation. Below 89% on the weekly gold set for two consecutive weeks removes an operator from live production into 1:1 retraining.

Can the client request additional agents beyond the 20?

Yes, with a 60-day ramp at the same methodology. The bottleneck is the QA reviewer and gold-set cycle, not the operator pool, so headcount expansion is paced to the QA capacity. A 40-operator ramp would need 90 days; a 100-operator ramp would need 6 months.

Is the same SOP used across regions?

Yes. The Indonesian pod and the parallel pods in other regions work from the same SOP, the same gold set, and the same rubric. The Indonesian pod's calibration scores are the baseline against which other regions are compared, and weekly cross-region calibration is run by Transperfect to flag drift.

What does the 30-day replacement guarantee cover?

Any agent who falls below the gold-set threshold for two consecutive weeks, or who triggers an escalation event, is replaced within 30 days. The 90-day window covers non-performance issues that surface later: consistently slow throughput, repeated QA flags, or failure to maintain shift coverage. Replacement is the operator's responsibility, not the client's.

Key takeaways

1. Transperfect Dataforce is a public vendor; the program is named. Accuracy target was 90%; Zipang's sustained accuracy is 92.1% on the rolling 12-week average.
2. 60 of approximately 1,800 applicants (3.3%) reached the trial cohort; 20 of those (33%) converted to full-time production. The cohort is deliberately small.
3. Three-layer review: peer review at frame N, supervisor review at frame N+24, weekly 200-case gold set. Disagreements greater than 5% IoU trigger re-annotation.
4. Accuracy curve: 78% week 1, 82% week 4, 88% week 8, 90% week 11, 92% week 12, oscillating 91–93% since.
5. Annual contract, per-task pricing with a quality bonus above 91%, 30-day replacement guarantee, 90-day non-performance window. Replacement is the operator's responsibility.
6. What the client did not get: public naming of individual operators or trainers. The program is referenced in aggregate, not by name.

Evaluating an interpolation or video-frame annotation program?

Zipang runs video frame annotation at 92.1% sustained accuracy with three-layer review and weekly gold-set calibration. See the production cases and request employer scoping.

Request annotation scoping See Zipang production cases

Sources

Data and claims in this article reference verifiable sources (including Zipang research and public data such as APJII, JobStreet, Buffer).

1.
Zipang Remote Work Market Research 2026
Zipang Research · 2026-06-14
2.
Transperfect Dataforce: Data Annotation Services
Transperfect · 2026-06-14
3.
Statistik Tenaga Kerja Indonesia
BPS Indonesia · 2026-06-14
4.
EF English Proficiency Index 2025
EF Education First · 2026-06-14
5.
Indonesia: The Next Big Talent Story
McKinsey & Company · 2026-06-14

Explore related job paths

Customer Support Data Entry Administrative

Zipang knowledge base

All guides (EN)Panduan (ID)Research Insights CS Playbook Employers About