The AI Improvement Framework

▸ contents

Note: This work was completed at Amazon. Internal systems and organizational details have been generalized where appropriate to respect confidentiality.

Amazon’s translation pipeline was automating fast. The question nobody had answered: what would the humans actually do in that world, and what tools would you build for people who were no longer fixing translations one by one? I ran the discovery, synthesized the diagnosis, and delivered a product strategy to the org’s Director and senior product and engineering leadership that gave fragmented teams something concrete to build toward together.

The Mandate

Amazon had already made its bet. AI translation was getting good enough, fast enough, that the path forward was clear: automate as much of the translation pipeline as possible, reduce dependence on post-editing, and cut costs. That decision raised a harder question.

In an AI-driven system, where and when do humans matter? What do they actually do? And what tools do you build for people who are no longer fixing translations one by one, but managing, monitoring, and improving AI systems at scale?

The organizational context made answering this harder than it sounds. Amazon’s localization operation wasn’t a single coherent product. It was a collection of interconnected systems and teams: tooling for linguists, orchestration infrastructure, content owner workflows, quality assurance, and more. Some ran on legacy platforms. Others were being rebuilt from scratch. Product managers across these segments had competing priorities and frequently disagreed over where new capabilities should be built, who should own them, and what “done” looked like. Without a shared vision for where the org was heading, every cross-functional decision became a negotiation.

The old world was built around linear output. Machine translation reduced how much humans had to edit, but humans still oversaw everything. Every piece of content passed through a human at some point. That worked when volume was manageable.

LLMs changed the equation. Content volume exploded. Human review costs exploded with it. And humans were still correcting individual outputs when what the system actually needed was for them to be training the models. Corrections fixed the symptom. They didn’t fix the problem.

The new world needed something different. AI handles the volume. Humans are the feedback layer. The strategy for those two worlds is completely different.

My manager, who ran the product team, brought me in with no specific feature to build and no defined scope. Just: figure out what that new world actually looks like, and make it concrete enough that fragmented teams could use it to make better decisions.

Discovery: Finding the Real Question

I started with the people who lived closest to the problem.

I ran a three-day workshop with Subject Matter Experts from Product, Science, and Operations—three teams that worked adjacent to the same systems but were rarely in the same room to solve a problem together.

Day 1 mapped current reality: existing workflows, where errors surfaced, what happened to feedback, where expertise lived and where it got lost.

Day 2 opened up: blue-sky sessions on the future state, what AI would eventually own, and where humans would still matter.

Day 3 closed the gap: what would need to change to get from here to there, and in what order.

Working alongside one of the ML scientists surfaced something that stuck with me:

An AI translation model had been consistently translating “wireless mouse” as the animal in German. Human translators fixed it hundreds of times across hundreds of product descriptions. The model never learned. Every new batch of content arrived with the same error, and someone fixed it again.

The correction lived in Translation Memory, which meant it could be reused. It didn’t reach the model, which meant the underlying mistake persisted. Human effort was being spent, but it wasn’t accumulating into anything. The system was getting corrections without getting smarter.

This wasn’t a rare edge case. It was the pattern. And it defined the core diagnosis: valuable signal from linguists, quality managers, project managers, and end users existed in separate places with no unified path to model improvement. Human expertise was trapped. The AI was flying partially blind.

The real question wasn’t “how do we update the tooling?” It was: how do we build a system where human expertise actually reaches the models that need it?

The Vision

The product strategy was built around one central idea: in an AI-first translation operation, humans don’t do the work. They make the system better at doing the work.

That sounds like a small shift. It isn’t. It changes what every role looks like, what every tool needs to do, and what “good work” means for the people doing it.

The Improvement Cycle

The strategy defined a continuous improvement cycle: Monitor signals, Identify problems, Refine models, Deploy improvements. Every human role in the org mapped to one or more stages of that cycle—not as an add-on to their existing jobs, but as the new definition of their jobs.

Feedback Primitives

Most translation feedback came from linguists through professional tooling—CAT tools and post-editing workflows designed for people who do this work full-time. That’s appropriate for what linguists do. But it limits the signal pool to one type of expert, inside one type of workflow.

There’s a lot of signal that never gets collected. A domain expert reviewing translated technical documentation. A customer support agent who sees the same misunderstanding come up repeatedly. End users who notice something feels unnatural and could tell you something was wrong even if they couldn’t say what or why.

None of them are in the translation workflow. None of them have access to heavyweight professional tools. And nobody was asking them anyway.

The feedback primitives were designed to show that didn’t have to be true. If you went small, simple, and reusable, you could collect signal from almost anywhere without requiring linguistic expertise or a specialized workflow to participate.

Primitive	How it works	Why it matters
Swipe (accept/reject)	Immediate gut reaction, one gesture	No expertise required. An end user can do this from a product page.
Star rating	1–5 quality assessment	More nuance than accept/reject. Useful for trending quality over time.
A/B comparison	Choose the better translation	People recognize quality even when they can’t articulate why. Works across expertise levels.
Likert scale	”This translation would prevent me from buying this product”	Connects quality directly to business outcomes.
Tagging	Structured error categorization	Powers pattern analysis downstream.

The same primitive a scientist deploys in an ad hoc study could sit inside a linguist’s queue, surface in a customer-facing context, or live in a domain expert’s review flow. Not a single heavyweight system. A small toolkit that could travel across the org.

Role Evolution

The bigger strategic challenge wasn’t the feedback interfaces. It was the roles.

As the system moved toward full automation, three existing roles needed to transform—not get replaced, but transform.

Role	Before	After
Project Manager	Managing individual translation jobs through the pipeline	Overseeing content segments as a strategic unit. Monitoring quality trends. Validating AI outcomes at scale.
Asset Manager	Maintaining translation memories and terminology by hand	Monitoring system health metrics. Setting quality thresholds. Guiding AI-driven optimization.
Scientist	Focused improvements to specific models	Owning the full improvement cycle: pattern analysis, experiment design, deployment validation.

These weren’t just job description changes. They required new tools, new interfaces, and new mental models for what good work looked like. I designed for each of them: personas showing how their work would evolve, wireframes illustrating what their day-to-day tooling would need to do, and a process map walking through a complete improvement loop from signal detection to model deployment.

Dynamic Content Routing

Separate from the role work, I developed a concept for how the translation system itself should make smarter decisions: a signal map for dynamic content routing.

Instead of treating all content the same way, the system would read content metadata to make real-time cost and quality tradeoffs. Different content carries different risk. A legal disclaimer and a product feature bullet don’t need the same quality threshold. A system that understands that can put human effort where it actually matters and route everything else to automation with confidence.

The improvement loop extended this further: AI estimates quality, an LLM runs automatic post-edits on content that falls below threshold, human review only happens where the risk warrants it, and performance signals feed back in to flag content for reprocessing. The system gets smarter over time without requiring proportional increases in human effort.

The direction: move from treating human oversight as a cost to be minimized, to treating human judgment as a precision instrument, deployed only where it’s irreplaceable.

Outcome

The output was an executive presentation delivered to the org’s Director and senior product and engineering leadership in February 2025. It contained four components: proto-personas for each evolved role, wireframes illustrating what their tooling would need to do, a process map showing the full improvement cycle with a worked example, and the dynamic routing concept.

It gave the organization a shared language for a future it was heading toward but hadn’t clearly described. A frame that fragmented product teams could use to ground cross-functional decisions in something more durable than competing priorities. Leadership could see, concretely, what the evolved roles would look like, what tools they’d need, and what the AI systems supporting them would have to do differently.

The immediate next step was a defined collaboration with the Science team to iterate on short-term feedback needs—taking the vision from concept to an active working agenda.

This is what product strategy looks like when there’s no defined spec to execute against. Structured discovery, cross-functional alignment, reasoning backwards from a future state to identify what needs to change now. The artifacts were in service of a bigger question: how does this organization stay effective as AI takes over the work? That question doesn’t go away when you change domains. Any AI-first company is asking a version of it.

Reflection

The industry did largely what the framework predicted. Organizations shifted toward oversight models, connected AI systems to asset libraries, and focused human linguist effort on high-priority or sensitive content. AI handles the volume. Humans stay close to the decisions that matter most.

That transition was right. It’s also already being overtaken.

The feedback loop question got the right diagnosis, wrong layer. The framework asked: what if we actually used all those linguist edits—the hundreds of corrections happening every day—to drive real model improvement? Most organizations settled for routing. Few built genuine feedback loops. And the ones that did found that correction-based feedback is downstream of where the real leverage lives.

In an agentic world, upstream context matters more than downstream correction. The question isn’t “how do we fix what AI gets wrong?” It’s “what does AI need to know before it starts?” Content metadata, domain context, organizational knowledge about what good looks like and why. The better you build that foundation, the better every downstream decision gets: quality estimation, routing, how much human review is actually warranted.

That’s where I’d focus today. The process mapping in the framework gestured at this—using content metadata to drive smarter routing decisions. I’d push it much further. What context does an AI system actually need to operate well in this domain? Where does that knowledge live in the organization? How is it maintained as the domain evolves? Building the systems that encode what an organization knows into forms AI can actually use is the design problem I’d prioritize.

Which reframes the oversight role entirely. It’s not “how do humans monitor AI outputs.” It’s “how does AI learn from human expertise at the specification level, before any content is processed.” Defining that relationship is where the most consequential work is now—not how human effort corrects AI mistakes, but how human knowledge enters AI systems in the first place.

When the problem is ambiguous, the first job is creating the conditions for clarity. The discovery process is the design process. You’re not waiting to start. You’re already doing it.

Artifacts: Executive presentation deck (45 slides), role evolution personas, wireframes, improvement cycle process map, dynamic signal map concept

Portfolio visuals recreated in Figma (originals confidential): figma.com/design/1d97slGQI1aUmweQjbE9hk