19
Temporal - the iPhone of System Design
- You can listen to the audio narrated version here.
- I presented these as talks for CascadiaJS (9 min lightning talk) and React NYC (30min full talk)
I'm excited to finally share why I've joined Temporal.io as Head of Developer Experience. It's taken me months to precisely pin down why I have been obsessed with Workflows in general and Temporal in particular.
It boils down to 3 core opinions: Orchestration, Event Sourcing, and Workflows-as-Code.
Target audience: product-focused developers who have some understanding of system design, but limited distributed systems experience and no familiarity with workflow engines. My only goal is to outline Temporal's core design goals and to imply that, if you share these goals, then you will eventually build something like Temporal, as Mitchell Hashimoto put it. I will not explain how it works, how to get started, or even really what Temporal is — that comes later.
The most valuable, mission-critical workloads in any software company are long-running and tie together multiple services.
- You want to standardize timeouts and retries.
- You want offer "reliability on rails" to every team.
- You must never drop any work.
- You must log all progress.
- You want to easily model dynamic asynchronous logic...
- ...and reuse, test, version and migrate it.
Finally, you want all this to scale. The same programming model going from small usecases to millions of users without re-platforming. Temporal is the best way to do all this — by writing idiomatic code known as "workflows".
Suppose you are executing some business logic that needs to go through System A, then System B, and then System C. Easy enough right?

But:
You could deal with B by just looping until you get a successful response, but that ties up compute. Probably the better way is to persist the incomplete task in a database and set a cron job to periodically retry the call.
Dealing with C is similar, but with a twist. You still need B's code to retry the API call, but you also need another (shorter lived, independent) scheduler to place a reasonable timeout on C's execution time since it doesn't report failures when it goes down.

Wiring together queues, timers, databases, and serverless functions just to do retries (just retries!) is a real architecture recommended by AWS:

But imagine doing this per system. Pretty soon your architecture looks like this:

Do this often enough and you soon realize that writing timeouts and retries are really standard production-grade requirements when crossing any system boundary, whether you are calling an external API or just a different service owned by your own team.
Instead of writing custom code for timeout and retries for every single service every time, is there a better way? Sure, we could centralize it!

We have just rediscovered the need for orchestration over choreography. There are various names for the combined A-B-C system orchestration we are doing — depending who you ask, this is either called a Job Runner, Pipeline, or Workflow.
Honestly, what interests me (more than the deduplication of code) is the deduplication of infrastructure. The maintainer of each system no longer has to provision the additional infrastructure needed for this stateful, potentially long-running work. This drastically simplifies maintenance — you can shrink your systems down to as small as a single serverless function — and makes it easier to spin up new ones, with the retry and timeout standards you now expect from every production-grade service. Workflow orchestrators are "reliability on rails".
But there's a risk of course — you've just added a centralized dependency to every part of your distributed system. What if it ALSO goes down?
The work that your code does is mission critical. What does that really mean?
There are two ways to track all this state. The usual way starts with a simple task queue, and then adds logging:
(async function workLoop() {
const nextTask = taskQueue.pop()
await logEvent('starting task:', nextTask.ID)
try {
await doWork(nextTask) // this could fail!
catch (err) {
await logEvent('reverting task:', nextTask.ID, err)
taskQueue.push(nextTask)
}
await logEvent('completed task:', nextTask.ID)
setTimeout(workLoop, 0)
})()
But logs-as-afterthought has a bunch of problems.
The alternative to logs-as-afterthought is logs-as-truth: If it wasn't logged, it didn't happen. This is also known as Event Sourcing. We can always reconstruct current state from an ever-growing list of
eventHistory
:(function workLoop() {
const nextTask = reconcile(eventHistory, workStateMachine)
doWorkAndLogHistory(nextTask, eventHistory) // transactional
setTimeout(workLoop, 0)
})()
The next task is strictly determined by comparing the event history to a state machine (provided by the application developer). Work is either done and committed to history, or not at all.
I've handwaved away a lot of heavy lifting done by
reconcile
and doWorkAndLogHistory
. But this solves a lot of problems: You can also make an analogy to the difference between "filename version control" and git — Using event histories as your source of truth is comparable to a git repo that reflects all git commits to date.
But there's one last problem to deal with - how exactly should the developer specify the full state machine?
The prototypical workflow state machine is a JSON or YAML file listing a sequence of steps. But this abuses configuration formats for expressing code. it doesn't take long before you start adding features like conditional branching, loops, and variables, until you have an underspecified Turing complete "domain specific language" hiding out in your JSON/YAML schema.
[
{
"first_step": {
"call": "http.get",
"args": {
"url": "https://www.example.com/callA"
},
"result": "first_result"
}
},
{
"where_to_jump": {
"switch": [
{
"condition": "${first_result.body.SomeField < 10}",
"next": "small"
},
{
"condition": "${first_result.body.SomeField < 100}",
"next": "medium"
}
],
"next": "large"
}
},
{
"small": {
"call": "http.get",
"args": {
"url": "https://www.example.com/SmallFunc"
},
"next": "end"
}
},
{
"medium": {
"call": "http.get",
"args": {
"url": "https://www.example.com/MediumFunc"
},
"next": "end"
}
},
{
"large": {
"call": "http.get",
"args": {
"url": "https://www.example.com/LargeFunc"
},
"next": "end"
}
}
]
async function dataPipeline() {
const { body: SomeField } = await httpGet("https://www.example.com/callA")
if (SomeField < 10) {
await httpGet("https://www.example.com/SmallFunc")
} else if (SomeField < 100) {
await httpGet("https://www.example.com/MediumFunc")
} else {
await httpGet("https://www.example.com/BigFunc")
}
}
The benefit of using general purpose programming languages to define workflows — Workflows-as-Code — is that you get to the full set of tooling that is already available to you as a developer: from IDE autocomplete to linting to syntax highlighting to version control to ecosystem libraries and test frameworks. But perhaps the biggest benefit of all is the reduced need for context switching from your application language to the workflow language. (So much so that you could copy over code and get reliability guarantees with only minor modifications.)
This config-vs-code debate arises in multiple domains: You may have encountered this problem in AWS provisioning (CloudFormation vs CDK/Pulumi) or CI/CD (debugging giant YAML files for your builds). Since you can always write code to interpret any declarative JSON/YAML DSL, the code layer offers a superset of capabilities.
So for our mission critical, long-running work, we've identified three requirements:
Respectively, these solve the pain points of reliability boilerplate, implementing observability/recovery, and modeling arbitrary business logic.
If you were to build this on your own:
Finally, you'd have to make your system scale for many users (horizontal scaling + load balancing + queueing + routing) and many developers (workload isolation + authentication + authorization + testing + code reuse).
When Steve Jobs introduced the iPhone in 2007, he introduced it as "a widescreen iPod with touch controls, a revolutionary mobile phone, and a breakthrough internet communications device", before stunning the audience: "These are not three separate devices. This is ONE device."
This is the potential of Temporal. Temporal has opinions on how to make each piece best-in-class, but the tight integration creates a programming paradigm that is ultimately greater than the sum of its parts:
A fun anecdote about how I got the job: through blogging.
While exploring the serverless ecosystem at Netlify and AWS, I always had the nagging feeling that it was incomplete and that the most valuable work was always "left as an exercise to the reader". The feeling crystallized when I rewatched DHH's 2005 Ruby on Rails demo and realized that there was no way the serverless ecosystem could match up to it. We broke up the monolith to scale it, but there were just too many pieces missing.
I started analyzing cloud computing from a "Jobs to Be Done" framework and wrote two throwaway blogposts called Cloud Operating Systems and Reconstituting the Monolith. My ignorant posting led to an extended comment from a total internet stranger telling me all the ways I was wrong. Lenny Pruss, who was ALSO reading my blogpost, saw this comment, and got Ryland to join Temporal as Head of Product, and he then turned around and pitched (literally pitched) me to join.
One blogpost, two jobs. Learn in Public continues to amaze me by the luck it creates.
Still, why would I quit a comfy, well-paying job at Amazon to work harder for less money at a startup like this?

There is much work to do, though. Temporal Cloud needs a lot of automation and scaling before it becomes generally available. Temporal's UI is in the process of a full rewrite. Temporal's docs need a lot more work to fully explain such a complex system with many use cases. Temporal still doesn't have a production-ready Node.js or Python SDK. And much, much, more to do before Temporal's developer experience becomes accessible to the majority of developers.
I've probably exhausted your patience at this point but at least I hope you see that I genuinely think the potential is humongous. And yet I'm still understating it.
Temporal today is pitched as "reliability on rails" or as a "workflow-as-code microservices orchestration engine" in the same way that the initial pitch of iPhone led with "a widescreen iPod with touch controls". We do that because it's familiar to things you already know — queues, databases, cronjobs, job runners, data and provisioning pipelines.
But now all my iPhone audio comes from Spotify and Overcast, I barely use the phone functionality, and I'm using the mobile Internet the rest of the time. The equivalent decade-long potential of Temporal is as ambitious as defining an "8th layer" to the OSI 7 Layer model and reinventing asynchronous programming the way iPhone reinvented smartphones.
Long time readers will recognize this as a "Strategy Turn" — the fact that it will happen is a matter of when, not if.
If what I've laid out excites you, take a look at our open positions (or write in your own!), and join the mailing list!
- Yan Cui's guide to Orchestration vs Choreography
- InfoQ: Coupling Microservices - a non-Temporal focused discussion of Orchestration
- A Netflix Guide to Microservices
- Martin Fowler on Event Sourcing
- Kickstarter's guide to Event Sourcing
- Dealing with failure - when to use Workflows
- The macro problem with microservices - Temporal in context of microservices
- Designing A Workflow Engine from First Principles - Temporal Architecture Principles
- Writing your first workflow - 20min code video
- Case studies and External Resources from our users
19