FaaS is stateless, and AWS Step Functions provides state as-a-Service. It’s a brilliant idea, but it’s not (yet) a revolution in serverless architecture.

Ben Kehoe
Serverless Zone
Published in
4 min readDec 5, 2016

--

tl;dr: I’m excited and disappointed about AWS Step Functions. I’m excited because I think it’s an excellent paradigm and a good service. I’m disappointed because its main use case, with associated limits and pricing, doesn’t allow that paradigm to be fully brought in to serverless architecture.

The AWS Step Functions service was announced this week at AWS re:Invent. Step Functions provide state machines that manage workflows involving Lambda functions. It appears to be built by the Simple Workflow folks. It includes an extensive specification of a JSON-based (allow YAML as input plz) declarative language for its state machines.

As a longtime user of state machine libraries (thanks for the memories, smach), I am excited about the introduction of this paradigm to serverless architecture. State machines are a great way to reason about state within a system, and can be especially useful at a small scale (e.g., as part of a transaction).

I also think it’s a great paradigm to marry synchronous external invocations (e.g., a client calling API Gateway) to robust asynchronous call chains. It solves the problem of nested synchronous Lambda invocations costing double or triple the execution time (as the caller waits for the callee).

Functions as a Service like AWS Lambda need to be stateless to provide the provisioning that enable serverless applications to be transparently scalable. But serverless applications themselves are generally stateful in some way (unless all state is kept client-side). Often this state is kept in a SaaS database, but sometimes small amounts of transient state need to be kept, like in distributed transactions. State as a Service, in the form of state machines, could reduce the amount of roll-your-own distributed state machine code that gets written (probably incorrectly).

But Step Functions isn’t built to provide small amounts of transient state. Instead, Step Functions is built to provide a medium amount of semi-persistent state. It’s a descendant of SWF, the Simple Workflow service, which was designed for business processes. Simon Wardley thinks Step Functions will be huge for business processes, and he might be right.

At iRobot, we considered SWF for handling distributed transactions, but settled on SQS with a “checklist” that has all steps re-checked on retry. Part of the reason was lack of comprehensive support in SWF for Lambda, but SWF is also expensive. You pay per workflow execution, as well as for the duration of the workflow.

Step Functions is cheaper than SWF. There’s no duration-based cost, and it’s 4 times cheaper per execution than SWF. But at $0.025 per 1,000 executions, it’s 125 times more expensive per invocation than Lambda. Now, keeping state is hard, and I‘m not sure I’d expect a state machine-based replacement for synchronous Lambda call chains to actually be cheaper overall, but at this price, it’s not going to be cost-effective to replace existing roll-your-own Lambda solutions.

Worse, the default throttling limit for a state machine is two executions per second. Obviously this limit can be raised, but it indicates the intended uses. I’d draw an analogy to scheduled CloudWatch Events, which also had severe limits at launch. Yes, it’s useful for running periodic Lambdas, but it’s not built to handle massively scaled but transient event scheduling — I can’t use it to let Lambdas delay themselves by creating an individual scheduled event in every Lambda invocation.

Step Functions have a lot of cool features. You can have state machine executions that last a year(!), set timeouts on tasks, interrupt execution, have tasks send heartbeats, and monitor and audit at fine granularity. These are all great, especially for the business process use case.

On the other hand, I would give up a lot of these features if it enabled cheap, high-scale state machines using an event-driven paradigm.

In Step Functions, Workers (called activities) must poll for tasks, and Lambda invocations by Step Functions are synchronous. I get why both of these decisions were made. But allowing a task to be specified as some sort of endpoint (asynchronous Lambda, SNS topic, HTTP endpoint, etc.), with a URL provided in the payload for posting back the output would allow for more event-driven architectures.

What would I give up for a cheaper version, one on the order of Lambda prices? Timeouts more than 15 minutes. The ability to reliably interrupt a state machine execution. Activities and synchronous Lambda invocations. Visibility (e.g., the state as returned by the API could be eventually consistent).

I’m hopeful that State as-a-Service is a paradigm that can be made to work for low-level, transient state needed by FaaS in serverless architectures, whether it’s through Step Functions or another service.

--

--