Lambda Auto Retries

When our runtime is AWS lambda, we need to be aware of the runtime error handling. In general, two types of errors are categorised.

Invocation error

This is typically happening before the request hits the runtime or our code. In other words, Lambda service cannot invoke a function. It includes the examples like:

  • Request payload is too large or bad request, requesting a non-existent function etc.
  • the caller doesn’t have the permission
  • rate limit and throttling

Function error

This is typically thrown by the runtime or our code.

  • runtime: The runtime code is the iceberg under the sea, and it does lots of work behind the scene, like unmarshal the request and marshal the response. Or run out of time, out of memory etc.

  • function code: which is the major workload we develop to run at the runtime and the part we can control the error thrown.

When we try to consider Lambda in a request/response fashion, it is analogous to HTTP request/response.

Invocation errors are similar to a HTTP 4xx or 5xx error, some precondition is not satisfied to invoke the function, caused by either the client or the lambda service. Function error is like a normal response, it has status code to tell what the error is like. A special header is used for this purpose:

X-Amz-Function-Error

  • 2xx: runtime or function errors
  • 4xx: the invoking client can alter the request, request permission or retry the request
  • 5xx: an issue with Lambda, or an issue with the function’s configuration or resources

Errors in app code

From the above, we know error handling in app code is required to avoid the auto retry behaviours if that is unnecessary. Especially when a retry invocation can cause some problem, for example, a resource has been created in the previous invocation, the retry may cause a duplicate resource.

Likewise, we can categorise the errors:

  • Internal errors

    • language syntax errors, null / undefined, reference errors in Javascript etc.

    • Dependency wiring errors

    • incorrect configurations

    • access denied due to lack of permissions

  • External errors

    • Invalid request

    • HTTP errors

    • AWS service access errors

    • DB access errors

    • 3rd party services errors

    • network jitter

The first category is mostly able to be addressed during development or testing. If those exist, a typical defect should be fixed.

The second category is mostly unexpected and cannot be avoidable if there is no 100% reliable service.

Handling practices

recoverable and unrecoverable errors

If we can clearly differentiate the nature of an error, it would be much easier to handle it. When it is recoverable, we know a retry may be doable, otherwise a fail-fast can be reasonable or a retry has large change to fail again.

Typical recoverable errors can be

  • HTTP 5xx errors, more semantics can make things easier when we define gained granularity HTTP status code

  • DB connection failure if not always

  • 3rd party service temporarily unavailable

  • network jitter

Unrecoverable errors can fall into the rest.

  • HTTP 4xx errors if it actually indicates an invalid request

  • whatever we cannot do anything to recover and resolve

For unrecoverable errors, we clearly know it shouldn’t be retried, so in our lambda, we shouldn’t throw it when responding the lambda caller. Instead, we can log it.

retry in a smaller scope

Most of the time, a retry in a Lambda level can be over-killed. It needs to re-construct the request either by pushing a SQS message to the queue or timed scheduling after a period of time. If the lambda function is not designed as idempotent, then the retry may cause problems in logic.

The tricky thing for Lambda function is not transactional and doesn’t have the feature of save points typically in transactional database.

However, sometimes a retry in a lower level can be sufficient. For example, if a function includes some logic to send HTTP requests. We predict it will fail in some cases, then a HTTP retry only applying in this endpoint can be considered, as there is no side effect if this HTTP request has the chance to succeed. Otherwise, we can report the lambda caller that we attempt but still fail.

replay vs. auto retry

Auto-retry provided by Lambda can be useful when all the data can be valid and will not invalidate after a short period of time, or the lambda code is re-entrant.

When timely conditions cannot be met, a replay mechanism can be introduced to re-construct the data when possible. For example, in raken-integrations, the command includes the access token which can be expired if it is not used immediately. In the replay, we can refresh the token and keep other data unchanged.

transaction conpensation

Sometimes we cannot avoid the happening of an error, for example, the contention and concurrency when uploading multiple files into one folder of a Cloud storage service. Normally the 3rd-party service will respond 400, we can catch the message and do a file version updating when the file already exists.