Hatchet转发了
How do you ensure that your web app never drops a single user request (particularly when those requests can be very complex, like triggering an AI agent)? It comes down to two simple paradigms: 1. Acknowledgements 2. Separation of concerns Let’s trace the path of a user request which triggers a complicated (perhaps LLM-backed) task: 1. A user clicks a button on your web app, which sends an HTTP request to your server. 2. Your server writes some data to the database for that request, and then creates a new task for that user. 3. Your task runs and updates the database when it’s finished to indicate that the work is done. 4. The user sees the results of the task in your web app. Overall, this seems pretty simple — where can this go wrong? - Your function restarts or crashes when processing the user request - Your function stores the data for the request, but never actually invokes the task - Your task starts, but never completes because of a hardware failure, running out of memory, or a bug in your code - Your task completes, but never gets stored in the database. While some of these scenarios might be unlikely when you just have a handful of users, you’ll start to see these issues more frequently as you scale. How do you solve this? 1. Acknowledgements and retries - every time you successfully process a request or message, you acknowledge it — either to the user in the form of a 200-level response code, or to a broker in the form of an ACK. If you fail to acknowledge a message within a time interval, or you negatively acknowledge a message (i.e. 500-level error), the caller retries the request. Even better if your functions are idempotent and you retry using exponential backoff with jitter. 2. Separation of concerns - different workloads have different runtime requirements. Spawning a very complex, long-running task in the same process as your API handler can cause headaches down the line, in the form of out-of-memory errors, saturating shared resources like file watchers or thread pools, or causing high latency due to intensive tasks eating up your CPU. The simple solution here is to separate your API from your workers, with a task queue (like Hatchet!) that handles ACKs, retries and recovery from failure. This also sheds some light on why using Postgres as a task queue can be so powerful — you remove certain classes of failure entirely when you enqueue a task as part of the same transaction that you’re actually writing user data. — We're building Hatchet - an open-source background worker framework (backed by Postgres) to build reliable AI apps in Python, TypeScript, and Go.