Solving the Mystery of the 350-Second Timeout in AWS Lambda with Laravel Vapor

From time to time you hit this kind of issues that are hard to debug and you spend a lot of time trying to figure out what's going on. This is one of those issues that someone encountered recently with an unexpected fix and I thought it would be a great story to share with you.

I was recently browsing the Laravel discord #vapor channel and encountered a very weird issue that someone was having with a timeout in AWS Lambda. Despite all logical configurations, the timeout was always happening at 350 seconds.

try {
    Http::timeout(400)->get('example.com')->body();
} catch (\Exception $e) {
    // an exception was thrown, the timeout was reached
}

The timeout in vapor.yml was set to 600 seconds and the API is slow and responding between 300 and 360 seconds.

Everything seemed ok when testing locally, but when running on AWS Lambda, if the answer was more than 350 seconds, the call would timeout after the 400 seconds set in the code.

The next thing I tried was to increase the memory to 8192mb and the timeout to 900 seconds, but the issue persisted.

Maybe it was a bug in the Laravel Http client? I tried the php curl client.

try {
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_URL, 'example.com');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 400); // set the timeout to 400 seconds
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 400); // set the connect timeout to 400 seconds

    $response = curl_exec($ch);

    if (curl_errno($ch)) {
        // there was an error in curl
    }
    curl_close($ch);

} catch (\Exception $e) {
    // an exception was thrown, the timeout was reached
}

As you might expect, no luck here either.

Then I remembered something, Lambda is using a NAT Gateway to connect to the internet. I started reading the NAT Gateway documentation and found this in the troubleshooting section:

Internet connection drops after 350 seconds

Problem: Your instances can access the internet, but the connection drops after 350 seconds.

Cause: If a connection that's using a NAT gateway is idle for 350 seconds or more, the connection times out. When a connection times out, a NAT gateway returns an RST packet to any resources behind the NAT gateway that attempt to continue the connection (it does not send a FIN packet).

Solution: To prevent the connection from being dropped, you can initiate more traffic over the connection. Alternatively, you can enable TCP keepalive on the instance with a value less than 350 seconds.

I discovered a crucial piece of information about the behavior of the NAT Gateway: it drops connections after 350 seconds of idleness.

The solution, I think you already guessed it, was to implement CURLOPT_TCP_KEEPALIVE in the curl requests.

try {
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_URL, 'example.com');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT, 400);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 400);
    curl_setopt($ch, CURLOPT_TCP_KEEPALIVE, 1);

    $response = curl_exec($ch);

    if (curl_errno($ch)) {
        // there was an error in curl
    }
    curl_close($ch);

} catch (\Exception $e) {
    // an exception was thrown, the timeout was reached
}

And just like that, the problem disappeared. This is a problem that affects any application using AWS Lambda with a NAT Gateway, not just Laravel Vapor.

This is a reminder for me that sometimes you need to take a step back and look at the bigger picture to find the solution to a problem.