Short story about how I spent good 2 hours of yesterday’s evening.
We use wget
utility as a health-check probe pretty much in every ecs container.
healthcheck = {
command = ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:8000/health || exit 1"]
timeout = 30
interval = 10
retries = 3
startPeriod = 10
}
Very handy as wget
exists everywhere. It even existed in google-provided vertex AI predictor images.
docker pull us-docker.pkg.dev/vertex-ai/automl-tabular/prediction-server:20240609_1425
So I just re-used that piece of container configuration. Annoyingly ECS kept marking containers as unhealthy soon after start despite all my efforts to fine-tune health-check delay/retry parameters. Worst of all – I can clearly see probing http requests in container logs and 200 OK
response statuses.
It is impossible to see logs of a health-check procedure anywhere in ECS afair. After scratching my head for a bit, I replaced wget
with curl
as official ECS
help page suggests and it worked! Containers are healthy and I was able to finish my day.
But subtle background thought of “why” never left me… Today in the morning’s shower it finally clicked – readonly filesystem! When I was playing with vertex predictor I noticed that it didn’t like HEAD
http requests, only GET /health
was good enough for it to answer with 200 OK
response code (otherwise it gave nasty 405 Method Not Allowed
), so I removed --spider
argument to wget
. What wget
does by default? Right, it saves the downloaded file to a current folder…
Container definition looks something like that:
module "vertex_container_definition" {
source = "cloudposse/ecs-container-definition/aws"
version = "~>0.58.1"
essential = true
command = ["bash", "-c", "python3 -m google3.third_party.py.cloud_ml_autoflow.prediction.launcher"]
readonly_root_filesystem = true
mount_points = [
{
sourceVolume = "tmp"
readOnly = false
containerPath = "/tmp"
}
]
}
wget
was sending requests, but wasn’t able to save downloaded content to a disk, that caused health-check command to emit 1
exit code and ECS
marked container as unhealthy.