CloudWatch: fix test_put_metric_alarm flakiness#13851
Conversation
Test Results - Preflight, Unit23 070 tests 21 179 ✅ 6m 9s ⏱️ Results for commit 565a93f. |
Test Results - Alternative Providers176 tests 39 ✅ 2m 30s ⏱️ Results for commit 565a93f. |
Test Results (amd64) - Acceptance7 tests 5 ✅ 3m 2s ⏱️ Results for commit 565a93f. |
LocalStack Community integration with Pro 2 files 2 suites 49m 9s ⏱️ Results for commit 565a93f. |
Test Results (amd64) - Integration, Bootstrap 5 files 5 suites 1h 5m 1s ⏱️ Results for commit 565a93f. |
pinzon
left a comment
There was a problem hiding this comment.
I wasn't aware of this issue. I can see it took a lot of effort to catch this.
Thank you for fixing it 👍
remove some parametrization for long running tests, those areas are already covered by other tests so no need to run all 3 protocols for them
I agree, let's not parametrize the client in long running tests like for the alarms.
Motivation
We have been seeing
test_put_metric_alarmbeing flaky for a long while now. This has been a bit exacerbated by the fact that we run 3 versions of the test for each protocol CloudWatch supports (query, json and cbor)I tracked down the issue to be because the way we collected SQS messages, and a rare condition that could happen:
If you put the metrics at the same time the alarm scheduled task which fetch those metrics, you might end up with an alarm state that is "OK", because it only picked up half the metrics. This is fine, it somewhat means the alarm was executed before those metrics were put. When CloudWatch re-run the alarm scheduler task, this time it will properly trigger the alarm on the second time. But our SQS snapshot helper was only fetching the first message and not deleting it, so it would stay with the "OK" message notification and fail, even though the right ALARM message was there, as shown by the logs:
Here are the full failing log, from this run:
This file contains the only relevant part, detailed under:
sqs-cloudwatch.txt
What the flow is:
-> PutMetrics with 2 metrics
-> Scheduler fetches the metrics, fetches only the first metric (executed at the same time, this is fine)
-> Alarm state is then OK
-> Publish OK notification
-> Helper calls SQS ReceiveMessage continuously
-> Scheduler fetches the metrics, triggers the Alarm
-> Alarm state is now ALARM
-> Publish ALARM notification
-> The helper is only fetching the first message, does not delete it, and never sees the ALARM notification
-> Test fails
Changes
Tests
Related