PHANTOM
🇮🇳 IN
Skip to content

CloudWatch: fix test_put_metric_alarm flakiness#13851

Open
bentsku wants to merge 1 commit intomainfrom
fix-cloudwatch
Open

CloudWatch: fix test_put_metric_alarm flakiness#13851
bentsku wants to merge 1 commit intomainfrom
fix-cloudwatch

Conversation

@bentsku
Copy link
Contributor

@bentsku bentsku commented Feb 26, 2026

Motivation

We have been seeing test_put_metric_alarm being flaky for a long while now. This has been a bit exacerbated by the fact that we run 3 versions of the test for each protocol CloudWatch supports (query, json and cbor)

I tracked down the issue to be because the way we collected SQS messages, and a rare condition that could happen:

If you put the metrics at the same time the alarm scheduled task which fetch those metrics, you might end up with an alarm state that is "OK", because it only picked up half the metrics. This is fine, it somewhat means the alarm was executed before those metrics were put. When CloudWatch re-run the alarm scheduler task, this time it will properly trigger the alarm on the second time. But our SQS snapshot helper was only fetching the first message and not deleting it, so it would stay with the "OK" message notification and fail, even though the right ALARM message was there, as shown by the logs:

Here are the full failing log, from this run:

This file contains the only relevant part, detailed under:
sqs-cloudwatch.txt

What the flow is:
-> PutMetrics with 2 metrics
-> Scheduler fetches the metrics, fetches only the first metric (executed at the same time, this is fine)
-> Alarm state is then OK
-> Publish OK notification
-> Helper calls SQS ReceiveMessage continuously
-> Scheduler fetches the metrics, triggers the Alarm
-> Alarm state is now ALARM
-> Publish ALARM notification
-> The helper is only fetching the first message, does not delete it, and never sees the ALARM notification
-> Test fails

Changes

  • update the cloudwatch snapshot helper to clean up SQS messages
  • add some comments with one case I didn't see in the snapshots
  • remove some parametrization for long running tests, those areas are already covered by other tests so no need to run all 3 protocols for them

Tests

Related

@bentsku bentsku added this to the 2026.03 milestone Feb 26, 2026
@bentsku bentsku self-assigned this Feb 26, 2026
@bentsku bentsku added aws:cloudwatch Amazon CloudWatch semver: patch Non-breaking changes which can be included in patch releases docs: skip Pull request does not require documentation changes notes: skip Pull request does not have to be mentioned in the release notes labels Feb 26, 2026
@github-actions
Copy link

Test Results - Preflight, Unit

23 070 tests   21 179 ✅  6m 9s ⏱️
     1 suites   1 891 💤
     1 files         0 ❌

Results for commit 565a93f.

@github-actions
Copy link

Test Results - Alternative Providers

176 tests    39 ✅  2m 30s ⏱️
  1 suites  137 💤
  1 files      0 ❌

Results for commit 565a93f.

@github-actions
Copy link

Test Results (amd64) - Acceptance

7 tests   5 ✅  3m 2s ⏱️
1 suites  2 💤
1 files    0 ❌

Results for commit 565a93f.

@github-actions
Copy link

LocalStack Community integration with Pro

    2 files      2 suites   49m 9s ⏱️
1 239 tests 1 161 ✅ 78 💤 0 ❌
1 241 runs  1 161 ✅ 80 💤 0 ❌

Results for commit 565a93f.

@github-actions
Copy link

Test Results (amd64) - Integration, Bootstrap

    5 files      5 suites   1h 5m 1s ⏱️
1 263 tests 1 187 ✅ 76 💤 0 ❌
1 269 runs  1 187 ✅ 82 💤 0 ❌

Results for commit 565a93f.

@bentsku bentsku marked this pull request as ready for review February 26, 2026 14:03
Copy link
Member

@pinzon pinzon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware of this issue. I can see it took a lot of effort to catch this.
Thank you for fixing it 👍

remove some parametrization for long running tests, those areas are already covered by other tests so no need to run all 3 protocols for them

I agree, let's not parametrize the client in long running tests like for the alarms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

aws:cloudwatch Amazon CloudWatch docs: skip Pull request does not require documentation changes notes: skip Pull request does not have to be mentioned in the release notes semver: patch Non-breaking changes which can be included in patch releases

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants