On Saturday 2019-02-09, we started receiving alerts about memory consumption (first incident).
After some investigation, Cindy Qi Li and I found:
- This graph (screenshot attached) which shows an apparent memory leak in Preferences starting around 2019-01-29.
- This graph (screenshot attached) showing a similar situation for Flowmanager
- We tracked down the change that was deployed on Jan 29 (visible as a spike in the above graphs – an artifact caused by a rolling update), but it only changes test data. This seems an unlikely cause of a memory leak.
- Stepan Stipl and I were independently suspicious of some recent changes that send more traffic to flowmanager and preferences, but the dates don't quite line up.
- https://issues.gpii.net/browse/GPII-3432 uses /ready and /health for Kubernetes's internal checks, but it landed around 2019-01-21, pretty early to have caused the uptick we saw around 2019-01-29.
- https://issues.gpii.net/browse/GPII-3669 and https://issues.gpii.net/browse/GPII-3687 / https://github.com/gpii-ops/gpii-infra/pull/281 made changes to uptime checks, but they landed on 2019-02-04, pretty late relative to the uptick.
Cindy proposed a couple of experiments, which I will run (since they require some tweaks to gpii-infra code):
- Spin up a dev cluster without locust tests, uptime checks, or internal health checks – no meaningful traffic to the Pods. Observe memory usage for a day or two.
- Slowly add back sources of traffic. Observe memory usage.
- If that doesn't provide enough data, compare behavior between different versions of universal from before and after the critical Jan 29 date
(This situation also revealed that our resource utilization alert wants some re-thinking and/or tuning: https://issues.gpii.net/browse/GPII-3718.)