How Our Proxy Setup Exposed a Critical Bug in AWS

Hi. I wrote some good stuff in Python and it still works.
As mentioned in the first part of the cycle, currently I’m running a large-scale IoT fleet on AWS Greengrass. These devices sit behind a very strict corporate firewall: every outbound flow must be accounted for, optimised and approved by security before any port/IP/hostname is opened. In practice, that means every packet counts and every config change goes through an actual review.
Planned solution (the “nice” one, by the book): route all outbound traffic through a corporate proxy stack (NLB > proxy in EC2) and let security whitelist only the proxy IP. On the Greengrass side, AWS has recipes for this - in theory, you set a few proxy variables in the merge config of your Nucleus and you’re done.
Reality #1: the docs are vague. The “just add a couple of lines to the Nucleus config” approach reads fine on paper, but in the field, you need to ensure the usual proxy envs are present everywhere: HTTP_PROXY, HTTPS_PROXY, NO_PROXY (and uppercase variants). Setting them only in the nucleus isn’t enough if components ignore the nucleus-environment.
Reality #2: most public components don’t play by the rules. They bypass the proxy and try to send traffic directly. Worse - there’s a floating Systems Manager bug where its proxy variables get reset after a Greengrass restart. So you can deploy, verify, sleep, and then a reboot turns a tidy, secure device into a leaking sieve. Or simply locks you out of your edge device as Systems Manager is hitting the firewall.
What I did (blunt solution until AWS addresses this issue):
Ensure proxy envs are explicitly injected into each public component (don’t rely solely on nucleus inheritance).
Manually create a drop-in config for the proxy variables. Greengrass won’t overwrite it on restart, so it sticks.
Recommended hack, echoed by an AWS architect: introduce a small “service component” that applies all critical settings (proxy, certs, env vars) and declare it as a dependency for every other component. That way, no other component even starts until the environment is sane.
And this is a perfect example of edge reality: the architecture diagrams look pretty, but the edge prefers surprises.
Filed the relevant bug/support case with AWS; until it’s fixed, treat Systems Manager’s public components like “cattle that occasionally forget they are behind a wall” and hard-enforce proxy config locally.
Note for myself and my brothers and sisters on the edge: always test your components against GG restarts, validate env propagation per component. It might save some night shifts.
