Future

Streaming Audio: A Confluent podcast about Apache Kafka®

Chaos Engineering with Apache Kafka and Gremlin

The most secure clusters aren’t built on the hopes that they’ll never break. They are the clusters that are broken on purpose and with a specific goal. When organizations want to avoid systematic weaknesses, chaos engineering with Apache Kafka® is the route to go. 

Your system is only as reliable as its highest point of vulnerability. Patrick Brennan (Principal Architect) and Tammy Butow (Principal SRE) from Gremlin discuss how they do their own chaos engineering to manage and resolve high-severity incidents across the company. But why would an engineer break things when they would have to fix them? Brennan explains that finding weaknesses in the cloud environment helps Gremlin to:

  • Avoid lengthy downtime when there is an issue (not if, but when)
  • Halt lost revenue that results from service interruptions
  • Maintain customer satisfaction with their stream processing services
  • Steer clear of burnout for the SRE team 

Chaos engineering is all about experimenting with injecting failure directly into the clusters on the cloud. The key is to start with a small blast radius and then scale as needed. It is critical that SREs have a plan for failure and then practice an intense communication methodology with the development team. This plan has to be detailed and includes precise diagramming so that nothing in the chaos engineering process is an anomaly. Once the process is confirmed, SREs can automate it, and nothing about it is random. 

When something breaks or you find a vulnerability, it only helps the overall network become stronger. This becomes a way to problem-solve across engineering teams collaboratively. Chaos engineering makes it easier for SRE and development teams to do their job, and it helps the organization promote security and reliability to their customers. With Kafka, companies don’t have to wait for an issue to happen. They can make their disorder within microservices on the cloud and fix vulnerabilities before anything catastrophic happens.

EPISODE LINKS

Episode source