A Humbling Raft(ing) Trip
Yesterday was the last day of David Beazley’s week-long training Rafting Trip. The aim of the course is to learn, understand, discuss and implement the Raft distributed consensus algorithm. Raft is at the heart of etcd, which is arguably one of the most important pillars and underpinnings of Cloud Native and distributed computing.
The problem with Raft is that it is deceivingly difficult to actually understand and much harder to implement it. Upon reading the venerable paper where the algorithm and problem it tries to address are laid out, readers come out with a sense of having a fairly reasonable understanding of the problem, the solution being proposed and how to implement it. However, as I’ll explain below, that sense above turns out to be mostly a false one.
The first and hardest question is: “where do I begin?” The reason this is a hard question with many correct answers is that there is a plethora of implicit complexity and decisions to be made that will absolutely affect the outcome of the endeavor. Do you start from bottom up (i.e. networking abstraction)? And if so, how deep do you go (socket programming)? Or do you start from the API level and work the curveballs as they come? Or do you start with the state machine? Or the log replication?
The second hardest question is how will you verify and test it? How do you make all pieces observable? This question actually turns out to be one that will drive many of your decisions, make changes to the current solution and a ultimately will be a driving force in discovering the near-unimaginable number of corner cases that hide underneath the surface — that need to be addressed for the algorithm to work.
One’s first impulse could be a Yogi Berra-esque one: you see the decision fork and you take it. Start somewhere, anywhere. The problem with this pragmatic approach is that, in doing so you’re subconsciously making assumptions about what the other parts will do and what they’ll need— and that matters a lot as you go up or down the stack. If you’re not careful and deliberate, you’ll very likely end up in a vicious cycle of fixing one thing leads to breaking of a different thing, fixing that other thing breaks the whole system. And this is where you think “I can avoid regressions with by adding testing” and this will lead you to then spend an inordinate amount of time and effort to build testability and observability into the system and thus you step into the abyss of despair.
Perhaps you decide to learn more about the algorithm and the plethora of edge cases for each of the components and somehow rank them and make a decision based whether you want to tackle the easiest or hardest. This could work, in principle. The problem, as I’ve alluded to earlier, is that there are a non-trivial amount of corner cases that will make you regret a lot of earlier decisions.
Maybe you decide to go BDUF and take the time to flesh out every single possible state of every component, interactions, data flow, sequence diagrams and the whole shebang. But this would probably take you days if not weeks worth of work to complete and pray to Turing almighty you got it completely right.
So far I’ve been using the phrase “corner cases” quite a bit. And there’s a bit of a small “eureka!” moment once it dawns on you that the corner cases are the main case. Or, in other words, the problem that Raft tries to address is so inherently and deceivingly complex that you can’t implement the “happy path” first and then code exceptions around it because there is no happy path in distributed consensus algorithms. Raft is designed to embrace the fact that dozens of failure modes are the “normal steady state”.
This week was humbling and constructive in many ways. Implementing Raft is such a beast that it will quickly reveal your weaknesses and it will endlessly challenge your strengths as a software designer, architect and engineer. This course blew all my expectations away and in no small reason because David is such a wonderful, knowledgeable, curious and entertaining instructor. If you want a non-commercial training, one that you will feel enriched by and leave you absolutely mentally exhausted, I could not recommend Rafting Trip enough.