Ruminations on the end-to-end argument

Posted on August 13, 2018

I spend a lot of my time reading papers and over the years a couple have always stuck out to me. For this post I wanted to write about an oldie but goodie: Saltzer, Reed, and Clark’s “End-To-End Arguments in System Design.”

The end-to-end argument posits that functions between cooperating components can only be properly implemented by the components themselves, as opposed to any sort of middleware between the components. The canonical example is reliable delivery between network-separated applications: while network protocols can ensure packets flow through the network intact, only the application can verify complete, end-to-end correctness. If there is an error buffering to or from the socket, if there is memory corruption, if the application itself is buggy, the fact that the packet made its way through the network is not useful as the data must still be re-sent.

This is not to say that reliable delivery implemented at the network layer is useless (see: TCP), and it doesn’t mean all functions between components must be implemented end-to-end. The crux of the end-to-end argument is that a function can only be implemented with complete correctness end-to-end, and any implementation by middleware is incomplete and can exist at most as a potential enhancement.

As another example, in a voicemail setting receiving voice data intact is important. Without a protocol like TCP the application would need to handle ordered error-free delivery in addition to making sure the voice data is OK - here TCP is a welcome enhancement.

Conversely, in a VoIP setting latency is key. A protocol like TCP would be inappropriate as any delays caused by retries would be unacceptable. Instead, VoIP is often sent over an unreliable protocol like UDP with end-to-end checking being handled by the participants themselves: “Can you say that again?”

That is the gist of the paper, but I encourage anyone who hasn’t read the paper, at all or in a while, to read it. I also recommend watching Professor Justine Sherry’s talk on the paper at PWLConf 2016.

While the paper is often cited as a systems paper, I have seen several examples of the argument in programming languages and software engineering. The rest of this post discusses some of these examples.

Reifying effects

In functional programming it is common to talk about reifying effects, where “effect” refers to a context in which a value is computed. For instance, the evaluation of a value which may or may not exist can be considered an effect. In languages like Java or Go such an effect is built into the language with the special null value. Constrast this with languages like Haskell and OCaml where such an effect is explicitly reified into a distinct type constructor, Maybe a versus a.

We can view this through the lens of the end-to-end argument by looking at the expressions producing or consuming such values as the components and the language itself as the middleware. In the null case the language has provided functionality “for free,” allowing any reference value to be assigned null if there is no meaningful value to assign. However as evidenced by the now famous “Billion Dollar Mistake” talk, this provided functionality does more harm than good. Many languages have since opted for a more explicit, end-to-end method of signaling absence of a value.

One interesting language to look at is Scala - while the language itself does have null, much of the community pretends it doesn’t and instead wraps relevant values in the Option constructor. Unfortunately, just like if reliable delivery was used in VoIP, the cost of the unnecessary middleware functionality must still be paid as a value of type Option may still be null.

The same argument can be made for other effects, including exceptions (Either), dependency injection (Reader), and side effects (IO).

Source vs. binary dependencies

Package managers have to make a choice between source versus binary dependencies. On one hand source dependencies retain all the structure setup by its developers and allow the user to turn on whatever flags, features, and optimizations they want. On the other hand binary dependencies are compiled and much easier to use and install, at the cost of hoping the packaged configuration is configured to par.

Unfortunately binary dependencies can be inconvenient to depend on, especially in an ecosystem like the JVM where classpaths can conflict and need to be binary compatible. Innocent changes to source that would work fine in a source dependency model can cause runtime errors in a binary dependency model.

Here we can treat packages as components and the toolchain as the middleware. If the toolchain is centered around binary dependencies, any customizations of a dependency a package becomes difficult and often results in not only forking the dependency, but also re-packaging and publishing the dependency. Contrast with a toolchain built for source dependencies, the only steps needed would be to fork and re-point the location of the source to depend on.

Again this is not to say the end-to-end argument suggests source dependencies are strictly better than binary dependencies. Rather it suggests that binary dependencies, while often times convenient, are an incomplete mechanism for dependency management; a source dependency model is more complete.

To this last point, while the Nix package manager is designed around source dependencies, it also supports binary dependencies explicitly as an optimization - see Dr. Eelco Dolstra’s Ph.D. thesis, section 7.3 for more information.

Frameworks vs. libraries

For the past couple of years I struggled with figuring out what it was about “libraries” that I liked and “frameworks” that I didn’t, but as it turns out the end-to-end argument is applicable here too.

With enough handwaving, we can define frameworks as components that want code a certain way, and if you can mold your problem to fit the model then you can “just plug-in” to a larger ecosystem and get functionality “for free.” The Akka project is one example of this.

In contrast, libraries are components that provide pieces of functionality which you pull piecemeal without having to go all-in on an ecosystem. To contrast, FS2 is an example of this.

Still my definitions are vague, and I will likely write a dedicated blog post about this in the future. For now I can only present a heuristic I use to gauge if something is a framework or a library. Given a function that has no knowledge of the components in question, how easy is it to use that function in the context of the component? In Akka this often involves creating an actor that then interacts with the other actors. In FS2, the combinator-centric model allows us to immediately use the function in a stream.

Applying the end-to-end argument, frameworks often try to encapsulate a lot of functionality in the middleware, at the cost of requiring users to mold their problem to the framework. In a library model functionality is provided piecemeal and it is on the user to compose them to their liking. The justification of frameworks then hinges on whether or not the partial functionality provided is worth the cost of having to re-cast the problem, and at times having to break the mold and re-implement functionality end-to-end.

For more discussion, Tim Perrett and Paul Chiusano have also written about this topic. Section 5 of the end-to-end paper also discusses many examples which ring of the frameworks vs. libraries debate.

The sidecar pattern

For a systems-y example, a common pattern that has emerged in the world of schedulers and containers is the use of sidecars. The term sidecar is used to describe a container that run alongside the main application container to provide additional functionality such as proxying, logging, metrics, etc. Often times sidecars are automatically injected by the deployment system, the idea being that service owners need only concern themselves with their application.

However, because sidecars run outside of the application, any functionality they provide must be done with incomplete information. For example, if the sidecar is a reverse proxy that does retries or load balancing, it must do so with limited information. Perhaps retries are done for any non-2xx status code, or load balancing just round robins. However, more sophisticated policies must be implemented at the application-level, such as a first-response-wins scatter-gather approach, or if the application wants to exploit knowledge of caching or data locality.

Tip of the iceberg

The examples I’ve presented above are only four among many examples I’ve run into the past couple of years. Some other examples include:

I encourage you as you design or evaluate systems to do so while keeping the end-to-end argument in mind. Rest assured as I work on Nelson I will too.