I push, therefore I am: 2 Days at Etsy
Netflix and Etsy ran a brief "developer exchange program" in 2013. Etsy sent one of their best to hang out at Netflix for 2 days. And a few months later, Netflix sent... me.
This was a writeup of the experience that I shared with my team at Netflix and the people I interacted with at Etsy. It was fun to write, I didn't mention anything sensitive, and much of it it still relevant in today's world, so I figured I'd share it publicly.
Surprisingly this was briefly at #4 on Hacker News when I made it public in 2020, here's the thread.
In October 2013, I spent two days at Etsy’s offices in Brooklyn for the second half of the Netflix/Etsy developer exchange.
I’ve chosen to summarize my experience as a sequence of 8 maxims, so let’s dive right in.
1. Infinite Capacity is Infinite Chaos
While at Etsy I spoke a lot of Netflix’s infamous “bad instance problem” and both its real and theoretical causes, such as “some boxes have a different CPU type”, “noisy neighbor”, “new AWS firewall bug”, “old AWS firewall bug which isn’t fixed in this region yet”, and so on.
A lot of the Etsy engineers seemed to find this fascinating. I think the main reason for this is that they run in a data center and control their own hardware, so they do not have this problem. One of the more interesting manifestations of this is that Etsy does not have per-node granularity in their application metrics. Netflix would die without this, but Etsy doesn’t need it - if an instance goes bad, per-node system-level metrics (such as load average or system CPU usage) fire alerts, and the problematic instance is taken out of service.
At Netflix it is well publicized that we have chosen total virtualization and infinite scale via AWS or “the cloud”. I find what isn’t well publicized is the infinite cost that comes along with this. By “infinite” I mean that as you continue to scale, you must continue technically investing in availability in order to maintain your current uptime. If you stop working on it, or allow the work to slow to a pace lower than your growth, your service will become notably worse to customers.
Since we cannot solve the problem of volatile nodes and networks, we instead focus on becoming more skilled and efficient at adapting to it, and hope that the rate of increase in chaos that we can manage provides sufficient capacity to continue growing our business.
2. Our Metrics System is a Map, Not the Territory. Also, Sometimes it Breaks
Shortly after I arrived, Aaron Gardner (my Etsy host) brought up a metrics dashboard which showed empty boxes where the graphs should have been. It turns out that Etsy’s metrics system was broken for a few hours, and while it did cause a lot of discussion and a delay in pushes until it was fixed, the sky did not fall.
I generally do not enjoy the pain of others, but in this case it was comforting, since at Netflix we have our own metrics dashboard (Atlas) which also sometimes shows us empty boxes instead of graphs. What is interesting here is not that the metrics system breaks (everything breaks), but how people react when it does. At Etsy this caused a “push hold”, which halts all pushes until the issue is resolved, as without metrics you can’t tell if your push is causing issues or not. The emotional reaction of the engineers seemed to be a mixture of curiosity, comic relief, and angst, the latter likely due to the tension of having unpushed changes. This seemed like a healthy mix of emotions to me.
I can’t remember the last fullscale Atlas outage at Netflix, but I am almost positive that teams were still pushing code to Prod while it was happening. There may have been a “hey, you may want to wait on your pushes” email sent out by the monitoring team, but as always teams at Netflix are free to do what they want. I won’t assign a positive or negative to this since this is just the way Netflix operates.
One healthy thing about a metrics outage is that it influences you to:
- Distrust metrics that don’t fit your current perception of Prod reality
- Find alternative data sources to observe what is happening in Prod
You can then build a habit of using these alternative data sources to either corroborate or disprove what your metrics system is showing you.
For particularly bizarre incidents in Prod it is not uncommon to see Netflix engineers crossreferencing Atlas metrics with Apache access logs, which, although painful, is a solid operational practice.
3. Tests are Great, But Understanding What is Happening in Prod is Better
While I am very much pro testing, if I was forced to choose between having a test suite with 100% coverage, or a highly functional Production monitoring system (without the ability to choose “both” in any form), I would choose the latter. I believe all engineers at both Netflix and Etsy would make the same choice.
I think this illustrates that the biggest problem we face is maintaining an understanding of what is happening in our system, and that while tests play a part in this, they don’t completely solve the problem.
One thing that Netflix and Etsy have in common in that the surface area of Prod functionality is too large to maintain 100% test coverage. We strategically choose the areas that require the most automated testing and pursue them, while maintaining looser standards for other pieces of functionality.
4. Make Pushes Easy, then Maybe Make Them Safe Also
Both Netflix and Etsy share a philosophy that the ability to rollback quickly is more important that having 100% confidence in a push, and that pushing more often helps to decrease the risk of each push. What was interesting to me about Etsy’s pushes was not the technical details (which I could eloquently summarize as “something with rsync, or git, or something”), but instead how so much of their tooling was built to automate the social workflow of pushing. Coordinating who has changes that need to go to Prod, which changes can be bundled together, and the state of these changes in terms of coding, testing, etc. all must be done by humans, and Etsy’s pushbot provides a common mechanism to do this.
We are not so organized at Netflix. While the push process has been immensely improved over the past few years, I still see cases where our push lead has to walk from cube to cube to get status of JIRAs and performance fixes before the push can go out.
Etsy does not run canaries, which was surprising to me, coming from the Netflix Edge team where we are constantly obsessed with the latest canary runs and results. My understanding of Etsy’s lack of canaries is that the performance of their application does not vary greatly between pushes, so there isn’t a need to watch it so closely. At Netflix where any of our dependency jars can be a trojan horse for a memory leak or CPU killer, we have a more cautious and paranoid approach, which I think it is justified.
5. Designing a Platform is Designing an Organization
An ex-colleague of mine taught me the phrase “you ship your org chart”, meaning that the structure of your system in Production usually matches the structure of your organization. In my experience this is true.
At Netflix we have many teams which create many jar files which are loaded into our massive Edge web application which contains all discovery-related functionality at Netflix. Etsy has a smaller number of teams working mostly on one large PHP web application. The simplicity of this greatly aids cross-team communication, as everyone is working in the same tree and all diffs can be derived from a common base.
On the Netflix Edge team, we frequently talk about the social implications of our software design because of where we live in the organization right in between the UI teams and the mid-tier services. I think our goal is to create a platform upon which these 2 groups can work and push together with a rhythm similar to Etsy’s.
6. Maybe PHP Doesn’t Suck
Prior to my Etsy trip I had written about 10 lines of PHP in my life, and all of them were terrible. At Netflix we are mostly Java developers who lean towards mocking non-JVM languages such as PHP.
But there is something to be said for the fact that I was able to sit down and hack together a small feature in PHP using just SSH, vi, grep, and the patient guidance of a PHP expert. A part of this was the amazing per-developer VM setup that Etsy has, but the other piece was undoubtedly the fact that I could hack my PHP changes in vi and immediately observe them running in my VM.
In comparison, getting started running the Netflix Edge application takes days. It is a horribly painful rite of passage, and if you ever go a period of time without running it, it will take you at least a day to get it running again. I haven’t run it in months and every time I think about it I get a headache and start sweating.
The ease of running and modifying the Etsy webapp is something that weighs heavy on my mind as we continue to move towards the next generation of the Edge application. We are not going to write it in PHP, but I do have a new respect for the language and the way in which Etsy runs it.
7. API.Next is One Style of Private API, and We Will See Other Styles Developing Soon
Paul Wright from Etsy showed me some details of what I believe was called the “Bespoke” API, which could be loosely described as a PHP version of API.Next. The primary difference is that the calls from the top-level script to Etsy’s “service layer” are network calls across their data center LAN, as the initial node serving the script sends multiple smaller requests to peers in its own cluster and assembles the responses. So while it is written in PHP and uses processes for concurrency instead of threads, the primary concept of doing more work in less API calls is identical to Netflix’s approach.
It was really interesting to see Etsy experiencing the same issues as Netflix in serving multiple devices and UI styles with the same API, and attempting to solve the issue with a similar conceptual approach but with a totally different technology stack. I think we will see a lot of new ideas in this space over the next few years as more companies feel the need to adjust their API designs due to device coverage, bandwidth constraints, UI innovation, and so on.
8. Always Keep in Mind What You Want to Excel At
In my early days at Netflix an engineer stated that “we optimize for organizational speed at the cost of stability”. He didn’t say this with a negative tone, it was simply presented as an indifferent description of the way we operate. I think it is important that engineering organizations identify a primary thing that they want to excel at. It affects the day to day mindset of the engineers, gives context to help with decisions, and provides guidance for what types of engineers we want to hire.
At both Netflix and Etsy the primary focus appears identical: moving as fast as possible. There are many different ways to achieve this for example, you can push code continuously throughout the day (as Etsy does), or push only at preferred nonpeak push times (as we do at Netflix). You can run one big webapp (Etsy), or dozens of separate services (Netflix). In the end it is the speed that matters.
I must admit that sometimes it feels like we are moving too fast, but I have worked in environments that moved too slow, and trust me, that feels much worse. I think in general we as engineers would rather live a dangerous 5 years in the wild as opposed to a calm 25 years in captivity.
So while I cannot say for certain that Netflix is doing things the right way, I can say that we share a great deal of philosophical similarities with Etsy both at a organizational and personal level. That leads me to believe that we are both moving in the right direction. I had a great time during my brief visit to Etsy (especially the lunch on Thursday which put ours to shame), and while I’m sure I missed a lot of details, hopefully my observations were interesting or at least entertaining to both groups. Thanks for reading, and thanks in particular to all of the engineers at Etsy who took the time to talk with me.
Matt Hawthorne Netflix Edge & Playback Platform Team 2013