2020.08.11

Launching in the Dark

Intro

This is a collection of things I learned during Summer/Fall 2019 while rolling out 2 new experiences to a full customer base following successful AB tests. It was originally intended as a presentation, but I couldn't find an audience for it so I'll just drop it here.

I've fictionalized some of the details (the numbers in particular) in order to describe the situation most effectively while simplifying the math and maintaining respect for privacy. Most notably I've combined 2 launches into a single, hypothetical "launch", which makes the story much easier to tell.

This is a more detailed version of what we discussed in the our LeadDev talk in March 2020.

My intention is to share my perspective - not to critize anyone in particular. We're all doing the best we can, with the skills that we have, navigating incentives defined by people more powerful than us. All companies have strengths and weaknesses - and if we don't confront those weaknesses, directly - how will we ever improve?

Know When You Need To Go Fast

...most decisions should probably be made with somewhere around 70% of the information you wish you had.

If you wait for 90%, in most cases, you’re probably being slow.

I currently work in personalization, mostly in the content discovery world - presenting videos that we predict users will want to watch.

In some companies, personalization is not loved by everyone.

But it's 2020, everyone knows that algorithms are the best way to do content discovery, right? I really can't say. But what I can say is that there will always be tension between algorithmic vs. human curated experiences. If you rely too much on algorithms, or too much on humans, you are going to have problems.

We ran AB tests to navigate around that tension. The data supporting the algorithmic experience was hard to deny, so we obtained permission to roll the experience out to all customers.

It was important to roll it out as soon as possible. There was a palpable feeling that if we waited a few weeks, some of our less data-driven colleagues would try to stop us from launching.

You Don't Have Time to Verify Everything

We had a lot of concerns about things that might break during the launch. Verifying that all of those things would not break could have taken months. We needed to start launching within 2 weeks.

What do you do if you only have 2 weeks to prepare for a launch? You identify your highest priority concerns and verify those, and only those.

What’s our expected increase in traffic?

Let’s imagine our full user base was 25 million customers.

  1. Find current RPS for test experience - let's say it was 10 requests/second
  2. We're going from 1M to 25M users, so multiply x 25 = 250 requests/second
  3. Find current total traffic - let's say it was 2000 requests/second
  4. ($projected_test_traffic - $current_test_traffic) / $current_total_traffic = expected total increase in traffic
  5. (250 - 10) / 2000 = 0.12, so we expect a 12% increase in traffic

Server Capacity

Do we need more servers to handle the increase in traffic? And if we do, what would be the process for obtaining them?

We looked at CPU usage across our clusters and it looked low - hovering around 25% at peak. In my experience you don’t start seeing issues until your instances get closer to 80% CPU, so I didn’t think we’d have a problem.

We were still running in the data center at the time (technically 2 data centers), so getting new hardware or VMs can be tricky. We proactively started looking into getting new VMs setup in a 3rd data center to provide additional capacity if we needed it.

Storage Capacity

We were already computing results for all users, whether they were in the test experience or not. For better or worse, we were doing more “live” computation than “pre” computation, so storage wasn’t an issue.

We also had to consider the CPU/memory/throughput/etc. capacity of our storage system. We were using Couchbase, and one of my teammates found a tool called Pillowfight for load testing it. He concluded that we had plenty of room before we approached our capacity.

Upstream Service Capacity

We weren’t significantly increasing traffic to any upstream services, but we were inserting our service in between an existing client-server relationship. This was fine as long as we didn’t increase the client’s latency to the point that timeouts increased significantly. Based on the latency numbers from the test we didn’t expect this to happen.

We Didn't Have Time to Do Things the "Right Way"

Short Term vs. Long Term

A few months back during an earlier allocation increase, we hit our connection pool limit to an upstream service. We wanted to avoid this happening again.

Unfortunately we did not have aggregated connection pool metrics, or even aggregated request rate and latency metrics. Why not? The quick answer is because, prior to this, nobody thought it was important enough to do. The longer answer is that I have no idea, because I had only been working at this job for 6 months.

We did, however, have a /metrics endpoint that exposed a lot of metrics on a per-instance basis, including the dependency latency and request rates that we needed.

What to do here? Do I bother our singular “ops guy” to configure Prometheus to aggregate all of our metrics? That might take 2 months.

Do I learn how to configure Prometheus myself? That might take 2 weeks.

Do I hack a Python script to grab data from the /metrics endpoint and compute my own aggregates? I can probably do that in 2 hours.

Tuning the dial between short-term and long-term solutions is always a challenge. We needed to launch this thing, so I tuned it totally towards short-term for this exercise.

Aggregates on Aggregates

I hacked a metrics aggregator script and ran it during peak hours across a few different days. I now had request rate and latency for requests to ServiceB during peak hours.

There are a few things to consider when looking at these aggregates:

  1. Time windows and aggregate request rates: Request rates frequently are captured within time windows, like 1 minute vs. 5 minutes vs. 15 minutes.. The goal here is to size your connection pools above your maximum concurrency, so I'd suggest looking at the p99 or maximum of the 5 minute request windows around peak hours.
  2. Aggregate latency: Latency stats also can have time windows, so you can make a similar choice there to request rate. But you also have a multi-aggregate situation here, where you need to decide how to aggregate your p99 latency numbers for each node. Do you look at your max p99? Your median p99? In my situation the p99 of p99s looked abnormally high (maybe a small number of boxes were GC-ing during that time?), so I rounded up my p90 p99 to the nearest second, to make the math easier.
  3. Heterogeneous nodes and distributions: We were running in 2 data centers with notably different hardware - we had different CPU counts in each group. So I had to be careful in how I aggregated the numbers. If you have more of one group than the other, your median may just be the fastest or slowest node from the larger group, which may not be what you want. Using p90 or higher aggregates can help with this, although you should be conscious that you are choosing global configuration based on the capacity of a subset of your nodes. You could also use different configurations on different nodes, but we weren't prepared for that level of complexity.

Any Queueing Theory Fans Out There?

I then performed a congruent exercise to what I had done earlier for predicting our overall traffic increase.

Except this time I used the one, the only, Little’s Law.

  1. Let's say our per-node request rate to ServiceB was 10 requests/second, and our latency was 100ms
  2. We're increasing the allocation 25x, so we're increasing to 250 requests/second
  3. 100ms = 0.1s, and 250 requests/second * .1 seconds/request = 25 requests
  4. We should resize our connection pools to 25 (or higher). They were already sized above that, so we were good.

We Didn’t Have Time to Ask Permission

When You Change One Thing, You Change Everything

Here are 2 diagrams showing the nature of the change we were making, traffic wise.

BEFORE: client fetches page from PageService, then fetches each row within the page from ProgramService

AFTER: client fetches page from P13nService which personalizes an already mildly personalized page returned from PageService.

The “mild” personalization is informed by the P13n Data Store. (why this weird architecture? history.)

The client then fetches each row from ProgramService.

Although we were just inserting P13nService an an intermediary, it was invoking a different code path in PageService that involved the data store, and also the rows returned were different (sometimes different content, sometimes just sorted differently) since it was a personalized page instead of a generic one. So the call patterns to every service were changing, significantly.

Forgiveness, Not Permission

So how do you approach communication with your partners about your upcoming launch? This is one of those “forgiveness, not permission” situations, but like a lot of things, there’s a dial you can tune between the two. 100% forgiveness would be “fuck it, just launch” without giving the other team a heads up. 100% permission would be to ask the other teams if it’s ok to launch, and be ready to delay your launch and have a lot of conversations if necessary.

They may ask if you can do some load testing first, and yes, you can, theoretically, run some load tests. But that’s going to burn a lot of time, and you can’t really load test a service without also load testing its dependencies, so you need to get new versions of the dependencies stood up, and if you’re in AWS you may have to worry about budget, and it becomes a giant rabbit hole that, in my opinion, is not worth crawling into.

I gave the owners of PageService a heads up that we were going to be increasing traffic to their service, with a rough idea of the timeline. We ran into some mild latency issues between PageService and the P13n Data Store, only in one data center where the network path between the 2 was cross-region. We asked them to increase the timeout a bit. After digging I also noticed that the latency issues were primarily on a few of their nodes, and those nodes seemed to be slower with regard to some other metrics also, so maybe they needed to be rebooted or just retired. Overall, a few minor bumps, but mostly smooth.

For the owners of ProgramService, we took the 100% forgiveness route. They are on the conservative side, and I didn’t want to enter into a conversation where they would ask us to wait, for any reason. If you’re not sure if you’ll like the answer, don’t ask the question. Their service also took a lot of traffic, and I didn’t believe that our changes would affect their overall health. And, I was right. We launched and they didn’t even notice. Maybe I did them a favor - thinking about the launch probably would have stressed them out, for no reason.

You Are an Aggregate of Your Partners

To zoom out a bit, your team will always have some partner teams that you depend on, or who depend on you. But while you and your partners share some interests and goals, you don’t share all of them, and they don’t necessarily move at the same pace as you do. Your launch may be critical to you, but meaningless to them.

If you choose to ask for permission, you also choose to move at the pace of your slowest partner. In my experience, execution speed is heavily influenced by risk appetite - teams aren’t slow just because they’re slow, they’re slow because they are afraid of taking risks. So when you ask permission, you also choose to lower your risk appetite to the minimum of your partners. And across a large and diverse company, the more partners you have, the lower the minimum is likely to be.

I don’t think you can escape being an aggregate of your partners, but how you choose to interact with them allows you to choose whether the aggregate is closer to the min or the max.

There’s a weird phenomena in big companies where people try to stop you from doing things for unclear reasons. In some cases there are genuine concerns, in others the concern is more that you are delivering more than they are, which makes them look bad. These people are not your allies. Know when to move fast so that they are forced to keep up with you. A lot of people can’t talk and run at the same time, so you will leave them behind, and move on with your day.

On a more humane note, sometimes people are just afraid. Afraid of making mistakes, afraid of criticism, afraid of looking bad in front of their boss. Maybe they’ve been like this all of their lives, or maybe they were scarred by prior trauma. A lot of us underestimate just how shitty some companies are, or just how bad some bosses can be. So when negotiating with a particularly conservative or risk-averse person, think a bit deeper about the root cause. “What can I do to make you feel more comfortable with this launch?” can be a powerful question.

Mis-calculations were made

Your Test Users May Not Be Representative Of All Users

Earlier when predicting our traffic increase, I multiplied current traffic by 25 since we were increasing the volume of users in the test experience by 25. This was a 5x overestimate. How did I get this so wrong?

I have no definite answer, but my theory is that the usage rate of our test population was greater than that of our total population. By “usage rate”, I’m referring to the percentage of users who interact with your service on a given day. Depending on your business, that number can be very high, or very low.

Our test group consisted mostly of heavy users - people who use our service at a disproportionately higher rate than the rest of our customer base. I should have looked at min/median/p90/p99/max requests per user, across the full population, and used that instead.

The lesson here is: if your test group isn’t a random subset of your full population, extrapolating behavior isn’t straightforward.

Wouldn’t this make our entire AB test invalid? Not necessarily. There are only a few ways to increase consumption that I’m aware of (where “consumption” in this case is video views), in terms of usage distribution:

  • make people who consume a lot consume even more
  • make people who don’t consume anything consume a little
  • increase evenly across your full population

I’m oversimplifying, but my point is that an AB test that increases consumption of only your top users might be good enough, assuming it doesn’t decrease usage across the other groups, which we verified ours did not.

Different Systems, Different Usage Patterns

I also overestimated the traffic to our data store, but for a different reason. I assumed that our increase in data store traffic would be the same as our increase in overall traffic, but this was incorrect, becuase we did not access the data store on every request. We typically would only hit the data store once per session, so if we assumed one session per customer per day, our total data store sessions would be at most 25 million requests per day.

If every customer accessed our service within the same second, that’s 25 million requests per second (for just that one second), which would be a problem. But not every customer uses the service every day, and they definitely don’t access our service in the same second.

If only half of our customers used the service every day, and they all accessed our service in the same hour, that would be:

~12,000,000 requests / 3600 seconds/hour = 3333 requests per second.

Which is more manageable.

I won’t share our actual usage rates. But our overall traffic distribution follows a familiar peak/trough pattern. I could have calculated the percentage of users that access our service at different times of day, and used that to calculate our peak data store traffic.

The lesson here is: for a given increase in user traffic, different systems will see significantly different increases in request volume.

You can’t just say “it’s a 12% increase” and apply to all of your systems. This all seems like common sense in hindsight, but here we are.

Invest in Resiliency, Not Prevention

We’re launched, baby. Unfortunately we were in the middle of an extremely unpleasant reorg, so we weren’t able to bask in the glory of it as much as we’d hoped, but whatever. One thing that still concerned us was how our (lack of) error handling affected user experience. As I showed previously, we injected ourselves in between an existing client-service relationship, and if our service’s error rates were slightly higher, that could mean slightly more customers seeing a “not available” screen.

We were particularly worried about one of those customers, by random bad luck, being an executive at our company, who would then email somebody and say: “What’s going on? A new service? Turn if off.”

We had to launch fast before someone changed their mind, but now we had to be careful not to cause any confusion or discomfort that could be used as ammunition to shut us down.

Our solution was a fallback cache. We cached non-personalized versions of pages, and served them upon errors to our downstream PageService - timeouts, HTTP 5xx, whatever. We had a cache entry per page type (of which there were only a few), a TTL for each entry after which we would refresh it upon the next request (a stale-while-revalidate style thing), and logic for removing personalized content before caching, so that we would only cache content that could be served to all devices.

It took longer to implement than I’d hoped, but after we rolled it out we immediately started serving a few thousand fallbacks per day. This was a very small percentage of our total request volume, but if you frame it as a few thousand customers who could see a lower-quality page rather than no page at all, it’s worth it.

Next Time, Go Faster

"If you have a 10-year plan of how to get [somewhere], you should ask: Why can’t you do this in 6 months?"

The launch process took 6 weeks, which is fast for us. We had zero problems. We did briefly roll back once, but it was due to an emotional outburst from another part of the company, and not a legitimate technical reason. We re-rolled forward as soon as possible.

If you don’t see any errors, then you probably are moving too slowly. I had to concede to my manager that we probably could have launched even faster. She was kind enough not to say “I told you so”. Managers sometimes push for speed because they are impatient, or don’t appreciate all of the technical details, or sometimes just because they want to say “it’s done” in a meeting. But one thing I’ve learned from managers is that asking “why can’t we launch this tomorrow?” is a useful exercise, particularly for prioritizing critical needs vs. nice-to-haves that just increase comfort. It’s also enjoyable to ask this to your more cautious teammates and watch their heads explode.

I should note that while one of my overall messages is “go faster”, it can only be done effectively if you also invest in building fallbacks, canaries, and intelligent alerting. If going faster means you frequently get paged in the middle of the night to fix things, or have to manually run a bunch of jobs at 7am or 10pm every day/night, that is not humane. If your manager is pressuring you to go faster without allowing you to build these additional safeguards, ignore them and build what is necessary to maintain your health and sanity.

Like the opening quote from Bezos said, you’re never going to have all of the information you want. And I find that the more I learn, the more information I want, so over time my information deficit is getting worse. But I believe that decision making is the most important skill in engineering, and learning to make decisions with limited information is an invaluable part of it. Ask questions, move forward, adjust, repeat. This story was an exhilarating example of that, and I’m proud and grateful to have been able to participate. Thanks to all of my teammates who helped with the launch, and thanks to you for reading.

Now go launch something. I believe in you.