Erlang Factory in San Francisco Bay is coming in 2 days and thanks to Erlang Solutions I had another great opportunity to interview Rick Reed, software engineer for WhatsApp, the famous messaging platform installed on the majority of smartphones out there.
My interview is technical and focused on the use of Erlang inside WhatsApp. I hope you will enjoy it, WhatsApp is engaging huge scalability challenges and it seems that they are winning them.
Before starting, I have also to say thank you to a friend of mine, Loris Fichera who is a great Erlang passionate and helped me with writing this interview.
Mirko: Hi Rick, thank you so much for this opportunity and welcome to my blog. Please introduce yourself to our readers.
Rick: It’s my pleasure, and thank you for the opportunity to talk about our technology.
I joined the server engineering team at WhatsApp in mid-2011 after a long stint at Yahoo! where I worked on the software platform team, and before that, I was at SGI working on digital video systems. My background is primarily C and Unix, especially performance and scalability. I received my BS and MS in Electrical Engineering from Stanford.
Mirko: WhatsApp has been founded in 2009. You joined in 2011. Why have you decided to leave your previous job and jump in this adventure?
Rick: After 12 years at Yahoo!, I was ready for a change. My first software job was with a startup, and I have fond memories of the fun, fast-paced environment. Yahoo! had a lot of the same feel when I started there. I longed to get back to that. I knew Jan and Brian from Yahoo!, and when they showed me what they were up to and how successful the product already was, it seemed like a great opportunity working with a great group of people on a disruptive, consumer-facing application. And real-time communication has been one of my favorite areas to work on.
Mirko: In your speech at last year’s Erlang Factory, you said didn't know Erlang when you joined WhatsApp. Which was the situation of the WhatsApp architecture in that period? I mean, was Erlang already in use in company’s projects or the WhatsApp’s tech team was planning to give it a try?
Rick: Erlang was already the main implementation technology when I started. That was a source of some anxiety for me since it had been quite a while since I’d had to change programming languages, but it was also an intriguing challenge after 27 years using imperative languages.
Mirko: If WhatsApp wasn’t built from scratch with Erlang, what were the previous technologies involved in the architecture? When did you decide to introduce Erlang? What were your doubts and what were the alternatives?
Rick: When we initially launched our chat capability, the server side was based on ejabberd. It’s since been completely rewritten, but that was the initial step in the Erlang direction. This all predates my time at the company, but I think the experience with the scalability, reliability, and operability of Erlang in that initial use case led to broader and broader use.
I came in with a healthy amount of skepticism since most of my previous high-performance experience was with C/C++, but after we worked through some of our bottlenecks (as described in my talk at last year’s Erlang Factory SF), I came to realize that Erlang was a great fit for what we were doing. We were achieving scalability goals on our hosts that we only dreamed about at Yahoo!.
Mirko: I have seen from last year’s Erlang Factory that you set goals of scalability from time to time, and you try to remove all bottlenecks to reach them. How do you set a “goal”? Is it based on accurate previews or is it based on a threshold you think is safe?
Rick: Actually, our growing user base really sets the goals for us. Our system usage has been growing rapidly as more people start using the app and existing users become more engaged, but we’re constantly working to keep our server count as low as we can while leaving enough headroom for events that create short-term spikes in usage. We analyze and optimize until we think we’ve hit the point of diminishing returns on those efforts and deploy more hardware.
Mirko: What are the next goals you are trying to achieve?
Rick: I’d still really like to get 4 million concurrent users on a single chat node, but I haven’t really been too focused on that goal. Our hardware specs have improved (we’re running dual octo-core Sandy Bridge processors now), but
we’ve also added encryption and other features that have eaten into our CPU budget. We still have some burn-in to do with R16B, and I haven't taken a look to see if there are any new bottlenecks that we can address. So far I’ve taken a host to 3.4 million but haven’t tried to redline one recently.
Mirko: I have seen huge numbers in the WhatsApp’s history, but this tweet from the WhatsApp account has been amazing!
How do you perform load testing in WhatsApp? Which strategy do you adopt for these things?
Rick: We’ve pretty much given up on synthetic load testing because we’re just not able to reproduce the kinds of load that our users can generate. We’re pretty comfortable with how our systems react to load (which ones react in a linear way and where the limits are and which ones have more troublesome tipping points), and they get challenged regularly by world events. A recent soccer match generated a 35% spike in outbound message rate right at our daily peak.
One of our primary gauges of system health is message queue length. We constantly monitor the message queue length of all the processes on a node and alert if they accumulate backlog beyond a preset threshold. If we see one or more processes falling behind, we alert on that, and that gives us a pointer to the next bottleneck we need to attack.
Mirko: While your system was delivering such a record what were you doing? Were you anxious or were you confident and on holiday?
Rick: We were all on vacation but monitoring the system performance from wherever we were. We had purposely overprovisioned the hardware both to ensure that our users had uninterrupted service during their festivities and so that we were able to enjoy our own holidays without spending the whole time fixing overload issues.
Our multimedia system had a rough time of it during the previous year-end holidays. I had written the replacement for that system (which we deployed starting in September), and this was the really big load test, so I was a little anxious. But again, we knew we were overprovisioned, and all the systems survived without any issues. Unfortunately, a switch failure the next day cascaded into a brief outage spoiling an otherwise perfect holiday.
Mirko: On a scale from 1 to 10, how confident are you about your system? How many times do you release to production in a week?
Rick: Well I’ll have to go with 11, of course. We usually push at least some code every day. Often, it’s multiple times a day, though in general we try to avoid pushes during our peak traffic times.
That has really been one of the things I like most about Erlang. We can be a lot more aggressive in getting fixes and features into production. Hot-loading means we can push updates without restarts or traffic shifting. Mistakes can usually be undone very quickly, again by hot-loading. And I find our systems to be much more loosely-coupled than systems I’ve worked on in the past which makes it very easy to roll changes out incrementally.
Mirko: Tell me something that you love about Erlang.
Rick: Well, as you can guess, I love the operability aspects of having our code running on the emulator. Being able to modify and in some cases completely replace parts of our systems while they’re running full-tilt production load
is just awesome as is connecting to a node with a remote shell and being able to interact with the running system in so many ways.
Mirko: Tell me something that you don’t like (if any) about Erlang.
Rick: I miss gdb.
Oh, and it’s a little frustrating that there are so many syntactic ways to do the same thing. I spend too much time obsessing over if vs. case vs. function head. List comprehension or tail recursion. And so on.
Mirko: How much time and effort have you invested in learning Erlang from scratch? And what advice would you give to our readers about learning the language in an effective way?
Rick: That’s hard to quantify. It took a few months to feel really comfortable. When I started, I spent a little time reading the books and looking at code … and then went off and wrote a monitoring utility in Perl because I needed
to feel productive. I find I can’t really learn a new language without having a problem in front of me and just writing code. I read enough to understand the syntax and then launched into my first module. In the beginning I looked at lots of examples of code that we were already running and at code in the OTP applications. I also think that as a programmer, you have to find your own groove with a language. I’m there now, but I can’t recall when I first felt it. I tend to revise a lot, not in terms of rewriting and refactoring lots of code, but in changing how I’ll approach problems with code. So I’m constantly looking at ways to improve how I’m using the language. On the other hand, I’m not the kind of person that has to know every feature and every syntactic or semantic capability of a language. I like to keep things simple, so I’m sure I’ll keep discovering things about Erlang for quite a while.
Mirko: Since you are working for the most used messaging system on the planet I have two non-technical questions. What is from your point of view the best thing introduced by digital communications? And on the other side, what do you miss from the “paper and pen” communication era?
Rick: Well we’ve come a long way from my first experience with digital communication, email on the arpanet. Today, I’m just a few taps on my smartphone away from instant communication with my family, friends, and coworkers, no matter where they are.
I think the downside is that because there are so many ways to communicate (phone, voicemail, email, messaging), you almost have to establish a mapping between each of the people and organizations in your life and how best to communicate with them. And if you get the mapping wrong, the mismatch between what you expect and they expect can cause all kinds of trouble.Tweet
comments powered by Disqus