Interview with Daniel Lee

The Erlang User Conference 2013 is really close and thanks to Erlang Solutions I have interviewed another speaker. He is Daniel Lee, Core-Platform developer at Klarna. In my opinion, his talk will be a must see; if you have any doubts please read the following interview and they will surely disappear.

Mirko: Hi Daniel, thanks for being available for this interview. Please introduce yourself to our readers.

Daniel: I have been a developer at Klarna for a bit over two years now. I grew up in Los Angeles, and did my studies at Cornell University and Carnegie Mellon University, where I was bitten by the functional programming bug and got to study with some really smart people with colorful personalities. After leaving graduate school, I spent some time in the service industry bartending in failing restaurants with pretty decent food. I attribute my sense of software development taste to the schooling and my work ethic to the bartending. In 2011, I moved to Sweden for a fantastic opportunity at Klarna and it's been a great adventure ever since. I am @dklee on twitter.

Mirko: Your talk at the Erlang User Conference 2013 is really interesting. You called it "Continuous Migration: Re-implementing the Purchase Taking Capability of a 24/7 Financial System". So you are basically re-factoring the whole "Purchase Taking Capability" with new Erlang code, following best practices and standards from the Erlang community. Why has this become a priority within Klarna? I remember a talk from David Craelius back in 2011 about Erlang in Klarna. Have you found that the previous monolitic Erlang system was viewed like technical debt?

Daniel: The "Purchase Taking Capability" is the most important function of Klarna's soft Real-Time Domain. Klarna's business model is all about increasing conversions for our merchants, so downtime there means lost business for Klarna, lost purchases for our merchants, and poor user experience for the end-consumers. Historically, availability of purchase taking capability was dependent on the availability of the legacy monolithic system. The goal of this project is to decouple that capability from the legacy monolithic system to provide a more reliable end-user experience, as well as better scalability in terms of increased purchase taking capacity. There's a line from an old Fleetwood Song "Landslide" (or you can youtube for the Glee cover): "I've been scared of changin', because I've built my life around you" Legacy code (and the legacy business logic it implements) is a pretty painful and risky thing to change at a company like Klarna, because you know that if no customers are complaining about it, you are quite possibly making a lot of money from it. Our customers would prefer to integrate with us once at the beginning of our relationship, and never have to mess with that ever again. Breaking a legacy integration is almost always a painful and expensive customer relations problem. Much of Klarna's meteoric growth is due to herculean pushes to get new features to market quickly. The shortcuts taken for these features, or even design decisions that were entirely reasonable at the time, are much of the technical debt. Trying not to break features no one really understands anymore can really slow you down. Developers would love to throw much of this away and start fresh, but much care must be taken to minimize how this affects the merchant integrations and customer experience.

Mirko: Extracting the "Purchase Taking Capability" is only the first step in a bigger re-factoring project or is only a single project? Are you looking for having a system of small systems which talk to each other but are more maintainable and testable?

Daniel: Klarna's legacy monolithic system worked surprisingly well when there were a small number of developers, most of whom were well-salted Erlang gurus. A single point of delivery will not scale for a development organization with dozens of product teams. It's also a pain to operate. Klarna is currently undergoing big initiatives to divide up business functionality into independent services with much more purity of purpose. This re-factoring has produced a number of independent utilities, frameworks and business libraries that are shared between multiple systems. Mats Cronqvist’s EUC talk goes into the ecosystem of new independent systems much more thoroughly.

Mirko: I like this interview because I can write some very technical questions. What are the best practices to fight the "legacy code"? I know that changing a code that is making money while we are writing this interview is not as easy as it seems. Is it a simple migration from system A to system B in the hope that everything will be OK or it is a step by step migration to a system which is growing its functionality day by day?

Daniel: Good taste and common sense. I personally prefer a "Ship of Theseus" approach, where big re-factorings are shipped incrementally until you have something completely new that essentially does the same thing. Each individual change can be understood much better, but the aggregate of all the changes terms a hairy mess into something remotely attractive. Shipping a huge diff in a critical part of the system is extremely dangerous. Having been involved in both sorts of changes and seen many, dropping a huge diff at once is generally a sign that something broke down somewhere in the planning or development process. And the releases are terrifying. No thanks.

We've also had much success in developing frameworks that capture the structure of the control flow of the code, and pushing the business logic into callbacks passed to the frameworks. Soapbox is a great example of a framework that was used to re-implement our XMLRPC-API into something sensible. It went from an 8000+ line of code module, to a family of callbacks describing input types and API methods. Many of the callbacks still retain legacy compatibility quirks, but things are at least organized in a much nicer way.

A key safety point in correctly doing this sort of migration is to avoid duplicate copies of the same logic. This involves moving shared logic from the version control of the legacy system to an independent sub-component shared between the legacy and new systems. At Klarna, this involves git, with the legacy system maintaining dependencies with git submodules and the new system using rebar for dependency management. There is some pain in managing change in such a situation, but there are also nice wins in new features and fixes going into both systems "for free".

Mirko: What is your own personal development cycle? You follow TDD, so you write an acceptance test for each new feature and then unit tests until you get the green light or you are less inclined to testing? If you test your Erlang code can you tell us something about tools and your experiences?

Daniel: As a Core - Platform developer, my personal situation is a bit unique, in that I mostly work on logging and monitoring issues, build/release related issues, developing libraries and frameworks, and the occasional business logic re-factoring. My acceptance criteria are typically “make my fellow developers happy”, “make my operators happy” or “do same thing in a more re-usable way without breaking existing behaviour”. In the last case, there are often existing regression tests in place. In most others, I prefer unit testing the new functionality. I also believe that there are many cases where problems can be sufficiently generalized into frameworks or libraries with a purity of purpose and a manifestly correct implementation. Such solutions work perfectly*, and bugs arise from using them wrong ;).

We use both eunit and common test at Klarna, with a smattering of proper.

We are currently working on an integration testing system that does acceptance testing between the new and legacy systems. This has a number of interesting challenges related to spinning up the environment and managing the different systems, but it's still in relative infancy so it's too early to say too much that is insightful about what we've learned from doing this yet. Your friend Roberto Aloi from ESL actually started consulting with us in May and has been given a lot of responsibility with this test system. Really excited about this!

Mirko: You release code once every week (I hope I am not wrong in my interpreting of the EUC talk description). Do you think it is possible to reach a pseudo "Continuous Deployment" cycle? I am always scared of code that goes on production 5 days after I have finished writing it. What do you think about all developers going to production after an internal approval? Do you think it is possible in Klarna?

Daniel: Because Klarna sells a service, and not the software itself, a release means an immediate change in the code paths executed in customer interactions.

Klarna's legacy system releases every week. The new purchase taking system is currently in a Beta, where all XMLRPC-API traffic goes through the new systems. Functionality we can handle is taken on the new system, and that which we cannot gets reverse proxied to the legacy system. This new system is not currently on any fixed release schedule and tends to be much more often than once a week. The restaurant guy in me thinks of finished, but unshipped code like food dying under a heat lamp waiting to be served. A waste of money, a displeasing delay in delivering value and eventually a health risk.

The real bottleneck in how often a service can be upgraded is the amount of overhead required to prepare and deploy a new release. Some of this overhead is computational (running regression tests) but the most expensive part is human: the time required by test engineers, release managers, and operators to approve and deploy an upgrade. Because of the large amount of business functionality served by the legacy system and the dozens of developers shipping to it, the fact that it releases once a week is a testament to a good deal of problem solving and the talent of several thick-skinned managers.

The new system currently has much less overhead and shorter test times, so releasing even multiple times a day is possible. Since currently this is still very much owned by developers, as opposed to operators or administrators, we're relying on our own laziness to keep the human costs of upgrading code low with aggressive automation of the mechanizable bits. The kid in me who grew up watching "The Terminator" and "The Matrix" thinks the GO/No GO decision should still be made by a human, however.

In both systems, the expectation is that once a developer has written a change, the change passes the regression suites, and the diff has been signed off by a technical reviewer, it is ready to be merged in (and ship very soon). Emergency fixes are an unavoidable regular occurrence we are constantly trying to minimize, but this level of trust is an important part of our culture and necessary for the pace at which we want to deliver code changes. The literal definition of agility is the ability to change direction quickly. The most agile business is literally the one that is capable of changing the fastest.

Mirko: I am currently working in a Mobile Payment company. I know that they are two different domains but we have to deal with purchases, subscriptions and a large number of payment methods. What are the most important Erlang features from your point of view and what kind of advantages are they giving to Klarna? What are the main differences if for example the whole Klarna system was written in a "more mainstream" language such as Java or PHP?

Daniel: The only other languages where I've done significant amounts of coding are Standard ML and a charming dependently-typed logic programming language called Twelf, so I can't make super fair comparisons with Java or PHP. The transparency of data in Erlang is rather convenient for debugging, but makes enforcing any sort of abstractions extremely difficult and you too often end up tragically married to your original representation. On the flip-side, process isolation in Erlang gives you a really lovely, easy to understand memory model whereas in more imperative, shared-memory language like Java that goes out the window immediately.

Due to its fault tolerance and concurrency, Erlang is a great solution for implementing the web-facing high-availability parts of a company's distributed system. If I have free reign and could start from scratch, I would use Erlang to terminate my web traffic and delegate the interesting business bits to something with a rich type system and relatively lively user-base like OCaml or Haskell. I'm sure recruiting and upper management would be thrilled at the prospect of chasing after an even more esoteric talent pool, but there are some very successful financial companies using statically typed functional programming languages out there. I'm a huge fan of compile-time checks for correctness, so for the important logical bits I think the stronger type system you have the more reliable your output will be.

Mirko: In your talk description on the Erlang User Conference site I read: "Core Code Grunt at Klarna". Trust me it is the first time I see the word "grunt" applied to a developer. Why do you feel like a grunt?

Daniel: I am pretty awesome and have a huge ego about it. Thinking of myself as a Code Grunt reminds me that although having good ideas is fantastic, the primary function of a software engineer is to implement solutions and ship them so business value is delivered to his company on a consistent basis. People with fancier titles typically have a lot less fun than me.

Mirko: In your previous life you have been a type theory researcher. What do you think about the Erlang Type System? Sometimes I read debates about it. A view from an ex-researcher is always appreciated.

Daniel: Erlang values occupy a very reasonable set of types, not too different from core ML. The language itself does very little to leverage those types at compile time, so in practice it is rather dynamic like a Lisp or Scheme. I think some kind of ML built on the beam with Erlang-style message passing could be rather exciting. My background makes me want to give such a thing a formally defined semantics, but then many implementation details of the beam make it rather complicated and my brain explodes.

I am a rather big fan of ML Functors, and Erlang parameterized modules are rather similar in that both are functions from "things" to modules. We've used parameterized modules in a number of places to avoid duplicate code, write wrappers that preserve separations of concerns, and occasionally create some boilerplate than is shinier than your average boilerplate. The strong type systems in MLs keep you in line when using functors, but Erlang checks none of that with parameterized modules so it is much like riding a motorcycle without a helmet. Really fast and more fun because you feel like a really bad, bad man, but when you screw up the mess is pretty unrecognizable. My team’s usage of parameterized modules tends to raise eyebrows with more traditional Erlang developers, feature father Richard Carlsson occasionally among them. Please don't bring up R16.

Mirko: What is your favourite Erlang development set up? Editor, tools and so on? I am performing a little survey about this argument.

Daniel: I'm a rather unsavvy minimalist when it comes to my development setup. emacs in the console + erlang mode. I prefer to use vanilla defaults in my configuration as much as possible, so that I have minimal expectations if I am thrust into a fresh/unfamiliar environment.

Many people at Klarna who prefer a more IDE-like interaction with emacs + erlang use edts, developed by my teammate Thomas Järvstrand.


Copyright Mirko Bonadei (2010-2017).