Dev Tools & Ops @Scale
Five years ago I attended the first @Scale Conference at Facebook's Menlo Park Campus and it instantly became one of my favorite events. It was the first time that I had the chance to hear directly from the engineers who built the mobile apps whose userbase was going through a hyper-growth phase such as Uber, LinkedIn, and of course Facebook itself. It was a relatively small event with a single track where a few hundred attendees could fit in a room. I came out of it enlightened and inspired to go seek greater technical challenges, thirsty to leave a bigger positive impact in every digital product I contributed to.
This year, the event was probably the biggest in its history, with over two thousands attendees and 3 parallel tracks: Data, Machine Learning, Dev Tools & Ops. While it's been a challenge trying to pick which talks to skip, I decided to fully dive into a single track the whole day, Dev Tools & Ops, and I'm glad I did!
First of all, I have to mention the amazing keynote which consisted of three talks. David Patterson, yes, the author of the famous Computer Architecture book some of you might have studied in college, kicked us off with a talk titled "Golden Age for Computer Architecture" where he took us through a 40-year history of computers in half an hour. While there's been a Moore's Law slowdown in transistor density since the 80s, another domain, deep learning, has caused an exponential growth in machine learning. AI training has been at the forefront of compute demand at a much higher rate than Moore's Law. Especially in the last 6 years, from AlexNet to AlphaGo Zero, there's been a 300,000x increase in compute power. The industry leaders in this domain are contributing to the current neural network architecture debate in quite different approaches. Google with TPU which has 1 core per chip and a large 2D multiplier, Nvidia with 80+ core GPUs with many threads, Microsoft with FPGAs, customizing hardware to applications, and Intel with 30+ cores CPUs. There also many startups with their own architecture bets and the marketplace will ultimately settle this debate which was one of the lessons of the last 50 years in computer architecture. Other takeaways were that software advances has and will continue to inspire architecture innovations and that raising the hardware/software interface enables architectural opportunities.
In the second keynote talk, "Building Community Driven AI Infrastructure", Jason Taylor, Facebook's VP of Infrastructure, spent some time going over their Open Compute Project contributions ranging in compute domain from the Twin Lakes Server Card and Yosemite V2 Chassis, in storage space Bryce Canyon and Lightning, to various CPU sleds and chassis. ML/AI accelerators have become impactful components in Facebook's 16 global data centers alongside the compute, storage, and memory units. Facebook's other memorable contributions to the AI community are PyTorch, an open source Python based ML library, ONNX (Open Neural Network Exchange), a deep learning model that allows interoperability between different open source AI frameworks, and a deep learning framework called Caffe2 (Convolutional Architecture for Fast Feature Embedding). He reiterated their mission to advance the world's AI and referenced facebook.ai where they have been sharing many more frameworks, libraries, models, and developer tools with the open source community with the goal to empower developers to take AI from research to production.
The keynote kept getting better when the third speaker, Clément Farabet, NVIDIA's VP of AI Infrastructure, took the stage to talk about their AI infrastructure for autonomous driving. To begin with, I was pleasantly surprised to find out NVIDIA was a player in this space. Turns out they have already driven millions of actual and billions of simulated miles, have 1,000+ nodes for offline testing and are nearing 100s PG of real test data. Currently they have 30 cars with 12 camera, radar, and lidar rigs which help label 20M objects every month. I was able take a closer look at one of these cars (the one in the below photo) later in the day.
Next, he talked about project MagLev (name inspired by magnetic levitation trains which are not only the fastest but most stable) which is an AI training and inference infrastructure to support all the data processing necessary to train and validate petabyte scale AI systems. It also happens to be the cloud system that will enable other auto manufacturers to build autonomous cars. With features such as programmatically capturing workflows including data preprocessing, selection, model training, and testing, NVIDIA hopes to revolutionize every industry that relies on inferencing for automation.
After such a fulfilling keynote, I could've gone home and it would've still been worth it but I'm really glad I stayed:) Next up was Ke Mao from Facebook to talk about an Automated Fault-Finding tool call Sapienz. With it, we're no longer talking about automating tests but automating the creation of tests. What that means is that this tool can intelligently design many of the tests that are needed to ensure the stability and performance of a large scale app such as Facebook. There will still be tests that are best designed by the engineers but the innovation is that they have shown that it's possible to automate much of the time consuming process therefore accelerating the development of new features.
Sapienz uses an intelligent computational algorithm to search the sample the space of all possible tests. While doing that, it builds a model of the system under test through the UI interactions and uses genetic algorithms to keep the good tests for future reuse. There are many steps in between but the outcome is that the system now designs, runs, and reports the results of over a hundred thousand test cases run every day on the Facebook app. Moreover, using the crash data and stack traces, it can automatically create fixes for the issues (some of these are actions such as rolling back diffs that caused the crash, reverting partial diffs, or applying templates to most occurring crash patterns). Because it reports the results of thousands of test cases every day, it has also allowed Facebook engineers to manually fix many issues with an impressive 75% success rate! They're still working on expanding Sapienz use to other iOS and Android apps within Facebook and my hope is that the tool is open-sourced in the very near future.
This tool works effectively hand-in hand with their static analyser, Infer, which also helps scale concurrency bug detection. Nikos Gorogiannis from Facebook's Static Analysis Tools Team talked about the most common bugs concurrency introduces such as data races, livelock, deadlock, starvation, and how Infer helps detect them at scale. The use cases for each category and helped clearly visualize how a static analyzer can identify such issues.
Infer's ability to scale relies on constructing a summary data structure that describes the behaviors of interest for each analyzer. This summary is then put in a database so that it won't need to be recomputed in the future.
Infer doesn't execute code, instead it will parse it and build mathematical models which it then will use to detect various categories of issues at build time. It houses 20 different analyzers and has been used at Facebook since 2013 on all mobile apps and backend code. Last year alone, it created 7,000 reports for concurrency bugs which resulted in developers introducing more than 4,000 fixes. Since it's open source, everyone can try it, well, almost everyone, as they don't support Swift at the moment.
Next up was Uber's Leslie Lei who talked about their testing journey with multi-app orchestration. They're in a unique spot because their testing involves running both a rider and driver app simultaneously (messaging and multi-player games are in a similar situation). They have tried a number of strategies and their main two success criterion has been stability (test results should be consistent) and performance (tests should run quickly, their SLA is 20 minutes!).
Early on in their journey they were using a blackbox approach and automated both apps. It mostly worked except for when backend services would fail due to updates or timeouts. This meant that the system wasn't stable and they couldn't run it at scale. Because both apps had to wait for the responses from each other, this also meant that the tests ran longer.
Next, they tried graybox testing where they recorded the backend responses and replayed them while running the tests thereby short-circuiting a backend roundtrip. In theory, this should meet the stability requirement but in practice there were flaky tests. After much investigation, they concluded that the file I/O cost for the stubs was slow enough to cause issues in some CI machines. They worked around this by generating network models from these stubs that could be compiled into native or byte code. This helped meet the stability and performance requirements however now the test were eventually stale and the recorded responses were static to the device, architecture, and A/B test combinations which were limiting.
Most recently, they've moved to a hybrid approach of Blackbox orchestration service combined with a graybox record/replay system (pictured above). The orchestration service simulates the production environment and is able to test both the rider and driver apps in the same tenancy without interfering with other A/B tests. This has the benefit of covering the missing issues around mobile networking and device types as well as preventing tests from getting stale. The tradeoff is that the failure surface is now larger because anything from mobile to backend can fail but they have increased tooling support to make tests more actionable. For example, the flaky tests are automatically disabled, there's a CLI that has the same setup as the CI machines, a test run visualizer provides simulator/device/network logs along with individual steps, and a trip simulator tool removes the need to have two apps to run simultaneously. As a result, they're able to run more tests with higher levels of confidence and their developers are much happier!
Now that we took a peek into how Uber optimizes their mobile testing at scale, the next talk by Evan Snyder completed the big picture by talking about how test device utilization is optimized at Facebook. With 2.2B MAU on Facebook's ecosystem, their apps need to be tested on various platforms including newer devices such as Oculus Rift and Oculus Go. A lot of time is wasted when running tests on real devices, though. Take for example running mobile tests: Before running tests, the emulator needs to be erased, booted, and the app is installed followed by the results being uploaded to a backend service, which causes more time to be spent in pre and post test-run phases than the running the tests themselves.
In order to be able to do more with less and managing combinations of hardware and software platforms, Facebook built a resource management system called One World that lets teams request any type of platform via a unified API. Through this API, a developer can request their test to be run on a real device or an emulator of their choice. The system highly optimizes the throughput of the system such as moving the tests to the next emulator should an emulator die. As a result, there are no emulators sitting idly.
Behind the scenes, they have data centers with more than 20,000 resources (including mobile devices, racks of Mac minis and Linux hardware). Because it's important to keep production and corporate networks separate, no devices are allowed to access the production network. All worker services within corporate network talk with clients in production environments through a queue. All communication between the machines to fetch jobs and update results is through simple Graph API requests which also enables running more hardware since all devices can make HTTP requests.
Some of the lessons they've learned while designing this system is that solutions won't be cross-platform magically unless all assumptions are checked and dependencies are understood. Before scaling a system, care must be given to maximize utilization and since it's not possible to predict industry changes, your frameworks need to be flexible and your system should be able to adapt to new requirements.
There were two more talks that I enjoyed greatly by Trenton Davies of Adobe on Regression Testing Against Real Traffic on Big Data Systems and Machine Learning Testing at Scale by Manasi Joshi of Google. In the end, though, the best part of the conference for me was to actually meet each speaker, talk a bit more about their experiences as well as just hanging out with them and having fun! I came away inspired to apply these strategies, tools, and perspectives to tackle similar complex engineering challenges that we face on a day-to-day basis and I can't wait to get to solving them!
Software Engineer at Ancestry
6 年Himanshu Makkar