Infrastructure at Scale: Continuous Integration
This blog series describe the engineering infrastructure (technologies, processes, tools and culture) that enable several hundred engineers across LinkedIn to innovate and release software continuously with agility, quality and productivity. This post describes how we’ve scaled the Continuous Integration (CI) pipeline to be fast, easy and reliable.
As described in Infrastructure at Scale: Overview:
there is one trunk that everyone commits to and we release from; every commit goes through automated builds (multiple flavors) and tests (over 4000 and growing) twice: before and after it is committed to the trunk. Trunk is always release ready, every build is a release candidate. To scale this continuous integration (CI) infrastructure to hundreds of engineers, with hundreds of releases daily, across iOS, Android, web and API, we need strong and coherent support in culture, technologies, processes, and tools. This post focuses on what we’ve done in defining test strategy, improving test framework, optimizing CI workflow, and building tools and processes, to make the CI infrastructure fast, easy, and reliable at scale.
Test Strategy
Test Types
We define four types of automated tests:
- Unit Test: test business logic, at method and class level, with some unit test framework (like xUnit), and mock framework (like Mockito). These tests are small, fast (in milliseconds), and reliable.
- Scenario Test (aka Acceptance Test): test user interactions and use cases, like interacting with one or more screens (screen means ViewController & View for iOS, Activity & Fragment for Android, Page for web, or API for API). These tests are written with a test framework (like Espresso for Android, KIF for iOS, QUnit for web, and TestNG for API), run within the app (on iOS simulator, Android emulator, in web browser, or API container). They use mock or fixture server to fake network and other dependency call responses, so these tests can be self contained, thus reliable and fast (in seconds).
- Layout Test: verify UI components are rendered correctly (i.e. no unexpected wrapping, clipping, overlapping, ellipsizing, concatenating, truncating; no garbled characters etc) on different locale, orientation, screen size, and density. We’ve created our own layout test frameworks on iOS, Android, and web.
- Live Test: end to end smoke test, focusing on a few key use cases, usually part of the CD (continuous delivery) pipeline instead of CI pipeline.
Below table summarizes the four types of tests:
Generally unit tests are the fastest, most reliable, and test the smallest software units; live tests are the slowest, most unreliable, and exercise the whole system end to end; scenario & layout tests are in the middle in terms of scope, speed, and reliability.
Test Guidelines
We have comprehensive guidelines on what tests to write, how to write them, dos and don'ts, tips in debugging and fixing flaky tests etc. Here are a few highlights:
- Tests must be reliable: flaky tests are worse than no tests, and are actively disabled.
- Tests must be fast: so we can run thousands of tests, for each commit, multiple times.
- Tests must be meaningful: trivial, redundant, obsolete, disabled tests are expensive to build, run, and maintain, so they are actively weeded out.
- 100% functional coverage is required, at method/class, screen/resource/route, pagekey etc level.
- Unit tests are preferred over scenario tests, live tests should be kept at minimal. 70% of total tests should be unit tests.
Test Frameworks
Below are the test and mock frameworks we’ve adopted, improved, and/or built for the four types of tests across iOS, Android, web and API four platforms:
* See below for more information on Tinker, Mixture, Fixture, RTF and Poseidon, test and mock frameworks we’ve built at LinkedIn.
Criteria
Our criteria for selecting or building the test and mock frameworks are fast, easy, and reliable:
- Tests must run lightning fast. There should be no unnecessary wait, no communication or other overhead by the framework. It should allow sharded, randomized, and parallelized execution of tests to further speed up execution of thousands of tests and finish them all in minutes.
- Tests must be easy, and ideally enjoyable, to write and debug. Writing tests should improve engineers' productivity, not slow them down. The test framework should be well designed, and have a vibrant community to provide documentation, utility, and support. It should have great IDE integration to make authoring, running, debugging and reporting tests easy and pleasant. It should have great CLI API’s for easy scripting, automation and reporting.
- Tests must be reliable. There should be no unreliable wait. It must handle the synchronization of asynchronous steps in test scripts natively and transparently at the framework level, so tests can be written easily, and are reliable by default. It should take effort to make a mistake, not the other way around.
Frameworks & Utilities
Based on above mentioned criteria, we’ve adopted, improved and often times created test and mock frameworks, and lots of test utilities, to make testing fast, easy and reliable:
Unit Test
Unit test frameworks are well developed across all platforms, so we didn’t reinvent the wheel but just adopt the standard or most popular test and mock frameworks for each platform. We did create our own shared mocked objects libraries for code reuse.
Layout Test
Testability is an important requirement of architecture design and we do enforce it. Layout test is a good example of the benefit of testable architecture design. We mandate the MVVM architecture pattern: Views are cleanly separated from ViewModels and other app logics; we use a layout test framework to specify and generate randomized, exaggerated ViewModels (e.g. lengthening and accenting strings, see a later post of this series, Infrastructure at Scale: Internationalization, for more information), bind them to the Views being tested, and automate the layout tests across lots of combinations of locale, orientation, screen size & density. The iOS Layout Test Framework is open-sourced by my colleague Peter Livesey. We have a similar layout test framework for Android. Poseidon is the layout test framework and more for web: it uses screenshot comparison to spot rendering differences and UX changes, and is integrated with our code review and CI workflow to make it easy to use.
Scenario Test
Scenario test and its mock framework is less developed. Many companies such as Apple, Google, and LinkedIn are actively developing and open-sourcing scenario test and mock frameworks across iOS, Android, web and API, for the benefit of the whole industry. We picked KIF for iOS, Espresso for Android, Ember QUnit for web, and TestNG for API, because:
- They were the standard or the most popular test framework for their respective platform, so they have great platform, IDE and community support.
- They are “in-proc” test frameworks in the sense that the scenario/integration/acceptance tests written in these frameworks are almost like unit tests:
- They are written in the native language of their respective platform, checked into the same source code repo as the code they test, go through the same code review, CI, and CD processes, so writing tests is just a naturally part of the feature development to engineers who own both the product code and test code.
- Tests run in the same app context and have full access to all product code, so there is no restriction to the power and flexibility of the test code, and no unnecessary overhead (compared with “out of proc” test frameworks like Appium). The downside of this is that in-proc tests can’t test across apps and can’t test the release version of the apps. These limitations are mitigated by live tests.
- They are open source, so serve as a good starting point for us to improve or rebuild to satisfy our needs.
For iOS, we created Tinker, inspired by KIF and Espresso, to meet our needs for speed, reliability and developer happiness:
- We created Mixture, an in-proc mock and fixture server. It intercepts all network requests and allows easy and powerful mocking, recording and replaying of network requests and responses.
- We replaced slow and unreliable timed wait in KIF with framework level synchronization of asynchronous steps by monitoring and waiting for runloop and dispatch queues to empty out, just like what Espresso and Ember QUnit do.
- We implemented the same API as Espresso, for its power and beauty, and for consistency across iOS and Android.
See below screenshot for an example of iOS scenario test code. Please notice the clean API: onView(ViewMatcher).expect(Assertion).perform(Action), no waiting, chaining, or any other synchronization:
Please read my colleague Keqiu Hu’s blog for more details on iOS test frameworks.
For Android, we’ve also built an in-proc fixture server, and created lots of test rules and utilities (like Tracking test utils, screenshot taking and comparison API etc) to make testing effective and pleasant. See below sample test code, and notice the consistency with iOS test code in test API and utilities:
Please read my colleague Drew Hannay’s blog for more information.
Live Test
We use live tests mostly for sanity check as part of the CD pipeline. Unlike unit and scenario tests, live tests can be written by either developers or testers. And given the feature parity and consistency of our apps across iOS, Android, and web, it makes sense for testers to use out-of-proc test framework like Appium to write tests once and run them against all platforms. Out-of-proc frameworks allow us to test the release binaries, and cross application scenarios like sharing, notification. But because of the ease and power of our scenario test frameworks and utilities, many live tests are actually written as scenario tests, especially when they don’t test cross app scenarios.
CI Pipeline
To speed up the commit to publish time, we have:
- massively parallelized every step of the CI pipeline (build, test, publishing, reporting), across multiple hudson machines and multiple simulator/emulator/browser instances.
- aggressively cached and reused build and test results to avoid redundant work.
Below screenshot is from my colleague Keqiu Hu’s blog 3x3: iOS Build Speed and Stability, showing that we distribute tests across four iOS simulators on one hudson box:
For more information, please read LinkedIn Mobile Engineering Blog and LinkedIn’s github contribution (including my colleague Trent Willis’ for web).
Processes & Tools
We have a on-call process to ensure the smooth operation and continuous improvement of the CI & CD process. A later post of this blog series, Infrastructure at Scale: On-call, will provide more details about the on-call process. Here I will focus on one key issue: flaky test.
Flaky tests are tests that pass sometimes and fail other times. Flaky tests are worse than no tests, as they undermine people’s confidence in the CI/CD system, waste people’s time in debugging random failures, and frequently prevent engineers from checking in, as every code change must pass all tests before it is committed to the trunk.
We’ve identified three root causes of flaky tests:
- Flakiness in the CI infrastructure. One leading example is that some of the hudson boxes used to build and test code changes can get into some corrupted state and fail any task assigned to them continuously or randomly.
- Flakiness in the test framework.
- Flakiness in the test code.
Of course, there is always the possibility that the tests are not flaky but correctly catch the flakiness in the product, like race conditions. Hence any test failure in post commit is a build break and needs immediate fix. On-call engineers usually
- disable the failed test if it is proven flaky (we have the execution history of all tests)
- revert the offending commit if the test failure correctly catches a code bug
- find and fix some hard to repro bugs in the product caught by the failed test, like race conditions
The third case is the hardest. We usually lean towards disabling the test first and finding the bug and fixing it in parallel, instead of leaving it to block other people from committing code changes.
For root cause #1 (CI infrastructure flakiness), we’ve built a tool called TrunkGuardian, which monitors all hudson boxes, uses heuristics (like execution history, signatures of known corruptions), to find and fix corrupted boxes.
For root cause #2 (test framework flakiness) and #3 (race conditions), we’ve built a tool called TestGuardian that keeps running all tests against the Last Known Good (LKG) builds repeatedly, hundreds or even thousands of times. If any test fails, it is automatically disabled, and a JIRA ticket is created against the owner to investigate. We built another tool Testly that visualizes the disabled tests across all teams, platforms and test types over time, and we have a process in place to actively monitor and drive disabled tests to zero.
Conclusion
Continuous integration to trunk provides many benefits:
- It encourages small frequent commits, so issues are found and addressed early, frequently, and easily. This prevents the buildup of technical debt and last minute rush before release.
- It pushes quality upstream and keeps trunk in high quality shippable state. Quality and craftsmanship become the norm and the values we live by daily.
- High quality trunk also keeps team productivity and morale high.
The traditional, arduous release processes and death marches are a thing of the past. A well functioning CI pipeline at scale enables LinkedIn to respond to members’ needs quickly and deliver value continuously.
All articles of this blog series:
- Infrastructure at Scale: Overview
- Infrastructure at Scale: Continuous Integration
- Infrastructure at Scale: Continuous Delivery
- Infrastructure at Scale: On-call
- Infrastructure at Scale: Tracking
- Infrastructure at Scale: Internationalization
#InfraAtScale #Mobile #Infrastructure #ContinuousIntegration #CI
Engineering Leadership at Airbnb. Scaling developer productivity and designing commit-to-production pipelines. Creator of @mockito. Core eng of @gradle 1.x and 2.x.
8 年Great article! Happy that Mockito was used to drive clean unit tests.