Reflections on 10,000 Hours of DevOps

Some reflections after putting 10,000 hours into DevOps engineering.

From my early adolescence doing sysadmin work, customizing my Arch Linux installation, to running a server in the closet of my college dorm (narrator: it was loud, and my email rarely delivered), to working on open-source DevOps at Google — I’ve probably put in many more hours. It’s hard to tell how many of those counted as Malcolm Gladwell’s “deliberate practice,” but these are the lessons learned nonetheless. (Also see my more general reflections on 10,000 hours of programming).

  1. Reproducibility matters. Without it, these subtle bugs burn hours of debugging time and kill productivity.
  2. Never reuse a flag.
  3. The value of a CI/CD Pipeline is inversely proportional to how long the pipeline takes to run.
  4. Code is better than YAML.
  5. Linear git history makes rollbacks easier.
  6. Version your APIs. Even the internal ones. No stupid breaking changes (e.g., renaming a field). Don’t reinvent the wheel. Use semantic versioning.
  7. Do not prematurely split a monorepo. Monorepos have U-shaped utility (great for extremely small or large orgs).
  8. Vertical scaling (bigger machines) is much simpler than horizontal scaling (sharding, distributed systems). But sometimes, the complexity of distributed systems is warranted.
  9. Your integration tests are too long.
  10. Have a high bar for introducing new dependencies. Especially ones that require special builds or environments.
  11. Release early, release often.
  12. Do not tolerate flaky tests. Fix them (or delete them).
  13. Make environments easy to set up from scratch. This helps in every stage: local, staging, and production.
  14. Beware toolchain sprawl. Every new tool requires expertise, management, and maintenance.
  15. Feature flags and gradual rollouts save headaches.
  16. Internal platforms (e.g., a PaaS) can make developers more productive, but make sure you aren’t getting in the way. Only create new abstractions that could only exist in your company.
  17. Don’t use Kubernetes, Yet. Make sure your technology's complexity matches your organization's expertise.
  18. Cattle, not pets (prefer ephemeral infrastructure over golden images). Less relevant in the cloud era but important to remember.
  19. Avoid shiny objects but know when the paradigm shifts.
  20. Technical debt isn’t ubiquitously bad.
  21. Meaningful health checks for every service. Standardize the endpoint (e.g., /healthz) and statuses.
  22. 80/20 rule for declarative configuration. The last 20% usually isn’t worth it.
  23. Default to closed (minimal permissions for) infrastructure.
  24. Default to open for humans. It’s usually a net benefit for developers to view code outside their own project.
  25. Bash scripts aren’t as terrible as their reputation. Just don’t do anything too complex. Always “set -ex” and “-o pipefail.”
  26. Throttle, debounce, and rate-limit external APIs.
  27. Immutable infrastructure removes a whole class of bugs.
  28. Makefiles are unreasonably effective.
  29. If you have to do a simple task more than 3 times, automate it.
  30. Be practical about vendor lock-in. Don’t over-engineer a generic solution when it’s incredibly costly. But proprietary solutions have a cost (developer experience, customizability, etc.)
  31. Structured logging (JSON) in production, plaintext in development.

要查看或添加评论,请登录

Matt Rickard的更多文章

  • Lessons from llama.cpp

    Lessons from llama.cpp

    Llama.cpp is an implementation of Meta’s LLaMA architecture in C/C++.

  • To be, or not to be; ay, there’s the point.

    To be, or not to be; ay, there’s the point.

    It doesn’t have the same ring to it as the Hamlet that we know, but this is from the first published version of Hamlet…

  • AI Agents Today

    AI Agents Today

    The term AI agent is used loosely. It can mean almost anything.

  • Norvig's Agent Definition

    Norvig's Agent Definition

    There’s no consensus on what an AI agent means today. The term is used to describe everything from chatbots to for…

    1 条评论
  • The Lucretius Problem

    The Lucretius Problem

    Just as any river is enormous to someone who looks at it and who, before that time, has not seen one greater. So, too…

    1 条评论
  • Eroom's Law

    Eroom's Law

    Despite advances in technology and increased spending, the number of new drugs approved per billion dollars spent on…

    1 条评论
  • Copilot is an Incumbent Business Model

    Copilot is an Incumbent Business Model

    The Copilot business model has been the prevailing enterprise strategy of AI. An assistant that helps you write the…

    1 条评论
  • What if Google Wasn’t The Default?

    What if Google Wasn’t The Default?

    Google has paid Apple to be the default search on their operating systems since 2002. But recent antitrust cases…

  • The Cost of Index Everything

    The Cost of Index Everything

    Many AI products today are focused on indexing as much as possible. Every meeting, every document, every moment of your…

  • Strategies for the GPU-Poor

    Strategies for the GPU-Poor

    GPUs are hard to come by, often fetching significant premiums in their aftermarket prices (if you can find them). Cloud…

社区洞察

其他会员也浏览了