Look Beyond Legacy 1990's-era Commerce Platforms for Real Innovation
Kelly Goetsch
Chief Strategy Officer at commercetools / MACH Alliance Co-founder / 4x O'Reilly Author
Most legacy commerce applications (I’m looking at you Oracle Commerce, IBM WebSphere Commerce, Magento, etc) were products of the 1990’s and are built for an entirely different era of software development. These applications are characterized by:
- One large monolithic application deployed to a heavyweight enterprise application server connected to a big enterprise-wide relational database
- Monthly or quarterly releases to production
- Software manually deployed on premise to hardware that you procure and manage
- Annual upgrades from the vendor in order to maintain support
- Deployment to only one location at a time
- No tolerance for any latency
- Expensive up-front licensing fees
This approach worked just fine before commerce went from the periphery to being the business. Technology powers commerce and is used to provide differentiation in the marketplace. Increasingly, even old retailers are turning into technology companies. If you’re not building out a full omnichannel platform, you’re quickly headed toward irrelevancy. Over the past few years, many retailers have begun to realize that their old "one size fits all" approach to application development does not work for commerce. It may still work for payroll, but applications that are closer to end-customers require an entirely new approach.
Gartner has come out with a buzzword to crystalize a trend toward having two "speeds" of application development: "bimodal IT." In their framework, there are two modes to application development
Mode 1
Traditional, emphasizing scalability, efficiency, safety and accuracy
Mode 2
Non-sequential, emphasizing agility and speed.
From the mid 1990’s to roughly 2010, commerce applications were clearly mode 1. Today, commerce applications are clearly mode 2.
Let’s explore further.
Mode 1
Mode 1 has ruled enterprises for decades and for all of the boring applications out there that are not used to differentiate an organization in the market, mode 1 is still probably preferred. Think about your company - do you care that your payroll system supports the highest level of the Richardson Maturity Model for REST or do you care that you get your paycheck deposited on time? Most enterprise applications work really well for mode 1. The software is stable, fully supported, doesn't change all that fast, scales well, and is generally safe. Nobody is going to get fired for buying one of these products for a mode 1-style application. Over the past few years, this market has started migrating over to SaaS. Rather than Siebel, you just buy a subscription to Salesforce. This trend will only accelerate to the point where all mode 1 applications will eventually be delivered as SaaS. Nobody differentiates on mode 1 applications, so there's no value in building/deploying something custom. Instead, it's all about cost/performance, which lends itself naturally to SaaS. The people who built/deployed/maintained these applications on premise will either be let go or moved over to mode 2 applications. The days of deploying undifferentiated, packaged software on premise is long over.
Mode 2
To those engaged in commerce, technology is now the key enabler of your business. Many of the fastest-growing retailers are tech companies that happen to sell things online. They can't afford to stay with mode 1 for these applications. They need to innovate, try new things, fail quickly, develop lots of features in parallel, release quickly, etc. Rather than a quarterly or annual release cycle, releases should happen multiple times per day. Almost by definition, you must build these applications, often from scratch, in order to differentiate. It's hard to have real differentiation if you're just using the same old commerce platform as everyone else. This leads to enterprises building their mode 2 applications on the most innovative, most agile stack available in order to gain that competitive edge. It's no different than all of the crazy things that high frequency traders do in order to beat out their competitors by a few microseconds. Mode 2 applications tend to be public-facing, deployed at very large scale, and use a distributed architecture favoring eventual consistency rather than ACID-based strong consistency. It's an entirely different model than mode 1, which I'll get to shortly.
When you hear "Cloud Native" or "Microservices" or "Web-scale," think mode 2. This style was pioneered by a handful of the larger consumer-focused tech companies. Think Google, Facebook, Amazon, Netflix, etc. Google spends $10.5 billion on R&D every year. Facebook spends > $1 billion. Amazon spends $10 billion. Most of that money goes to highly paid developers. These companies are not spending much if any money with the legacy technology vendors. They are building their software, mostly from scratch. These companies also differ from traditional enterprises in that they have just a handful of applications. Netflix has basically one collection of microservices that powers their entire business across all channels. These organizations can afford to spend lavishly to build their software from scratch, in a way that differentiates them in the market. These organizations also need to experiment by trying new things and failing fast. They need to release to production multiple times per day (imagine if Facebook had a quarterly release cycle...). They need flexibility far above and beyond what software from traditional ISVs can provide. These organizations pioneered many of the "new" technology we all use today. Amazon built out AWS to power Amazon.com. Yahoo invented Zookeeper to underpin many of their consumer-facing applications. Facebook came up with Apache Thrift. Google open sourced Kubernetes. This software acts as building blocks for large, distributed systems. These building blocks didn't exist 10 years ago. Many didn't even exist five years ago. These building blocks make it possible for traditional enterprises to build the software that has long been the domain of these large consumer-focused tech companies.
Let’s explore some of the differences between mode 1 and mode 2-style commerce applications.
General Purpose -> Specialized
Legacy packaged commerce applications have enormously heavy dependencies in the form of application servers, databses, infrastructure, etc. It’s not uncommon to have to set up 15 or 20 different virtual machines in order to stand up an instance of one of these old platforms. It comes down to architecture. These legacy applications have 5, 10, 15 million lines of code running in one single operating system process. Think about all of the different functionality that an application of that complexity can exercise the underlying software. The application servers, databases, infrastructure, etc from another era had to support millions of use cases from thousands of customers. Big enterprise databases, for example, are very heavy and feature-rich.
Mode 2 is entirely different. In mode 2, the underlying software and infrastructure need to support exactly what is required of the application. Rather than a full enterprise database and its ACID compliance with strong transactions, an application may require a simple BASE key/value store, like Cassandra. Every application can have a custom-built stack underneath it, supporting exactly the requirements of the application and nothing more. Clearly there are enormous inefficiencies in giving each application its own underlying stack - but for mode 2 applications, it just doesn't matter. What matters most (by far!) is time to market. Using a big centralized database may appease an enterprise architect but it undoubtedly slows down time to market. Every application developer wants full control over their own underlying stack. With no dependencies, developers can move very fast. This is basically the concept behind microservices.
Microservices decomposes a single application into its discrete business functions. Small cross-functional teams are assembled to architect, build, deploy and run their own unique microservice. Each team has complete autonomy over what technologies they choose as part of their microservices. Again, the focus is time to market, not efficiency. The enterprise architects of the world get a lot of heartburn when they see microservices with their multiple copies of eventually consistent denormalized data. But it's not about enterprise architects, or cost savings, or efficiency. It's about time to market with mode 2.
Centralized -> Decentralized
Mode 1 style applications are characterized by extreme centralization. This extends to procurement of software. Typically the CIO will negotiate with a vendor to sign an Unlimited License Agreement (ULA) to offer one or more layers of the stack. For example, you would negotiate a $10 million ULA that would allow you to deploy as much database as you wanted over the course of three years. Once the agreement was signed, you had a strong incentive to standardize on that database across the entire company. This would lead to a handful of very large databases, which many applications all wrote to/read from as their sole system of record. There might be a "customer" database, or an "order" database. Different applications would exchange data by using a single database as a single system of truth. That works fine for mode 1 but mode 2, again, is all about time to market. Mode 2 applications have data scattered all over the place, in different types of data stores. Each microservice traditionally has its own datastore. There generally aren't top-down technology mandates for mode 2 applications. It's all bottom-up, developer-led. And developers don't choose "traditional" technology vendors because the products they sell are overkill for the simple use cases of these applications that are being developed. Microservices each tend to be a few thousand lines of code, exercising just a handful of code paths. Traditional commerce applications could easily be 10 million lines of code, executing thousands of code paths. A traditional enterprise application server or database under such a simple application is completely overkill. It's so much easier to just go to pull MariaDB off the shelf and use it. Mode 2 applications do not lend themselves to centralized procurement of software. Developers working on individual microservices pick the software they want, that solves their individual (very small) use cases.
From a technology standpoint, mode 2 applications are also characterized by extreme decentralization. Mode 1 applications tend to be small and deployed to "reliable" on premise hardware. Mode 2 applications are much larger (think thousands or tens of thousands of servers for a single application) and deployed to public clouds. We're getting to the point where public cloud is as, if not more reliable than traditional on premise, but what's clear is that today's architects are assuming that public cloud infrastructure is inherently untrusted and unreliable. Google famously attached disk drives to their servers using velcro, making it easier to swap out failed disk drives. According to Google Fellow Jeff Dean:
In each cluster’s [of 10,000 servers] first year, it’s typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to vanish from the network; 5 racks will “go wonky,” with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5 percent of the machines at any given moment over a 2-day span. And there’s about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover.
Traditional commerce appliations require much more reliable infrastructure because the software and its dependencies weren't built for cloud, which requires designing for failure from the very beginning. Try putting a clustered enterprise database on a shared, multi-tenant network in a public cloud. Try clustering a traditional application server in a public cloud. Legacy commerce applications were built for highly reliable on premise hardware that never goes down. Think big old mainframe-like SPARC boxes running on a high performance single-tenant network. This naturally lends itself to ACID-compliance, relational databases, strong consistency, strong coupling, and everything being synchronous. It's simply easier to program assuming reliable infrastructure.
When mode 2 software is deployed to a public cloud, assumptions cannot be made about the availability of the underlying infrastructure. Assume that parts of your network will become unavailable. Assume that individual nodes will die at any given time. This type of infrastructure lends itself to decentralized/distributed-style architecture patterns. Rather than ACID, assume BASE. Rather than strong consistency, assume eventual consistency. Weak coupling over strong coupling. Asynchronous over synchronous. Traditional commerce appications weren’t built for this world. They were built for mode 1, where the infrastructure is presumed to be reliable. Software built for distributed systems is vastly more reliable (if done properly) and scales horizontally by its very definition. Remember, mode 2 is all about time to market. It's much easier to deploy a few hundred instances of Cassandra on some cheap IaaS, or use one of the many object stores-as-a-Service available from public clouds.
Another way to look at this is through the CAP theorem. CAP is an abbreviation for Consistency (each node should have the same data at all times), Availability (each node is available for writes at all times) and Partition Tolerance (able to handle network outages). Pick two. Since all applications run on a network, the "P" is required. You're then left to choose whether you want Consistency or Availability. Mode 1 applications favor Consistency because consistency allows you to use ACID transactions, write synchronous (easy) code, and generally never doubt your data. In mode 2, distributed systems naturally favor Availability. A node can always be written to, somewhere. But the data may not be consistent across nodes. Eventual consistency is the best you can hope for. Which unless you're building a bank, is probably good enough for most use cases. Though even ATM machines are famously eventually consistent so that isn't 100% true. Very few systems need to be strongly consistent. The world is eventually consistent. Adopting this paradigm takes some change but take comfort in the fact that many of the world's largest commerce platforms, like Amazon.com and eBay are mostly eventually consistent.
Persistent -> Ephemeral
Almost as crucial as the move from centralized -> distributed is the shift from persistent -> ephemeral. Mode 1 applications are characterized by persistence. Mode 2 applications are characterized by ephemerality. If you're running a mode 1 application on infrastructure that you control and that you assume to be available and unchanging, it is possible to statically build out your environment and persist data or state locally. You can have all sorts of named singletons that will be highly available. You can hard code ports in firewalls. You can hard code IPs in your configuration. You can store state locally to each VM or JVM. You can store your configuration locally. You can have someone build out each environment by hand. Why automate the build out of an environment if it's never going to be changed? An application can take tens of minutes to start up because it is presumed to live for weeks between restarts. Individual servers live for years. Everything is static and in the world of mode 1 applications, it generally works OK assuming nobody touches anything.
Mode 2 applications are the complete opposite of mode 1 applications. Again, these applications are deployed to containers, which may live for just a few seconds. The applications are generally consumer-facing, which leads to enormous variability in traffic. This is the daily traffic from Gilt, an online-only eCommerce vendor:
This traffic pattern is extremely common for mode 2 applications. These applications, especially commerce, need rapid elasticity. With elasticity comes ephemerality. Each VM or container needs to be spun up automatically through auto-scaling and the software needs to be automatically installed on it within seconds (no more doing anything by hand!). VMs or containers will be automatically killed off when the traffic drops. Because of this ephemerality, you can't use any of the mode 1 constructs. You can't hard code anything because the IPs, ports, etc change so frequently. You can't store state or configuration locally because any given VM or container could be killed at any time. The state and configuration needs to be pushed down to cloud services that are built for this. Mode 2 applications need to start in seconds, so that they are available to quickly handle rapid spikes in traffic. VMs and increasingly containers are short lived, by definition. Google Cloud even offers "Preemptible VMs," which allows Google to terminate VMs without notice in exchange for a deep discount. Ephemerality is almost the very definition of public cloud. For another data point, have a look at what Netflix built years ago: Simian Army. They have built a whole fleet of services that run in production, whose sole goal is to shut down VMs (Chaos Monkey), availability zones (Chaos Gorilla), or even an entire region (Chaos Kong). If you're a developer building a microservice at Netflix, you have to build your software knowing that any VM, availability zone, or region could be killed in production at any time.
Having every VM or container presumed to be ephemeral significantly changes ongoing ops. If there's ever a problem with a single VM or container, the system should kill it off automatically, without administrator involvement. If required, the auto-scaling mechanism will add another one. The issue was probably some type of rare hardware or software fault. If VMs or containers continue to have problems, the problem is likely due to how the software is installed and configured. For example, some code path may be trying to read a local file that has bad permissions. In that case, it makes sense to fix the actual problem at the source and then re-deploy the entire environment. In no circumstance does it make sense for individual admins or developers to fix problems on a one-off basis. Whatever scripts or container-based approach that's used to deploy and configure software should be responsible for pushing changes. Humans make mistakes. This is all in direct opposition to mode 1 applications, where humans are solely responsible for installing and configuring software, making ongoing changes, fixing problems on individual VMs or containers, etc.
Another way that ephemeral VMs or containers changes ops is through patching. Instances of mode 1 applications are presumed to always be available. Singletons, especially, cannot go down. Therefore, patching should happen online without downtime.
The new mode 2 model is to build a new VM or container with the patches applied and deploy it alongside the existing application, using some kind of blue/green deployment methodology. Cloud is entirely what enables blue/green-type strategies. Spin up a parallel copy of your production environment for a few minutes and only pay for those few minutes of additional capacity. With a mode 1 infrastructure, you simply cannot spin up a whole copy of a production environment for a few minutes. This style of patching eliminates downtime and significantly reduces the risk by allowing immediate failover to the old environment. It also simplifies the code. Online patching is very difficult to do.
In mode 1 applications, the infrastructure is something that is manually set up and configured. It is not repeatable or automated. Code is manually deployed. In mode 2 applications, everything is code. Hardware resources are provisioned through code. Code is deployed automatically, using code. Even containers are constructed using code. Everything is code. New code deployments with a mode 2 are as simple as re-generating the VM or container but with new code. Immutable/disposable VMs and containers are the norm with mode 2 applications. With mode 1, you're stuck deploying new code by hand. Automation through code is a major theme of mode 2 applications.
A larger issue is that of singletons. Mode 1 applications are full of singletons. A good analogy bouncing around the internet is the concept of pets vs cattle. Singletons in mode 1 applications are analogous to pets. If they get ill, you take them to the vet and spend a lot of money fixing them. If they go missing, you will start calling for them by name. If they die, you'll be very sad. They are named individuals with unique personalities. Mode 2 applications should be treated more like dairy cows. They are indistinguishable commodities, bought and sold on the open market. Their output (milk) is also commoditized. You don't know their names. All you care about is the aggregate output of milk of the herd. If you can get your application (or each microservices) down to one binary with no singletons, it makes life so much easier. Rather than naming an individual server to do something, you can have one server be nominated to do something different using some type of leader election protocol like Raft or Paxos. You can then kill off any VM or container you please at any time without concern for what that VM or container happens to be running. With mode 1, you just don't have to deal with this problem. VMs last forever. You can't assume mode 2 VMs or containers will last forever. You have to assume they can be killed at any time.
Fault Intolerance -> Fault Tolerance
Mode 1 applications are not designed for fault tolerance because the infrastructure under mode 1 applications is assumed to be reliable. That's why you see software vendors, hardware vendors, and hosting vendors pitching their 99.999999% uptime SLAs. In practice, we all know that the infrastructure isn't reliable. But the software designed for mode 1 systems is generally brittle and not really built for fault isolation or fault tolerance. If some part of the infrastructure breaks, the entire system goes offline. It's atomic - either everything works or nothing works. This extends to the commerce applications themselves, which tend to be very large and very monolithic. If you have a memory leak in a remote corner of your monolithic commerce application, your entire application will go down. Monolithic applications are brittle. Mode 2 applications assume that the infrastructure is unreliable. That's why you see extreme loose coupling (asynchronous messaging is key), eventual consistency, and a level of resilience built into software that you don't see with mode 1 applications. This resilience extends to the design of the applications themselves. Mode 1 applications are generally single monoliths, whereas mode 2 applications are distributed microservices applications. Microservices generally communicate with each other asynchronously over reliable messaging. Each microservice is typically built in isolation, with no hard dependencies on other microservices or systems, except for the datastore it uses. Microservices by definition are designed for fault tolerance. If there's a problem with one microservice, the whole application generally continues to run without a problem.
Mode 1 applications are generally served out of one data center at a time. They are often deployed to two data centers in an active/passive configuration, but only one data center at a time may actually serve the application. This is because mode 1 applications generally write to exactly one ACID-compliant, central database. You can't have two databases separated by tens or hundreds of milliseconds of latency, each with end-users updating the same records in both at the same time. Mode 1 applications also nearly universally use relational databases, with an object relational mapping (ORM) system on top. Specialized products may replicate the data between systems but conflicts may not be handled correctly, which often results in corruption at the ORM-level. Mode 2 applications are almost always deployed out of two or more data centers, often with some latency separating the data centers. Again, mode 2 applications fundamentally assume that the infrastructure (including entire data centers, availability zones and regions) is unreliable. Because the infrastructure for mode 2 applications is distributed within a data center, it's often very easy to distribute those same workloads across data centers, availability zones and regions. Global Server Load Balancers are then used to watch the health of each data center in real time and forward end-requests to the appropriate data center.
VMs -> Containers
In the 1990's, physical machines ruled the world. All applications (with few exceptions) were deployed to physical machines that ran a few applications. As Moore's Law continued, the utilization of individual servers began to plummet. Enter VMware and a wave of VM-level products. They allowed you to pack a bunch of "virtual" machines on one piece of hardware. This helped a bit with hardware utilization but it only exacerbated the operational problems inherent with VMs. Rather than patching 100 operating systems, admins now had to patch 1,500 virtual machines. And because the physical resources were pre-allocated to each VM, there was no opportunity to share resources across an entire physical machine. Overall utilization improved but each VM was still under-utilized, primarily due to over-sizing. VMs are expensive to start up (tens of minutes), do not have very portable networking, have a lot of overhead (running a whole kernel for each guest is expensive), and suffer from performance problems. VMs have fundamental issues.
Enter containers. In 2013, Docker took the world by storm. Docker is now the de facto tool used to both package applications and provide partitioning between different applications on the same physical host. Application packaging is the #1 driver behind the adoption of Docker. You can declaratively build a Docker container using a dockerfile. You can also pull container images (having just the OS), make changes, and persist just the diffs. Containers are perfect for mode 2 applications because they make it easy to build (with code) an immutable container image. You can pull diffs from your fellow developers, push changes as diffs, publish images to your corporate repository, and more. All through the magic of copy-on-write file systems. Over just the past two years, Docker's packaging mechanism has quickly overtaken traditional Configuration Management (Puppet, Chef, Ansible, etc) tooling. Docker images are now presumed to be immutable. Every patch, code deploy, configuration change, etc results in new container images being deployed and the old ones being trashed. It's an extremely compelling approach to many of the issues we've been facing as an industry.
Unlike VMs, containers do not force you to cap the resources assigned to individual workloads. And the overhead of each container is very small. This allows containers to be very densely packed onto a single host, so you can achieve as high of density as is comfortable. All hardware resources are over-subscribed. Cost savings can be substantial.
Mode 1 applications can run on containers or VMs, but mode 2 applications almost require the use of containers. Containers can be started in just a few seconds. They are extremely portable, especially as far as networking goes. They are light and easy for any team to start using immediately. Mode 2 applications don't technically require containers but life is easier with them. Mode 2 applications seem to have arisen alongside containers, hence the popularity of them.
Projects -> Products
This one is more of a cultural shift but it is very important to the adoption of mode 2 applications. Many who have adopted legacy commerce applications are organized in large, horizontal layers. There's a development VP who manages hundreds of developers. There are factions within development that write infrastructure-level code, business-level code, and UI-level code. There's an ops VP who manages hundreds of ops people. There are factions within ops that administer databases, storage, networks, security, etc. The development VP and the ops VP roll up to the CIO. On top of that, there are line of business owners who work with the CIO to get their applications built. Individual employees are organized around projects of varying length. These different layers primarily communicate with each other through ticketing. Developers write their code and hand it off to ops for deployment to production and ongoing production support in perpetuity. This model is perfect for mode 1 ("Traditional, emphasizing scalability, efficiency, safety and accuracy"). All interactions between teams are carefully documented (cover-your-ass syndrome is alive and well within corporate America). What you lose with this approach is time to market. Projects often take years to complete. Many employees spend their day in ticketing hell. It works but it's slow.
Mode 2 emphasizes time to market. To get to market faster, organizations need to re-orient around products. They need to think of each application or individual microservice as a product. Like a startup would approach it. In that model, you build small teams of people with cross-functional skills who design, develop, deploy and maintain that code in production in perpetuity. Each small team reports to a single, manager who's the general manager for that one product. Members of each team are promoted/paid/fired based on the performance of their product. Each product team is responsible for picking the technology, products, and methodologies that are best for implementing the application or microservice they're tasked with. Each product team will acutely feel the pain of their decisions if they proved to not be very good. Under the old mode 1 model, developers weren't the ones called at 3 AM to fix production outages. With the new mode 2 model, there's far more accountability.
Fragmentation is the natural consequence of this bottom-up decision making. Fragmentation leads to unnecessary cost through reduced buying power, more time operationalizing, and minimizing re-use. But the major overall advantage is time to market. Mode 2 applications are 100% focused on time to market and fragmentation removes all barriers to getting work done. With barriers removed, each team has near complete ownership of their destiny. It's hard to point at other teams for your team's failure.
Conclusion
Building a mode 2 commerce application is now the standard and with all of the new technology it is absolutely possible. Just understand that mode 1 and mode 2 are entirely different in terms of goals, architecture and technology. At some point you have to rewrite your mode 1 commerce application from scratch or begin to extend it with mode 2 microservices.
If you'd like to learn more, have a look at a new O'Reilly book we just published on the intersection of microservices + commerce.
About commercetools
We at commercetools offers an extensive and flexible collection of cloud-hosted commerce APIs that you can consume a la carte. You build the commerce microservices you want to differentiate on and buy the rest of the commerce functionality a la carte from commercetools, much like Lego blocks. If you'd like, commercetools's APIs can even power your entire platform. commercetools offers extensive business user tooling to make it friendly for enterprises. We are 100% cloud-based, running on a modern stack consisting of Scala, MongoDB and other cloud native technologies. Much of our software is open sourced at GitHub. We take pride in being the most representative mode 2 commerce application on the market.
Drop us a line if you'd like to discuss extending or replacing your old mode 1 commerce applications with a more modern mode 2 microservices-style commerce implementation. We'd be happy to hear from you.
Let's talk Cloud :)
7 年Excellent write-up!
Excellent article that is very timely!
Investor & mentor. I invest, advise and help B2B SaaS startups to grow from 0 to 10M ARR.
7 年Great read!
Solutions Engineering | Retail, Financial Services | Data, Integration & API | MuleSoft
7 年Fascinating piece on how e-commerce tech landscape is changing! In Centralized -> Decentralized section above, you say each Microservice has its own datastore…doesn’t it fragment customer data and how does it fit with the age-old business desire to have single customer view and consequently omni-channel experience?
I maintain Ruby on Rails applications.
7 年What a powerful article! This is a great documentation of the recent history of software development practices. It also serves as a wakeup call, and a list of things to change for companies trying to build something new and innovative. Kelly Goetsch