Ten Common Traits Among Successful Adobe Experience Manager (AEM) Implementations
Jayan Kandathil
(DevOps) Platform Engineering, Managed Services and Cloud Operations (AWS and Azure)
My work with Adobe Managed Services over the past eight years has afforded me a unique perspective on many of over eight hundred CQ5 and AEM6 implementations, SITES as well as ASSETS and FORMS.
I see some clear patterns on what makes an AEM implementation successful over the long-term, after the HIGH-FIVEs post go-live, and long after the crowds have gone home.
1) Healthy paranoia about a possible site crash on go-live which causes an emphasis on realistic load testing by a knowledgeable team, with clearly defined success criteria regarding peak expected request load per minute, and a threshold for what constitutes acceptable response time at that load.
2) Long-term, trusted relationship with a capable, knowledgeable customizer partner who in turn keeps their employee churn to a minimum. It is suicidal to end the customizer’s contract 3-6 months after go-live. Who will fix that custom component when you need to upgrade to the next version of AEM? Who will fix the code when a previously unknown security flaw is reported? For more on this, see blog by James Lelyveld of Adobe partner AKQA.
3) Regular (weekly or monthly) cadence for code updates that can include both functionality improvements as well as bug fixes. Give each of these sprints a theme/name, document new features and bug fixes in terms of Jira ticket numbers, and perform comprehensive tests in STAGING before the release.
4) Absolute insistence on not performing any changes in PRODUCTION that have not been tested in STAGING first.
5) Use of a CDN. Content Delivery Networks such as Akamai and Amazon CloudFront keep the occasional user request tsunami away from your Dispatcher/Publish instances. Offload ratios of successful implementations exceed 95% Also, a sophisticated, selective cache invalidation approach works better than massive "CP Code" cache purges which puts sudden, intense load on the origin AEM server instances
6) Early focus on cacheability of as much content as possible in the Dispatcher. The development team should shoot for at least 80% of the requests to be served out of the Dispatcher caching layer.
7) For ASSETS uploads, the realization that the first botttleneck to upload performance is the user’s own network upload (not download) bandwidth. If you’re doing this from a laptop via WiFi from a conference room where five others are also active on the Internet as well, it will take quite a bit of time to upload large files. To be sure, test it with Speedtest.net. At 30 Mbps upload bandwidth, a 100 MB Photoshop file will take at least 26 seconds to upload. Any additional latency introduced by AEM will be on top of this. Also, uploads will take longer if you’re in a city far away from where the AEM server is. If you plan to move 100 GB or larger videos up to or down from the cloud, your upload/download bandwidth would have to be close to 1,000 Mbps for an optimal user experience.
8) For the initial bulk ASSETS ingestion, performance of a detailed content type breakdown analysis and the nuanced treatment of various asset types. PDF and Word documents with large amounts of text will bloat the search index. Maybe, you don’t need thumbnails generated for each page of every PDF document? Videos require special treatment, possibly with AEM Dynamic Media. Large TIFF files (>1 GB) also require special attention.
9) Detailed analysis of user and custom component search (query) behavior. The developers should look for node traversals in the error.log, and ensure that all required search indexes are created. Query-heavy sites should consider off-loading search to dedicated Apache Solr instances.
10) Realization that as the amount of content in the PRODUCTION system grows over time, that content needs to be back-ported to the STAGING environment so that tests in that environment return realistic results. Search performance in many cases depends on the amount of content in the repository. A year or two into the deployment, queries that used to be very performant will start slowing down, causing higher CPU and disk utilization because they now churn through more content than they used to.
Bonus 1) Implementation of strict socket and connection timeouts on every call that goes out to external services. These calls use the same thread pool maintained by AEM's webcontainer (Eclipse Jetty) for responding to incoming connection requests as well. The default size of the thread pool is 200 threads. If the external service goes down, throttles your calls or experiences an outage, your application will exhaust the thread pool quickly with BLOCKed threads if you don't implement timeouts. This effectively puts AEM out of business as the publish instances get dropped out of rotation by the load-balancer because they stop responding to healthchecks.
Senior Developer, Campaign, Adobe
8 年Great list, Jayan, and most of it applicable to all significant enterprise software projects.
VP - Retail Applications Senior Product Manager at Truist
8 年Some very good advice that I have seen not followed in the past. Great list for AEM users to keep handy. Evaluating content is key!
Principle Consultant at Adobe
8 年So true!