Navigating ChromeDriver crashes in Kubernetes: A tale of test automation resilience
Ministry of Testing
Ministry of Testing is where software testing professionals grow their careers.
Overcome ChromeDriver crashes and resource limitations by testing on Kubernetes
By Dan Burlacu
In my day-to-day work as a Software Developer in Test, I was tasked with developing automated UI (user interface) tests for a complex web application with dynamically generated content. This means that the web elements I need to interact with often lack static attributes that can be easily referenced. As a result, I have to use more complex strategies to locate and interact with these elements. After using some web element localization gymnastics to make sure I can click the right buttons and access the right iframes, the tests were complete.
The web application runs as a separate K8S (kubernetes) instance for each client company, on a different K8S cluster, where the resources for that particular instance are grouped inside namespaces. The UI tests are automatically triggered prior to a major version update of the web application, and immediately after, to make sure that the update has not negatively impacted its usability. This was achieved by containerizing the UI test code in a Docker image, sending it to an organisational repository and using a K8S job to deploy the tests on the specific instance namespace whenever an update was pending and immediately after.
Being deployed on hundreds of resources, the Docker image needed to be light, so the tests ran in a linux environment. Running in Linux with no display support, meant that the tests couldn’t open a normal browser, but instead had to use the browser’s headless mode. All of the development and testing is done using the Chrome web browser company-wide, so naturally, I used the ChromeDriver to drive the Chromium browser set up on container deployment. A server that kept track of the upgrades schedule was used to trigger the tests for a particular instance of the web application, and the tests reported back to the server a JSON summary of the results.
Starting your journey
The first try of running the tests in a containerized fashion, using Docker Desktop ended up with providing the following error:
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/lib/chromium/chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Double-check your web results
Upon searching the error online1, I came across a solution that seemed to work, and that is using in the options for the WebDriver the --no-sandbox flag. This seemed fine, my tests worked, so I could have moved on, but I wondered what that flag actually does and if using this solution can negatively impact the tests or the environment they run in.
It seems like “The sandbox removes unnecessary privileges from the processes that don't need them in Chrome for security purposes. Disabling the sandbox makes your PC more vulnerable to exploits via web pages, so Google doesn't recommend it.” This explanation was found here1 and it seems to be an option that is needed to run Chrome on unix-like systems in headless mode. Given that my tests ran in an ephemeral K8S job that is discarded after some time has passed since it was complete, I did not need to worry about environment security concerns.
This however, introduced a new issue when I tried to run my tests from my development environment, which is not discarded after use, running Windows 11 Pro, version 22H2. In about 50% of the cases, after the tests were run and the WebDriver.quit() line was executed, Chrome had 2 lingering background processes. These needed to be killed manually from the task manager, or else running the tests multiple times would increase the CPU usage to 100%, making the PC unusable without a restart. The testing community on the world wide web again came to the rescue, as this problem is documented in this GitHub issue2.
Try running in the final environment
All seemed to work fine on Docker Desktop, so I could have published the final image with the testing code and be done with it, chanting the infamous “Works on my machine”. I was even delivering “my machine” or the container environment where the tests worked to the location they were supposed to run in, the K8s cluster, so no issues there, right? Wrong.
When running the tests in the environment they were supposed to run in, namely K8S, only the come tests from the whole suite were executed correctly and then all the others failed.
Be creative
Having worked on these tests already for quite a long time, I had to deliver. Having no time to investigate further why the UI tests fail, without any apparent set-up error, I got the idea of thinking outside of the company box by using Firefox. That seemed to solve the issue and the tests finally worked in the environment they were supposed to. The code was delivered and it did what it was supposed to do, but the question remained, why did the UI tests fail in a K8S environment?
Be curious
After delivering the code, I could have patted myself on the back for a job well done, and closed the chapter on the subject, but the question of why some of the? tests work on Firefox and not on Chrome kept bugging me, so I started investigating, without having the pressure of delivering results.
For debugging purposes, I set the Dockerfile to run a sleep infinity command after creating the container, instead of the usual command that triggered the UI tests. This kept the container alive as I executed commands inside it, to run the tests. It also provided a way to be able to transfer files between the K8s pod that was running my tests and my local machine. This is the full Dockerfile I used for debugging:
I wanted to see how the UI looks when the tests fail, so I wrote some logic to take screenshots on test failure. As I could not see the images on the linux server environment inside the K8S pod, I downloaded the files on my pc. It seemed that during one of the tests that failed, the web page was trying to load an iframe and the iframe did not load when the web page was accessed from inside the K8S pod. This issue was not encountered when running the tests headless from localhost, nor when running them from inside a container in Docker Desktop based on the same image that was used in the kubernetes environment. I now knew why the tests had failed, the iframes did not load, but what could be the cause of that?
Be methodical
I thought about the processes that were going on during the tests and imagined what could go wrong such that the iframe would not load. The first thing that came to mind was that the browser could not reach the iframe URL from inside the K8S pod. I tried getting the iframe URL from the DevTools on my local browser and pinging the URL from inside the K8S pod, and it was responding fine. This ruled out a connection issue.
If the browser reached the iframe, maybe the WebDriver wasn’t handling the response correctly, so some logs would be helpful. I tried saving the WebDriver logs and going through them line by line while comparing them to the WebDriver logs that I got from the Docker Desktop container. They were identical, so the WebDriver handling of the connection to the iframe was not the problem.
领英推荐
Was it a problem in the way the browser handled displaying the information it got from the iframe? Were there any problems in the DevTools console logs? I wrote some logic in the code to fetch the DevTools console logs and save them to a file while running the tests. When going through the file, I found the culprit:?
{'level': 'SEVERE', 'message': 'https://www.example.com - Failed to load resource: net::ERR_INSUFFICIENT_RESOURCES', 'source': 'network', 'timestamp': 1704886353348}
It seemed like the browser did not have sufficient resources to load a simple iframe? I checked the resources allocated to the pod, and they were more than enough. I then turned to Stack Overflow for answers. They pointed out a 2011 chromium bug that on Linux forced the browser to reach a memory capacity3. Still no solution in sight, as the bug was not resolved and the last reply was from 2018.
Be resilient
Not wanting to give up, I tried doing some research on what each of the Selenium ChromeOptions mean, to see if one of them could be the answer that could fix my problem. I stumbled upon the --disable-dev-shm-usage option in this post4. This stated that “The /dev/shm partition is too small in certain VM environments, causing Chrome to fail or crash. Use this flag to work-around this issue (a temporary directory will always be used to create anonymous shared memory files).” It seems to be related to another Chromium bug, as linked in the post. The way the Chrome browser was using resources in a K8S environment was different than the ones used in the Docker Desktop app and on my local machine.
Be successful
I initialised my ChromeDriver with this flag and ran the tests. After such a long journey, it finally worked. I was now running UI tests inside a Chromium browser on a linux environment in a K8S cluster. I changed the code back to use the Chromium browser for running the tests and deployed the new image to the company container registry. I then checked for an instance that had an upcoming update and waited for the test results to roll in. Everything worked and this journey brought a sense of accomplishment and enriched my testing experience.
To wrap up
If you ever feel the gentle nudge of the nagging question of “Why?” take the time and resources to pursue and answer. It might take you where few others have gone before and prove to be an advantage in navigating the ever changing world of software testing. If you do find something interesting, take the time to share it with others, because together we test the world, one bug at a time.
For those facing the challenge of testing within a K8S environment, be sure to use the two options that made all this work possible:
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
Navigating ChromeDriver crashes in Kubernetes presents a journey of test automation resilience. Tasked with UI test development for a dynamically generated web app, challenges arose due to the fact that initial tests in Docker Desktop faced Chrome crashes. This was resolved by the --no-sandbox flag, yet Windows lingering processes required further community solutions.
Deploying tests in Kubernetes revealed? further failures, prompting Firefox adoption as a workaround. Curiosity drove post-delivery investigation, uncovering iframe loading issues. Resource checks, bug searches, and ChromeOptions exploration led to success with --disable-dev-shm-usage, ensuring UI tests run smoothly in Kubernetes.
This journey underscores the importance of curiosity, resilience, and resourcefulness in overcoming testing challenges, with lessons shared for navigating similar hurdles.
References:
For More Information:
?? Read Dan's article and others on many testing topics over at the Ministry of Testing site.
"MoT has enriched my life for many years now! I'm always meeting awesome people, and always learning. #30DaysOfTesting, The Club, all the amazing Pro content, the learning opportunities and the support from the community are endless." - Lisa Crispin
"Ministry of Testing is the one-stop shop for everything testing." - Ashutosh Mishra
#educacaofinanceira #fe #co-cidadania #empreendedor
3 个月Bom dia desejo sucesso