A while back I spoke (in Bulgarian) about the use cases for debugging with strace I've faced in my previous job for which I was asked to translate that in English. I rarely see someone to uses strace around me and since this cases are very specific from my practice, it's my view that it's worth to have it written down in English as a reference to my lecture no matter that I don't use it very often in the current working environment. Furthermore, I was looking for this info 10 years ago. I knew strace is incredible tool but I didn't know what to use it for. One more motivation (of least importance) for doing this may be the track record of my career path if needed at some point.
Not wordy, I use this tool for cases in which I can't see any specific system or app errors and want to understand what is happening under the hood while executing a command or during running process.
There is a difference between doing (for example) php index.php (especially if there is chroot) and strace -f -s 9999 -p PID (of nginx, php, fpm or another one) -o output_file.txt. I don't even work with php right now but it is participant in the examples below. These examples may lead to idea how it could be useful to you.
- Mails were not sent through newsletter app. No system or app logs were recorded anywhere. Tracing the php process spawned by the app revealed the error "Could not execute sendmail. Malformed address". This error showed me that the app have added a quotes sign (") to the mail address which makes it invalid address. That helped me to inform the customer where the problem in his app was and that it was not system specific.
- Website pictures was not visualised during opening. Strace of the php processes spawned by the app discovered wrong paths to the pictures folder in the application itself. Good example for that one is migration from windows to linux where the paths changes from \ to /.
- Troubleshooting slow websites that were migrated from one server to another. In my practice I've seen this behaviour to be caused by wrong port (the one from the old server instead of the new one) of the object caching like redis or memcached set in the application (or more accurate - changed port on the new server due to a new process hence port spawned). Strace would show you to which IP on which port is trying to connect under the hood. The latter one timeouts because the connection could not be made. Also, if we don't know in which file of the app the host and port is described we could discover that one with strace too. Last but not least, related to the same case, once the host and port were discovered we would be able to to check the network connectivity in case there is a firewall restricting them.
- Albeit not 100% single source of truth, but for Wordpress, it was useful to me for discovering which plugin makes most of the system calls.
- Website was copied 1:1 from the original folder to a subdomain as a staging target after which the new subdomain returned 502 error. Nginx log records showed that fpm behind nginx was not available but the interesting part was that this site was not using fpm at all after which thankfully to strace, it was discovered that there was a missing folder in another path outside the main folder that was required to be copied. Conclusion is that, what nginx error said and what was the actual culprit were two totally different things.
- During opening of file that consist of phpinfo() "no input file specified" was shown instead of loading the phpinfo() function. Tracing php-fpm processes showed that the info php file was missing albeit we were able to "stat" it and it was there. Why strace was useful here? - It showed that fpm is unable to see the file, not that the file is not there which turned out to be a wrong DocumentRoot set in the php-fpm config file of the vhost.
- To resolve dubious behaviour of python application, tracing the nginx processes showed wrong permissions and owners of the nginx cache folders. Permissions issues could be easily catched using strace also.