Retro & Lessons learned from the time I accidentally opened up Stack Overflow to XSS attacks
It is (or at least used to be) customary that when new developers join Stack Overflow and are given privileges to commit code and run builds, they are told something along the lines of: "you will break Stack Overflow one day, it is ok, it happens to everyone". And then some of the more experienced developers would chime in with tales of the times that they brought the whole network down due to some avoidable error (though honestly, most times that it happens, it is due to DNS).
As a developer at Stack Overflow, I never brought the network down. Though I did break the build on Team City a couple of times (messed up some AssemblyBindings way back when that was a thing) back when all devs could still commit directly to master without having a green PR (which surprisingly wasn't all too long ago). And I have had my share of poorly thought-out queries that exploded on prod (and very much upset our DBRE's, even if they weren't bad enough to take down the site).
But I did do something once that was potentially worse: I introduced a code logic error that opened up an XSS vulnerability for every user on every page of Stack Overflow and all Stack Exchange network sites (including home page and all individual question pages). Beyond what I wrote on Meta Stack Overflow at the time, no public retrospective was ever shared. Now that I have left the company, I'd like to circle around and share more about exactly what happened, what went wrong, and the lessons that I think can be learned from this.
What are XSS attacks?
(Please skip this section if you know this already)
XSS stands for Cross Site Scripting (yes, I know; I didn't make up the acronym). It is defined by OWASP as occurring when:
...an attacker uses a web application to send malicious code, generally in the form of a browser side script, to a different end user... An attacker can use XSS to send a malicious script to an unsuspecting user. The end user’s browser has no way to know that the script should not be trusted, and will execute the script. Because it thinks the script came from a trusted source, the malicious script can access any cookies, session tokens, or other sensitive information retained by the browser and used with that site.
The scripts that are loaded from a website (especially from the same domain) are normally trusted by a browser and are given lots of privileges and access. In an XSS attack, the attacker finds a vulnerability in the website that allows them to load a script to visitors such that the visitor's browsers will think that the script is coming from the site, and can thereby abuse its elevated privileges to do sneaky things.
Any site that displays user-generated data is potentially open to this type of vulnerability, and enforcing measure to prevent it is the responsibility of the site developers.
The most common way for this vulnerability to be manifested is to always display user-input as-is, without any Html encoding. This means that if user input includes any html tags that call for a script to be loaded, they will be included as-is, and the script will be loaded. Html encoding user-generated content forces important elements of tags (like the opening and closing angle brackets in text like <script>) to be output in a way that allows the brackets to be displayed on-screen without the browser interpreting them as being part of the html markup. (See here for more on ways to prevent XSS attacks - and yes, I know that output encoding won't close all XSS vulnerabilities, but it suffices for the case being discussed here).
Incident Timeline
On Wednesday morning, September 1, 2021, I pushed a build for the entire Stack Exchange network containing a fix for a really old bug report (dating back to 2010) to stop converting regular quotes to "SmartyPants quotes" in post titles. Logic had been put into place sometime around 2009-2010 to automatically convert straight single and double-quotes in question titles into their fancier ("curly") variations, and especially on technical sites this could have a negative impact on users copying code directly from titles into their code editors (where straight vs curly quote styles have completely different meanings. At the time (2021) whenever I was on bug duty, I liked to find really old, highly upvoted, forgotten-but-still-relevant bug reports on Meta SO/SE to fix among the more recent issues that often took up most of our attention.
The Pull Request had been tested locally and on a dev site by myself and another dev who reviewed the PR (both of us were Staff Engineers with 10-15+ years of experience). If I recall correctly, it went live with a site setting set for applying SmartyFormatting to titles set to TRUE. I toggled the site setting to FALSE on one site, tested it, and then turned it off for the entire network at 10:02 GMT. Monitored things for a few minutes, added a [status-completed] tag and answer to the original bug report, and went out for a quick errand.
(On a side note and relevant for later, I live and work from Israel, so 10:02 GMT is an hour when the European Devs are mostly awake, but the US-based devs and staff are not, so it is generally a quiet time on the company Slack.)
At this point in time (10:02 GMT), the XSS vulnerability was live, and in theory, a user could exploit it by asking a question that loaded a third-party script via a script tag in the title of the question.
Eighteen minutes later, at 10:20 GMT, a user posted a bug report on Meta Stack Overflow, the initial report of the issue. As they showed there, a question was asked the previous day (and revised a couple of hours earlier) with the title (at the time) of "Simple multi-colored <textarea> based editor" was causing display issues on a tag listing, displaying a tax input inside the title, and stopping the output of the rest of the questions listing:
I was running my errand out of the house while this happened, and wasn't monitoring MSO any longer, wasn't close to my dev machine.
At 10:38 GMT, I received a ping on a Discord Server devoted to discussion about Stack Exchange from user Nick, alerting me to the initial MSO post. Luckily I had the Discord mobile app installed on my phone, and I saw the ping at 10:43 GMT. I loaded the homepage of Stack Overflow and got a "hello world" javascript alert in my mobile browser. Heart skipped a beat. Then I quickly went to our developers-only site settings section, and toggled the site setting back to TRUE (so smarty-encoding was back in network-wide) at 10:46 GMT, 44 minutes after the vulnerability was live, 26 minutes after it was reported on MSO, and 3 minutes after I saw the ping alerting me to this.
Before fixing the issue for a redeploy (more on this below), I quickly responded to the original bug report with a one-line: "The vulnerability has been closed (was related to turning off smarty-encoding for titles), fix incoming." I then began my investigation.
What went wrong?
While my code change did what it needed to do, it also introduced an XSS vulnerability into the display logic for all post titles across the network, including on the home page questions listing, search and tag results, and individual posts. I don't have access to the actual code anymore, but to the best of my recollection, the code diff was similar to this:
The function in question was called almost anywhere that a title was being displayed on the site. A pretty simple change: a new site setting (AKA feature flag) is added to control usage of the feature, and in the case when fancy encoding is off for the site, do not run it through the SmartyPantsEncoder.
In retrospect, the issue was glaring (to me), but at the time it slipped through: the function had been encoding the string before sending it to the SmartyPantsEncoder. The new logic meant that when the site setting was FALSE, no encoding would be performed on the string. At all.
After fixing the logic, it looked something like this (again, this is not the actual code, which I no longer have access to, this is a reconstruction based on my recollection):
领英推荐
Aside from the actual fix to the logic (to encode the text when Smarty Formatting is off), note some other changes, all of which are there to make the code more maintainable and easier to understand:
Investigation & Disclosure
After the site setting was reverted to its default at 10:46, I commenced investigating the cause (see above) and more importantly, into whether or not anyone had taken advantage of the vulnerability. I searched through PostHistory records that were made during this time, and aside from two proof-of-concept posts (with <script>alert('hello world')</script> in the title), did not find any indication of malicious scripts going into titles. After some more searching (of logs, older posts, and other related internal resources and tools), I and others were sufficiently satisfied that no actual harm had been done, and no users had been affected (aside from those who loaded a deformed question listing during the vulnerability window).
And this also made sense logically speaking - in order to really take advantage of this, a miscreant would have had to have a script ready to go on another server, ready to try to do stuff on Stack Overflow (or off, in case the script would do something like redirect to an exact clone of Stack Overflow in an attempt to hijack login credentials, or something like that). And be ready to with this just in case an unlikely-to-occur issue (no html encoding on post titles) would ever happen. And then if it did happen, to have some way of being alerted to this and be ready to post the question with very little delay. But was good to get confirmation that the unlikely had not occurred.
Though it almost did: I did discover that at 10:49, three minutes after the vulnerability was removed, a new user account created a question with the title: <script>window.location.href="someUrl"</script>. If the vulnerability had been live when this question was asked, all main questions listing traffic from Stack Overflow would have been redirected to the url that they put in the title of the question. So definitely dodged a bullet there.
While this investigation was going on, I kept revising the MSO post as to the progress, posting updates at 11:45, 12:03, 12:16, 13:13, and 13:52 GMT. I made these updates on my own initiative. I of course reported the issue internally through defined protocols immediately after the site setting was reverted to its default, but all of the relevant folks in charge of reviewing things like this were still asleep while the initial investigation was going on (I am in Israel, 7 hours ahead of NY and 10 ahead of West Coast US).
For better or for worse, I decided that it was better to err on the side of providing reassurances and updates for the community (without disclosing anything sensitive), instead of remaining silent. Even if we discovered that no actual users had been affected, the negative blowback in terms of community trust could be much larger if we waited crucial hours before giving any updates.
Reactions
There were two types of reactions within Stack Overflow internally:
Within the Stack Overflow user community (primarily in the comments of the original MSO report and my answer) there was a lot of criticism of us for letting this occur, and folks trying to imagine worse-case scenarios and ways that abuse of the vulnerability could have escaped our investigation.
But the most interesting response was on a Reddit post: XSS vulnerability found in Stack Overflow. For some background, r/programming is normally not a very friendly platform when it comes to Stack Overflow. It is a place where employees tread very carefully (or just altogether avoid), and in general folks are quick to point out and emphasize lots of the negatives about Stack Overflow (bad treatment of new users, users who thought that their questions were closed unfairly, anger at moderators, etc).
However, in this case, responses were gratifyingly, very supportive.
The top comment praised my response:
That was fast, both the fix and the assertion that nobody took advantage of the issue.
Others defended against attacks in the MSO comments, talked about the use of site settings, and showed understanding for how these things can happen.
And my personal favorite:
Holy cow I love this specificity, most companies post stuff like we hAvE bUILt mOrE rEINFoRceMeNts against the aTtaCkErs or something lol. Glad to know SO truly cares.
Takeaways
Overall, this was a potentially horrible issue that had a quick resolution. Here are some things that I learned from the entire sequence of events:
Now go kick that imposter syndrome in the butt. You've got this (talking to myself here)!
Disclaimer: no one from Stack Overflow read or approved of this post. This is based purely on my recollection of events (along with all of the stuff on the public record), and some details are no longer verifiable by me. I am posting this for the historical value, and in the hopes that the lessons learned can be of use to some folks out there.
If you've made it this far, know that I appreciate you. Your comments and feedback are welcome.
VP of UX @ Newfire Global Partners
1 年Great story and great takeaways. Thanks for sharing!
Engineering Director at Capital Technology Group, LLC
1 年I'd like to reinforce your takeaway #3 and would-be fix #5: An early step in producing well-documented code is using appropriate naming for functions and variables. Sure, changing the name means you have to touch a bunch more files, but if the function was called ApplySecurityAndPrettyEncoding or something, you might not even need comments to indicate its importance, and its importance would also be understood everywhere the function is called, without additional comments there.
Hey, what does this button do?
1 年Smart quotes ruin everything, again :D Great write up!
Software Engineer at Stack Overflow
1 年Great insight Yaakov. Particularly the bit on ternary operators. I often find many people, myself included, hyper focused on "clean" code who don't step back and question if the cleanliness is actually beneficial or if we are just conflating visual satisfaction with quality code.