登录查看更多内容

Retro & Lessons learned from the time I accidentally opened up Stack Overflow to XSS attacks

Yaakov Ellis

Staff Engineer at Intuit. Gets Things Done. ex Stack Overflow

发布日期: 2023年12月26日

It is (or at least used to be) customary that when new developers join Stack Overflow and are given privileges to commit code and run builds, they are told something along the lines of: "you will break Stack Overflow one day, it is ok, it happens to everyone". And then some of the more experienced developers would chime in with tales of the times that they brought the whole network down due to some avoidable error (though honestly, most times that it happens, it is due to DNS).

As a developer at Stack Overflow, I never brought the network down. Though I did break the build on Team City a couple of times (messed up some AssemblyBindings way back when that was a thing) back when all devs could still commit directly to master without having a green PR (which surprisingly wasn't all too long ago). And I have had my share of poorly thought-out queries that exploded on prod (and very much upset our DBRE's, even if they weren't bad enough to take down the site).

But I did do something once that was potentially worse: I introduced a code logic error that opened up an XSS vulnerability for every user on every page of Stack Overflow and all Stack Exchange network sites (including home page and all individual question pages). Beyond what I wrote on Meta Stack Overflow at the time, no public retrospective was ever shared. Now that I have left the company, I'd like to circle around and share more about exactly what happened, what went wrong, and the lessons that I think can be learned from this.

What are XSS attacks?

(Please skip this section if you know this already)

XSS stands for Cross Site Scripting (yes, I know; I didn't make up the acronym). It is defined by OWASP as occurring when:

...an attacker uses a web application to send malicious code, generally in the form of a browser side script, to a different end user... An attacker can use XSS to send a malicious script to an unsuspecting user. The end user’s browser has no way to know that the script should not be trusted, and will execute the script. Because it thinks the script came from a trusted source, the malicious script can access any cookies, session tokens, or other sensitive information retained by the browser and used with that site.

The scripts that are loaded from a website (especially from the same domain) are normally trusted by a browser and are given lots of privileges and access. In an XSS attack, the attacker finds a vulnerability in the website that allows them to load a script to visitors such that the visitor's browsers will think that the script is coming from the site, and can thereby abuse its elevated privileges to do sneaky things.

XKCD 327: Exploits of a Mom. Not an XSS injection attack. But it is an injection attack.

Any site that displays user-generated data is potentially open to this type of vulnerability, and enforcing measure to prevent it is the responsibility of the site developers.

The most common way for this vulnerability to be manifested is to always display user-input as-is, without any Html encoding. This means that if user input includes any html tags that call for a script to be loaded, they will be included as-is, and the script will be loaded. Html encoding user-generated content forces important elements of tags (like the opening and closing angle brackets in text like <script>) to be output in a way that allows the brackets to be displayed on-screen without the browser interpreting them as being part of the html markup. (See here for more on ways to prevent XSS attacks - and yes, I know that output encoding won't close all XSS vulnerabilities, but it suffices for the case being discussed here).

Incident Timeline

On Wednesday morning, September 1, 2021, I pushed a build for the entire Stack Exchange network containing a fix for a really old bug report (dating back to 2010) to stop converting regular quotes to "SmartyPants quotes" in post titles. Logic had been put into place sometime around 2009-2010 to automatically convert straight single and double-quotes in question titles into their fancier ("curly") variations, and especially on technical sites this could have a negative impact on users copying code directly from titles into their code editors (where straight vs curly quote styles have completely different meanings. At the time (2021) whenever I was on bug duty, I liked to find really old, highly upvoted, forgotten-but-still-relevant bug reports on Meta SO/SE to fix among the more recent issues that often took up most of our attention.

The Pull Request had been tested locally and on a dev site by myself and another dev who reviewed the PR (both of us were Staff Engineers with 10-15+ years of experience). If I recall correctly, it went live with a site setting set for applying SmartyFormatting to titles set to TRUE. I toggled the site setting to FALSE on one site, tested it, and then turned it off for the entire network at 10:02 GMT. Monitored things for a few minutes, added a [status-completed] tag and answer to the original bug report, and went out for a quick errand.

(On a side note and relevant for later, I live and work from Israel, so 10:02 GMT is an hour when the European Devs are mostly awake, but the US-based devs and staff are not, so it is generally a quiet time on the company Slack.)

At this point in time (10:02 GMT), the XSS vulnerability was live, and in theory, a user could exploit it by asking a question that loaded a third-party script via a script tag in the title of the question.

Eighteen minutes later, at 10:20 GMT, a user posted a bug report on Meta Stack Overflow, the initial report of the issue. As they showed there, a question was asked the previous day (and revised a couple of hours earlier) with the title (at the time) of "Simple multi-colored <textarea> based editor" was causing display issues on a tag listing, displaying a tax input inside the title, and stopping the output of the rest of the questions listing:

Screenshot from the original report, CC-BY-SA 4.0 (link to source)

I was running my errand out of the house while this happened, and wasn't monitoring MSO any longer, wasn't close to my dev machine.

At 10:38 GMT, I received a ping on a Discord Server devoted to discussion about Stack Exchange from user Nick, alerting me to the initial MSO post. Luckily I had the Discord mobile app installed on my phone, and I saw the ping at 10:43 GMT. I loaded the homepage of Stack Overflow and got a "hello world" javascript alert in my mobile browser. Heart skipped a beat. Then I quickly went to our developers-only site settings section, and toggled the site setting back to TRUE (so smarty-encoding was back in network-wide) at 10:46 GMT, 44 minutes after the vulnerability was live, 26 minutes after it was reported on MSO, and 3 minutes after I saw the ping alerting me to this.

Before fixing the issue for a redeploy (more on this below), I quickly responded to the original bug report with a one-line: "The vulnerability has been closed (was related to turning off smarty-encoding for titles), fix incoming." I then began my investigation.

What went wrong?

While my code change did what it needed to do, it also introduced an XSS vulnerability into the display logic for all post titles across the network, including on the home page questions listing, search and tag results, and individual posts. I don't have access to the actual code anymore, but to the best of my recollection, the code diff was similar to this:

The function in question was called almost anywhere that a title was being displayed on the site. A pretty simple change: a new site setting (AKA feature flag) is added to control usage of the feature, and in the case when fancy encoding is off for the site, do not run it through the SmartyPantsEncoder.

In retrospect, the issue was glaring (to me), but at the time it slipped through: the function had been encoding the string before sending it to the SmartyPantsEncoder. The new logic meant that when the site setting was FALSE, no encoding would be performed on the string. At all.

After fixing the logic, it looked something like this (again, this is not the actual code, which I no longer have access to, this is a reconstruction based on my recollection):

领英推荐

How to Prepare for the PWPT Certification

TCM Security 9 个月前

Tecplix ThreatTrack Insights - April II

Tecplix Technologies Private Limited 10 个月前

React JS?-?Security Best Practices

Kristiyan Velkov 1 年前

Aside from the actual fix to the logic (to encode the text when Smarty Formatting is off), note some other changes, all of which are there to make the code more maintainable and easier to understand:

Comment at the top of the function and also above the critical line, explaining what is happening and that it can be dangerous to mess around with the code here (there had not been any comments at all previously). Lack of comments on this didn't do us any favors when it came to understanding how crucial the function was.
Avoiding use of any ternary operators. They are great for reducing your overall lines of code and making things look neater. They can also make it easy to gloss over crucial logic (as I and the PR reviewer did).
Using the site setting in a positive branch (do something when it is on, instead of when it is off). Just makes it easier to grok, harder to mess up the logic.
Not pictured: added unit tests to this function to ensure that it always returns encoded html, and integration tests for some of the main usage paths of the function ensuring that questions with html tags in the title will pass get encoded titles through all of the main title access points on the Post object.
Not done: rename the function to be about EncodingTitles and not be about applying SmartyFormatting. I and the PR reviewer had seen the title and improperly assumed that it was only about that, and not about general encoding (which it was). We didn't do this at the time (didn't want to do a big refactor in my quick-fix follow-up). In retrospect, we probably should have just done it then, as I doubt it has been done since (I know that I didn't subsequently change it).

Investigation & Disclosure

After the site setting was reverted to its default at 10:46, I commenced investigating the cause (see above) and more importantly, into whether or not anyone had taken advantage of the vulnerability. I searched through PostHistory records that were made during this time, and aside from two proof-of-concept posts (with <script>alert('hello world')</script> in the title), did not find any indication of malicious scripts going into titles. After some more searching (of logs, older posts, and other related internal resources and tools), I and others were sufficiently satisfied that no actual harm had been done, and no users had been affected (aside from those who loaded a deformed question listing during the vulnerability window).

And this also made sense logically speaking - in order to really take advantage of this, a miscreant would have had to have a script ready to go on another server, ready to try to do stuff on Stack Overflow (or off, in case the script would do something like redirect to an exact clone of Stack Overflow in an attempt to hijack login credentials, or something like that). And be ready to with this just in case an unlikely-to-occur issue (no html encoding on post titles) would ever happen. And then if it did happen, to have some way of being alerted to this and be ready to post the question with very little delay. But was good to get confirmation that the unlikely had not occurred.

Though it almost did: I did discover that at 10:49, three minutes after the vulnerability was removed, a new user account created a question with the title: <script>window.location.href="someUrl"</script>. If the vulnerability had been live when this question was asked, all main questions listing traffic from Stack Overflow would have been redirected to the url that they put in the title of the question. So definitely dodged a bullet there.

While this investigation was going on, I kept revising the MSO post as to the progress, posting updates at 11:45, 12:03, 12:16, 13:13, and 13:52 GMT. I made these updates on my own initiative. I of course reported the issue internally through defined protocols immediately after the site setting was reverted to its default, but all of the relevant folks in charge of reviewing things like this were still asleep while the initial investigation was going on (I am in Israel, 7 hours ahead of NY and 10 ahead of West Coast US).

For better or for worse, I decided that it was better to err on the side of providing reassurances and updates for the community (without disclosing anything sensitive), instead of remaining silent. Even if we discovered that no actual users had been affected, the negative blowback in terms of community trust could be much larger if we waited crucial hours before giving any updates.

Reactions

There were two types of reactions within Stack Overflow internally:

Empathy and support from developers who knew it could have happened to them, and shared in the learning and fix
After-the-fact support from the relevant parties for my post and updates on MSO, though I definitely got the idea that they would have rather had a more delayed and less verbose text, with fewer disclosures.

Within the Stack Overflow user community (primarily in the comments of the original MSO report and my answer) there was a lot of criticism of us for letting this occur, and folks trying to imagine worse-case scenarios and ways that abuse of the vulnerability could have escaped our investigation.

But the most interesting response was on a Reddit post: XSS vulnerability found in Stack Overflow. For some background, r/programming is normally not a very friendly platform when it comes to Stack Overflow. It is a place where employees tread very carefully (or just altogether avoid), and in general folks are quick to point out and emphasize lots of the negatives about Stack Overflow (bad treatment of new users, users who thought that their questions were closed unfairly, anger at moderators, etc).

However, in this case, responses were gratifyingly, very supportive.

The top comment praised my response:

That was fast, both the fix and the assertion that nobody took advantage of the issue.

Others defended against attacks in the MSO comments, talked about the use of site settings, and showed understanding for how these things can happen.

And my personal favorite:

Holy cow I love this specificity, most companies post stuff like we hAvE bUILt mOrE rEINFoRceMeNts against the aTtaCkErs or something lol. Glad to know SO truly cares.

Takeaways

Overall, this was a potentially horrible issue that had a quick resolution. Here are some things that I learned from the entire sequence of events:

No one is perfect, everyone will make mistakes, no matter the experience level. As I said above, I did this as a Staff Engineer with 15+ years experience. The PR was reviewed by a dev who was on a similar experience level. It happens. Important not to get down about it. More important to get up, fix it, be open about it, and learn from it.
Always use Feature flags for anything where there is even a small chance that it might be needed. The site settings system on Stack saved me big time. Thank God that I had the foresight to use it. I was able to revert this change in 3 minutes, from my phone, standing on the sidewalk, for all of Stack Overflow and the Stack Exchange network. If I had needed to make a code change to revert this, it could would have easily taken 30-60 more minutes (if not more) to get this fixed (and into PR, and approved, and wait for the PR to build, and hope that a random integration test didn't misfire, and get it merged, and build on Dev) and on to prod. (Note: the CI/CD process has changed drastically since 2021.)
In general, documented, understandable code very often trumps concise pretty code. Aside from very niche fields, we no longer have to worry about making code more concise in order to saving extra bytes. Space is cheap, it just isn't an issue anymore like it was back in the K&R days. While it is still a good practice to not make code longer than necessary, and while shorter code is often more readable, sometimes it is worth expanding the ternary. Adding the extra if/else. Adding a comment to the function and to critical lines. Refactor function and variable names to be extra clear about their usage. This pays big dividends down the line.
Foster a positive, trusting relationship with your community of users. As I wrote above, I didn't see the initial MSO post. I only found out about this because I got pinged in a community-run Discord server by a user who felt like they had someone to talk to. If it had waited for someone from the company to see it, ping my teams Slack channel, and me to see the ping, the fix could have been delayed quite significantly. Now, this wont apply to every situation, and one should always strive to have mechanisms in place for receiving high priority alerts submitted through different platforms. But if this is at all relevant, keeping up good community relations can pay itself back in a big way (aside from all of the other benefits).
It is almost never a bad move to err on the side of open disclosure. Be honest about your mistakes. Reassure folks that you are being responsible in the aftermath of an incident. Of course, follow your protocols, and never disclose any sensitive information. But a quick follow-up and general air of honesty and openness can go a really long way towards rebuilding trust. (And a hat tip to my favorite public incident report from my time at Stack, A deeper dive into our May 2019 security incident, by Dean Ward .)
You can never preemptively test every permutation of things that can go wrong in advance, but when things do go wrong, be sure to account for them moving forward. Some comments on MSO questioned why we weren't already testing for this vulnerability ahead of time. I didn't have a great answer. We weren't. Hindsight is 20/20. But you can bet that there are tests around this now. It wont happen again.

Now go kick that imposter syndrome in the butt. You've got this (talking to myself here)!

Disclaimer: no one from Stack Overflow read or approved of this post. This is based purely on my recollection of events (along with all of the stuff on the public record), and some details are no longer verifiable by me. I am posting this for the historical value, and in the hopes that the lessons learned can be of use to some folks out there.

If you've made it this far, know that I appreciate you. Your comments and feedback are welcome.

Jonathan Abbett

VP of UX @ Newfire Global Partners

1 年

Great story and great takeaways. Thanks for sharing!

1 次回应

Isaac Moses

Engineering Director at Capital Technology Group, LLC

1 年

I'd like to reinforce your takeaway #3 and would-be fix #5: An early step in producing well-documented code is using appropriate naming for functions and variables. Sure, changing the name means you have to touch a bunch more files, but if the function was called ApplySecurityAndPrettyEncoding or something, you might not even need comments to indicate its importance, and its importance would also be understood everywhere the function is called, without additional comments there.

3 次回应

Michael Stum

Hey, what does this button do?

1 年

Smart quotes ruin everything, again :D Great write up!

2 次回应

Joshua Wolsborn

Software Engineer at Stack Overflow

1 年

Great insight Yaakov. Particularly the bit on ternary operators. I often find many people, myself included, hyper focused on "clean" code who don't step back and question if the cleanliness is actually beneficial or if we are just conflating visual satisfaction with quality code.

1 次回应

查看更多评论

要查看或添加评论，请登录

Yaakov Ellis的更多文章

The importance of minimizing down time in the hiring process, for both employers and applicants

2024年3月4日

The importance of minimizing down time in the hiring process, for both employers and applicants

When I was looking for a new position a few months ago, I encountered a certain pattern in how communications and…

6 条评论
The Importance of Planning, Part 2: Break it all down

2024年1月29日

The Importance of Planning, Part 2: Break it all down

(Ideally you should read The Importance of Planning, Part 1: Document All The Things & Don't Skip the Scary Stuff…
The Importance of Planning, Part 1: Document All The Things & Don't Skip the Scary Stuff

2024年1月18日

The Importance of Planning, Part 1: Document All The Things & Don't Skip the Scary Stuff

You are given a task "Sounds simple" you think. Time to get started, right? Finish the initial logic, and worry about…

3 条评论
Being the elevator: thoughts on getting unstuck and reenergized

2024年1月8日

Being the elevator: thoughts on getting unstuck and reenergized

A long time ago (in what now feels like a galaxy far far away), I was in my final semester at University of…

2 条评论
Always leave things better off than when you found them

2024年1月2日

Always leave things better off than when you found them

Not too long ago we had a local gardener over to discuss some issues that we were having with certain plants. I noticed…

1 条评论
What I would do if I ran Stack Overflow Product

2023年12月14日

What I would do if I ran Stack Overflow Product

Stack Overflow is a very important world resource. The site has contributed tremendous good to the world, being a…

13 条评论

See all articles

Retro & Lessons learned from the time I accidentally opened up Stack Overflow to XSS attacks

Yaakov Ellis

Staff Engineer at Intuit. Gets Things Done. ex Stack Overflow

What are XSS attacks?

Incident Timeline

What went wrong?

领英推荐

Investigation & Disclosure

Reactions

Takeaways

Yaakov Ellis的更多文章

社区洞察

其他会员也浏览了

ReversingLabs researchers identify novel attack on PyPI

How to develop exploits

Buffer Overflow Exploitation with Immunity Debugger | TryHackMe Buffer Overflow Prep

Essential Reading for Web App Bug Hunters: 60 Must-Read Blog Posts on Exploits, Attack Vectors, and Vulnerabilities

Preventing JavaScript Injection Attacks: Best Practices and Techniques

AWAE/OSWE - A not expected review

Top Subdomain and Vulnerability Tools for 2024

CRITICAL ALERT: CVE-2024-53677 Apache Struts Vulnerability

Anatomy of the infamous EMPIRE powershell framework.

Signing & Validating API Requests

What are XSS attacks?

Incident Timeline

What went wrong?

领英推荐

Investigation & Disclosure

Reactions

Takeaways

Yaakov Ellis的更多文章

The importance of minimizing down time in the hiring process, for both employers and applicants

The Importance of Planning, Part 2: Break it all down

The Importance of Planning, Part 1: Document All The Things & Don't Skip the Scary Stuff

Being the elevator: thoughts on getting unstuck and reenergized

Always leave things better off than when you found them

What I would do if I ran Stack Overflow Product

社区洞察

其他会员也浏览了

ReversingLabs researchers identify novel attack on PyPI

How to develop exploits

Buffer Overflow Exploitation with Immunity Debugger | TryHackMe Buffer Overflow Prep

Essential Reading for Web App Bug Hunters: 60 Must-Read Blog Posts on Exploits, Attack Vectors, and Vulnerabilities

Preventing JavaScript Injection Attacks: Best Practices and Techniques

AWAE/OSWE - A not expected review

Top Subdomain and Vulnerability Tools for 2024

CRITICAL ALERT: CVE-2024-53677 Apache Struts Vulnerability

Anatomy of the infamous EMPIRE powershell framework.

Signing & Validating API Requests