For sites where users are allowed to use HTML, the goal is not to escape the input, but to restrict what HTML features can be used.
The level of restriction depends on the site. A site like MySpace may decide to let users customize the appearance of their pages as much as they want. In contrast, a forum will probably limit users to P, BLOCKQUOTE, lists, simple inline styles, and perhaps images.
The naive approach of "stripping tags" using regular expressions often misses things because it interprets them differently than browsers do. For example, the regular expression <.*?> matches nothing in "<script src='http://evil.com/evil.js' </script", leaving your Internet Explorer and Firefox visitors vulnerable (see bug 226495 and for details about "half-tag" parsing). Gerv has an example involving unterminated entities. RSnake maintains an extensive list of XSS vectors that naive filtering may miss.
TODO: go through RSnake's list and make sure my advice covers everything.
The best approach is to parse the input HTML on the server, keeping only tags, attributes, and attribute values you want to allow. Upon serialization, the result will be "well-formed" HTML that browsers will parse the same way your server did.
Things you should ensure are never allowed in user-submitted HTML, to protect the accounts of visitors who use Firefox and IE:
javascript:, vbscrypt:, and data: URLs in links, images, anywhere. <scrypt> tags, with or without src attributes. Event attributes (on*), which contain scripts. -moz-binding: or behavior: CSS properties inside <style> elements or style attributes. HTML is that is not "well-formed" -- you can't be sure how quirky browsers will parse it. (Example: <b <i>Foo)
The above list might not be complete and it is safer to use whitelists than blacklists. For example, only allow http:, https:, and ftp: links rather than allowing all links other than javascript:, vbscript:, and data:.
If you must allow unsanitized, untrusted HTML to be part of your site, ensure that those pages are not on the same hostname as where other users log in. (webmail, web hosts, attachments in a bug-tracking system, Google cache) (see Gerv's proposal)
TODO: test various browsers' interpretation of setting "document.domain" to figure out exactly what "not on the same hostname" needs to mean
0
comments, (699 reads) All Articles by, GentleGiant