In
data sanitization, HTML sanitization is the process of examining an
HTML
Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets ( ...
document and producing a new HTML document that preserves only whatever tags and attributes are designated "safe" and desired. HTML sanitization can be used to protect against attacks such as
cross-site scripting
Cross-site scripting (XSS) is a type of security vulnerability that can be found in some web applications. XSS attacks enable attackers to inject client-side scripts into web pages viewed by other users. A cross-site scripting vulnerability may be ...
(XSS) by sanitizing any HTML code submitted by a user.
Details
Basic tags for changing fonts are often allowed, such as
<b>
,
<i>
,
<u>
,
<em>
, and
<strong>
while more advanced tags such as
<script>
,
<object>
,
<embed>
, and
<link>
are removed by the sanitization process. Also potentially dangerous
attributes such as the
onclick
attribute are removed in order to prevent malicious code from being injected.
Sanitization is typically performed by using either a
whitelist
A whitelist or allowlist is a list or register of entities that are being provided a particular privilege, service, mobility, access or recognition. Entities on the list will be accepted, approved and/or recognized. Whitelisting is the reverse of ...
or a
blacklist
Blacklisting is the action of a group or authority compiling a blacklist of people, countries or other entities to be avoided or distrusted as being deemed unacceptable to those making the list; if people are on a blacklist, then they are considere ...
approach. Leaving a safe HTML element off a whitelist is not so serious; it simply means that that feature will not be included post-sanitation. On the other hand, if an unsafe element is left off a blacklist, then the vulnerability will not be sanitized out of the HTML output. An out-of-date blacklist can therefore be dangerous if new, unsafe features have been introduced to the HTML Standard.
Further sanitization can be performed based on rules which specify what operation is to be performed on the subject tags. Typical operations include removal of the tag itself while preserving the content, preserving only the textual content of a tag or forcing certain values on attributes.
Implementations
In
PHP, HTML sanitization can be performed using the
strip_tags()
function at the risk of removing all textual content following an unclosed less-than symbol or angle bracket. The HTML Purifier library is another popular option for PHP applications.
In
Java
Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
(and
.NET
The .NET platform (pronounced as "''dot net"'') is a free and open-source, managed code, managed computer software framework for Microsoft Windows, Windows, Linux, and macOS operating systems. The project is mainly developed by Microsoft emplo ...
), sanitization can be achieved by using the
OWASP Java HTML Sanitizer Project.
In
.NET
The .NET platform (pronounced as "''dot net"'') is a free and open-source, managed code, managed computer software framework for Microsoft Windows, Windows, Linux, and macOS operating systems. The project is mainly developed by Microsoft emplo ...
, a number of sanitizers use the Html Agility Pack, an HTML parser.
[ Another library is HtmlSanitizer.
In ]JavaScript
JavaScript (), often abbreviated as JS, is a programming language and core technology of the World Wide Web, alongside HTML and CSS. Ninety-nine percent of websites use JavaScript on the client side for webpage behavior.
Web browsers have ...
there are "JS-only" sanitizers for the back end, and browser-based implementations that use browser's own Document Object Model
The Document Object Model (DOM) is a cros s-platform and language-independent API that treats an HTML or XML document as a tree structure wherein each node is an object representing a part of the document. The DOM represents a document with ...
(DOM) parser to parse the HTML (for better performance).
References
{{Reflist
HTML