Tag Soup
   HOME

TheInfoList



OR:

In
web development Web development is the work involved in developing a website for the Internet (World Wide Web) or an intranet (a private network). Web development can range from developing a simple single static page of plain text to complex web applications, ...
, "tag soup" is a
pejorative A pejorative word, phrase, slur, or derogatory term is a word or grammatical form expressing a negative or disrespectful connotation, a low opinion, or a lack of respect toward someone or something. It is also used to express criticism, hosti ...
for
HTML Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets ( ...
written for a
web page A web page (or webpage) is a World Wide Web, Web document that is accessed in a web browser. A website typically consists of many web pages hyperlink, linked together under a common domain name. The term "web page" is therefore a metaphor of pap ...
that is syntactically or structurally incorrect.
Web browser A web browser, often shortened to browser, is an application for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's scr ...
s have historically treated structural or syntax errors in HTML leniently, so there has been little pressure for web developers to follow published standards. Therefore there is a need for all browser implementations to provide mechanisms to cope with the appearance of "tag soup", accepting and correcting for invalid syntax and structure where possible. An HTML
parser Parsing, syntax analysis, or syntactic analysis is a process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar by breaking it into parts. The term '' ...
(part of a web browser) that is capable of interpreting HTML-like markup even if it contains invalid syntax or structure may be called a tag soup parser. All major web browsers currently have a tag soup parser for interpreting malformed HTML, with most error-handling elements standardized. "Tag soup" encompasses many common authoring mistakes, such as malformed HTML tags, improperly nested
HTML element An HTML element is a type of HTML (HyperText Markup Language) document component, one of several types of HTML nodes (there are also text nodes, comment nodes and others). The first used version of HTML was written by Tim Berners-Lee in 199 ...
s, and unescaped character entities (especially ampersands (&) and less-than signs (<)). The Markup Validation Service is a resource for web page authors to avoid creating tag soup.


Overview

"Tag soup" is a term used to denigrate various practices in web authoring. Some of these (roughly ordered from most severe to least severe) include: # Malformed markup where tags are improperly nested or incorrectly closed. For example, the following:

This is a malformed fragment of HTML.

# Invalid structure where elements are improperly nested according to the DTD for the document. Examples of this include nesting a "ul" element directly inside another "ul" element for any of the HTML 4.01 or XHTML DTDs. Dan Connolly cites the use of title element outside the head section. # Use of proprietary or undefined elements and attributes instead of those defined in W3C recommendations. For example the use of the
Blink element The blink element is a non-standard HTML element that indicates to a user agent (generally a web browser) that the page author intends the content of the element to blink (that is, alternate between being visible and invisible). The element was in ...
or the
Marquee element The marquee tag is a non-standard HTML element which causes text to scroll up, down, left or right automatically. The tag was first introduced in early versions of Microsoft's Internet Explorer, and was compared to Netscape's blink element, as a ...
which were non-standard elements originally only supported by
Netscape Netscape Communications Corporation (originally Mosaic Communications Corporation) was an American independent computer services company with headquarters in Mountain View, California, and then Dulles, Virginia. Its Netscape web browser was o ...
and
Internet Explorer Internet Explorer (formerly Microsoft Internet Explorer and Windows Internet Explorer, commonly abbreviated as IE or MSIE) is a deprecation, retired series of graphical user interface, graphical web browsers developed by Microsoft that were u ...
browsers respectively.


Causes and implications


Malformed markup

Malformed markup is arguably the most severe problem in web authoring. However, thanks to better education and information and perhaps with some help from XHTML, the issue of malformed markup is becoming less common. Browsers, when faced with malformed markup, must guess the intended meaning of the author. They must infer closing tags where they expect them and then infer opening tags to match other closing-tags. The interpretation can vary markedly from one browser to the next. While many graphical web editors produce well-formed markup, an author writing code manually with a text-editor and then testing only in one browser can easily miss such errors. The presentation can therefore vary drastically from one browser to another as each tries to "correct" the authorʼs intent in different ways and then applies styling to those "corrections".


Invalid document structure

Invalid document structure here means only the use of attributes and elements where they do not belong. For example, placing a "cite" attribute on a "cite" element is invalid since the HTML and XHTML DTDs do not ascribe any meaning to that attribute on that element. Similarly, including a "p" element within the content of an "em" element is also invalid. With the move toward separating malformed markup from invalid markup, the problems with invalid markup have increasingly been seen as less severe. Some have begun to advocate looser content models that allow greater flexibility in authoring HTML documents (whether in HTML or XHTML). However, use of invalid markup can blur the author's intended meaning, though not as severely as malformed markup. Many graphic web editors still produce invalid markup. Moreover, many professional web designers and authors pay little attention to issues of validity. It is common to see invalid markup in many of the sites throughout the
World Wide Web The World Wide Web (WWW or simply the Web) is an information system that enables Content (media), content sharing over the Internet through user-friendly ways meant to appeal to users beyond Information technology, IT specialists and hobbyis ...
.


Use of proprietary/discontinued elements

In the early age of the web (much of the 1990s), the design of the official HTML specification became increasingly strained, compared to the desire of designers for flexibility in creating visually vibrant designs. In response to this pressure, browser makers unilaterally added new proprietary features to HTML that fell outside the standards at the time. This meant there were proprietary elements in HTML that worked in some browsers, but not in others. To some extent, this problem was slowed by the introduction of new standards by the W3C, such as CSS, introduced in 1998, which helped to provide greater flexibility in the presentation and layout of web pages without the need for large numbers of additional HTML elements and attributes. Moreover, in HTML 4 and XHTML 1, many elements were either superseded by a single semantic construct (such as ''object'' elements replacing proprietary ''applet'' and ''embed'' elements) or deprecated due to being presentational (such as the "s", "strike" and "u" elements). Nevertheless, browser developers continued to introduce new elements to HTML when they perceived a need. Some browsers included tabindex attributes on any element. Developers of Apple's
WebKit WebKit is a browser engine primarily used in Apple's Safari web browser, as well as all web browsers on iOS and iPadOS. WebKit is also used by the PlayStation consoles starting with the PS3, the Tizen mobile operating systems, the Amazon K ...
introduced the ''
canvas Canvas is an extremely durable Plain weave, plain-woven Cloth, fabric used for making sails, tents, Tent#Marquees and larger tents, marquees, backpacks, Shelter (building), shelters, as a Support (art), support for oil painting and for other ite ...
'' element, a version of which was subsequently adopted by
Mozilla Mozilla is a free software community founded in 1998 by members of Netscape. The Mozilla community uses, develops, publishes and supports Mozilla products, thereby promoting free software and open standards. The community is supported institution ...
. In 2004, Apple, Mozilla and
Opera Opera is a form of History of theatre#European theatre, Western theatre in which music is a fundamental component and dramatic roles are taken by Singing, singers. Such a "work" (the literal translation of the Italian word "opera") is typically ...
founded the
WHATWG The Web Hypertext Application Technology Working Group (WHATWG) is a community of people interested in evolving HTML and related technologies. The WHATWG was founded by individuals from Apple Inc., the Mozilla Foundation and Opera Software, ...
, with the intent of creating a new version of the HTML specification which all browser behavior would match. This included changing the specification if necessary to match an existing consensus between different browsers. The ''canvas'' and ''embed'' elements were subsequently standardised by the WHATWG. Certain elements (including ''b'', ''i'' and ''small'') which were previously considered presentational and deprecated were included, but defined in a media-independent rather than visual manner. Versions of the WHATWG specification were published by the
W3C The World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web. Founded in 1994 by Tim Berners-Lee, the consortium is made up of member organizations that maintain full-time staff working together in ...
as
HTML5 HTML5 (Hypertext Markup Language 5) is a markup language used for structuring and presenting hypertext documents on the World Wide Web. It was the fifth and final major HTML version that is now a retired World Wide Web Consortium (W3C) recommend ...
.


Evolving specifications to solve tag soup

While some of the issues of tag soup are due to shortcomings of browsers and sometimes due to a lack of information for web authors, some of the proliferation of tag soup was due to missing links in the web standards themselves. The W3C has spearheaded several efforts to address the shortcomings of web standards. As more browsers support newer revisions of standards, the pressure on web developers to use non-standard code to solve problems diminishes.


Cascading Style Sheets (CSS)

Cascading Style Sheets (CSS) provide a mechanism to specify the presentation of elements in a document without altering the markup structure of the document. Before CSS was commonplace, web developers may have resorted to some structurally invalid markup to achieve certain presentational goals – for example, including block level elements within inline elements to obtain a particular effect, or using sometimes large numbers of <font> and other display-specific HTML tags. CSS uses style rules to accomplish these tasks while leaving the markup cleaner and simpler.


XML and XHTML

XHTML Extensible HyperText Markup Language (XHTML) is part of the family of XML markup languages which mirrors or extends versions of the widely used HyperText Markup Language (HTML), the language in which Web pages are formulated. While HTML, pr ...
is a reformulation of the HTML language based on
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
. XHTML was developed to address many of the problems associated with tag soup. XML allows parsers to separate the process of interpreting the document syntax and its structure. In HTML and
SGML The Standard Generalized Markup Language (SGML; International Organization for Standardization, ISO 8879:1986) is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on t ...
, a parser needed to know certain rules about elements during parsing, such as what elements could be contained within other elements and which elements implicitly close the previous element. This is because in HTML and SGML, closing tags and even opening tags were optional on some elements. By requiring all elements to have explicit opening and closing tags, XML parsers can parse the document and produce a document tree without any knowledge of the document type. This allows parsers to be universal and very light-weight, and to be separated from the process of validating or interpreting the document. The XML specification clearly defines that a conforming
user agent On the Web, a user agent is a software agent responsible for retrieving and facilitating end-user interaction with Web content. This includes all web browsers, such as Google Chrome and Safari A safari (; originally ) is an overland jour ...
(such as a web browser) must not accept a document, and not continue parsing it, if any syntactical error is encountered. Thus, a browser interpreting a web page as XHTML will refuse to display the page if it encounters a formation error. This can help ensure that when authors test XHTML code against a conforming browser they will immediately be informed of malformation problems: perhaps the most severe problem facing web browsers. When code is malformed, the intent of the author is ambiguous. Without the directives of XML, HTML browsers must use complex algorithms to infer the author's intended meaning in a wide range of cases where invalid syntax is encountered. XML and XHTML introduce the concept of namespaces. With namespaces, authors or communities of authors can define new elements and attributes with new semantics, and intermix those within their XHTML documents. Namespaces ensure that element names from the various namespaces will not be conflated. For example, a "table" element could be defined in a new namespace with new semantics different from the HTML "table" element and the browser will be able to differentiate between the two. In providing namespaces, XHTML combined with CSS allow authoring communities to easily extend the semantic vocabulary of documents. This accommodates the use of proprietary elements so long as those elements can be presented to the intended audience through complete style sheet definitions (including aural/speech and tactile styles). XHTML documents may be served on the web using the
internet media type In information and communications technology, a media type, content type or MIME type is a two-part identifier for file formats and content formats. Their purpose is comparable to filename extensions and uniform type identifiers, in that they ident ...
application/xhtml+xml or text/html Microsoft Internet Explorer versions before 9 do not display XHTML documents served as application/xhtml+xml. IE9 and later versions are compliant. See also the discussion of this issue in the XHTML article.


HTML5

HTML5 aims to be the most complete solution to the problem of tag soup thus far while remaining as backwards- and forwards-compatible as possible. By contrast to XHTML, which departs from backwards compatibility and takes the approach that parsers should become less tolerant of badly formed markup, HTML5 acknowledges that badly formed HTML code already exists in large quantities and will probably continue to be used, and takes the view that the specification should be expanded to ensure maximum compatibility with such code. Thus, the HTML 5 specification has altered its definition of HTML syntax both to accommodate common syntax in use today, and to explicitly describe exactly how "badly formed code" should be treated by the parser. The handling of badly formed code now has a place in the specification itself, hopefully reducing the need for future HTML parsers to implement additional, out-of-specification measures for dealing with code that it does not recognize.


Tools

Many software tools exist which can parse and attempt to correct malformed markup, among other functions. * HTML Tidy is a software tool available for many platforms which can correct invalid syntax, and most invalid document structure, converting HTML-like code to HTML or XHTML.
Aggiorno
is a
Visual Studio Visual Studio is an integrated development environment (IDE) developed by Microsoft. It is used to develop computer programs including web site, websites, web apps, web services and mobile apps. Visual Studio uses Microsoft software development ...
add-in that focuses on making websites standards-compliant
TagSoup
is a Java library that parses HTML, cleans it up, and delivers a stream of SAX events representing well-formed XML (not necessarily valid XHTML). This tools is used for processing JNLP files in the open source implementation of the JNLP protocol available in IcedTea-Web, a sub-project of IcedTea, the build and integration project of the
OpenJDK OpenJDK (Open Java Development Kit) is a free and open-source implementation of the Java Platform, Standard Edition (Java SE). It is the result of an effort Sun Microsystems began in 2006, four years before the company was acquired by Oracle Corp ...
. * Beautiful Soup is a Python DOM-like parser for HTML/XML which can handle malformed markup.
tagsoup
a library for Haskell language.


Valid deviations from XHTML

Unlike the strict XHTML, HTML and its predecessor
SGML The Standard Generalized Markup Language (SGML; International Organization for Standardization, ISO 8879:1986) is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on t ...
are designed to be written by humans, and already have a significant degree of flexibility in syntax to reduce boilerplate. These differences do not make the document invalid and are therefore not tag soup. The following apply to both HTML 4 and HTML5,
HTML 5.1 2nd Edition § 8.1.2.4. Optional tags
/ref> and examples date back to the first days of HTML. * Tags like can often be omitted completely. * The closure of tags can often be omitted because the specification rejects some elements nesting into itself. For example, multiple elements can be written without closing. Despite their validity, these omissions still require a special parser with a knowledge of HTML (as opposed to the more rigid XML) to parse. In addition, it is common for tools to "fix" these structures too. For example, HTML Tidy allows omitting optional tags, but defaults to not doing so.


See also

*
Overlapping markup In markup languages and the digital humanities, overlap occurs when a document has two or more structures that interact in a non-hierarchical manner. A document with overlapping markup cannot be represented as a tree. This is also known as concurre ...
* Quirks mode


Notes

* G. Ken Holman. ''Re: ml-devWhat is Tag Soup?'', XML development mailing list, 11 Oct 2002
Archived message available online
* "tag soup." Definitions.net. STANDS4 LLC, 2013. Web. 19 Nov. 2013
soup


References

{{Reflist, 30em HTML