Definition of XHTML
XHTML (eXtensible HyperText Markup Language) is a family of XML markup languages that mirror or extend versions of the widely-used Hypertext Markup Language (HTML), the language in which web pages are written.
While HTML (prior to HTML5) was defined as an application of Standard Generalized Markup Language (SGML), a very flexible markup language framework, XHTML is an application of XML, a more restrictive subset of SGML. Because XHTML documents need to be well-formed, they can be parsed using standard XML parsers-unlike HTML, which requires a lenient HTML-specific parser.
XHTML 1.0 became a World Wide Web Consortium (W3C) Recommendation on January 26, 2000. XHTML 1.1 became a W3C Recommendation on May 31, 2001. XHTML5 is undergoing development as of September 2009, as part of the HTML5 specification.
XHTML 1.0 is "a reformulation of the three HTML 4 document types as applications of XML 1.0". The World Wide Web Consortium (W3C) also continues to maintain the HTML 4.01 Recommendation, and the specifications for HTML5 and XHTML5 are being actively developed. In the current XHTML 1.0 Recommendation document, as published and revised to August 2002, the W3C commented that, "The XHTML family is the next step in the evolution of the Internet. By migrating to XHTML today, content developers can enter the XML world with all of its attendant benefits, while still remaining confident in their content's backward and future compatibility."
However, in 2004, the Web Hypertext Application Technology Working Group (WHATWG) formed, independently of the W3C, to work on advancing ordinary HTML not based on XHTML. The WHATWG eventually began working on a standard that supported both XML and non-XML serializations, HTML5, in parallel to W3C standards such as XHTML 2. In 2007, the W3C's HTML working group voted to officially recognize HTML5 and work on it as the next-generated HTML standard. In 2009, the W3C allowed the XHTML 2 Working Group's charter to expire, acknowledging that HTML5 would be the sole next-generation HTML standard, including both XML and non-XML serializations. Of the two serializations, the W3C suggests that most authors use the HTML syntax, rather than the XHTML syntax.
XHTML was developed to make HTML more extensible and increase interoperability with other data formats. HTML 4 was ostensibly an application of Standard Generalized Markup Language (SGML); however the specification for SGML was complex, and neither web browsers nor the HTML 4 Recommendation were fully conformant to it. The XML standard, approved in 1998, provided a simpler data format closer in simplicity to HTML 4. By shifting to an XML format, it was hoped HTML would become compatible with common XML tools; servers and proxies would be able to transform content, as necessary, for constrained devices such as mobile phones. By utilizing namespaces, XHTML documents could provide extensibility by including fragments from other XML-based languages such as Scalable Vector Graphics and MathML. Finally, the renewed work would provide an opportunity to divide HTML into reusable components (XHTML Modularization) and clean up untidy parts of the language.
Relationship to HTML
There are various differences between XHTML and HTML. The Document Object Model is a tree structure that represents the page internally in applications, and XHTML and HTML are two different ways of representing that in markup (serializations). Both are less expressive than the DOM (for example, "--" may be placed in comments in the DOM, but cannot be represented in a comment in either XHTML or HTML), and generally XHTML's XML syntax is a little more expressive than HTML (for example, arbitrary namespaces are not allowed in HTML). So, firstly one source of differences is immediate: XHTML uses an XML syntax, while HTML uses a pseudo-SGML syntax (officially SGML for HTML 4 and under, but never in practice, and standardised away from SGML in HTML5). Secondly however, because the expressible contents of the DOM in syntax are slightly different, there are some changes in actual behavior between the two models.
Firstly then, syntax differences:
- Broadly, the XML rules require that all elements be closed, either by a separate closing tag or using self closing syntax (e.g.
<br />), while HTML syntax permits some elements to be unclosed because either they are always empty (e.g.
<input>) or their end can be determined implicitly ("omissibility", e.g.
- XML is case-sensitive for element and attribute names, while HTML is not.
- Some shorthand features in HTML are omitted in XML, such as (1) attribute minimization, where attribute values or their quotes may be omitted (e.g.
<option selected> or
<option selected=selected>, while XML this must be expressed as
<option selected="selected">); (2) element minimization may be used to remove elements entirely (such as
<tbody> inferred in a table if not given); and (3) the rarely used SGML syntax for element minimization ("shorttag"), which most browsers do not implement.
- There are numerous other technical requirements surrounding namespaces and precise parsing of whitespace and certain characters and elements. The exact parsing of HTML in practice has been undefined until recently; see the HTML5 specification ([HTML5]) for full details, or the working summary (HTML vs. XHTML).
Secondly, in contrast to these minor syntactical differences, there are some behavioral differences, mostly arising from the underlying differences in serialization. For example:
- Most prominently, behavior on parse errors differ. A fatal parse error in XML (such as an incorrect tag structure) causes document processing to be aborted.
- Most content requiring namespaces will not work in HTML, except the built-in support for SVG and MathML in the HTML5 parser along with certain magic prefixes such as
document.write() method; it is not available for XHTML. The
innerHTML property is available, but will not insert non-well-formed content. On the other hand, it can be used to insert well-formed namespaced content into XHTML.
- CSS is also applied slightly differently. Due to XHTML's case-sensitivity, all CSS selectors become case sensitive for XHTML documents. Some CSS properties, such as backgrounds, set on the
<body> element in HTML are 'inherited upwards' into the
<html> element; this appears not to be the case for XHTML.
The similarities between HTML 4.01 and XHTML 1.0 led many web sites and content management systems to adopt the initial W3C XHTML 1.0 Recommendation. To aid authors in the transition, the W3C provided guidance on how to publish XHTML 1.0 documents in an HTML-compatible manner, and serve them to browsers that were not designed for XHTML.
Such "HTML-compatible" content is sent using the HTML media type (
text/html) rather than the official Internet media type for XHTML (
application/xhtml+xml). When measuring the adoption of XHTML to that of regular HTML, therefore, it is important to distinguish whether it is media type usage or actual document contents that is being compared.
Most web browsers have mature support for all of the possible XHTML media types. The notable exception is Internet Explorer versions 8 and earlier by Microsoft; rather than rendering
application/xhtml+xml content, a dialog box invites the user to save the content to disk instead. Both Internet Explorer 7 (released in 2006) and Internet Explorer 8 (released in March 2009) exhibit this behavior. Microsoft developer Chris Wilson explained in 2005 that IE7's priorities were improved browser security and CSS support, and that proper XHTML support would be difficult to graft onto IE's compatibility-oriented HTML parser; however, Microsoft added support for true XHTML in IE9.
As long as support is not widespread, most web developers avoid using XHTML that is not HTML-compatible, so advantages of XML such as namespaces, faster parsing and smaller-footprint browsers do not benefit the user.
In the early 2000s, some web developers began to question why Web authors ever made the leap into authoring in XHTML. Others countered that the problems ascribed to the use of XHTML could mostly be attributed to two main sources: the production of invalid XHTML documents by some Web authors and the lack of support for XHTML built into Internet Explorer 6. They went on to describe the benefits of XML-based Web documents (i.e. XHTML) regarding searching, indexing and parsing as well as future-proofing the Web itself.
In October 2006, HTML inventor and W3C chair Tim Berners-Lee, introducing a major W3C effort to develop a new HTML specification, posted in his blog that, "The attempt to get the world to switch to XML ...all at once didn't work. The large HTML-generating public did not move ... Some large communities did shift and are enjoying the fruits of well-formed systems ... The plan is to charter a completely new HTML group." The current HTML5 working draft says "special attention has been given to defining clear conformance criteria for user agents in an effort to improve interoperability ... while at the same time updating the HTML specifications to address issues raised in the past few years." Ian Hickson, editor of the HTML5 specification criticising the improper use of XHTML in 2002, is a member of the group developing this specification and is listed as one of the co-editors of the current working draft.
Simon Pieters researched the XML-compliance of mobile browsers and concluded "the claim that XHTML would be needed for mobile devices is simply a myth".
Semantic content in XHTML
XHTML+RDFa is an extended version of the XHTML markup language for supporting RDF through a collection of attributes and processing rules in the form of well-formed XML documents. This host language is one of the techniques used to develop Semantic Web content by embedding rich semantic markup.
XHTML 1.x documents are mostly backward compatible with HTML 4 user agents when the appropriate guidelines are followed. XHTML 1.1 is essentially compatible, although the elements for ruby annotation are not part of the HTML 4 specification and thus generally ignored by HTML 4 browsers. Later XHTML 1.x modules such as those for the
role attribute, RDFa and WAI-ARIA degrade gracefully in a similar manner.
XHTML 2.0 is significantly less compatible, although this can be mitigated to some degree through the use of scripting. (This can be simple one-liners, such as the use of "
Cross-compatibility of XHTML and HTML
HTML5 and XHTML5 serializations are largely inter-compatible if adhering to the stricter XHTML5 syntax, but there are some cases in which XHTML will not work as valid HTML5 (e.g., processing instructions are deprecated in HTML, are treated as comments, and close on the first "?", whereas they are fully allowed in XML, are treated as their own type, and close on "