Create nested XML from flat HTML using XSLT - html

I have this specific flat input HTML structure:
<!DOCTYPE html>
<html>
<head>
<title>Article <b>bold</b> title</title>
</head>
<body>
<article>
<h1 class="h-title"><span class="span-title">1 </span> Title 1 with some <sup>sup</sup> elements.</h1>
<p>Some <b>bold</b> text for 1.</p>
<p>Some more <b>bold</b> text for 1.</p>
<h1 class="h-title"><span class="span-title">2 </span> Title 2 with some <sup>sup</sup> elements.</h1>
<ul>
<li>The first list item.</li>
<li>The second list item with <i>italic</i> text.</li>
</ul>
<p>Some <b>bold</b> text for 2.</p>
<h2 class="h-title"><span class="span-title">2.1</span> Title 2.1 with some <sup>sup</sup> elements.</h2>
<p>Some <b>bold</b> text for 2.1.</p>
<h2 class="h-title"><span class="span-title">2.2</span> Title 2.2 with some <sup>sup</sup> elements.</h2>
<p>Some <b>bold</b> text for 2.2.</p>
<h3 class="h-title"><span class="span-title">2.2.1</span> Title 2.2.1 with some <sup>sup</sup> elements.</h3>
<p>Some <b>bold</b> text for 2.2.1.</p>
<h3 class="h-title"><span class="span-title">2.2.2</span> Title 2.2.2 with some <sup>sup</sup> elements.</h3>
<p>Some <b>bold</b> text for 2.2.2.</p>
<h2 class="h-title"><span class="span-title">2.3</span> Title 2.3 with some <sup>sup</sup> elements.</h2>
<p>Some <b>bold</b> text for 2.3.</p>
<h1 class="h-title"><span class="span-title">3</span> Title 3 with some <sup>sup</sup> elements.</h1>
<p>Some <b>bold</b> text for 3.</p>
</article>
</body>
</html>
I would need to create a nested output XML structure as below:
<?xml version="1.0" encoding="UTF-8"?>
<xml>
<front type="head">
<title>Article <b>bold</b> title</title>
</front>
<body>
<sec id="s1" sec-type="Title 1 with some sup elements.">
<label>1</label>
<title>Title 1 with some <sup>sup</sup> elements.</title>
<p>Some <b>bold</b> text for 1.</p>
<p>Some more <b>bold</b> text for 1.</p>
</sec>
<sec id="s2" sec-type="Title 2 with some sup elements.">
<label>2</label>
<title>Title 2 with some <sup>sup</sup> elements.</title>
<list list-type="bullet">
<list-item>The first list item.</list-item>
<list-item>The second list item with <i>italic</i> text.</list-item>
</list>
<p>Some <b>bold</b> text for 2.</p>
<sec id="s2.1" sec-type="Title 2.1 with some sup elements.">
<label>2.1</label>
<title>Title 2.1 with some <sup>sup</sup> elements.</title>
<p>Some <b>bold</b> text for 2.1.</p>
</sec>
<sec id="s2.2" sec-type="Title 2.2 with some sup elements.">
<label>2.2</label>
<title>Title 2.2 with some <sup>sup</sup> elements.</title>
<p>Some <b>bold</b> text for 2.2.</p>
<sec id="s2.2.1" sec-type="Title 2.2.1 with some sup elements.">
<label>2.2.1</label>
<title>Title 2.2.1 with some <sup>sup</sup> elements.</title>
<p>Some <b>bold</b> text for 2.2.1.</p>
</sec>
<sec id="s2.2.2" sec-type="Title 2.2.2 with some sup elements.">
<label>2.2.2</label>
<title>Title 2.2.2 with some <sup>sup</sup> elements.</title>
<p>Some <b>bold</b> text for 2.2.2.</p>
</sec>
</sec>
<sec id="s2.3" sec-type="Title 2.3 with some sup elements.">
<label>2.3</label>
<title>Title 2.3 with some <sup>sup</sup> elements.</title>
<p>Some <b>bold</b> text for 2.3.</p>
</sec>
</sec>
<sec id="s3" sec-type="Title 3 with some sup elements.">
<label>3</label>
<title>Title 3 with some <sup>sup</sup> elements.</title>
<p>Some <b>bold</b> text for 3.</p>
</sec>
</body>
</xml>
So far, I have produced this XSLT transformation below (h1-h6 section needs to be improved I believe):
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs"
version="2.0">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<!-- all -->
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<!-- html / xml -->
<xsl:template match="html">
<xml>
<xsl:apply-templates select="node()|#*"/>
</xml>
</xsl:template>
<!-- head / front -->
<xsl:template match="head">
<front type="head">
<xsl:apply-templates select="node()|#*"/>
</front>
</xsl:template>
<!-- article / -->
<xsl:template match="article">
<xsl:apply-templates select="node()|#*"/>
</xsl:template>
<!-- h1-h6 / sec -->
<xsl:template match="h1[#class='h-title']">
<xsl:variable name="secId" select="normalize-space(span)"/>
<xsl:variable name="secType" select="substring-after(.,' ')"/>
<sec>
<xsl:attribute name="id" select="normalize-space(concat('s', $secId))"/>
<xsl:attribute name="sec-type" select="$secType"/>
<label>
<xsl:value-of select="$secId"/>
</label>
<title>
<xsl:apply-templates select="node() except span" />
</title>
</sec>
</xsl:template>
<!-- ul / list -->
<xsl:template match="ul">
<list list-type="bullet">
<xsl:apply-templates select="node()|#*"/>
</list>
</xsl:template>
<!-- li / list-item -->
<xsl:template match="li">
<list-item>
<xsl:apply-templates select="node()|#*"/>
</list-item>
</xsl:template>
</xsl:stylesheet>
Short description:
I have this flat HTML structure which needs to be transformed to nested XML structure. The original HTML structure may use h1 to h6 headings and they should be transformed into nested output XML sections accordingly. Each heading (h1...h6) has its own class (h1-title...h6-title). The HTML is always "well-structured", meaning h1 can be followed only by h2 or h3, etc. The wrong format (i.e. h1->h3->h2) may never occur.
I have two issues:
I believe the transformation needs to be done with recursion, but I am unable to figure it out with XSLT. I managed to create the right XML output structure and re-tag everything accordingly, but I'm unable to set nested structure.
The second (small) issue is that I don't know how to strip leading/trailing spaces from XML output tag and at the same time use "node() except span"? Function normalize-space() in this case returns an error.
I will be eternally grateful (and I mean it!) to someone who can solve this recursive mystery above for me.

Three days ago I touched the XSLT code for the first time. Today I'm posting my first "achievement" which is based on Martin Honnen's golden function (html(h)->xml(sec)). I believe the code is ugly and I don't know if it's written by all standards as it should be, but the end result is correct so I'll post it as an answer to my question for now. If there are some anomalities/issues still present, I'll be glad to fix it, if someone can comment.
It looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:transform
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:mf="http://example.com/mf"
exclude-result-prefixes="fn xs mf"
version="2.0">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<!-- all -->
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<!-- html / xml -->
<xsl:template match="html">
<xml>
<xsl:apply-templates select="node()|#*"/>
</xml>
</xsl:template>
<!-- head / front -->
<xsl:template match="head">
<front type="head">
<xsl:apply-templates select="node()|#*"/>
</front>
</xsl:template>
<!-- flat html (h1-h6) to nested xml (sec) transformation -->
<xsl:function name="mf:group" as="node()*">
<xsl:param name="nodes" as="node()*"/>
<xsl:param name="level" as="xs:integer"/>
<xsl:for-each-group select="$nodes" group-starting-with="*[starts-with(local-name(), concat('h', $level))]">
<xsl:choose>
<xsl:when test="self::*[starts-with(local-name(), concat('h', $level))]">
<sec>
<xsl:apply-templates select="."/>
<xsl:sequence select="mf:group(current-group() except ., $level+1)"/>
</sec>
</xsl:when>
<xsl:when test="$level lt 6">
<xsl:sequence select="mf:group(current-group(), $level+1)"/>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="current-group()"/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:function>
<!-- article / -->
<xsl:template match="article">
<xsl:sequence select="mf:group(node(), 1)"/>
</xsl:template>
<!-- h1-h6 / sec -->
<xsl:template match="(h1|h2|h3|h4|h5|h6)[#class='h-title']">
<xsl:variable name="secId" select="normalize-space(span)"/>
<xsl:variable name="secType" select="fn:substring-after(normalize-space(.), ' ')"/>
<xsl:attribute name="id" select="normalize-space(concat('s', $secId))"/>
<xsl:attribute name="sec-type" select="$secType"/>
<label>
<xsl:value-of select="$secId"/>
</label>
<title>
<xsl:apply-templates select="node() except span" />
</title>
</xsl:template>
<!-- ul / list -->
<xsl:template match="ul">
<list list-type="bullet">
<xsl:apply-templates select="node()|#*"/>
</list>
</xsl:template>
<!-- li / list-item -->
<xsl:template match="li">
<list-item>
<xsl:apply-templates select="node()|#*"/>
</list-item>
</xsl:template>
</xsl:transform>

Related

XSLT apply-templates in for-each

I'm trying to write a simple XHTML to Simple Docbook translator (the input XHTML is a limited subset so it should be doable).
I have this:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="UTF-8" indent="yes" omit-xml-declaration="yes" standalone="no"/>
<!--
<xsl:strip-space elements="*"/>
-->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()" />
</xsl:copy>
</xsl:template>
<!-- skip implicit tags-->
<xsl:template match="/html/body"><xsl:apply-templates/></xsl:template>
<xsl:template match="/html"><xsl:apply-templates/></xsl:template>
<!-- paragraphs to sections converter -->
<xsl:template match="h2">
<xsl:variable name="title" select="generate-id(.)"/>
<section>
<title><xsl:apply-templates select="text()"/></title>
<xsl:for-each select="following-sibling::*[generate-id(preceding-sibling::h2[1]) = $title and not(self::h2)]">
<xsl:apply-templates/>
</xsl:for-each>
</section>
</xsl:template>
<xsl:template match="p">
<para><xsl:apply-templates select="*|text()"/></para>
</xsl:template>
<xsl:template match="p[preceding-sibling::h2]"/>
<xsl:template match="ul">
<itemizedlist><xsl:apply-templates select="li"/></itemizedlist>
</xsl:template>
<xsl:template match="ul[preceding-sibling::h2]"/>
<xsl:template match="ol">
<orderedlist><xsl:apply-templates select="li"/></orderedlist>
</xsl:template>
<xsl:template match="ol[preceding-sibling::h2]"/>
<xsl:template match="li">
<listitem><para><xsl:apply-templates select="*|text()"/></para></listitem>
</xsl:template>
</xsl:stylesheet>
For this input
<html>
<body>
<p>First paragraph</p>
<p>Second paragraph</p>
<h2>First title</h2>
<p>First paragraph</p>
<p>Second paragraph</p>
<p>Third paragraph</p>
<h2>Second title</h2>
<p>First paragraph</p>
<ul>
<li>A list item</li>
<li>Another list item</li>
</ul>
<p>Second paragraph</p>
</body>
</html>
I expect this output
<para>First paragraph</para>
<para>Second paragraph</para>
<section>
<title>First title</title>
<para>First paragraph</para>
<para>Second paragraph</para>
<para>Third paragraph</para>
</section>
<section>
<title>Second title</title>
<para>First paragraph</para>
<itemizedlist>
<listitem>A list item</listitem>
<listitem>Another list item</listitem>
</itemizedlist>
<para>Second paragraph</para>
</section>
But I get
<para>First paragraph</para>
<para>Second paragraph</para>
<section><title>First title</title>First paragraphSecond paragraphThird paragraph</section>
<section><title>Second title</title>First paragraph
<listitem><para>A list item</para></listitem>
<listitem><para>Another list item</para></listitem>
Second paragraph</section>
For some reason, the template for my paragraphs and lists is not being applied. I'm guessing because the templates matching are the empty ones, but I need those to prevent duplicate tags outside section.
How can I make this work? TIA.
Use
<xsl:for-each select="following-sibling::*[generate-id(preceding-sibling::h2[1]) = $title and not(self::h2)]">
<xsl:apply-templates select="."/>
</xsl:for-each>
or simply
<xsl:apply-templates select="following-sibling::*[generate-id(preceding-sibling::h2[1]) = $title and not(self::h2)]"/>
to process those elements you want to wrap into a section. But there will be a collision with your other templates so perhaps using a mode helps for the processing:
<xsl:template match="p" mode="wrapped">
<para><xsl:apply-templates select="*|text()"/></para>
</xsl:template>
<xsl:template match="p[preceding-sibling::h2]"/>
<xsl:template match="ul" mode="wrapped">
<itemizedlist><xsl:apply-templates select="li"/></itemizedlist>
</xsl:template>
<xsl:template match="ul[preceding-sibling::h2]"/>
<xsl:template match="ol" mode="wrapped">
<orderedlist><xsl:apply-templates select="li"/></orderedlist>
</xsl:template>
<xsl:template match="ol[preceding-sibling::h2]"/>
<xsl:template match="li" mode="wrapped">
<listitem><para><xsl:apply-templates select="*|text()"/></para></listitem>
</xsl:template>

How can I replace HTML content using XSLT?

I would like to remove a certain text from an HTML page using XSLT. The text I would like to remove is <h2>OLD TEXT</h2>. I'm trying to remove it by replacing the text with an empty string but I don't get it to work.
When calling my string-replace-all function with "<h2>OLD TEXT</h2>" text as input I get the following error:
The value of the attribute "select" associated with an element type "xsl:with-param" must not contain the '<' character.
When calling my string-replace-all function with just "OLD TEXT" as input the text gets replaced but the output
is no longer HTML, it's just plain text without the HTML tags.
How could I do to replace <h2>OLD TEXT</h2> and still get the output in HTML format?
My code:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" />
<xsl:template name="string-replace-all">
<xsl:param name="text" />
<xsl:param name="replace" />
<xsl:param name="by" />
<xsl:choose>
<xsl:when test="$text = '' or $replace = ''or not($replace)" >
<!-- Prevent this routine from hanging -->
<xsl:copy-of select="$text" />
</xsl:when>
<xsl:when test="contains($text, $replace)">
<xsl:copy-of select="substring-before($text,$replace)" />
<xsl:copy-of select="$by" />
<xsl:call-template name="string-replace-all">
<xsl:with-param name="text" select="substring-after($text,$replace)" />
<xsl:with-param name="replace" select="$replace" />
<xsl:with-param name="by" select="$by" />
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<xsl:copy-of select="$text" />
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:variable name="updatedHtml">
<xsl:call-template name="string-replace-all">
<xsl:with-param name="text" select="//div[#id='mainContent']" />
<xsl:with-param name="replace" select="'OLD TEXT'" />
<xsl:with-param name="by" select="''" />
</xsl:call-template>
</xsl:variable>
<xsl:template match="//div[#id='mainContent']">
<xsl:copy-of select="$updatedHtml" />
</xsl:template>
</xsl:stylesheet>
HTML:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Test page</title>
</head>
<body>
<div id="container">
<div id="mainContent">
<h1>Header one</h1>
<p>Some text</p>
<h2>OLD TEXT</h2>
<p>Some information about the old text</p>
<br>
More text
<h2>Another header</h2>
Some information <strong>with strong</strong> text.
<h2>Another header again</h2>
Some information <strong>with strong</strong> text.
</div>
</div>
</body>
</html>

How to wrap <h2> and <p> tags inside a <section> tag in xslt?

I want to wrap the headers and paragraph inside the section tags. Section tag ends when the next header arises.
Input:
<body>
<h2>text text</h2>
<p> some text </p>
<p> some text </p>
<h2> text text </h2>
<p> some text </p>
<p> some text </p>
<p> some text </p>
</body>
Output:
<body>
<section>
<h2>text text</h2>
<p> some text </p>
<p> some text </p>
</section>
<section>
<h2> text text </h2>
<p> some text </p>
<p> some text </p>
<p> some text </p>
</section>
</body>
Like mentioned in the comments, this is a grouping question.
If you're using XSLT 2.0, you can use xsl:for-each-group/#group-starting-with...
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/body">
<xsl:copy>
<xsl:copy-of select="#*"/>
<xsl:for-each-group select="*" group-starting-with="h2">
<section>
<xsl:copy-of select="current-group()"/>
</section>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
If you're stuck with XSLT 1.0, you can use an xsl:key based on a generated id of the h2...
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:key name="sectElems" match="/body/*[not(self::h2)]"
use="generate-id(preceding-sibling::h2[1])"/>
<xsl:template match="/body">
<xsl:copy>
<xsl:copy-of select="#*"/>
<xsl:apply-templates select="h2"/>
</xsl:copy>
</xsl:template>
<xsl:template match="h2">
<xsl:variable name="id">
<xsl:value-of select="generate-id()"/>
</xsl:variable>
<section>
<xsl:copy-of select=".|key('sectElems',$id)"/>
</section>
</xsl:template>
</xsl:stylesheet>
Both of these stylesheets produce the same output.

Wrapping words from HTML using XSL

I need wrapping each word with a tag (e. span) in a HTML document, like:
<html>
<head>
<title>It doesnt matter</title>
</head>
<body>
<div> Text in a div </div>
<div>
Text in a div
<p>
Text inside a p
</p>
</div>
</body>
</html>
To result something like this:
<html>
<head>
<title>It doesnt matter</title>
</head>
<body>
<div> <span>Text </span> <span> in </span> <span> a </span> <span> div </span> </div>
<div>
<span>Text </span> <span> in </span> <span> a </span> <span> div </span>
<p>
<span>Text </span> <span> in </span> <span> a </span> <span> p </span>
</p>
</div>
</body>
</html>
It's important to keep the structure of the body...
Any help?
All of the three different solutions below use the XSLT design pattern of overriding the identity rule to generally preserve the structure and contents of the XML document, and only modify specific nodes.
I. XSLT 1.0 solution:
This short and simple transformation (no <xsl:choose> used anywhere):
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="*[not(self::title)]/text()"
name="split">
<xsl:param name="pText" select=
"concat(normalize-space(.), ' ')"/>
<xsl:if test="string-length(normalize-space($pText)) >0">
<span>
<xsl:value-of select=
"substring-before($pText, ' ')"/>
</span>
<xsl:call-template name="split">
<xsl:with-param name="pText"
select="substring-after($pText, ' ')"/>
</xsl:call-template>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
when applied to the provided XML document:
<html>
<head>
<title>It doesnt matter</title>
</head>
<body>
<div> Text in a div </div>
<div>
Text in a div
<p>
Text inside a p
</p>
</div>
</body>
</html>
produces the wanted, correct result:
<html>
<head>
<title>It doesnt matter</title>
</head>
<body>
<div>
<span>Text</span>
<span>in</span>
<span>a</span>
<span>div</span>
</div>
<div>
<span>Text</span>
<span>in</span>
<span>a</span>
<span>div</span>
<p>
<span>Text</span>
<span>inside</span>
<span>a</span>
<span>p</span>
</p>
</div>
</body>
</html>
II. XSLT 2.0 solution:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="*[not(self::title)]/text()">
<xsl:for-each select="tokenize(., '[\s]')[.]">
<span><xsl:sequence select="."/></span>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
when this transformation is applied to the same XML document (above), again the correct, wanted result is produced:
<html>
<head>
<title>It doesnt matter</title>
</head>
<body>
<div>
<span>Text</span>
<span>in</span>
<span>a</span>
<span>div</span>
</div>
<div>
<span>Text</span>
<span>in</span>
<span>a</span>
<span>div</span>
<p>
<span>Text</span>
<span>inside</span>
<span>a</span>
<span>p</span>
</p>
</div>
</body>
</html>
III Solution using FXSL:
Using the str-split-to-words template/function of FXSL one can easily implement much more complicated tokenization -- in any version of XSLT:
Let's have a more complicated XML document and tokenization rules:
<html>
<head>
<title>It doesnt matter</title>
</head>
<body>
<div> Text: in a div </div>
<div>
Text; in; a. div
<p>
Text- inside [a] [p]
</p>
</div>
</body>
</html>
Here there is more than one delimiter that indicates the start or end of a word. In this particular example the delimiters can be: " ", ";", ".", ":", "-", "[", "]".
The following transformation uses FXSL for this more complicated tokenization:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:ext="http://exslt.org/common"
exclude-result-prefixes="ext">
<xsl:import href="strSplit-to-Words.xsl"/>
<xsl:output method="xml" indent="yes" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="*[not(self::title)]/text()">
<xsl:variable name="vwordNodes">
<xsl:call-template name="str-split-to-words">
<xsl:with-param name="pStr" select="normalize-space(.)"/>
<xsl:with-param name="pDelimiters"
select="' ;.:-[]'"/>
</xsl:call-template>
</xsl:variable>
<xsl:apply-templates select="ext:node-set($vwordNodes)/*"/>
</xsl:template>
<xsl:template match="word[string-length(normalize-space(.)) > 0]">
<span>
<xsl:value-of select="."/>
</span>
</xsl:template>
</xsl:stylesheet>
and produces the wanted, correct result:
<html>
<head>
<title>It doesnt matter</title>
</head>
<body>
<div>
<span>Text</span>
<span>in</span>
<span>a</span>
<span>div</span>
</div>
<div>
<span>Text</span>
<span>in</span>
<span>a</span>
<span>div</span>
<p>
<span>Text</span>
<span>inside</span>
<span>a</span>
<span>p</span>
<word/>
</p>
</div>
</body>
</html>
You could achieve this by extending the identity transform to include a recursive template which checks for spaces in a piece of text, and if so puts a span tag around the first word. It can then recursively calls itself for the remaining portion of the text.
Here is it in action...
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" indent="yes"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<!-- Don't split the words in the title -->
<xsl:template match="title">
<xsl:copy-of select="." />
</xsl:template>
<!-- Matches a text element. Given a name so it can be recursively called -->
<xsl:template match="text()" name="wrapper">
<xsl:param name="text" select="." />
<xsl:variable name="new" select="normalize-space($text)" />
<xsl:choose>
<xsl:when test="contains($new, ' ')">
<span><xsl:value-of select="concat(substring-before($new, ' '), ' ')" /></span>
<xsl:call-template name="wrapper">
<xsl:with-param name="text" select="substring-after($new, ' ')" />
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<span><xsl:value-of select="$new" /></span>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
When called on your sample HTML, the output is as follows:
<html>
<head>
<title>It doesnt matter</title>
</head>
<body>
<div>
<span>Text </span>
<span>in </span>
<span>a </span>
<span>div</span>
</div>
<div>
<span>Text </span>
<span>in </span>
<span>a </span>
<span>div</span>
<p>
<span>Text </span>
<span>inside </span>
<span>a </span>
<span>p</span>
</p>
</div>
</body>
</html>
I wasn't 100% sure how important the spaces within the span elements are for you though.

XSLT Insert html content

I'm trying to insert some HTML at a given point. The XML file has a content node, which inside that has actual HTML. For exmaple here is the content section of the XML:
-----------------
<content>
<h2>Header</h2>
<p>some link</p>
<p>some link1</p>
<p>some link2</p>
</content>
-----------------
I need to insert a link after the header but before the first link, inside its own p tag. A little rusty with XSLT, any help is appreciated!
Given this source:
<html>
<head/>
<body>
<content>
<h2>Header</h2>
<p>some link</p>
<p>some link1</p>
<p>some link2</p>
</content>
</body>
</html>
This stylesheet will do what you want to do:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="/html/body/content/h2">
<xsl:copy>
<xsl:apply-templates/>
</xsl:copy>
<p>your new link</p>
</xsl:template>
</xsl:stylesheet>
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="/content">
<xsl:copy-of select="h2"/>
foo
<xsl:copy-of select="p"/>
</xsl:template>
</xsl:stylesheet>