Applying redactions in the form of string substitutions to HTML documents using XSLT - html

I have a large number of HTML (and possibly other xml) documents that I need to redact.
The redactions are typically of the form "John Doe" -> "[Person A]". The text to be redacted may be in headers or paragraphs, but will almost always be in paragraphs.
Simple string substitutions really. Not very complicated things.
However, I do want to preserve document structure, and I would prefer to not reinvent any wheels. String substitution in the document text may do the job, but also may break document structure, so it will be a last option.
Right now I have stared at XSLT for an hour and tried to force "str:replace" to do my bidding. I will spare you from viewing me feeble attempts that didn't work, but I will ask this: Is there a simple and know way to apply my redactions using XSLT, and could you post it here?
Thank you in advance.
Update: at the request of Martin Honnen I'm adding my input files, as well as the command I used to get the latest error message. From this it will be apparent that I'm a complete n00b when it comes to XSLT :-)
.html file:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8"/>
<title>TodaysDate</title>
<meta name="created" content="2020-11-04T30:45:00"/>
</head>
<body>
<ol start="2">
<li><p> John Doe on 9. fux 2057 together with Henry
Fluebottom formed the company Doe &; Fluebottom Widgets
Inc. </p>
</ol>
</body>
</html>
The XSLT transformation file:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>
<xsl:template match="p">
<xsl:copy>
<xsl:attribute name="matchesPattern">
<xsl:copy-of select='str:replace("John Doe", ".*", "[Person A]")'/>
</xsl:attribute>
<xsl:copy-of select='str:replace("Henry Fluebottom", ".*", "[Person B]")'/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
The command and the output:
$ xsltproc -html transform.xsl example.html
xmlXPathCompOpEval: function replace bound to undefined prefix str
xmlXPathCompiledEval: 2 objects left on the stack.
<?xml version="1.0"?>
TodaysDate
<p matchesPattern=""/>
$

xsltproc is based on libxslt and that way supports various EXSLT functions like str:replace, to use it you will need to declare the namespace
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:str="http://exslt.org/strings"
exclude-result-prefixes="str"
version="1.0">
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="p//text()">
<xsl:value-of select="str:replace(., 'John Doe', '[Person A]')"/>
</xsl:template>
</xsl:stylesheet>

There is no simple way in XSLT 1.0 to perform multiple replacements on the same string. You need to use a recursive named template, performing one replacement operation at a time, then moving to the next instance of the current find string or - when no next instance exists - to the next find/replace pair.
Consider the following example:
Input
<html>
<head>
<title>John Doe and Henry Fluebottom</title>
</head>
<body>
<p>John Doe is a person. John Doe on 9. fux 2057 together with Henry Fluebottom formed the company Doe & Fluebottom Widgets Inc. Henry Fluebottom is also a person.</p>
</body>
</html>
XSLT 1.0 (+ EXSLT node-set() function)
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:exsl="http://exslt.org/common"
extension-element-prefixes="exsl">
<xsl:output method="xml" omit-xml-declaration="yes" version="1.0" encoding="utf-8" indent="yes"/>
<!-- identity transform -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:variable name="dictionary">
<entry find="John Doe" replace="[Person A]"/>
<entry find="Henry Fluebottom" replace="[Person B]"/>
</xsl:variable>
<xsl:template match="text()">
<xsl:call-template name="multi-replace">
<xsl:with-param name="string" select="normalize-space(.)"/>
<xsl:with-param name="entries" select="exsl:node-set($dictionary)/entry"/>"/>
</xsl:call-template>
</xsl:template>
<xsl:template name="multi-replace">
<xsl:param name="string"/>
<xsl:param name="entries"/>
<xsl:choose>
<xsl:when test="$entries">
<xsl:call-template name="multi-replace">
<xsl:with-param name="string">
<xsl:call-template name="replace">
<xsl:with-param name="string" select="$string"/>
<xsl:with-param name="search-string" select="$entries[1]/#find"/>
<xsl:with-param name="replace-string" select="$entries[1]/#replace"/>
</xsl:call-template>
</xsl:with-param>
<xsl:with-param name="entries" select="$entries[position() > 1]"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$string"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:template name="replace">
<xsl:param name="string"/>
<xsl:param name="search-string"/>
<xsl:param name="replace-string"/>
<xsl:choose>
<xsl:when test="contains($string, $search-string)">
<xsl:value-of select="substring-before($string, $search-string)"/>
<xsl:value-of select="$replace-string"/>
<xsl:call-template name="replace">
<xsl:with-param name="string" select="substring-after($string, $search-string)"/>
<xsl:with-param name="search-string" select="$search-string"/>
<xsl:with-param name="replace-string" select="$replace-string"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$string"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
Result
<html>
<head>
<title>[Person A] and [Person B]</title>
</head>
<body>
<p>[Person A] is a person. [Person A] on 9. fux 2057 together with [Person B] formed the company Doe & Fluebottom Widgets Inc. [Person B] is also a person.</p>
</body>
</html>
As you can see, this replaces all instances of the search strings anywhere in the input document (except for attributes), while preserving the document's structure.
Note that the input in your example does not actually contain the "Henry Fluebottom" search string. You might want to get around that by calling the first template with:
<xsl:with-param name="string" select="normalize-space(.)"/>
instead of:
<xsl:with-param name="string" select="."/>

The first problem is to find an XSLT processor that actually supports string replacement. The replace() function is standard in XSLT 2.0+, but does not exist in XSLT 1.0. Some XSLT 1.0 processors support an extension function str:replace() in a different namespace, but at the very least, you need to add the namespace declaration xmlns:str="http://exslt.org/strings" to your stylesheet in order to locate the function. I don't know if that will work (I don't know if there is any way of using this function with xsltproc); my advice would be to use an XSLT 2.0+ processor instead.
The next problem is the way you are invoking the function. Typically, a correct invocation would be
replace(., "John Doe", "[Person A]")
though you will have to jump through a few more hoops to make multiple replacements on the same string.
I've no idea what you are trying to achieve with the <xsl:attribute name="matchesPattern"> instruction.

Related

XLST + XML -> Word <field name="foo">Value</foo>

I have a XML produce with MySQL Query Browser.
I'm trying to apply a XSLT to output the result into Word tables. One table for each record.
Here's a sample of my XML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ROOT SYSTEM "Nessus.dtd">
<ROOT>
<row>
<field name="Niveau">Critique</field>
<field name="Name">Apache 2.2 < 2.2.15 Multiple Vulnerabilities</field>
</row>
<row>
<field name="Niveau">Critique</field>
<field name="VulnName">Microsoft Windows 2000 Unsupported Installation Detection</field>
</row>
<row>
<field name="Niveau">Haute</field>
<field name="VulnName">CGI Generic SQL Injection</field>
</row>
</ROOT>
For the XLST I've already found out that I need to do a for-each select
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/">
<xsl:for-each select="ROOT/row">
Niveau : <xsl:value-of select="????"/>
Name : <xsl:value-of select="????"/>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
When I do this loop I see the same number of empty table as there is <row></row> in my file.
But I haven't found the way to make the right "value-of select=". I've try the following without luck.
<xsl:value-of select="#name"/>
<xsl:value-of select="name"/>
<xsl:value-of select="#row/name"/>
<xsl:value-of select="row/#name"/>
<xsl:value-of select="#ROOT/row/name"/>
And a few other that I can't remember. Any idea what I need to craft the request to get the value in my resulting file?
I've just tried with :
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/">
<xsl:for-each select="ROOT/row">
Niveau : <xsl:value-of select="field/#Niveau"/>
Name : <xsl:value-of select="field/#name"/>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
And it output this :
NIVEAU :
NAME : name
NIVEAU :
NAME : Niveau
NIVEAU :
NAME : Niveau
I would like this output :
NIVEAU : Critique
NAME : Apache 2.2 < 2.2.15 Multiple Vulnerabilities
NIVEAU : Critique
NAME : Microsoft Windows 2000 Unsupported Installation Detection
NIVEAU : Haute
NAME : CGI Generic SQL Injection
Any help would be appreciated.
Thank you.
UPDATE
Now with this XSLT
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="row">
<xsl:text>NIVEAU : </xsl:text>
<xsl:value-of select="field[#name = 'Niveau']"/>
<xsl:text>
</xsl:text>
<xsl:text>NAME : </xsl:text>
<xsl:value-of select="field[#name = 'Name']"/>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
I get this output :
<?xml version="1.0"?>
NIVEAU :
NAME : Apache 2.2 < 2.2.15 Multiple Vulnerabilities
NIVEAU : Critique
NAME : Microsoft Windows 2000 Unsupported Installation Detection
NIVEAU : Haute
NAME : CGI Generic SQL Injection
As you can see the first field is empty. I could honestly live with that and fill it manually but if you see why this is happenning I'd be very happy :)
UPDATE
Using <xsl:value-of select="field[#name = 'foo']"/> gave me the value I wanted. I kept the for-each as it was easier to use (for me) inside a MS Word template.
for-each is generally code smell in XSLT. You most likely want a template, not a for-each loop:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="/">
<root>
<xsl:apply-templates/>
</root>
</xsl:template>
<xsl:template match="field">
<xsl:value-of select="#name"/> : <xsl:value-of select="."/>
</xsl:template>
</xsl:stylesheet>
This template will produce the following output:
<root>
Niveau : Critique
name : Apache 2.2 < 2.2.15 Multiple Vulnerabilities
Niveau : Critique
name : Microsoft Windows 2000 Unsupported Installation Detection
Niveau : Haute
name : CGI Generic SQL Injection
</root>
XSLT was designed for this pattern of use--many xsl:templates matching a small part of the source and applying other templates recursively. The most important and common template in this pattern is the identity template, which copies output. This is a good tutorial on XSLT coding patterns that you should read.
Update
Below is a complete solution that will produce text output (since that is what you seem to want, not XML output). I'm not sure xsl is the best language for this, but it works....
Notice that the only place we use for-each is for sorting to determine the length of the longest name. We use templates and pattern-matching selects to make all other loops implicit.
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:variable name="newline"><xsl:text>
</xsl:text></xsl:variable>
<!-- determine the maximum length of the "name" field so we can pad with spaces for all shorter items -->
<xsl:variable name="max_name_len">
<xsl:for-each select="ROOT/row/field/#name">
<xsl:sort select="string-length(.)" data-type="number" order="descending"/>
<xsl:if test="position() = 1">
<xsl:value-of select="string-length(.)"/>
</xsl:if>
</xsl:for-each>
</xsl:variable>
<!-- for each row, apply templates then add a blank line -->
<xsl:template match="row">
<xsl:apply-templates/>
<xsl:value-of select="$newline"/>
</xsl:template>
<!-- for each field, apply template to name and add value, followed by a newline -->
<xsl:template match="field">
<xsl:apply-templates select="#name"/> : <xsl:value-of select="concat(., $newline)"/>
</xsl:template>
<!-- for each name, uppercase and pad with spaces to the right -->
<xsl:template match="field/#name">
<xsl:call-template name="padright">
<xsl:with-param name="text">
<xsl:call-template name="toupper">
<xsl:with-param name="text" select="."/>
</xsl:call-template>
</xsl:with-param>
<xsl:with-param name="len" select="$max_name_len"/>
</xsl:call-template>
</xsl:template>
<!-- Utility function: uppercase a string -->
<xsl:template name="toupper">
<xsl:param name="text"/>
<xsl:value-of select="translate($text, 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')"/>
</xsl:template>
<!-- Utility function: pad a string to desired len with spaces on the right -->
<!-- uses a recursive solution -->
<xsl:template name="padright">
<xsl:param name="text"/>
<xsl:param name="len"/>
<xsl:choose>
<xsl:when test="string-length($text) < $len">
<xsl:call-template name="padright">
<xsl:with-param name="text" select="concat($text, ' ')"/>
<xsl:with-param name="len" select="$len"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$text"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
This stylesheet produces the following output:
NIVEAU : Critique
NAME : Apache 2.2 < 2.2.15 Multiple Vulnerabilities
NIVEAU : Critique
NAME : Microsoft Windows 2000 Unsupported Installation Detection
NIVEAU : Haute
NAME : CGI Generic SQL Injection
When you do <xsl:for-each select="ROOT/row">, the current context inside the loop is the rowelement. So in order to access the field name, you need to write, for example, <xsl:value-of select="field/#name"/>.
Since your XML contains several fields, you will still have to extend your XSLT file somewhat to iterate the fields, or (as Francis Avila suggested) write a template. Both methods are ok, though. The technical terms for the two approaches are "pull" and "push", respectively.
And of course, since you ultimately want to generate a word table, you will have to generate w:tr and w:tc elements, etc.
This stylesheet will give you exactly your desired output
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="row">
<xsl:text>NIVEAU : </xsl:text>
<xsl:value-of select="field[#name eq 'Niveau']"/>
<xsl:text>
</xsl:text>
<xsl:text>NAME : </xsl:text>
<xsl:value-of select="field[#name eq 'VulnName']"/>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
This is very specific to your input xml and output text, for example any <field> children of <row> other than 'Niveau' or 'VulnName' will be dropped from your report. A more generic solution could look like this:
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:output method="text"/>
<xsl:template match="field">
<xsl:value-of select="concat(upper-case(#name),': ',.)"/>
</xsl:template>
</xsl:stylesheet>
This solution though doesn't exactly match the whitespace formatting in your desired output, but it does capture all possible fields.

How to change the format of XML schema I get from SQL server

I have a query:
SELECT top 0 * FROM sometable FOR XML AUTO, ELEMENTS, XMLSCHEMA ('MyURI')
This query returns a schema:
<xsd:element name="ClientName">
<xsd:simpleType>
<xsd:restriction base="sqltypes:nvarchar" sqltypes:localeId="1033" sqltypes:sqlCompareOptions="IgnoreCase IgnoreKanaType IgnoreWidth" sqltypes:sqlSortId="52">
<xsd:maxLength value="50" />
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
but I want something more like this:
<xsd:element name="ClientName">
<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:maxLength value="50" />
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
How can I achieve this?
You can use a XSL transform to change the SQL Server types back to the desired XSD types.
How apply the transform depends on your application. If you are creating static schemata for your tables, then you can use something like Visual Studio or msxsl. If this is a routine request from the server, then Applying an XSL Transformation (SQLXML Managed Classes) may be a better fit.
A style sheet you can build on with additional types is:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl"
xmlns:sqltypes="http://schemas.microsoft.com/sqlserver/2004/sqltypes"
xmlns="http://www.w3.org/1999/XSL/Transform"
>
<xsl:output method="xml" indent="yes"/>
<xsl:param name="Strip">false</xsl:param>
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="#base">
<xsl:attribute name="{name()}">
<xsl:choose>
<xsl:when test=".='sqltypes:nvarchar'">xsd:string</xsl:when>
<!-- Add additional tests here -->
<xsl:otherwise>
<xsl:value-of select="."/>
</xsl:otherwise>
</xsl:choose>
</xsl:attribute>
</xsl:template>
<xsl:template match="#sqltypes:*">
<xsl:if test="$Strip=true">
<xsl:comment>
<xsl:text>Stripped (</xsl:text>
<xsl:value-of select="concat(name(), '="', ., '"')"/>
<xsl:text>)</xsl:text>
</xsl:comment>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
Change the Strip parameter at the beginning to false if you need to see the attributes that are stripped in the transformation process.

Using xsl to fill in empty rows in a html table

I want to create multiple html table pages using XML as the input and xsl as the transformation language.
Now these tables should always have a fixed height, whether it's just one row or ten.
I can't get it to work with CSS (min-height).
So I was wondering, if it is possible to get xsl to always output ten rows and add empty rows in case there are less then ten rows or adding rows in case there are more then ten rows existent in the XML and therefore resizing the table.
Any ideas how this can be achieved?
You sure can do that. I can show you how you would split your data into tables each having ten rows stuffing up the last one (or maybe the only one) with dummy rows when you don't have enough. It should help you get going where you need to go (without an example XML input and desired HTML output this is as much as I can do)
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" encoding="UTF-8"/>
<xsl:template match="/">
<xsl:apply-templates select="data/row[position() mod 10 = 1]" mode="newtable"/>
</xsl:template>
<xsl:template match="row" mode="newtable">
<table>
<xsl:apply-templates select="."/>
<xsl:apply-templates select="following-sibling::row[position() < 10]"/>
<xsl:call-template name="dummy-rows">
<xsl:with-param
name="how-many"
select="9 - count(following-sibling::row[position() < 10])"/>
</xsl:call-template>
</table>
</xsl:template>
<xsl:template match="row">
<tr><td><xsl:value-of select="."/></td></tr>
</xsl:template>
<xsl:template name="dummy-rows">
<xsl:param name="how-many" select="0"/>
<xsl:if test="$how-many > 0">
<tr><td>dummy</td></tr>
<xsl:call-template name="dummy-rows">
<xsl:with-param name="how-many" select="$how-many - 1"/>
</xsl:call-template>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
The idea is that you start your table with the "first" node of each set of 10. That's the [position() mod 10 = 1] predicate. When you get a hold of the starting point of your table you create the table boundaries and process that node again in a normal mode. Then you get the next nine data rows that follow it. Finally, you add as many dummy nodes as you need to make sure you got the 10 total in each table. The dummy-rows template is a recursion. So two techniques here: splitting the set by position() mod and using a recursion to implement iteration.
UPDATE If you only need to make sure you have at least ten rows in your table then you don't need the split logic:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" encoding="UTF-8"/>
<xsl:template match="/">
<table>
<xsl:apply-templates select="data/row"/>
<xsl:call-template name="dummy-rows">
<xsl:with-param
name="how-many"
select="10 - count(data/row)"/>
</xsl:call-template>
</table>
</xsl:template>
<xsl:template match="row">
<tr><td><xsl:value-of select="."/></td></tr>
</xsl:template>
<xsl:template name="dummy-rows">
<xsl:param name="how-many" select="0"/>
<xsl:if test="$how-many > 0">
<tr><td>dummy</td></tr>
<xsl:call-template name="dummy-rows">
<xsl:with-param name="how-many" select="$how-many - 1"/>
</xsl:call-template>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
You can try this with an input like this:
<data>
<row>1</row>
<row>1</row>
<row>3</row>
</data>
or an input like this:
<data>
<row>1</row>
<row>2</row>
<row>3</row>
<row>4</row>
<row>5</row>
<row>6</row>
<row>7</row>
<row>8</row>
<row>9</row>
<row>10</row>
<row>11</row>
<row>12</row>
</data>
In both cases the result was as expected. Try it. You should be able to take it from here.

How to split text and preserve HTML tags (XSLT 2.0)

I have an xml that has a description node:
<config>
<desc>A <b>first</b> sentence here. The second sentence with some link The link. The <u>third</u> one.</desc>
</config>
I am trying to split the sentences using dot as separator but keeping in the same time in the HTML output the eventual HTML tags.
What I have so far is a template that splits the description but the HTML tags are lost in the output due to the normalize-space and substring-before functions.
My current template is given below:
<xsl:template name="output-tokens">
<xsl:param name="sourceText" />
<!-- Force a . at the end -->
<xsl:variable name="newlist" select="concat(normalize-space($sourceText), ' ')" />
<!-- Check if we have really a point at the end -->
<xsl:choose>
<xsl:when test ="contains($newlist, '.')">
<!-- Find the first . in the string -->
<xsl:variable name="first" select="substring-before($newlist, '.')" />
<!-- Get the remaining text -->
<xsl:variable name="remaining" select="substring-after($newlist, '.')" />
<!-- Check if our string is not in fact a . or an empty string -->
<xsl:if test="normalize-space($first)!='.' and normalize-space($first)!=''">
<p><xsl:value-of select="normalize-space($first)" />.</p>
</xsl:if>
<!-- Recursively apply the template for the remaining text -->
<xsl:if test="$remaining">
<xsl:call-template name="output-tokens">
<xsl:with-param name="sourceText" select="$remaining" />
</xsl:call-template>
</xsl:if>
</xsl:when>
<!--If no . was found -->
<xsl:otherwise>
<p>
<!-- If the string does not contains a . then display the text but avoid
displaying empty strings
-->
<xsl:if test="normalize-space($sourceText)!=''">
<xsl:value-of select="normalize-space($sourceText)" />.
</xsl:if>
</p>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
and I am using it in the following manner:
<xsl:template match="config">
<xsl:call-template name="output-tokens">
<xsl:with-param name="sourceText" select="desc" />
</xsl:call-template>
</xsl:template>
The expected output is:
<p>A <b>first</b> sentence here.</p>
<p>The second sentence with some link The link.</p>
<p>The <u>third</u> one.</p>
A good question, and not an easy one to solve. Especially, of course, if you're using XSLT 1.0 (you really need to tell us if that's the case).
I've seen two approaches to the problem. Both involve breaking it into smaller problems.
The first approach is to convert the markup into text (for example replace <b>first</b> by [b]first[/b]), then use text manipulation operations (xsl:analyze-string) to split it into sentences, and then reconstitute the markup within the sentences.
The second approach (which I personally prefer) is to convert the text delimiters into markup (convert "." to <stop/>) and then use positional grouping techniques (typically <xsl:for-each-group group-ending-with="stop"/> to convert the sentences into paragraphs.)
Here is one way to implement the second approach suggested by Michael Kay using XSLT 2.
This stylesheet demonstrates a two-pass transformation where the first pass introduces <stop/> markers after each sentence and the second pass encloses all groups ending with a <stop/> in a paragraph.
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<!-- two-pass processing -->
<xsl:template match="/">
<xsl:variable name="intermediate">
<xsl:apply-templates mode="phase-1"/>
</xsl:variable>
<xsl:apply-templates select="$intermediate" mode="phase-2"/>
</xsl:template>
<!-- identity transform -->
<xsl:template match="#*|node()" mode="#all" priority="-1">
<xsl:copy>
<xsl:apply-templates select="#*|node()" mode="#current"/>
</xsl:copy>
</xsl:template>
<!-- phase 1 -->
<!-- insert <stop/> "milestone markup" after each sentence -->
<xsl:template match="text()" mode="phase-1">
<xsl:analyze-string select="." regex="\.\s+">
<xsl:matching-substring>
<xsl:value-of select="regex-group(0)"/>
<stop/>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
<!-- phase 2 -->
<!-- turn each <stop/>-terminated group into a paragraph -->
<xsl:template match="*[stop]" mode="phase-2">
<xsl:copy>
<xsl:for-each-group select="node()" group-ending-with="stop">
<p>
<xsl:apply-templates select="current-group()" mode="#current"/>
</p>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>
<!-- remove the <stop/> markers -->
<xsl:template match="stop" mode="phase-2"/>
</xsl:stylesheet>
This is my humble solution, based on the second suggestion of #Michael Kay answer.
Differently from #Jukka answer (which is very elegant indeed) I'm not using xsl:analyse-string, as XPath 1.0 functions contains and substring-after are enough to accomplish the split. I've also started the match pattern from the config.
Here's the transform:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<!-- two pass processing -->
<xsl:template match="config">
<xsl:variable name="pass1">
<xsl:apply-templates select="node()"/>
</xsl:variable>
<xsl:apply-templates mode="pass2" select="$pass1/*"/>
</xsl:template>
<!-- 1. Copy everything as is (identity) -->
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<!-- 1. Replace "text. text" with "text<dot/> text" -->
<xsl:template match="text()[contains(.,'. ')]">
<xsl:value-of select="substring-before(.,'. ')"/>
<dot/>
<xsl:value-of select="substring-after(.,'. ')"/>
</xsl:template>
<!-- 2. Group by examining in population order ending with dot -->
<xsl:template match="desc" mode="pass2">
<xsl:for-each-group select="node()"
group-ending-with="dot">
<p><xsl:apply-templates select="current-group()" mode="pass2"/></p>
</xsl:for-each-group>
</xsl:template>
<!-- 2. Identity -->
<xsl:template match="node()|#*" mode="pass2">
<xsl:copy>
<xsl:apply-templates select="node()|#*" mode="pass2"/>
</xsl:copy>
</xsl:template>
<!-- 2. Replace dot with mark -->
<xsl:template match="dot" mode="pass2">
<xsl:text>.</xsl:text>
</xsl:template>
</xsl:stylesheet>
Applied on the input shown in your question, produces:
<p>A <b>first</b> sentence here.</p>
<p>The second sentence with some link The link.</p>
<p>The <u>third</u> one.</p>
this might do the trick:
http://symphony-cms.com/download/xslt-utilities/view/20816/
/J

Preserve certain html tags during XSLT

I have looked up solutions on stackflow, but none of them seem to work for me. Here is my question. Lets say I have the following text :
Source:
<greatgrandparent>
<grandparent>
<parent>
<sibling>
Hey, im the sibling .
</sibling>
<description>
$300$ <br/> $250 <br/> $200! <br/> <p> Yes, that is right! <br/> You can own a ps3 for only $200 </p>
</description>
</parent>
<parent>
... (SAME FORMAT)
</parent>
... (Several more parents)
</grandparent>
</greatgrandparent>
Output:
<newprice>
$300$ <br/> $250 <br/> $200! <br/> Yes, that is right! <br/> You can own a ps3 for only $200
</newprice>
I can't seem to find a way to do that.
Current XSL:
<xsl:template match="/">
<xsl:apply-templates />
</xsl:template>
<xsl:template match="greatgrandparents">
<xsl:apply-templates />
</xsl:template>
<xsl:template match = "grandparent">
<xsl:for-each select = "parent" >
<newprice>
<xsl:apply-templates>
</newprice>
</xsl:for-each>
</xsl:template>
<xsl:template match="description">
<xsl:element name="newprice">
<xsl:apply-templates/>
</xsl:element>
</xsl:template>
<xsl:template match="p">
<xsl:apply-templates/>
</xsl:template>
Use templates to define behavior on specific elements
<!-- after standard identity template -->
<xsl:template match="description">
<xsl:element name="newprice">
<xsl:apply-templates/>
</xsl:element>
</xsl:template>
<xsl:template match="p">
<xsl:apply-templates/>
</xsl:template>
The first template says to swap description with newprice. The second one says to ignore the p element.
If you're unfamiliar with the identity template, take a look here for a few examples.
EDIT: Given the new example, we can see that you want to only extract the description element and its contents. Notice that the template action starts with the match="/" template. We can use this control where our stylesheet starts and thus skip much of the riffraff we want to filter out.
change the <xsl:template match="/"> to something more like:
<xsl:template match="/">
<xsl:apply-templates select="//description"/>
<!-- use a more specific XPath if you can -->
</xsl:template>
So altogether our solution looks like this:
<xsl:stylesheet
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
exclude-result-prefixes="xs">
<xsl:template match="/">
<xsl:apply-templates select="//description" />
</xsl:template>
<!-- this is the identity template -->
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="description">
<xsl:element name="newprice">
<xsl:apply-templates/>
</xsl:element>
</xsl:template>
<xsl:template match="p">
<xsl:apply-templates/>
</xsl:template>
</xsl:stylesheet>
Shouldn't the contents of be inside a CDATA element? And then probably disable output encoding on xsl:value-of..
You should look into xsl:copy-of.
You would probably wind up with somthing like:
<xsl:template match="description">
<xsl:copy-of select="."/>
</xsl:template>
Probably the shortest solution is this one:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="description">
<newprice>
<xsl:copy-of select="node()"/>
</newprice>
</xsl:template>
<xsl:template match="text()[not(ancestor::description)]"/>
</xsl:stylesheet>
When this transformation is applied on the provided XML document, the wanted result is produced:
<newprice>
$300$ <br /> $250 <br /> $200! <br /> <p> Yes, that is right! <br /> You can own a ps3 for only $200 </p>
</newprice>
Do note:
The use of <xsl:copy-of select="node()"/> to copy all the subtree rooted in description, without the root itself.
How we override (with a specific, empty template) the XSLT built-in template, preventing any text nodes that are not descendents of a <description> element, to be output.