Output of Perl destroys specific letters from non-English languages - html

I have an XML file as follows:
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="test.xslt"?>
<results>
<test name="sentence1">
<description href="#ömr">
ömr1, ämr1, ümr1 and pär1
</description>
</test>
<test name="sentence2" href="#pär2">
<description>
ömr2, ämr2, ümr2 and pär2
</description>
</test>
<test name="sentence3" href="#pär3">
<description>
ömr3, ämr3, ümr3 and pär3
</description>
</test>
</results>
Then here is the XSLT
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:b="http://www.froglogic.com/XML2"
xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<xsl:output method="html" version="5.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="Summary/test">
<html>
<body>
<xsl:for-each select="//test">
<xsl:variable name="linkMe" select="#name"/>
<xsl:value-of select="description"/>
<a href="#{$linkMe}" >
<xsl:value-of select="$linkMe" />
</a>
<xsl:value-of select="description"/>
</xsl:for-each>
</body>
</html>
</xsl:template>
I want to convert the XML to an HTML file using Perl. But it's going to have not desired output although I have told Perl I want output as a UTF-8.
The perl code is like this:
use strict;
use warnings;
use XML::LibXML;
use XML::Writer;
use XML::LibXSLT;
use XML::Parser;
use Encode qw( is_utf8 encode decode );
my $XML_File = "test2.xml";
my $XSLT_File = "test2.xslt";
my $HTML_File = "test2.html";
sub XML2HTML {
my $xml_parser = XML::LibXML->new('1.0', 'UTF-8');
my $xslt_parser = XML::LibXSLT->new('1.0', 'UTF-8');
my $xml = $xml_parser->parse_file($XML_File);
$xml->setEncoding('UTF-8');
my $xsl = $xml_parser->parse_file($XSLT_File);
my $stylesheet = $xslt_parser->parse_stylesheet($xsl);
my $results = $stylesheet->transform($xml);
my $output = $stylesheet->output_string($results);
$stylesheet->output_file($results, $HTML_File);
}
&XML2HTML($XML_File, $XSLT_File, $HTML_File);
Another question is how I could have UTF-8-BOM output as file? I searched the internet and could not find an exact answer. They all mention UTF-8 rather than UTF-8-BOM.
The HTML output seems unpleasant:
ömr1, ämr1, ümr1 and pär1 ömr2, ämr2, ümr2 and pär2 ömr3, ämr3, ümr3 and pär3
The encoding format in HTML is
Codepage 1252(Western)
and it is strange!

First, you have a subroutine which operates on global variables. That is not a good idea. Instead, those values as arguments to the function so your function is not tied to names you use in other places in your program.
Second, you do not do anything with $output, but storing the output in it will still increase the memory footprint of your program.
Third, looking at the underlying XS code for write_file, we see:
xsltSaveResultToFilename(filename, doc, self, 0);
And, xsltSaveResultToFilename is documented here. Looking at the source code for xsltSaveResultToFilename, we note that the routine deduces the output encoding from the stylesheet. So, the problem has to lie elsewhere.
It turns out, my initial diagnosis was incorrect. After getting my hands on a system with the necessary libraries, I ran your script (which revealed syntax errors in your XSL file -- don't post code we cannot run). After fixing those, I realized the code was producing UTF-8 encoded output, but the HTML did not include a declaration of document encoding. Therefore, when I viewed in my browser, it tried to use Windows 1252. Your XSL template needs to declare the encoding of the HTML document as well. Of course, if add the BOM, you probably don't need the declaration in the head of the document.
The following script seems to work for me:
use strict;
use warnings;
use autouse Carp => 'croak';
use File::BOM ();
use XML::LibXML;
use XML::LibXSLT;
xml_to_html('test.xml', 'test.xsl', 'test.html');
sub xml_to_html {
my ($xml_file, $xsl_file, $html_file) = #_;
open my $out, '>:unix', $html_file
or croak "Failed to open '$html_file': $!";
print $out $File::BOM::enc2bom{'UTF-8'}
or croak "Failed to write UTF-8 BOM: $!";
my $xslt_parser = XML::LibXSLT->new;
my $xml_parser = XML::LibXML->new;
my $xml = $xml_parser->parse_file( $xml_file );
my $xsl = $xml_parser->parse_file( $xsl_file );
my $style = $xslt_parser->parse_stylesheet( $xsl );
my $results = $style->transform( $xml );
$style->output_fh( $results, $out );
return;
}
with this template:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:b="http://www.froglogic.com/XML2"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<xsl:output method="html" version="5.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>,
</head>
<body>
<xsl:for-each select="//test">
<xsl:variable name="linkMe" select="#name"/>
<xsl:value-of select="description"/>
<a href="#{$linkMe}" >
<xsl:value-of select="$linkMe" />
</a>
<xsl:value-of select="description"/>
</xsl:for-each>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
and produces the following output:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html xmlns:b="http://www.froglogic.com/XML2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">,
</head>
<body>
ömr1, ämr1, ümr1 and pär1
sentence1
ömr1, ämr1, ümr1 and pär1
ömr2, ämr2, ümr2 and pär2
sentence2
ömr2, ämr2, ümr2 and pär2
ömr3, ämr3, ümr3 and pär3
sentence3
ömr3, ämr3, ümr3 and pär3
</body>
</html>
I have
$ pacman -Ss libxslt
extra/libxslt 1.1.29+42+gac341cbd-1 [installed]
XML stylesheet transformation library
which does not seem to include support for generating HTML5 doctype.
Depending on your specific needs, you may have to tweak the XSLT file further.

Related

How do I use the "replace" function in XSLT 1.0 to remove special characters from my XML to get the correct JSON output?

XML I was given
<root>
<page>0</page>
<totalRecords>74</totalRecords>
<totalPages>1</totalPages>
<offset>0</offset>
<status>success</status>
<entityList>
<ENAME>
<ADAMS>
</ENAME>
<DNAME>&RESEARCH</DNAME>
<JOB>CLERK</JOB>
<EMPNO>7876</EMPNO>
<HIREDATE>1987-05-23 "00: 00 ":00.0</HIREDATE>
<LOC>DALLAS</LOC>
</entityList>
<entityList>
<ENAME>>ALLEN</ENAME>
<DNAME>&SALES</DNAME>
<JOB>SALESMAN</JOB>
<EMPNO>7499</EMPNO>
<HIREDATE>1981-02-20 00:00:00.0</HIREDATE>
<LOC>CHICAGO</LOC>
</entityList>
<entityList>
<ENAME>Abhi></ENAME>
<DNAME>&SALES</DNAME>
<JOB>PRU</JOB>
<EMPNO>7956</EMPNO>
<HIREDATE></HIREDATE>
<LOC>CHIC"AGO "</LOC>
</entityList>
</root>
MY XSLT block thus far
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="xml" indent="no" omit-xml-declaration="yes" />
<xsl:template match="root">
{
"page":<xsl:value-of select="//page"/>,
"totalRecords":<xsl:value-of select="//totalRecords"/>,
"totalPages":<xsl:value-of select="//totalPages"/>,
"offset":<xsl:value-of select="//offset"/>,
"status":"<xsl:value-of select="//status"/>",
"entityList":[<xsl:for-each select="entityList">
{
"ENAME":"<xsl:value-of select="translate(ENAME,'<,>','')"/>",
"DNAME":"<xsl:value-of select="replace(DNAME,'&','')"/>",
"JOB":"<xsl:value-of select="JOB"/>",
"EMPNO":"<xsl:value-of select="EMPNO"/>",
"HIREDATE":"<xsl:value-of select="replace(HIREDATE,'"','')"/>",
"LOC":"<xsl:value-of select="replace(LOC,'"','')"/>"
}
<xsl:if test="position()!=last()">,</xsl:if>
</xsl:for-each>
]
}
</xsl:template>
</xsl:stylesheet>
Do I use the replace function or do I use translate? On here I see people use variables but I am new to XSLT and am just learning everything.
I need to remove the '&', the '<>', and the quotes in the hire date along with all other special characters in the XML
Please assist with pointers and tips as you can. Thank you.
There is no replace() function in XSLT 1.0; it was introduced in 2.0.
You can remove special characters using translate(), for example translate(., '&', '').
But what do you really want to do? Why are you trying to remove these characters? It looks as if you're trying to output JSON, in which case the way to handle these characters correctly is simply to use <xsl:output method="text"/>.

How can one match a subnode in xslt in xml?

I've inherited a project that wants to use xslt to transform some html. Matching works with '/', but I can't get it to run on a subnode
I've found some code snippet on mozilla, that applies xslt transformation to html on mozilla, the code works https://developer.mozilla.org/en-US/docs/Web/XSLT/XSLT_JS_interface_in_Gecko/Advanced_Example.
The Problem is that I'm not able to template match the node "firmenliste"
What I use is:
var xslRef;
var xslloaded = false;
var xsltProcessor = new XSLTProcessor();
var myDOM;
var xmlRef = document.implementation.createDocument("", "", null);
p = new XMLHttpRequest();
p.open("GET", "xsl/FirmenListe.xsl",false);
p.send(null);
xslRef = p.responseXML;
xsltProcessor.importStylesheet(xslRef);
xmlRef = document.implementation.createDocument("", "", null);
// we want to move a part of the DOM from an HTML document to an XML document.
// importNode is used to clone the nodes we want to process via XSLT - true makes it do a deep clone
var myNode = document.getElementById("example");
var clonedNode = xmlRef.importNode(myNode, true);
// after cloning, we append
xmlRef.appendChild(clonedNode);
var fragment = xsltProcessor.transformToFragment(xmlRef, document);
// clear the contents
document.getElementById("example").innerHTML = "";
myDOM = fragment;
// add the new content from the transformation
document.getElementById("example").appendChild(fragment)
The corresponding html and xslt looks like:
<xml id="Data">
<data id="example" xmlns:dt="urn:schemas-microsoft-com:datatypes">
<firmenliste></firmenliste>
</data>
</xml>
<?xml version ='1.0'?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>
<xsl:template match="/">
b
<xsl:apply-templates select="firmenliste"/>
</xsl:template>
<xsl:template match="firmenliste">
A
</xsl:template>
</xsl:stylesheet>
The output should be
<xml id="Data">
<data id="example" xmlns:dt="urn:schemas-microsoft-com:datatypes">
bA
</data>
</xml>
But what i get is
<xml id="Data">
<data id="example" xmlns:dt="urn:schemas-microsoft-com:datatypes">
b
</data>
</xml>
Edit: The problem is reproducible in https://next.plnkr.co/edit/Yvc59BPQmI1PHlSy?open=lib%2Fscript.js&preview
I think the main problem is that you start with elements in a HTML DOM document which since HTML5 are by definition in the XHTML namespace http://www.w3.org/1999/xhtml and then clone and copy them to an XML document where they keep their namespace but where in XSLT/XPath a path or match pattern like firmenliste selects or matches elements of that name in no namespace and not in the XHTML namespace.
So using
<xsl:template match="/">
b
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="xhtml:firmenliste" xmlns:xhtml="http://www.w3.org/1999/xhtml">
A
</xsl:template>
instead would fix that problem: https://next.plnkr.co/edit/tsB9qwCafLodg8Rz?open=lib%2Fscript.js&preview
But the whole approach of using non-defined elements like xml or firmenliste in HTML and moving between HTML DOM and XML DOMs is asking for trouble in my experience. Consider to keep the XML data you want to transform outside of the HTML document in a separate XML document, only use XSLT on XML documents and only use its transformation result to be inserted into an HTML DOM if you have used transformToFragment with the owning HTML document as the second argument.
In your xslt, while matching root node with '/', You need to give whole xPath to match <firmenliste> in <xsl:apply-templates>
Try the same by replacing the line <xsl:apply-templates select="firmenliste"/>
with <xsl:apply-templates select="/xml/data/firmenliste"/>
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
b
<xsl:apply-templates select="/xml/data/firmenliste" />
</xsl:template>
<xsl:template match="firmenliste">
A
</xsl:template>
</xsl:stylesheet>

XSLT 3.0 json-to-xml() not working with valid json

Below json is a valid Json. But XSLT 3.0 transformation json-to-xml() not working and showing some error in json syntax.
{
"identifier": {
"use": "<div xmlns=\"http://www.w3.org/1999/xhtml\"> </div>"
}
}
What can I do to make it work. I think some thing related to escaping characters need to be done here. Any pointer on this would be great help.
Try execution of code at this location Fiddler
You are trying to put your JSON with XML into an XML input document, that is causing the problem with the XML parser trying to parse that input you have put into the fiddle, if you use a string parameter for the stylesheet, as done in https://xsltfiddle.liberty-development.net/gWmuiJf, you get
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs"
version="3.0">
<xsl:output indent="yes"/>
<xsl:param name="json-input" as="xs:string"><![CDATA[{
"identifier": {
"use": "<div xmlns=\"http://www.w3.org/1999/xhtml\"> </div>"
}
}]]></xsl:param>
<xsl:template match="/">
<xsl:copy-of select="json-to-xml($json-input)"/>
</xsl:template>
</xsl:stylesheet>
and the output is
<map xmlns="http://www.w3.org/2005/xpath-functions">
<map key="identifier">
<string key="use"><div xmlns="http://www.w3.org/1999/xhtml"> </div></string>
</map>
</map>
You could also use the same CDATA escaping in the primary XML input, that is, use
<root><![CDATA[{
"identifier": {
"use": "<div xmlns=\"http://www.w3.org/1999/xhtml\"> </div>"
}
}]]></root>
as the XML input and then
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs"
version="3.0">
<xsl:output indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select="json-to-xml(root)"/>
</xsl:template>
</xsl:stylesheet>
as the XSLT, as done in https://xsltfiddle.liberty-development.net/gWmuiJf/1, and you get the same result as above.
In the "fiddler" you point to, you have an XML file:
<data>{
"identifier": {
"use": "<div xmlns=\"http://www.w3.org/1999/xhtml\"> </div>"
}
}
</data>
The problem is that this is invalid XML. The XML parser sees a start tag <data>, followed by a text node, followed by a start tag <div xmlns=\, and complains because the first character after xmlns= must be " rather than \.
So you have XML nested within JSON nested within XML. When you nest XML within JSON you must escape " as \", which you have done; but when you nest JSON within XML, you must escape < as <, which you have not done. The simplest solution is probably to use a CDATA section:
<data><![CDATA[{
"identifier": {
"use": "<div xmlns=\"http://www.w3.org/1999/xhtml\"> </div>"
}
}
]]></data>

Idea about creating a html file from xml+xslt in Perl

My aim is to generate an HTMl file from two XML and XSLT files. I have looked over net ans find no good suggestion which I could do the same way. What I understand when I add the lines
<xsl:output method="html" ..... />
I could see the output as HTML. But my goal is having a new file whic I could save it via Perl script and seeing the result. I don't know if it is possible or not. I'll be glad to know if is so.
I would be pleased if someone let me know any idea that help me.
Here are my XML and XSLT files:
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="test.xslt"?>
<Summary>
<test name="test">
<xml_name name="ABC">
<version num="104">
<component name="APPS">
<componenet_ver>104</componenet_ver>
</component>
<component name="Ner">
<componenet_ver>1.0</componenet_ver>
</component>
<component name="HUNE">
<componenet_ver>003</componenet_ver>
</component>
<component name="FADA">
<componenet_ver>107</componenet_ver>
</component>
<component name="VEDA">
<componenet_ver>8.8</componenet_ver>
</component>
</version>
</xml_name>
</test>
</Summary>
and
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" version="5.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="Summary/test">
<html>
<body>
<table>
<tr bgcolor="Peru">
<th>Components</th>
<th>Versions</th>
</tr>
<xsl:for-each select="//component">
<xsl:variable name="CompomName" select="#name"/>
<xsl:variable name="VerName" select="description"/>
<tr>
<td bgcolor="aqua" name = "{$CompomName}"> </td>
<td bgcolor="aqua" name = "{$VerName}"> </td>
</tr>
</xsl:for-each>
</table>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
As I tried a lot in this issue, I found out it could be done by a simple code. I am writing the code here for someone maybe would have the same question as I had:
use XML::LibXSLT;
use XML::LibXML;
my $XML_FILENAME = "test.xml";
my $XSL_FILENAME = "test.xslt";
my $OUTPUT_FILENAME = "test.html";
my $xml_parser = XML::LibXML->new;
my $xslt_parser = XML::LibXSLT->new;
my $xml = $xml_parser->parse_file($XML_FILENAME);
my $xsl = $xml_parser->parse_file($XSL_FILENAME);
my $stylesheet = $xslt_parser->parse_stylesheet($xsl);
my $results = $stylesheet->transform($xml);
my $output = $stylesheet->output_string($results);
# the main command and of course the easiest one
$stylesheet->output_file($results, $OUTPUT_FILENAME);

&#160 shows up as a '?' question mark on HTML

i have a problem with XSLT...
<xsl:text> </xsl:text>
Then after generation, for some reason the resulting JSP file produces a '?' instead. What's wrong?
My recent system changes:
I changed Java5 -> Java6
Weblogic -> Weblogic12
Eclipse Ganymede -> Oracle Pack Eclipse
EDIT 1: <xsl:output method="xml"/>, encoding=UTF-8
The original XSL:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:include href="common.xsl"/>
<xsl:output method="xml"/>
...
<xsl:template name="makeLink">
<xsl:variable name="fieldtype" select="name()"/>
<xsl:variable name="currentNode"><xsl:value-of select="generate-id()"/></xsl:variable>
<xsl:variable name="appendSpace">
<xsl:for-each select="ancestor::ButtonList[position() = 1]/descendant::Button">
<xsl:if test="generate-id() = $currentNode and position() > 1">true</xsl:if>
</xsl:for-each>
</xsl:variable>
<a href="{$url}">
<xsl:attribute name="id">btn_<xsl:value-of select="Action"/></xsl:attribute>
<xsl:call-template name="populateAttributes">
<xsl:with-param name="fieldtype">
<xsl:value-of select="$fieldtype"/>
</xsl:with-param>
</xsl:call-template>
<xsl:copy-of select="#class"/>
<xsl:copy-of select="#style"/>
<xsl:text><span><span></xsl:text><xsl:value-of select="$buffer"/><xsl:text></span></span></xsl:text>
</a>
<xsl:if test="not(#omitWhiteSpace)">
<xsl:text> </xsl:text>
</xsl:if>
<xsl:if test="ReadOnly and ReadOnly != 'someReadOnlyMethod'
and ReadOnly != 'someReadyOnlyMethod'
and ReadOnly != ''">
<xsl:text></c:if></xsl:text>
</xsl:if>
</xsl:template>
....
Transformed (after XSLT), and resulting JSP page:
<%# page contentType = "text/html;charset=GBK"%>
<%# page isELIgnored = "false"%>
<%# page language="java"
import=" my.controller.*, my.core.config.*, my.core.datastructure.*, my.core.error.*, my.core.util.*,
my.service.Constants, my.service.modulesvr.ModuleBean, myW.sn.*, java.util.Locale, java.util.Map"%>
<%# taglib uri="http://www.mycompany.com/my/tags/htmltag-10" prefix="html"%>
<%# taglib uri="http://java.sun.com/jsp/jstl/core" prefix="c"%>
<%MySn mySession = (MySn) session.getValue("MySn"); QuickSearchController mb = (mySession == null) ? null : (QuickSearchController)
mySession.getModuleBean(); String sessionToken = mySession.getSessionToken(); String htmlCharSet = mySession.getEncoding();
MyUsr user = mySession.getMyUsr(); String[] result; Object o;%>
.........
<span><span>NEW PROP</span></span> </c:if>
EDIT 2: it seems like if i use <xsl:text>&#160;</xsl:text> instead of <xsl:text> </xsl:text>...the problem seems to have gone away. In the JSP, it will appear as &#160 and on the browser, it is seen as a no-break space, which is expected.
That often happens if your encoding is wrong. What encoding are you writing your output in? How are you serving up the page? Possibly you are serializing in UTF-8 but trying to display in ISO-8859-1 (or Windows-1252), or vice-versa.
Check to see if the default encoding somewhere has changed.
Just because you say <xsl:output method="xml" encoding="UTF-8"/> doesn't mean that the program will honor it. Is the XSLT embedded in a piece of Java? Does the Java control the streams/readers/writers?
If you can save a portion of the file and dump it in HEX, you should quickly be able to find out. If you see 0xC2 0xA0 then your file is indeed in UTF-8. However, if you just see 0xA0 alone, then you are in ISO-8859-1 or one of its close relations.
It's also possible that the page is being rendered properly, but the page is being served up with the wrong encoding. Can you look at the headers returned, perhaps by using Firebug in Firefox or in Chrome "Web Developer->Information->View Response Headers" or by using the IE debug tools.