I'm trying to clean some htmls. I have converted them to xhtml with tidy
$ tidy -asxml -i -w 150 -o o.xml index.html
The resulting xhtml ends up having named entities.
When trying xsltproc on those xhtmls, I keep getting errors.
$ xsltproc --novalid -o out.htm t.xsl o.xml
o.xml:873: parser error : Entity 'mdash' not defined
resources to storing data and using permissions — as needed.</
^
o.xml:914: parser error : Entity 'uarr' not defined
</div>↑ Go to top
^
o.xml:924: parser error : Entity 'nbsp' not defined
Android 3.2 r1 - 27 Jul 2011 12:18
If I add --html to the xsltproc it complains on a tag that has name and id attributes with same name (which is valid)
$ xsltproc --novalid --html -o out.htm t.xsl o.xml o.xml:845: element a: validity error : ID top already defined
<a name="top" id="top"></a>
^
The xslt is simple:
<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" indent="yes" omit-xml-declaration="yes"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="//*[#id=side-nav]"/>
</xsl:stylesheet>
Why doesn't --html work? Why is it complaining? Or should I forget it and fix the entities?
I did the other way - made tidy produce numeric entities rather then named with -n option.
$ tidy -asxml -i -n -w 150 -o o.xml index.xml
Now I can remove --html option and it works.
Although I can remove that name attribute, but still wonder why it is reported as an error, although it is valid
I am assuming that the unclearly stated question is this: I know how to avoid "Entity 'XXX' not defined" errors when running xsltproc (add --html). But how do I get rid of "ID YYY already defined"?
Recent builds of Tidy have an anchor-as-name option. You can set it to "no" to remove unwanted name attributes:
This option controls the deletion or addition of the name attribute in elements where it can serve as anchor. If set to "yes", a name attribute, if not already existing, is added along an existing id attribute if the DTD allows it. If set to "no", any existing name attribute is removed if an id attribute exists or has been added.
Related
I am working on importing an XML file from the internet into my MySQL database and I am running into problems because it contains some multivariable attributes. For example, there may be 1 "category" tag per item or 3. In database relations, this attribute should form its own table, but I am not sure how to connect things like that. Below is a shortened example of what I am dealing with.
<Library>
<Book>
<Author> Dave </Author>
<Title> XML Help </Title>
<Category> Computers </Category>
<Category> XML </Category>
</Book>
</Library>
I am aware of the basic syntax as below
LOAD XML LOCAL INFILE 'file.xml' INTO TABLE table ROWS IDENTIFIED BY '<Value>';
This assumes that there is only a single value for each attribute. I cannot edit the xml file because it is hundreds of thousands of lines long and I am looking to automate this process anyway. Thank you for your help.
Consider transforming your XML with XSLT, a declarative and special-purpose language like SQL, used specifically to transform XML documents. And since the mysql CLI can run shell commands using system or \!, you can call an installed XSLT processor at command line or run a prepared (and compiled) general-purpose language (Java, Python, PHP, etc.) script at command line.
Assuming you need to transform original input to the following where distinct categories are split into different <Book> nodes.
<?xml version="1.0" encoding="UTF-8"?>
<Library>
<Book>
<Author>Dave</Author>
<Title>XML Help</Title>
<Category>Computers</Category>
</Book>
<Book>
<Author>Dave</Author>
<Title>XML Help</Title>
<Category>XML</Category>
</Book>
</Library>
Run below XSLT. See Online Demo.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="Book">
<xsl:apply-templates select="Category"/>
</xsl:template>
<xsl:template match="Category">
<Book>
<xsl:apply-templates select="preceding-sibling::Author"/>
<xsl:apply-templates select="preceding-sibling::Title"/>
<xsl:copy>
<xsl:apply-templates select="node()"/>
</xsl:copy>
</Book>
</xsl:template>
<xsl:template match="text()">
<xsl:apply-templates select="normalize-space()"/>
</xsl:template>
</xsl:stylesheet>
Then, call shell commands from the mysql CLI:
With Unix's xsltproc
mysql> system xsltproc myScript.xsl Input.xml > Output.xml
With Windows' System.Xml.Xsl (see Powershell script here)
mysql> system Powershell.exe -File "PS_Script.ps1" "Input.xml" "myScript.xsl" "Output.xml"
Call general-purpose languages (see XSLT scripts here):
mysql> system java run_xslt
mysql> system python run_xslt.py
mysql> system php run_xslt.php
mysql> system perl run_xslt.pl
mysql> system Rscript run_xslt.R
Finally, run LOAD XML using transformed document:
LOAD XML LOCAL INFILE 'myTranformedOutput.xml'
INTO TABLE mytable
ROWS IDENTIFIED BY '<Book>';
I am trying to get the data from a MariaDB database into a 3rd party program running on a machine that does not have access to the DB server, so I need to use flat text files.
CSV is not an option, as the program reading the data does not play well with escapes and quotations.
So I am stuck with XML for now. Luckily MySQL, or MariaDB, allow for the --xml parameter in both mysql and mysqldump command line tool.
However, all columns have the name 'field' with an attribute name="column_name":
shell> mysql --xml -uroot -e "SHOW VARIABLES LIKE 'version%'"
<?xml version="1.0"?>
<resultset statement="SHOW VARIABLES LIKE 'version%'" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<row>
<field name="Variable_name">version</field>
<field name="Value">5.0.40-debug</field>
</row>
<row>
<field name="Variable_name">version_comment</field>
<field name="Value">Source distribution</field>
</row>
For the program reading this data to be able to understand it, I need it to be in the following format:
<row>
<Variable_name>version</Variable_name>
<Value>5.0.40-debug</Value>
</row>
<row>
<Variable_name>version_comment</Variable_name>
<Value>Source distribution</Value>
</row>
I have written a little XSLT stylesheet to convert this:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="UTF-8"/>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="field[#name]">
<xsl:element name="{#name}">
<xsl:value-of select="."/>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
Which works, but is very slow for larger datasets (100k records, 2M lines XML) using Xalan C++ from the command line. It can take up to 15-30 minutes.
Are there better ways to accomplish this? It is really too bad we can't tell MySQL / MariaDB to output XML using normal tag names instead of these generic ones and having to translate it after the export.
I have used the following resource to handle the xml dump from MySQL.
http://www.tutorialspoint.com/java_xml/java_dom_query_document.htm
I can loop thru the field name tags to get the name/value pair.
I'm using CTest (part of CMake) for my automated tests.
How do I get CTest results in the Jenkins dashboard ? Or, phrased differently, how do I get CTest to output in JUnit-like XML ?
In Jenkins, after the CMake part (probably made through the CMake plugin), add the following batch script, or adapt for builds on Linux :
del build_32\JUnitTestResults.xml
pushd build_32\Tests
"C:\Program Files\CMake 2.8\bin\ctest.exe" -T Test -C RelWithDebInfo --output-on-failure
popd
verify >nul
C:\Python27\python.exe external/tool/CTest2JUnit.py build_32/Tests external/tool/CTest2JUnit.xsl > build_32/JUnitTestResults.xml
build_32 is the Build Directory in the CMake plugin
Tests is the subdirectory where all my tests live
-T Test makes CTest output in XML (?!)
verify >nul resets errorlevel to 0, because CTest returns >0 if any test fails, which Jenkins interprets as "the whole build failed", which we don't want
The last line converts CTest's XML into a minimal JUnit xml. The Python script and the xslt live in the source directory, you may want to change that.
The python script looks like this (hacked together in 10 min, beware) :
from lxml import etree
import StringIO
import sys
TAGfile = open(sys.argv[1]+"/Testing/TAG", 'r')
dirname = TAGfile.readline().strip()
xmlfile = open(sys.argv[1]+"/Testing/"+dirname+"/Test.xml", 'r')
xslfile = open(sys.argv[2], 'r')
xmlcontent = xmlfile.read()
xslcontent = xslfile.read()
xmldoc = etree.parse(StringIO.StringIO(xmlcontent))
xslt_root = etree.XML(xslcontent)
transform = etree.XSLT(xslt_root)
result_tree = transform(xmldoc)
print(result_tree)
It needs lxml, direct link
It takes two arguments, the directory in which the tests live (in the build directory), and a xsl file
It simply reads the last xml tests results, transforms it with the xsl, and outputs it to stdout
The "last xml tests" are present in the first line of the Testing/TAG file, hence the additional fopen
The xsl looks like this. It's pretty minimal but gets the job done : [EDIT] see MOnsDaR 's improved version : http://pastebin.com/3mQ2ZQfa
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/Site/Testing">
<testsuite>
<xsl:apply-templates select="Test"/>
</testsuite>
</xsl:template>
<xsl:template match="Test">
<xsl:variable name="testcasename"><xsl:value-of select= "Name"/></xsl:variable>
<xsl:variable name="testcaseclassname"><xsl:value-of select= "FullName"/></xsl:variable>
<testcase name="{$testcasename}" classname="{$testcaseclassname}">
<xsl:if test="#Status = 'passed'">
</xsl:if>
<xsl:if test="#Status = 'failed'">
<error type="error"><xsl:value-of select="Results/Measurement/Value/text()" /></error>
</xsl:if>
<xsl:if test="#Status = 'notrun'">
<skipped><xsl:value-of select="Results/Measurement/Value/text()" /></skipped>
</xsl:if>
</testcase>
</xsl:template>
</xsl:stylesheet>
Finally, check "Publish JUnit tests results" (or similar, my version is in French) and set the xml path to build_32/JUnitTestResults.xml
Well, that was ugly. But still, hope this helps someone. And improvements are welcome ( running ctest from python maybe ? Using the path of the Python plugin instead of C:... ? )
This seems to be integrated in jenkins-ci nowadays:
https://github.com/jenkinsci/xunit-plugin/commits/master/src/main/resources/org/jenkinsci/plugins/xunit/types/ctest-to-junit.xsl
Let me try to explain my situation:
We are using a CMS which 'bakes' a website, and you publish it to a webserver. The published site contains only static HTML ( or XML ) pages ( generated from the content in the CMS database ).
I imported an XML file with the names and phone numbers from the company phone directory.
Using only XSLT, can I create a way to search that directory?
For example, if my XML file, directory.xml looks like this:
<directory>
<person>
<fname>Ryan</fname>
<lname>Purple</lname>
<phone>887 778 5544</phone>
</person>
<person>
<fname>Tanya</fname>
<lname>Orange</lname>
<phone>887 998 5541</phone>
</person>
<directory>
Can I create a way to search for a person with the last name starting with "Pur" ?
Can I pass a parameter to the XSLT?
Can I search the XML tree to match the string in the parameter?
Using only XSLT, can I create a way to
search that directory?
Yes.
Can I create a way to search for a
person with the last name starting
with "Pur" ?
Yes. In fact, the transformation below allows to search for text starting with any 2,3,4 or 5 characters. It can be generalized to allow search for a starting string up to any predefined maximum length.
1.Can I pass a parameter to the XSLT?
Yes. The details how to do this depend on the particular XSLT processor that is used. For example here is how to pass external parameters to the .NET XslCompiledTransform.Transform()
2.Can I search the XML tree to match the string in the parameter?
Yes. This transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:param name="pPattern" select="'Pur'"/>
<xsl:key name="kPersonByLNameStart"
match="person" use="substring(lname,1,2)"/>
<xsl:key name="kPersonByLNameStart"
match="person" use="substring(lname,1,3)"/>
<xsl:key name="kPersonByLNameStart"
match="person" use="substring(lname,1,4)"/>
<xsl:key name="kPersonByLNameStart"
match="person" use="substring(lname,1,5)"/>
<xsl:template match="/">
<results>
<xsl:copy-of select=
"key('kPersonByLNameStart', $pPattern)"/>
</results>
</xsl:template>
</xsl:stylesheet>
when applied on this XML document (the provided XML document -- corrected to be well-formed and extended):
<directory>
<person>
<fname>Ryan</fname>
<lname>Purple</lname>
<phone>887 778 5544</phone>
</person>
<person>
<fname>Tanya</fname>
<lname>Orange</lname>
<phone>887 998 5541</phone>
</person>
<person>
<fname>Martin</fname>
<lname>Purr</lname>
<phone>887 778 5544</phone>
</person>
</directory>
produces the wanted, correct results and in the most efficient way:
<results>
<person>
<fname>Ryan</fname>
<lname>Purple</lname>
<phone>887 778 5544</phone>
</person>
<person>
<fname>Martin</fname>
<lname>Purr</lname>
<phone>887 778 5544</phone>
</person>
</results>
Do Note:
This code shows how to search efficiently for text having some prefix of length 2 or 3 or 4 or 5.
How about AJAX? That should run without server-side assistance and will read your xml perfectly. W3Schools has a good intro.
Edited: Blah, sorry, that's useless..I'd forgotten that even here, you need to use a server-side script :/
is there any chance of getting the output from a MySQL query directly to XML?
Im referring to something like MSSQL has with SQL-XML plugin, for example:
SELECT * FROM table WHERE 1 FOR XML AUTO
returns text (or xml data type in MSSQL to be precise) which contains an XML markup structure generated
according to the columns in the table.
With SQL-XML there is also an option of explicitly defining the output XML structure like this:
SELECT
1 AS tag,
NULL AS parent,
emp_id AS [employee!1!emp_id],
cust_id AS [customer!2!cust_id],
region AS [customer!2!region]
FROM table
FOR XML EXPLICIT
which generates an XML code as follows:
<employee emp_id='129'>
<customer cust_id='107' region='Eastern'/>
</employee>
Do you have any clues how to achieve this in MySQL?
Thanks in advance for your answers.
The mysql command can output XML directly, using the --xml option, which is available at least as far back as MySql 4.1.
However, this doesn't allow you to customize the structure of the XML output. It will output something like this:
<?xml version="1.0"?>
<resultset statement="SELECT * FROM orders" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<row>
<field name="emp_id">129</field>
<field name="cust_id">107</field>
<field name="region">Eastern</field>
</row>
</resultset>
And you want:
<?xml version="1.0"?>
<orders>
<employee emp_id="129">
<customer cust_id="107" region="Eastern"/>
</employee>
</orders>
The transformation can be done with XSLT using a script like this:
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="resultset">
<orders>
<xsl:apply-templates/>
</orders>
</xsl:template>
<xsl:template match="row">
<employee emp_id="{field[#name='emp_id']}">
<customer
cust_id="{field[#name='cust_id']}"
region="{field[#name='region']}"/>
</employee>
</xsl:template>
</xsl:stylesheet>
This is obviously way more verbose than the concise MSSQL syntax, but on the other hand it is a lot more powerful and can do all sorts of things that wouldn't be possible in MSSQL.
If you use a command-line XSLT processor such as xsltproc or saxon, you can pipe the output of mysql directly into the XSLT program. For example:
mysql -e 'select * from table' -X database | xsltproc script.xsl -
Using XML with MySQL seems to be a good place to start with various different ways to get from MySQL query to XML.
From the article:
use strict;
use DBI;
use XML::Generator::DBI;
use XML::Handler::YAWriter;
my $dbh = DBI->connect ("DBI:mysql:test",
"testuser", "testpass",
{ RaiseError => 1, PrintError => 0});
my $out = XML::Handler::YAWriter->new (AsFile => "-");
my $gen = XML::Generator::DBI->new (
Handler => $out,
dbh => $dbh
);
$gen->execute ("SELECT name, category FROM animal");
$dbh->disconnect ();
Do you have any clue how to achieve this in MySQL?
Yes, go by foot and make the xml yourself with CONCAT strings. Try
SELECT concat('<orders><employee emp_id="', emp_id, '"><customer cust_id="', cust_id, '" region="', region, '"/></employee></orders>') FROM table
I took this from a 2009 answer How to convert a MySQL DB to XML? and it still seems to work. Not very handy, and if you have large trees per item, they will all be in one concatenated value of the root item, but it works, see this test with dummies:
SELECT concat('<orders><employee emp_id="', 1, '"><customer cust_id="', 2, '" region="', 3, '"/></employee></orders>') FROM DUAL
gives
<orders><employee emp_id="1"><customer cust_id="2" region="3"/></employee></orders>
With "manual coding" you can get to this structure.
<?xml version="1.0"?>
<orders>
<employee emp_id="1">
<customer cust_id="2" region="3" />
</employee>
</orders>
I checked this with a larger tree per root item and it worked, but I had to run an additional Python code on it to get rid of the too many openings and closings generated when you have medium level nodes in an xml path. It is possible using backward-looking lists together with entries in a temporary set, and I got it done, but an object oriented way would be more professional. I just coded to drop the last x items from the list as soon as a new head item was found, and some other tricks for nested branches. Worked.
I puzzled out a Regex that found each text between tags:
string = " <some tag><another tag>test string<another tag></some tag>"
pattern = r'(?:^\s*)?(?:(?:<[^\/]*?)>)?(.*?)?(?:(?:<\/[^>]*)>)?'
p = re.compile(pattern)
val = r''.join(p.findall(string))
val_escaped = escape(val)
if val_escaped != val:
string.replace(val, val_escaped)
This Regex helps you to access the text between the tags. If you are allowed to use CDATA, it is easiest to use that everywhere. Just make the content "CDATA" (character data) already in MySQL:
<Title><![CDATA[', t.title, ']]></Title>
And you will not have any issues anymore except for very strange characters like (U+001A) which you should replace already in MySQL. You then do not need to care for escaping and replacing the rest of the special characters at all. Worked for me on a 1 Mio. lines xml file with heavy use of special characters.
Yet: you should validate the file against the needed xml schema file using Python's module xmlschema. It will alert you when you are not allowed to use that CDATA trick.
If you need a fully UTF-8 formatted content without CDATA, which might often be the task, you can reach that even in a 1 Mio lines file by validating the code output (= xml output) step by step against the xml schema file (xsd that is the aim). It is a bit fiddly work, but it can be done with some patience.
Replacements are possible with:
MySQL using replace()
Python using string.replace()
Python using Regex replace (though I did not need it in the end, it would look like: re.sub(re.escape(val), 'xyz', i))
string.encode(encoding = 'UTF-8', errors = 'strict')
Mind that encoding as utf-8 is the most powerful step, it could even put aside all three other replacement ways above. Mind also: It makes the text binary, you then need to treat it as binary b'...' and you can thus write it to a file only in binary mode using wb.
As the end of it all, you may open the XML output in a normal browser like Firefox for a final check and watch the XML at work. Or check it in vscode/codium with an xml Extension. But these checks are not needed, in my case the xmlschema module has shown everything very well. Mind also that vscode/codium can can handle xml problems quite easily and still show a tree when Firefox cannot, therefore, you will need a validator or a browser to see all xml errors.
Quite a huge project could be done using this xml-building-with-mysql, at the end there was a triple nested xml tree with many repeating tags inside parent nodes, all made from a two-dimensional MySQL output.