Nested XML to joined MySQL Tables - mysql

I have some data with nested fields that I would like to import into MySQL. A lot of files, potentially, so any repeatable scripting language is appreciated. It seems like this should be easier than I am making it, but I can't find a good answer.
I believe the cleanest way would be with joined tables, though it would be nice to have one element ALSO present in the parent table, say if it had the kind code "A1" in the sample below.
A similar query was answered here Parsing nested xml into denormalized table except that wasn't MySQL and that data came with a unique identifier. One of the challenges of my data is that there is no unique identifier in the data to create the primary key for joining tables.
Sample data below. Here, the doc-id and assignor tags would have to be separate tables and joined. The data has a DTD that I'm not including for what it's worth. Any input is much appreciated!
<?xml version="1.0" encoding="UTF-8"?>
<assignment>
<assignment-record>
<reel-no>28879</reel-no>
<frame-no>97</frame-no>
<last-update-date><date>20120903</date></last-update-date>
<recorded-date><date>20120830</date></recorded-date>
<page-count>4</page-count>
<correspondent>
<name>LEE, HONG, DEGERMAN, KANG & WAIMEY</name>
<address-1>660 S. FIGUEROA ST., 23RD FL.</address-1>
<address-2>LOS ANGELES, CA 90017</address-2>
</correspondent>
<conveyance-text>ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).</conveyance-text>
</assignment-record>
<assignors>
<assignor>
<name>WOO, SUNGHO</name>
<execution-date><date>20120806</date></execution-date>
</assignor>
<assignor>
<name>CHOI, JAEYOUNG</name>
<execution-date><date>20120806</date></execution-date>
</assignor>
</assignors>
<docproperties>
<property>
<document-id>
<country>US</country>
<doc-number>13277056</doc-number>
<kind>X0</kind>
<date>20111019</date>
</document-id>
<document-id>
<country>US</country>
<doc-number>20120213136</doc-number>
<kind>A1</kind>
<date>20120823</date>
</document-id>
<title lang="en">SYSTEMS AND METHODS FOR CONTROLLING SENSOR DEVICES IN MOBILE DEVICES</title>
</property>
</docproperties>
</assignment>
</xml>

Since this three-year old unanswered question was recently bumped up by another user, I went ahead and answered it on behalf of the community as the original OP is no longer active.
For future readers, anytime you need to flatten nested XML files for flatfile import such as database tables consider XSLT, the transformation language used to manipulate XML files. Practically all general purpose languages has some library for XSLT 1.0 processing including Python, PHP, Perl, Java, C#, VB, and others.
As OP mentioned Python, below examples uses the third-party lxml module to flatten the XML file. And to generate a unique id to relate the <assignor> and <document-id> data, an XPath in XSLT script is run using ancestor to retrieve the the reel-no value from <assignment-record> which both nodes share as a sibling. This is similar to the TSQL solution in posted link.
XSLT Scripts (save as .xsl to be referenced in Python)
Assignor Transformation
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="/assignment">
<xsl:copy>
<xsl:apply-templates select="descendant::assignor"/>
</xsl:copy>
</xsl:template>
<xsl:template match="assignor">
<xsl:copy>
<assign_id>
<xsl:value-of select="ancestor::assignment/assignment-record/reel-no"/>
</assign_id>
<xsl:copy-of select="name"/>
<xsl:copy-of select="execution-date/date"/>
</xsl:copy>
</xsl:template>
</xsl:transform>
Document-id Transformation
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="/assignment">
<xsl:copy>
<xsl:apply-templates select="descendant::document-id"/>
</xsl:copy>
</xsl:template>
<xsl:template match="document-id">
<xsl:copy>
<assign_id>
<xsl:value-of select="ancestor::assignment/assignment-record/reel-no"/>
</assign_id>
<xsl:copy-of select="*"/>
</xsl:copy>
</xsl:template>
</xsl:transform>
Python Script (transforms source into two files)
import lxml.etree as ET
dom = ET.parse('Assignment.xml')
xslt = ET.parse('Assignor_XSLT_Script.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)
xmlfile = open(os.path.join(cd, 'Assignor.xml'),'wb')
xmlfile.write(newdom)
xmlfile.close()
xslt = ET.parse('Document-Id_XSLT_Script.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)
xmlfile = open(os.path.join(cd, 'Document-Id.xml'),'wb')
xmlfile.write(newdom)
xmlfile.close()
XML Outputs (can now use MySQL's LOAD XML effectively)
<?xml version="1.0" encoding="UTF-8"?>
<assignment>
<assignor>
<assign_id>28879</assign_id>
<name>WOO, SUNGHO</name>
<date>20120806</date>
</assignor>
<assignor>
<assign_id>28879</assign_id>
<name>CHOI, JAEYOUNG</name>
<date>20120806</date>
</assignor>
</assignment>
<?xml version="1.0" encoding="UTF-8"?>
<assignment>
<document-id>
<assign_id>28879</assign_id>
<country>US</country>
<doc-number>13277056</doc-number>
<kind>X0</kind>
<date>20111019</date>
</document-id>
<document-id>
<assign_id>28879</assign_id>
<country>US</country>
<doc-number>20120213136</doc-number>
<kind>A1</kind>
<date>20120823</date>
</document-id>
</assignment>

Related

is it possible to disable-output-escaping twice in XSLT

I have XML that has encoded HTML data. I am trying to render the data but can't seem to figure out how. Best I can tell is I need to disable-output-escaping="yes" twice but not sure how to do that.
For example, this is a snippet of my XML:
<root>
<node value="&lt;b&gt;body&lt;/b&gt;" />
</root>
My XSLT is outputting HTML. Here is the rendered output (the HTML source) with various options
<xsl:value-of select="#value" /> outputs &lt;b&gt;hi&lt;/b&gt;
<xsl:value-of select="#value" disable-output-escaping="yes" /> outputs <b>hi</b>
I would like it to output <b>hi</b> to the HTML source so its actually rendered as a bolded hi. Does that make sense? Is that possible?
Escaping is the process of turning < into <. If you disable escaping, it will leave < as <. What you want to achieve is to turn < into <, which would normally be called "unescaping".
In the normal course of events, a parser performs unescaping, while a serializer performs escaping. So if you want to unescape characters, you need to put them through a parsing process, which means you need to take the content of the #value attribute and put it through an operation like fn:parse-xml-fragment() in XPath 3.0, or an equivalent extension function in your chosen processor.
Assuming Sharepoint as a Microsoft .NET product uses XslCompiledTransform you could try to implement the unescaping and parsing with extension "script" (C# or VB or JScript.NET code embedded in XSLT) as follows:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet
version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
xmlns:mf="http://example.com/mf"
exclude-result-prefixes="msxsl mf">
<msxsl:script language="C#" implements-prefix="mf">
<msxsl:using namespace="System.IO"/>
public string Unescape(string input)
{
XmlDocument doc = new XmlDocument();
XmlDocumentFragment frag = doc.CreateDocumentFragment();
frag.InnerXml = input;
return frag.InnerText;
}
public XPathNavigator ParseXml(string xmlInput)
{
using (StringReader sr = new StringReader(xmlInput))
{
return new XPathDocument(sr).CreateNavigator();
}
}
</msxsl:script>
<xsl:output method="html" doctype-public="XSLT-compat" omit-xml-declaration="yes" encoding="UTF-8" indent="yes" />
<xsl:template match="/">
<html>
<head>
<title>Test</title>
</head>
<xsl:apply-templates/>
</html>
</xsl:template>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="node">
<div>
<xsl:copy-of select="mf:ParseXml(mf:Unescape(#value))" />
</div>
</xsl:template>
</xsl:stylesheet>
If you have access to an XSLT processor (like any version of Saxon 9.7 or Exselt or the latest Altova or XmlPrime) supporting the XPath 3 functions parse-xml and parse-xml-fragment you can write that template without extension functions (in a version="3.0" stylesheet) as
<xsl:template match="node">
<div>
<xsl:copy-of select="parse-xml(string(parse-xml-fragment(#value)))"/>
</div>
</xsl:template>
Output your result with disable-output-escaping, then treat it again in another XSL with disable-output-escaping.

xslt completely remove duplicates from string

I have a variable containing non-numerical values, and I need to completely remove duplicate entries from this string using XSLT:
$string = a,b,c,c,d,d,e,f,g
needs to become: $newstring = a,b,e,f,g
An alternative option would be to compare the two variables and ignore/remove the overlapping entries.
$stringA = a,c
$stringB = a,b,c,d,e,f
needs to become:
$newstring = b,d,e,f
Concatenating the variables is straightforward but I need the opposite of that!
Please help,
XSLT is designed to process XML, not strings. XSLT 1.0 in particular is a poor tool for manipulating text.
IMHO, the best way to proceed here is to convert the problem to XML first. If you're using libxslt (as xsltproc does), this is quite easy to do using an extension function:
XSLT 1.0
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:str="http://exslt.org/strings"
extension-element-prefixes="str">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:param name="stringA">a,c,g</xsl:param>
<xsl:param name="stringB">a,b,c,d,e,f</xsl:param>
<xsl:variable name="setA" select="str:tokenize($stringA, ',')" />
<xsl:variable name="setB" select="str:tokenize($stringB, ',')" />
<xsl:template match="/">
<test>
<xsl:for-each select="$setA[not(.=$setB)] | $setB[not(.=$setA)]">
<xsl:value-of select="."/>
<xsl:if test="position()!=last()">,</xsl:if>
</xsl:for-each>
</test>
</xsl:template>
</xsl:stylesheet>
Result:
<?xml version="1.0" encoding="UTF-8"?>
<test>g,b,d,e,f</test>

Why is there xmlns in my html output

In the html output file from an XSLT process (using saxon9he), there have been 155 occurrences of xmlns:fn="http://www.w3.org/2005/xpath-functions" inserted into a variety of tr elements
The part of xsl that uses xpath-functions is
<xsl:if test="(string(#hideIfHardwareIs)='') or (not(fn:matches(string($input_doc//inf[#id='5'), string(#hideIfHardwareIs), 'i')))">
unless I am reading it wrong, matches takes 3 arguments, a string, another string and then a flag in which case this is case-insensitive.
What I don't undestand is that the tr elements that are showing up with the xmlns arent close to the portion or xsl that the matches() function is done at.
The XSL file I am working with is 2100 lines and the XML file it parses is 12800 lines. So I don't think I can share it easily. I've inherited this and need to (at this time) maintain it.
What are somethings i can look for within the XSL that would insert the xmlns into the html output?
Those functions do not need to be prefixed.
Remove the xmlns:fn="http://www.w3.org/2005/xpath-functions" from your xsl:stylesheet and remove the fn: prefix from the xpath functions.
Examples:
XML Input
<foo>test</foo>
XSLT 2.0 #1
<xsl:stylesheet version="2.0" xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/*">
<xsl:if test="fn:matches(.,'^t')">
<bar><xsl:value-of select="."/></bar>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
Output
<bar xmlns:fn="http://www.w3.org/2005/xpath-functions">test</bar>
XSLT 2.0 #2
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/*">
<xsl:if test="matches(.,'^t')">
<bar><xsl:value-of select="."/></bar>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
Output
<bar>test</bar>

Issue with xslt transforming xml to output html

I am trying to do a simple transform of XML using XSLT to generate HTML, but I'm having difficulties and I can't seem to figure out what the problem is. Here is a sample of the XML I am working with:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="C:\Users\cgubata\Documents\Digital Measures\jcamp_fac_ex_xslt.xsl"?>
<Data xmlns="http://www.digitalmeasures.com/schema/data" xmlns:dmd="http://www.digitalmeasures.com/schema/data-metadata" dmd:date="2012-02-27">
<Record userId="310106" username="jcamp" termId="453" dmd:surveyId="1154523">
<dmd:IndexEntry indexKey="COLLEGE" entryKey="School of Business" text="School of Business"/>
<dmd:IndexEntry indexKey="DEPARTMENT" entryKey="Accountancy" text="Accountancy"/>
<dmd:IndexEntry indexKey="DEPARTMENT" entryKey="MBA" text="MBA"/>
<PCI id="11454808064" dmd:lastModified="2012-02-08T13:17:39">
<PREFIX>Dr.</PREFIX>
<FNAME>Julia</FNAME>
<PFNAME/>
<MNAME>M.</MNAME>
<LNAME>Camp</LNAME>
<SUFFIX/>
<ALT_NAME>Julia M. Brennan</ALT_NAME>
<ENDPOS/>
All I want to do is have the value for some of the nodes to be displayed in HTML. So for example, I might want the PREFIC, FNAME, LNAME nodes to display as "Dr. Julia Camp" (no quotes - I'll do styling later). Here's the XSL that I am using:
<?xml version="1.0" encoding="utf-8"?><!-- DWXMLSource="jcamp_fac_ex.xml" -->
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:dmd="http://www.digitalmeasures.com/schema/data-metadata">
<xsl:output method="html" encoding="utf-8"/>
<xsl:template match="/">
<xsl:value-of select="/Data/Record/PCI/PREFIX"/>
</xsl:template>
</xsl:stylesheet>
From what I been researching, that should show the value of that PREFIX field. But instead, it is outputting all of the values from all of the nodes (so if there are 4000 nodes with a text value, I am getting 4000 values returned in HTML). My goal will be to pull out the values from certain nodes, and I will probably arrange them in a table.
How do I pull out the values from a specific node? Thanks in advance.
Well, I can't reproduce your symptoms. When I test what you've posted it doesn't produce any output at all. Which looks correct because your xpath is testing the wrong namespace. You need to add in your xslt a namespace-prefix mapping for the http://www.digitalmeasures.com/schema/data namespace, and then use it in the value-of xpath. Like this:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:dmd="http://www.digitalmeasures.com/schema/data-metadata"
xmlns:dm="http://www.digitalmeasures.com/schema/data">
<xsl:output method="html" encoding="utf-8"/>
<xsl:template match="/">
<xsl:value-of select="/dm:Data/dm:Record/dm:PCI/dm:PREFIX"/>
</xsl:template>
</xsl:stylesheet>
I'm afraid you've fallen into the number one XSLT trap for beginners: we see this question at least once a day on this forum. Your elements are in a namespace and your stylesheet is trying to match nodes in no namespace.

Match multiple attribute values in html for ant buildscript

I want to match mulitple values of a attribute for replacing. for example
<div class="div h1 full-width"></div>
Should produces div, h1 and full-width as seperate matches.
I want to do this to prefix the classes. So instead of div h1 full-width it should be pre-div pre-h1 pre-full-width
The regex I have sofar is
(?<=class=["'])(\b-?[_a-zA-Z]+[_a-zA-Z0-9-]*\b)+
This matches only the first class. This is offcourse because that is the only thing this pattern should match :( I tried to make the lookbehind take more then just class=" but I just end up with it taking everying and leaving nothing to replace.
I want to make a pattern that matches any value individually between the quotes of the class attribute.
I want to do this for an Ant buildscript that processes all files and replaces the class="value1 value2 value3" with a set prefix. Ive done this with little trouble for replacing the classes in css files but ye html seems to be alot trickier.
It is a Ant buildscript. Java regexp package is used to process the pattern. The ant tag used is: replaceregexp
The ant implemtentation of above pattern is:
<target name="prefix-class" depends="">
<replaceregexp flags="g">
<regexp pattern="(?<=class=['"])(\b-?[_a-zA-Z]+[_a-zA-Z0-9-]*\b)+"/>
<substitution expression=".${prefix}\1"/>
<fileset dir="${dest}"/>
</replaceregexp>
</target>
I don't think that you can find n (or in your case 3) different class entries and substitude them in one simple regexp. If you need to do this in ant i think you have to write your own ant task. A better way would be xslt, are you familiar with xslt?
Gave up on Ants ReplaceRegExp and sorted my problem with XSLT to transform xhtml to xhtml.
Following code adds a prefix to all values of a elements class attribute. the xhtml source document must be properly formatted to be parsed.
<xsl:stylesheet version="2.0"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
xmlns="http://www.w3.org/1999/xhtml"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xhtml xsl xs">
<xsl:output method="xml" version="1.0" encoding="UTF-8"
doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN"
doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1.dtd"
indent="yes" omit-xml-declaration="yes"/>
<xsl:param name="prefix" select="'oo-'"/>
<xsl:template match="/">
<xsl:apply-templates select="./#*|./node()" />
</xsl:template>
<!--remove these atts from output, default xhtml values from dtd -->
<xsl:template match="xhtml:a/#shape"/>
<xsl:template match="#rowspan"/>
<xsl:template match="#colspan"/>
<xsl:template match="#class">
<xsl:variable name="replace_regex">
<xsl:value-of select="$prefix"/>
<xsl:text>$1</xsl:text>
</xsl:variable>
<xsl:attribute name="class">
<xsl:value-of select="fn:replace( . , '(\w+)' , $replace_regex )"/>
</xsl:attribute>
</xsl:template>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>