Extract data from html/xml - html

I'm using Webharvest to retrieve data from websites. It converts the html pages to xml documents before getting for me the wanted data based on the xPath provided.
Now I'm working on a page like this: pastebin Where I showed the blocks I'd like to get. Each block should be returned as a single unit.
the xPath the first element of the block is: //div[#id="layer22"]/b/span[#style="background-color: #FFFF99"]
I tested it and it gives all "bloc start" elements.
the xPath of the last element of the block is: //div[#id="layer22"]/a[contains(.,"Join")]
I tested it and it gives all the "bloc end" elements.
The xPath should return a set of blocks as:
(xPath)[1] = block 1
(xPath)[2] = block 2
....
Thank you in advance

Use (for the first wanted result):
($first)[1] | ($last)[1]
|
($first)[1]/following::node()
[count(.|($last)[1]/preceding::node()) = count(($last)[1]/preceding::node())]
where you need to substitute $first with:
//div[#id="layer22"]/b/span[#style="background-color: #FFFF99"]
and substitute $last with:
//div[#id="layer22"]/a[contains(.,"Join")]
To get the k-th result, substitute in the final expression ($first)[1] with ($first)[{k}] and ($last)[1] with ($last)[{k}], where {k} should be replaced by the number k.
This technique follows directly from the well-known Kayessian formula for set intersection in XPath 1.0:
$ns1[count(.|$ns2) = count($ns2)]
which selects the intersection of the two node-sets $ns1 and $ns2 .
Here is XSLT verification with a simple example:
<nums>
<num>01</num>
<num>02</num>
<num>03</num>
<num>04</num>
<num>05</num>
<num>06</num>
<num>07</num>
<num>03</num>
<num>07</num>
<num>10</num>
</nums>
This transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:variable name="v1" select=
"(//num[. = 3])[1]/following-sibling::*"/>
<xsl:variable name="v2" select=
"(//num[. = 7])[1]/preceding-sibling::*"/>
<xsl:template match="/">
<xsl:copy-of select=
"$v1[count(.|$v2) = count($v2)]"/>
</xsl:template>
</xsl:stylesheet>
applies the XPath expression and the selected nodes are copied to the output:
<num>04</num>
<num>05</num>
<num>06</num>

Related

How to read values from JSON embedded in XML with XSLT?

This is my XML, which I want to convert into DOT:
<Import>
<Row>
<id>1</id>
<parentmenu>siasn-instansi</parentmenu>
<label>Layanan Profile ASN</label>
<role_id>1</role_id>
<role>role:siasn-instansi:profilasn:viewprofil</role>
<items>[{"url": "/tampilanData/pns", "label": "Profile Pegawai", "subMenu": "pns"}, {"url": "/tampilanData/pppk", "label": "Profile Pegawai PPPK", "subMenu": "pppk"}, {"url": "/tampilanData/JPTNonASN", "label": "Profile Pegawai PPT Non-ASN", "subMenu": "ppt"}]</items>
</Row>
</Import>
Below is a picture of my XSL code with the DOT file rules.
XSL code:
The problem is, I want to get the values from <items>, like below:
/displayData/pns
/displayData/pppk
/displayData/JPTNoASN
and I want to take the value points above into my XSL as outlined in red in the image.
How would the XSL look like that can take the values from my XML? The <items> value is quite difficult, unlike the <role> values, which I have managed to take.
In XSLT 3.0 you can do, for example:
<xsl:template match="items">
<xsl:variable name="content" select="parse-json(.)/*" as="map(*)*"/>
<xsl:for-each select="$content">
url="{?url}"
label="{?label}"
menu="{?subMenu}"
</xsl:for-each>
</xsl:template>
The parse-json() function returns an array of maps; the "/*" operator turns this into a sequence of maps; the construct ?url accesses a specific entry in a map.

how to separate a value from its node and group the node with others of its type

End result I need is to have all the text nodes have the same indent. The #name field is not constant size. Parentnode has a varying number of children that must be parsed in the order received. possibleothernodes are not explicitly ordered in all cases.
XML:
<parentnode>
<possibleothernodes1...n/>
<node name="SomeBoldText">
<text>Text1</text>
</node>
<node>
<text>Text2</text>
</node>
<node>
<text>Text3</text>
</node>
<node>
<text>Text4</text>
</node>
<possibleothernodes2...n/>
</parentnode>
I need the resulting HTML to look like
possibleothernodes1
SomeBoldText: Text1
Text2
Text3
Text4
possibleothernodes2
My real goal right now is how do I group Text1,Text2, Text3, Text4 into one div tag, and the #name into a different div tag? With two divs I can just float them to where they need to be.
How about something like this:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" indent="yes"/>
<xsl:template match="parentnode">
<div>
<h1>
<xsl:value-of select="node/#name"/>
</h1>
<div>
<xsl:apply-templates select="node/text" />
</div>
</div>
</xsl:template>
<xsl:template match="node/text">
<div>
<xsl:value-of select ="."/>
</div>
</xsl:template>
</xsl:stylesheet>
When run on your sample input, the result is:
<div>
<h1>SomeBoldText</h1>
<div>
<div>Text1</div>
<div>Text2</div>
<div>Text3</div>
<div>Text4</div>
</div>
</div>
ok, this is totally untested, and before coffee hits but hopefully it will be a good start using some JQuery:
$(function)(){
$.ajax({
url:"../folder/nameoffile.xml",
dataType:"xml",
success:function(xml){
$(xml).find("node").each(function(){
var sideName = $(this).attr("name");
$("#idofSideDiv").append(sideName);
var myNode = $(this).find("node").text();
// ok, this is assuming that you have an ul list
// to contain the text defined in the node
$("#idofULList").append("<li>"+myNode+"</li>");
})
}
});
});
Like I said - totally untested, and probably some syntax errors - but hopefully it can be a good start for a google search.

How to get the position of a node based on the value of a child element

In my xml I want specific menuitems/menuitem nodes that will be at different arbitrary positions under its parent (I don't want hardcoded position selector).
Is it possible to get the position of a menuitem node that has the right value in the name element under it, meaning menuitems/menuitem/name. In short: selecting the menuitem that has the right name value under it.
<one>
<menuitems>
<menuitem> <!-- I dont want this one -->
<name>
...
</name>
</menuitem>
<menuitem> <!-- I want this one at position 2 under <one> -->
<name>
... <!-- Based one correct name value here -->
</name>
</menuitem>
</menuitems>
</one>
<two>
<menuitems>
<menuitem> <!-- I want this one at position 1 under <two> -->
<name>
...
</name>
</menuitem>
</menuitems>
</two>
I can easily find out if one menuitem under menuitems has the correct name value. Like so:
<xsl:value-of select="current()/menuitems/menuitem/name = 'OhYes'"></xsl:value-of>
Which will return true. But at which position is this menuitem amongs other menuitem that returned true? Selecting under the same parent and at the same level.
I want to avoid this:
<xsl:if test="current()/menuitems/menuitem[1]/name = 'OhYes'"> .. </xsl:if>
<xsl:if test="current()/menuitems/menuitem[2]/name = 'OhYes'"> .. </xsl:if>
Use this:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:for-each select="/*/*/menuitems/menuitem[name='OhYes']">
position: <xsl:text/>
<xsl:value-of select="count(preceding-sibling::menuitem) +1"/>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
**When this transformation is applied on the following XML document:
...
-->
OhYes
-->
OhYes
the wanted, correct result is produced:
position: 2
position: 1
You can first seperating out one and two containers using getElementsByTagName. The for each section (one, two) try iterating through all the names by creaing an array using getElementsByTagName, and then checking each element to see if the name is correct. Since each menuitem position corresponds to the name position under each heading (one, two), you can use the .parentNode accessor on each matching name node to return the menuitem node, and the index of the iterator to return its position.
For example, for the XML Code
<one>
<menuitems>
<menuitem> <!-- I dont want this one -->
<name>
the incorrect name
</name>
</menuitem>
</menuitems>
<menuitems>
<menuitem> <!-- I want this one at position 2 -->
<name>
the correct name <!-- Based one correct name value here -->
</name>
</menuitem>
</menuitems>
</one>
<two>
<menuitems>
<menuitem> <!-- I want this one at position 1 -->
<name>
the correct name
</name>
</menuitem>
</menuitems>
</two>
The following code alerts the index of the name nodes(and parent menuitem) under the current container (one, two) that cotnain the correct name text in the tag labeled "the correct name"
names = new Array();
// Find the correct position for the menuitem under one
one = document.getElementsByTagName("one")[0];
names = one.getElementsByTagName("name");
for (var i=0; i<names.length; i++){
if (names[i].innerHTML.search("the correct name") >= 0)
alert("For one: Correct name found at name node index " + (i+1) + " and its parent menuitem is " + names[i].parentNode);
// names[i].parentNode is the reference to the menuitem in one that contains the correct name
}
// Find the correct position for the menuitem under two
two = document.getElementsByTagName("two")[0];
names = two.getElementsByTagName("name");
for (var i=0; i<names.length; i++){
if (names[i].innerHTML.search("the correct name") >= 0)
alert("For two: Correct name found at name node index " + (i+1) + " and its parent menuitem is " + names[i].parentNode);
// names[i].parentNode is the reference to the menuitem in one that contains the correct name
}
I have also created a fiddle at http://jsfiddle.net/nMN3j/1/ so you can try it out, and see how it works.
This code will alert that under container "one" a menuitem tagwas found at position 2 that has a correct name sub-tab, but not at position 1, which has an incorrect name sub-tag.
For container "two", the code will alert that a menuitem was found at position 1 that has a correct name.
Hope this helps!

xsl generate-id() function returns same id twice for different nodes

I have an input xml for a transformation like ;
<?xml version="1.0" encoding="UTF-8" ?>
<AssetcustomerCollection xmlns="http://xmlns.oracle.com/pcbpel/adapter/db/top/somens">
<Assetcustomer xmlns="">
....
</Assetcustomer>
<Assetcustomer xmlns="">
<accountklantid>000000123456789</accountklantid>
<accountrowid>1-W8HQ1J</accountrowid>
<adrestypeaccnt/>
<adrestypecon/>
<assetbankcode>1173</assetbankcode>
<assetnumber>0000001234</assetnumber>
<assetprodcode>1200</assetprodcode>
<assetproduct>Overeenkomst Rekening-courant</assetproduct>
<assetproductlocatie>00</assetproductlocatie>
<assetstatus>Actief</assetstatus>
<assetsubstatus>Lopende rekening</assetsubstatus>
<assettypecode>0010</assettypecode>
<contactklantid/>
<contactrowid/>
<primairaccount>Y</primairaccount>
<primaircontact>N</primaircontact>
<reltypeaccnt>Hoofdcontractant</reltypeaccnt>
<reltypecon/>
<rowidasset>1-X3XBMO</rowidasset>
<rowidassetaccnt>1-X3XBMQ</rowidassetaccnt>
<rowidassetcon/>
<tnsidaccnt/>
<tnsidcon/>
</Assetcustomer>
<Assetcustomer xmlns="">
....
</Assetcustomer>
<Assetcustomer xmlns="">
<accountklantid/>
<accountrowid/>
<adrestypeaccnt/>
<adrestypecon/>
<assetbankcode>1173</assetbankcode>
<assetnumber>0000004321</assetnumber>
<assetprodcode>1201</assetprodcode>
<assetproduct>WereldPas (Zakelijk)</assetproduct>
<assetproductlocatie>00</assetproductlocatie>
<assetstatus>Actief</assetstatus>
<assetsubstatus>Lopende rekening</assetsubstatus>
<assettypecode>0003</assettypecode>
<contactklantid>000000987654321</contactklantid>
<contactrowid>1-X17PLM</contactrowid>
<primairaccount>N</primairaccount>
<primaircontact>Y</primaircontact>
<reltypeaccnt/>
<reltypecon>Pasverantwoordelijke</reltypecon>
<rowidasset>1-X3XBN0</rowidasset>
<rowidassetaccnt/>
<rowidassetcon>1-X3XBNE</rowidassetcon>
<tnsidaccnt/>
<tnsidcon/>
</Assetcustomer>
<Assetcustomer xmlns="">
....
</Assetcustomer>
</AssetcustomerCollection>
When transforming this input xml i got an unexpected output (15 of the 16 input Assetcustomer nodes were transformed) I now have found the cause, but cannot explain why it occurs;
The following transformation returns the same id twice;
<xsl:element name="A">
<xsl:value-of select="generate-id(key('AssetRowIDs',/ns0:AssetcustomerCollection/Assetcustomer[rowidasset = '1-X3XBMO']/*)[1])"/>
</xsl:element>
<xsl:element name="B">
<xsl:value-of select="generate-id(key('AssetRowIDs',/ns0:AssetcustomerCollection/Assetcustomer[rowidasset = '1-X3XBN0']/*)[1])"/>
</xsl:element>
<A>N10211</A>
<B>N10211</B>
While the generated id for any other node with a different rowidasset is different.
Any ideas before i start pulling my hair out ?
Peter
I do not know exactly why , but changing
<xsl:key name="AssetRowIDs" match="Assetcustomer" use="rowidasset"/>
into
<xsl:key name="AssetRowIDs" match="Assetcustomer" use="concat('-',rowidasset,'-')"/>
and
<xsl:for-each select="/ns0:AssetcustomerCollection/Assetcustomer[generate-id() = generate-id(key('AssetRowIDs',rowidasset)[1])]">
into
<xsl:for-each select="/ns0:AssetcustomerCollection/Assetcustomer[generate-id() = generate-id(key('AssetRowIDs',concat('-',rowidasset,'-'))[1])]">
Seems to generate a unique id for each node, still bugging me dat i do not understand the cause of it.
Check the namespace? If the ns0 prefix is bound to a wrong namespace URI, your query will in both cases yield an empty result set. Together with the same first argument for key, that, I imagine, will yield the same call to key() and thus the same ID.
Also I don't think the key() function does what you think it does: http://www.w3schools.com/xsl/func_key.asp
In any case you can apply generate-id() directly on the node set for which you wish to calculate the ID.

xsl any symbol code for value comparison using <xsl:when test

I was wondering what would be any symbol code?
<xsl:when test="path/path1 = '(ANYSYMBOL)1' ">
this code alows us to check if some values equals X1,#1,%1,91....
so what is the anysymbol/ anychar code #xxx?
There's no such wildcards.
You have two options:
<xsl:when test="substring(path/path1, 2) = '1'">
and
<xsl:when test="matches(path/path1, '.1')">
The latter one using regexp is only XSLT 2.0 compatible.