Getting HTML elements via XPath in bash - html

I was trying to parse a page (Kaggle Competitions) with xpath on MacOS as described in another SO question:
curl "https://www.kaggle.com/competitions/search?SearchVisibility=AllCompetitions&ShowActive=true&ShowCompleted=true&ShowProspect=true&ShowOpenToAll=true&ShowPrivate=true&ShowLimited=true&DeadlineColumnSort=Descending" -o competitions.html
cat competitions.html | xpath '//*[#id="competitions-table"]/tbody/tr[205]/td[1]/div/a/#href'
That's just getting a href of a link in a table.
But instead of returning the value, xpath starts validating .html and returns errors like undefined entity at line 89, column 13, byte 2964.
Since man xpath doesn't exist and xpath --help ends with nothing, I'm stuck. Also, many similar solutions relate to xpath from GNU distributions, not in MacOS.
Is there a correct way of getting HTML elements via XPath in bash?

Getting HTML elements via XPath in bash
from html file (with not valid xml)
One possibility may be to use xsltproc. (I hope it is available for MAC). xsltproc has an option --html to use html as input. But with that you need
to have a xslt stylesheet.
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" />
<xsl:template match="/*">
<xsl:value-of select="//*[#id='competitions-table']/tr[205]/td[1]/div/a/#href" />
</xsl:template>
</xsl:stylesheet>
Notice that the xapht is changed. There is no tbodyin the input file.
Call xsltproc:
xsltproc --html test.xsl competitions.html 2> /dev/null
Where the xslproc complaining about errors in html is ignored ( send to /devn/null ).
The output is: /c/R
To use different xpath expression from command line you may use a xslt template and replace the __xpath__.
E.g. xslt template:
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" />
<xsl:template match="/*">
<xsl:value-of select="__xpaht__" />
</xsl:template>
</xsl:stylesheet>
And use (e.g) sed for the replacement.
sed -e "s,__xpaht__,//*[#id='competitions-table']/tr[205]/td[1]/div/a/#href," test.xslt.tmpl > test.xsl
xsltproc --html test.xsl competitions.html 2> /dev/null

Related

XSLT - Remove nodes when current node is equal to previous node

I need to write XSLT for an xml which contains in below format.
<books>
<book>
<a>name</a>
<a>name</a>
<b>name</b>
<b>name</b>
</book>
</books>
I need to eliminate the duplicate child nodes in some conditions.
Only if(current node == previous node) then it should be removed.
ie.. if previous node (element) is <a> and current node (element) is also <a>, Then one node should be removed.
output for the above be,
`<a>name</a>`
`<b>name</b>`
please help me to do this.
In XSLT 2 or 3 you can easily group adjacent sibling elements by their node name with for-each-group select="*" group-adjacent="node-name()" and simply output the first item in each group (which is equal to the context item .):
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="3.0">
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="book">
<xsl:copy>
<xsl:for-each-group select="*" group-adjacent="node-name()">
<xsl:copy-of select="."/>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/6qVRKw4/1
As I understood, you want to omit a leaf element (without children elements)
if it has a previous sibling, which:
is also a leaf element,
has the same name,
has the same text content.
So the most intuitive solution (I think) is to write an empty template,
matching just these nodes:
<xsl:template match="*[not(*)][preceding-sibling::*[1][not(*)]
[name() = current()/name()][text() = current()/text()]]"/>
A brief description of the match attribute:
*[not(*)] - Every element without any child element (leaf element).
[ - Start of the second predicate.
preceding-sibling::*[1] - Take the first preceding sibling.
[not(*)] - It must not have any child element.
[name() = current()/name()] - It must have the same name as the
"starting" element.
[text() = current()/text()] - It must have the same text as the
"starting" element.
] - End of the second predicate.
Of course, the script must contain also an identity template.
For a working example, with a bit extended source, see http://xsltransform.net/jxN8Nqm
If requirement concerning the same text is not necessary, delete the respective
predicate fragment.

Output php processing instruction inside attribute value

In my XSLT (2.0 - the output method is html) I have this:
<img>
<xsl:attribute name="href">
<xsl:text disable-output-escaping="yes"><?php echo get_url(); ?></xsl:text>
</xsl:attribute>
</img>
The output I want is as follows:
<img href="<?php echo get_url(); ?>">
The output I get is as follows:
<img href="<?php echo get_url(); ?>">
Tried a bunch of different things to get the ">" coming out in the output instead of > (CDATA marked sections etc.) but nothing seems to work. Strange that the less than sign works fine, but the greater than doesn't. I'm using Saxon-PE 9.5.1.7.
Use a character map with some characters you don't need elsewhere, here is an example (https://www.w3.org/TR/xslt20/#character-maps) adapted from the XSLT 2.0 spec:
<img href="«?php echo get_url(); ?»"/>
and
<xsl:output method="html" use-character-maps="m1"/>
<xsl:character-map name="m1">
<xsl:output-character character="«" string="<"/>
<xsl:output-character character="»" string=">"/>
</xsl:character-map>
Online example is at http://xsltransform.net/93dEHFP.
As for disable-output-escaping, it does not work in attribute values as far as I know, that result that you get is not the result of disable-output-escaping but just the use of xsl:output method="html" (https://www.w3.org/TR/xslt-xquery-serialization/#HTML_ATTRIBS) mandating 'The HTML output method MUST NOT escape "<" characters occurring in attribute values.'.

XLST creates an empty space after convert to html

I don´t get it.
My xml input:
<?xml version="1.0" encoding="UTF-8"?>
<results>
<error file="mixed.cpp" line="11" id="unreadVariable" severity="style" msg="Variable 'wert' is assigned a value that is never used."/>
<error file="mixed.cpp" line="13" id="unassignedVariable" severity="style" msg="Variable 'b' is not assigned a value."/>
<error file="mixed.cpp" line="11" id="arrayIndexOutOfBounds" severity="error" msg="Array 'wert[2]' accessed at index 3, which is out of bounds."/>
<error file="mixed.cpp" line="15" id="uninitvar" severity="error" msg="Uninitialized variable: b"/>
<error file="mixed.cpp" line="5" id="unusedFunction" severity="style" msg="The function 'func' is never used."/>
<error file="*" line="0" id="unmatchedSuppression" severity="style" msg="Unmatched suppression: missingIncludeSystem"/>
</results>
using this xsl file:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="xml" omit-xml-declaration="yes"/>
<xsl:template match="error">
<tr>
<td><xsl:value-of select="#file"/></td>
<td><xsl:value-of select="#line"/></td>
<td><xsl:value-of select="#test"/></td>
<td><xsl:value-of select="#severity"/></td>
<td><xsl:value-of select="#msg"/></td>
</tr>
</xsl:template>
</xsl:stylesheet>
But the first line I get is empty:
empty line
<tr><td>mixed.cpp</td><td>11</td><td/><td>style</td><td>Variable 'wert' is assigned a value that is never used.</td></tr>
Where is the empty line coming from?
The default template kicks in for templates not matching your error template, and the default template just outputs the text. Since you have whitespace text nodes, and you are not matching results, the whitespace inside results (and before and after error) will become part of the output.
There are multiple ways to fix this. A typical method is to write a low priority template that matches text that you do not want to match. I.e., if you add the following, your whitespace will disappear:
<xsl:template match="text()" />
Another approach would be to positively match your structure. I.e., if you would add the following, the whitespace also disappears, because now you match the root element and subsequently only apply templates on the elements that you are interested in (and not also the text nodes under results).
<xsl:template match="results">
<xsl:apply-templates select="error" />
</xsl:template>
A third approach would be to add a whitespace-stripping declaration, but this may influence the input XML if your actual stylesheet is larger and would depend on whitespace elsewhere. This would only strip the whitespace on the results element:
<xsl:strip-space elements="results"/>
All three solution work, it depends on your project as a whole which one is most suitable.
Remember that in XSLT 1.0 and XSLT 2.0 non-matching nodes will be matched by the default template (which is invisible) and simply outputs the text value of that node. In XSLT 3.0 you have more control over this process:
<!-- XSLT 3.0 only -->
<xsl:mode on-no-match="shallow-skip" />

extracting information from a JSON file using XSLT version 1.0

I'm a noobie to stackoverflow and xslt so I hope I don't sound unintelligent!
So I am working with SDI for a GIS company and I have a task that requires me to convert points that are in one spacial reference system (SRS) coordinate plane, such as EPSG:4035, to the world SRS, aka EPSG:4326. This really isn't a problem for me since I have the accessibility of an online service that will just give me what I want. However, the format that it outputs is in either JSON or HTML. I have browsed for a while to find a way to extract information from a JSON file but most of the techniques I have seen use xslt:stylesheet version 2.0, and I have to use version 1.0. One method I thought about doing was using the document($urlWithJsonFormat) xslt function, however this only accepts xml files.
Here is an example of the JSON formatted file that I would retrieve after asking for the conversion:
{
"geometries" :
[{
"xmin" : -4,
"ymin" : -60,
"xmax" : 25,
"ymax" : -41
}
]
}
All I simply want are the xmin, ymin, xmax, and ymax values, that's all! It just seems so simple yet nothing works for me...
You could use an external entity to include the JSON data as part of an XML file that you then transform.
For instance, assuming the example JSON is saved as a file called "geometries.json" you could create an XML file like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE wrapper [
<!ENTITY otherFile SYSTEM "geometries.json">
]>
<wrapper>&otherFile;</wrapper>
And then transform it with the following XSLT 1.0 stylesheet:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="wrapper">
<geometries>
<xsl:call-template name="parse-json-member-value">
<xsl:with-param name="member" select="'xmin'"/>
</xsl:call-template>
<xsl:call-template name="parse-json-member-value">
<xsl:with-param name="member" select="'ymin'"/>
</xsl:call-template>
<xsl:call-template name="parse-json-member-value">
<xsl:with-param name="member" select="'xmax'"/>
</xsl:call-template>
<xsl:call-template name="parse-json-member-value">
<xsl:with-param name="member" select="'ymax'"/>
</xsl:call-template>
</geometries>
</xsl:template>
<xsl:template name="parse-json-member-value">
<xsl:param name="member"/>
<xsl:element name="{$member}">
<xsl:value-of select="normalize-space(
translate(
substring-before(
substring-after(
substring-after(.,
concat('"',
$member,
'"'))
, ':')
,'
')
, ',', '')
)"/>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
To produce the following output:
<geometries>
<xmin>-4</xmin>
<ymin>-60</ymin>
<xmax>25</xmax>
<ymax>-41</ymax>
</geometries>
The two main choices here seem to be:
write (or use) a JSON parser in XSLT 1.0, or
use some other language than XSLT.
Since XSLT 1 engines generally can't process JSON directly I'd recommend using some other language to convert to XML.
https://github.com/WelcomWeb/JXS may help you too, if this is XSLT in a Web browser.

XPath Expression: Select elements between A HREF="expr" tags

I didn't found an explicit way to select all nodes that exist between two anchors (<a></a> tag pair) in an HTML file.
The first anchor has the following format:
Second anchor:
I've verified that both can be selected using starts-with (note that I'm using HTML Agility Pack):
HtmlNode n0 = html.DocumentNode.SelectSingleNode("//a[starts-with(#href,'file://START')]"));
HtmlNode n1 = html.DocumentNode.SelectSingleNode("//a[starts-with(#href,'file://END')]"));
With this in mind, and with my amateurish XPath skills, I wrote the following expression to get all tags between the two anchors:
html.DocumentNode.SelectNodes("//*[not(following-sibling::a[starts-with(#href,'file://START0')]) and not (preceding-sibling::a[starts-with(#href,'file://END0')])]");
This seems to work, but selects all HTML document!
I need to, for example for the following HTML fragment:
<html>
...
<p>First nodes</p>
<p>First nodes
<span>X</span>
</p>
<p>First nodes</p>
...
</html>
remove both anchors, the three P (including of course the inner SPAN).
Any way to do this?
I don't know if XPath 2.0 offers better ways to achieve this.
*EDIT (special case!) *
I should also handle the case where:
"Select tags between X and X', where X is <p></p>"
So instead of:
<!-- xhtml to be extracted -->
I should handle also:
<p>
</p>
<!-- xhtml to be extracted -->
<p>
</p>
Thank you very much, again.
Use this XPath 1.0 expression:
//a[starts-with(#href,'file://START')]/following-sibling::node()
[count(.| //a[starts-with(#href,'file://END')]/preceding-sibling::node())
=
count(//a[starts-with(#href,'file://END')]/preceding-sibling::node())
]
Or, use this XPath 2.0 expression:
//a[starts-with(#href,'file://START')]/following-sibling::node()
intersect
//a[starts-with(#href,'file://END')]/preceding-sibling::node()
The XPath 2.0 expression uses the XPath 2.0 intersect operator.
The XPath 1.0 expression uses the Kayessian (after #Michael Kay) formula for the intersectioon of two node-sets:
$ns1[count(.|$ns2) = count($ns2)]
Verification with XSLT:
This XSLT 1.0 transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select=
" //a[starts-with(#href,'file://START')]/following-sibling::node()
[count(.| //a[starts-with(#href,'file://END')]/preceding-sibling::node())
=
count(//a[starts-with(#href,'file://END')]/preceding-sibling::node())
]
"/>
</xsl:template>
</xsl:stylesheet>
when applied on the provided XML document:
<html>...
<p>First nodes</p>
<p>First nodes
<span>X</span>
</p>
<p>First nodes</p>
...
</html>
produces the wanted, correct result:
<p>First nodes</p>
<p>First nodes
<span>X</span>
</p>
<p>First nodes</p>
This XSLT 2.0 transformation:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select=
" //a[starts-with(#href,'file://START')]/following-sibling::node()
intersect
//a[starts-with(#href,'file://END')]/preceding-sibling::node()
"/>
</xsl:template>
</xsl:stylesheet>
when applied on the same XML document (above) again produces exactly the wanted result.
I've added a special case that I should handle
To handle this special case you can work in the same way, I mean use the Kayessian (and use XPath Visualizer as well ;-)). The intersecting node-sets change as follows:
Intersecting node-set C
"//p[.//a[starts-with(#href,'file://START')]]
/following-sibling::node()"
All following sibling of p containing a START.
Intersecting node-set D
"./following-sibling::p[.//a[starts-with(#href,'file://END')]]
/preceding-sibling::node())"
All preceding siblings of p containing a END and following sibling of current p
Now you can perform the intersection as:
C ∩ D
That is
"//p[.//a[starts-with(#href,'file://START')]]
/following-sibling::node()[
count(.| ./following-sibling::p
[.//a[starts-with(#href,'file://END')]]
/preceding-sibling::node())
=
count(./following-sibling::p
[.//a[starts-with(#href,'file://END')]]
/preceding-sibling::node())
]"
If you need to manage both situations, you can proceed with the union of the intersecting node-sets as
(A ∩ B) ∪ (C ∩ D)
Where:
The XPath union operator | must be used:
the node-sets A e B are already showed in the #Dimitre'answer
the node-sets C e D are those showed in my answer.