SelectNodes and GetElementsByTagName - html

what are main differences between SelectNodes and GetElementsByTagName.

SelectNodes is a .NET/MSXML-specific method that gets a list of matching nodes for an XPath expression. XPaths can select elements by tag name but can also do lots of other, more complicated selection rules.
getElementByTagName is a DOM Level 1 Core standard method available in many languages (but spelled with a capital G in .NET). It selects elements only by tag name; you can't ask it to select elements with a certain attribute, or elements with tag name a inside other elements with tag name b or anything clever like that. It's older, simpler, and in some environments faster.

SelectNodes takes an XPath expression as a parameter and returns all nodes that match that expression.
GetElementsByTagName takes a tag name as a parameter and returns all tags that have that name.
SelectNodes is therefore more expressive, as you can write any GetElementsByTagName call as a SelectNodes call, but not the other way around. XPath is a very robust way of expressing sets of XML nodes, offering more ways of filtering than just name. XPath, for example, can filter by tag name, attribute names, inner content and various aggregate functions on tag children as well.

SelectNodes() is a Microsoft extension to the Document Object Model (DOM) (msdn).
SelectNodes as mentioned by Welbog and others takes XPath expression. I would like to mention difference with GetElementsByTagName() when deleting xml node is needed.
Answer and code provided user chilberto at msdn forum
The next test illustrates the difference by performing the same function (removing the person nodes) but by using the GetElementByTagName() method to select the nodes. Though the same object type is returned its construction is different. The SelectNodes() is a collection of references back to the xml document. That means we can remove from the document in a foreach without affecting the list of references. This is shown by the count of the nodelist not being affected. The GetElementByTagName() is a collection that directly reflects the nodes in the document. That means as we remove the items in the parent, we actually affect the collection of nodes. This is why the nodelist can not be manipulated in a foreach but had to be changed to a while loop.
.NET SelectNodes()
[TestMethod]
public void TestSelectNodesBehavior()
{
XmlDocument doc = new XmlDocument();
doc.LoadXml(#"<root>
<person>
<id>1</id>
<name>j</name>
</person>
<person>
<id>2</id>
<name>j</name>
</person>
<person>
<id>1</id>
<name>j</name>
</person>
<person>
<id>3</id>
<name>j</name>
</person>
<business></business>
</root>");
XmlNodeList nodeList = doc.SelectNodes("/root/person");
Assert.AreEqual(5, doc.FirstChild.ChildNodes.Count, "There should have been a total of 5 nodes: 4 person nodes and 1 business node");
Assert.AreEqual(4, nodeList.Count, "There should have been a total of 4 nodes");
foreach (XmlNode n in nodeList)
n.ParentNode.RemoveChild(n);
Assert.AreEqual(1, doc.FirstChild.ChildNodes.Count, "There should have been only 1 business node left in the document");
Assert.AreEqual(4, nodeList.Count, "There should have been a total of 4 nodes");
}
.NET GetElementsByTagName()
[TestMethod]
public void TestGetElementsByTagNameBehavior()
{
XmlDocument doc = new XmlDocument();
doc.LoadXml(#"<root>
<person>
<id>1</id>
<name>j</name>
</person>
<person>
<id>2</id>
<name>j</name>
</person>
<person>
<id>1</id>
<name>j</name>
</person>
<person>
<id>3</id>
<name>j</name>
</person>
<business></business>
</root>");;
XmlNodeList nodeList = doc.GetElementsByTagName("person");
Assert.AreEqual(5, doc.FirstChild.ChildNodes.Count, "There should have been a total of 5 nodes: 4 person nodes and 1 business node");
Assert.AreEqual(4, nodeList.Count, "There should have been a total of 4 nodes");
while (nodeList.Count > 0)
nodeList[0].ParentNode.RemoveChild(nodeList[0]);
Assert.AreEqual(1, doc.FirstChild.ChildNodes.Count, "There should have been only 1 business node left in the document");
Assert.AreEqual(0, nodeList.Count, "All the nodes have been removed");
}
With SelectNodes() we get collection / list of references to xml document nodes. We can manipulate with those references. If we delete node, the change will be visible to xml document, but the collection / list of references is the same (although node which was deleted, it's reference points now to null -> System.NullReferenceException) Although I do not really know how this is implemented. I suppose if we use XmlNodeList nodeList = GetElementsByTagName() and delete node with nodeList[i].ParentNode.RemoveChild(nodeList[i]) is frees/deletes reference in nodeList variable.

Related

xpath filter - how to filter to the latest node record

I have an issue with filtering xpath on specified node to have only the latest record. In the example xml there s a rule that the very first record on each node is the most current information. I would like to filter all records which are not relevant (different than the first record).
The second rule is that i do not want to use date conditions to filter all job_information records after the very first one.
Normally I am using xpath tester with expression such like:
/queryCompoundEmployeeResponse/CompoundEmployee[(person/employment_information/job_information[1])]
which gives me only the first record of job information but here is does not work. Can you show me what is wrong with it?
Can you help me?
xml input with 3 job_information records
<queryCompoundEmployeeResponse>
<CompoundEmployee>
<id>11111</id>
<person>
<employment_information>
<job_information>
<end_date>9999-12-31</end_date>
<start_date>2017-05-17</start_date>
</job_information>
<job_information>
<end_date>2018-12-31</end_date>
<start_date>2017-05-17</start_date>
</job_information>
<job_information>
<end_date>2016-12-31</end_date>
<start_date>2013-05-17</start_date>
</job_information>
</employment_information>
</person>
</CompoundEmployee>
</queryCompoundEmployeeResponse>
xml output I would like to have
<queryCompoundEmployeeResponse>
<CompoundEmployee>
<id>11111</id>
<person>
<employment_information>
<job_information>
<end_date>9999-12-31</end_date>
<start_date>2017-05-17</start_date>
</job_information>
</employment_information>
</person>
</CompoundEmployee>
</queryCompoundEmployeeResponse>
Assuming you have always the same XML structure, you can try :
data=xmlParse("C:/Users/.../pathtoyourxmlfile.xml")
a=xpathSApply(data,"count(//job_information[1]/ancestor::*)+6")
b=xpathSApply(data,"count(//job_information)-1")*4+(a-1)
old=read_lines("C:/Users/.../pathtoyourxmlfile.xml")
new = old[-(a:b)]
writeLines(new,con = "new.xml")
Output (new.xml) :
<queryCompoundEmployeeResponse>
<CompoundEmployee>
<id>11111</id>
<person>
<employment_information>
<job_information>
<end_date>9999-12-31</end_date>
<start_date>2017-05-17</start_date>
</job_information>
</employment_information>
</person>
</CompoundEmployee>
</queryCompoundEmployeeResponse>

Using Perl LibXML to read textContent that contains html tags

If I have the following XML:
<File id="MyTestApp/app/src/main/res/values/strings.xml">
<Identifier id="page_title" isArray="0" isPlural="0">
<EngTranslation eng_indx="0" goesWith="-1" index="0">My First App</EngTranslation>
<Description index="0">Home page title</Description>
<LangTranslation index="0">My First App</LangTranslation>
</Identifier>
<Identifier id="count" isArray="0" isPlural="0">
<EngTranslation eng_indx="0" goesWith="-1" index="0">You have <b>%1$d</b> view(s)</EngTranslation>
<Description index="0">Number of page views</Description>
<LangTranslation index="0">You have <b>%1$d</b> view(s)</LangTranslation>
</Identifier>
</File>
I'm trying to read the 'EngTranslation' text value, and want to return the full value including any HTML tags. For example, I have the following:
my $parser = XML::LibXML->new;
my $dom = $parser->parse_file("test.xml") or die;
foreach my $file ($dom->findnodes('/File')) {
print $file->getAttribute("id")."\n";
foreach my $identifier ($file->findnodes('./Identifier')) {
print $identifier->getAttribute("id")."\n";
print encode('UTF-8',$identifier->findnodes('./EngTranslation')->get_node(1)->textContent."\n");
print encode('UTF-8',$identifier->findnodes('./Description')->get_node(1)->textContent."\n");
print encode('UTF-8',$identifier->findnodes('./LangTranslation')->get_node(1)->textContent."\n");
}
}
The output I get is:
MyTestApp/app/src/main/res/values/strings.xml
page_title
My First App
Home page title
My First App
count
You have %1$d view(s)
Number of page views
You have %1$d views
What I'm hoping to get is:
MyTestApp/app/src/main/res/values/strings.xml
page_title
My First App
Home page title
My First App
count
You have <b>%1$d</b> view(s)
Number of page views
You have <b>%1$d</b> views
I'm just using this as an example for a more complicated situation, hopefully it makes sense.
Thanks!
Here's a rather monkey patching solution, but it works:
sub XML::LibXML::Node::innerXML{
my ($self) = shift;
join '', $self->childNodes();
}
…
say $identifier->findnodes('./Description')->get_node(1)->innerXML;
Oh, and if the encoding becomes a problem, use the toString method, it's first argument handles encoding. (I did use open, but there were no out of range characters in the xml).
If you don't like the monkey patching. you can change the sub to a normal one and supply the argument, like this:
sub myInnerXML{
my ($self) = shift;
join '', map{$_->toString(1)} $self->childNodes();
}
…
say myInnerXML($identifier->findnodes('./Description')->get_node(1));
In your source XML, you either need to encode the tags as entities or wrap that content in a CDATA section.
One problem with embedding HTML in XML is that HTML is not necessarily 'well formed'. For example the <br> tag and the <img> tag are not usually followed by matching closing tags and without the closing tags, it would not be valid in an XML document unless you XML-escape the whole string of HTML, e.g.:
<EngTranslation eng_indx="0" goesWith="-1" index="0">You have <b>%1$d</b> view(s)</EngTranslation>
Or use a CDATA section:
<EngTranslation eng_indx="0" goesWith="-1" index="0"><![CDATA[You have <b>%1$d</b> view(s)]]></EngTranslation>
However, if you restrict your HTML to always be well-formed, you can achieve what you want with the toString() method.
If you called toString() on the <EngTranslation> element node, the output would include the <EngTranslation>...</EngTranslation> wrapper tags. So instead, you would need to call toString() on each of the child nodes and concatenate the results together:
binmode(STDOUT, ':utf8');
foreach my $file ($dom->findnodes('/File')) {
print $file->getAttribute("id")."\n";
foreach my $identifier ($file->findnodes('./Identifier')) {
print $identifier->getAttribute("id")."\n";
my $html = join '', map { $_->toString }
$identifier->findnodes('./EngTranslation')->get_node(1)->childNodes;
print $html."\n";
print $identifier->findnodes('./Description')->get_node(1)->textContent."\n";
print $identifier->findnodes('./LangTranslation')->get_node(1)->textContent."\n";
}
}
Note I took the liberty of using binmode to set UTF8 encoding on the output filehandle so it was not necessary to call encode for every print.

XML: How to recursively combine the text of identical elements in Xquery

How can I merge all the context that are in identical and repeated elements throughout a document using Xquery?
sample document:
<webMessage xmlns="http://www.website.gov.uk/CM/envelope">
<EnvelopeVersion>2.0</EnvelopeVersion>
<Header>
<MessageDetails>
<Class>Web-CT600</Class>
<Qualifier/>
<Function/>
</MessageDetails>
<SenderDetails>
<IDAuthentication>
<SenderID/>
<Authentication>
<Method/>
<Role/>
<Value/>
</Authentication>
</IDAuthentication>
</SenderDetails>
</Header>
<webTalkDetails>
<Keys>
<Key Type="UTR">2274792909</Key>
</Keys>
<ChannelRouting>
<Channel>
<URI/>
<Product/>
<Version/>
</Channel>
</ChannelRouting>
</webTalkDetails>
<Body>
<IRenvelope xmlns="http://www.website.gov.uk/taxation/CT/3">
<IRheader>
<Keys>
<Key Type="UTR">2274792909</Key>
</Keys>
<PeriodEnd/>
<DefaultCurrency/>
<IRmark Type="generic">n1uS2MiavBsb6YwL82MK</IRmark>
<Sender/>
</IRheader>
<CompanyReturn ReturnType="new">
<CompanyInformation>
<CompanyName/>
<RegistrationNumber/>
<Reference/>
<PeriodCovered>
<From>2013-01-07</From>
<To>2014-01-07</To>
</PeriodCovered>
</CompanyInformation>
<Turnover>
<Total>45893</Total>
</Turnover>
<CompanyCalculation>
<Income>
<TradingAndProfessional>
<Profits>95517</Profits>
<NetProfits>51276</NetProfits>
</TradingAndProfessional>
</Income>
</CompanyCalculation>
<AttachedFiles>
<Xsubmission>
<Accounts>
<Instance>
<EncodedInlineSubmission> TEXT I WANT TO JOIN</EncodedInlineSubmission>
</Instance>
</Accounts>
<Computations>
<Instance>
<EncodedInlineSubmission> MORE TEXT I WANT TO JOIN</EncodedInlineSubmission>
</Instance>
</Computations>
</Xsubmission>
</AttachedFiles>
</CompanyTaxReturn>
</IRenvelope>
</Body>
So in This XML here I want to combine all the text in all the instances of and put them into one element single element so it will read:
<EncodedInlineSubmission> TEXT I WANT TO JOIN MORE TEXT I WANT TO JOIN</EncodedInlineSubmission>
Update: added an element constructor around the returned string.
You can use fn:string-join() to join a sequence of strings with a joiner string. You'll need to evaluate an XPath expression that selects all the nodes you want to join, and then retrieve their string values.
Here's an example:
declare namespace env = "http://www.website.gov.uk/CM/envelope";
let $nodes := $doc/env:webMessage/env:Body//env:EncodedInlineSubmission
return element EncodedInlineSubmission { fn:string-join($nodes/fn:string(), " ") }
Notes:
Assume $doc is bound to the document-node of your sample document
you may need a different XPath expression
your sample is not well-formed
You can simplye construct a new element using the concatenated values of all elements and the previous name of an element:
for $x in //*:Xsubmission
let $encoded := $x//*:EncodedInlineSubmission
return element {$encoded[1]/local-name()} {string-join($encoded)}

What does # means

What does that #, in data.#state means?
<s:State name="normal" basedOn="{data.#state}"/>
Thank you.
# is a e4x attribute identifier operators.
var myXML:XML =
<order>
<item id='1'>
<menuName>burger</menuName>
<price>3.95</price>
</item>
<item id='2'>
<menuName>fries</menuName>
<price>1.45</price>
</item>
</order>
trace(myXML.item[0].#id); // Output: 1
As others have stated, # is an e4x attribute.
In the context you have provided, I must assume that data is an XMLLst. But, it may be an XML variable. In the context of Flex it may also be an XMLListCollection; which is just a wrapper around an XMLList used as the dataProvider to a Flex listed-based class.
I assume that the data variable must point to something like this:
<someElement state="someStateValue"> </someElement>
And therefore, data.#state should return the value 'someStateValue'

How to iterate through DOM elements that match a css class using xpath?

I'm processing an HTML page with a variable number of p elements with a css class "myclass", using Python + Selenium RC.
When I try to select each node with this xpath:
//p[#class='myclass'][n]
(with n a natural number)
I get only the first p element with this css class for every n, unlike the situation if I iterate through selecting ALL p elements with:
//p[n]
Is there any way I can iterate through elements by css class using xpath?
XPath 1.0 doesn't provide an iterating construct.
Iteration can be performed on the selected node-set in the language that is hosting XPath.
Examples:
In XSLT 1.0:
<xsl:for-each select="someExpressionSelectingNodes">
<!-- Do something with the current node -->
</xsl:for-each>
In C#:
using System;
using System.IO;
using System.Xml;
public class Sample {
public static void Main() {
XmlDocument doc = new XmlDocument();
doc.Load("booksort.xml");
XmlNodeList nodeList;
XmlNode root = doc.DocumentElement;
nodeList=root.SelectNodes("descendant::book[author/last-name='Austen']");
//Change the price on the books.
foreach (XmlNode book in nodeList)
{
book.LastChild.InnerText="15.95";
}
Console.WriteLine("Display the modified XML document....");
doc.Save(Console.Out);
}
}
XPath 2.0 has its own iteration construct:
for $varname1 in someExpression1,
$varname2 in someExpression2,
. . . . . . . . . . .
$varnameN in someExpressionN
return
SomeExpressionUsingTheVarsAbove
Now that I look again at this question, I think the real problem is not in iterating, but in using //.
This is a FAQ:
//p[#class='myclass'][1]
selects every p element that has a class attribute with value "myclass" and that is the first such child of its parent. Therefore this expression may select many p elements, none of which is really the first such p element in the document.
When we want to get the first p element in the document that satisfies the above predicate, one correct expression is:
(//p)[#class='myclass'][1]
Remember: The [] operator has a higher priority (precedence) than the // abbreviation.
WHanever you need to index the nodes selected by //, always put the expression to be indexed in brackets.
Here is a demonstration:
<nums>
<a>
<n x="1"/>
<n x="2"/>
<n x="3"/>
<n x="4"/>
</a>
<b>
<n x="5"/>
<n x="6"/>
<n x="7"/>
<n x="8"/>
</b>
</nums>
The XPath expression:
//n[#x mod 2 = 0][1]
selects the following two nodes:
<n x="2" />
<n x="6" />
The XPath expression:
(//n)[#x mod 2 = 0][1]
selects exactly the first n element in the document with the wanted property:
<n x="2" />
Try this first with the following transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select="//n[#x mod 2 = 0][1]"/>
</xsl:template>
</xsl:stylesheet>
and the result is two nodes.
<n x="2" />
<n x="6" />
Now, change the XPath expression as below and try again:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select="(//n)[#x mod 2 = 0][1]"/>
</xsl:template>
</xsl:stylesheet>
and the result is what we really wanted -- the first such n element in the document:
<n x="2" />
Maybe all your divs with this class are at the same level, so by //p[#class='myclass'] you receive the array of paragraphs with the specified class. So you should iterate through it using indexes, i.e.
//p[#class='myclass'][1], //p[#class='myclass'][2],...,//p[#class='myclass'][last()]
I don't think you're using the "index" for it's real purpose. The //p[selection][index] syntax in this selection is actually telling you which element within its parent it should be... So //p[selection][1] is saying that your selected p must be the first child of its parent. //p[selection][2] is saying it must be the 2nd child. Depending on your html, it's likely this isn't what you want.
Given that you're using Selenium and Python, there's a couple ways to do what you want, and you can look at this question to see them (there are two options given there, one in selenium Javascript, the other using the server-side selenium calls).
Here's a C# code snippet that might help you out.
The key here is the Selenium function GetXpathCount(). It should return the number of occurrences of the Xpath expression you are looking for.
You can enter //p[#class='myclass'] in XPather or any other Xpath analysis tool so you can indeed verify multiple results are returned. Then you just iterate through the results in your code.
In my case, it was all the list items in an UL that needed to be iterated -i.e. //li[#class='myclass']/ul/li - so based on your requirements should be something like:
int numProductsInLeftNav = Convert.ToInt32(selenium.GetXpathCount("//p[#class='myclass']"));
List<string> productsInLeftNav = new List<string>();
for (int i = 1; i <= numProductsInLogOutLeftNav; i++) {
string productName = selenium.GetText("//p[#class='myclass'][" + i + "]");
productsInLogoutLeftNav.Add(productName);
}