How can I remove duplicate nodes in XQuery?

How can I remove duplicate nodes in XQuery? - duplicates

I have an XML document I generate on the fly, and I need a function to eliminate any duplicate nodes from it.
My function looks like:
declare function local:start2() {
let $data := local:scan_books()
return <books>{$data}</books>
};
Sample output is:
<books>
<book>
<title>XML in 24 hours</title>
<author>Some Guy</author>
</book>
<book>
<title>XML in 24 hours</title>
<author>Some Guy</author>
</book>
</books>
I want just the one entry in my books root tag, and there are other tags, like say pamphlet in there too that need to have duplicates removed. Any ideas?
Updated following comments. By unique nodes, I mean remove multiple occurrences of nodes that have the exact same content and structure.

A simpler and more direct one-liner XPath solution:
Just use the following XPath expression:
/*/book
[index-of(/*/book/title,
title
)
[1]
]
When applied, for example, on the following XML document:
<books>
<book>
<title>XML in 24 hours</title>
<author>Some Guy</author>
</book>
<book>
<title>Food in Seattle</title>
<author>Some Guy2</author>
</book>
<book>
<title>XML in 24 hours</title>
<author>Some Guy</author>
</book>
<book>
<title>Food in Seattle</title>
<author>Some Guy2</author>
</book>
<book>
<title>How to solve XPAth Problems</title>
<author>Me</author>
</book>
</books>
the above XPath expression selects correctly the following nodes:
<book>
<title>XML in 24 hours</title>
<author>Some Guy</author>
</book>
<book>
<title>Food in Seattle</title>
<author>Some Guy2</author>
</book>
<book>
<title>How to solve XPAth Problems</title>
<author>Me</author>
</book>
The explanation is simple: For every book, select only one of its occurences -- such that its index in all-books is the same as the first index of its title in all-titles.

You can use the built-in distinct-values() function...

A solution inspired by functional programming. This solution is extensible in that you can replace the "=" comparison by your custom-built boolean local:compare($element1, $element2) function. This function has worst-case quadratic complexity in the length of the list. You could get n(log n) complexity by sorting the list before-hand and only comparing with the immediate successor.
To my best knowledge, the fn:distinct-values (or fn:distinct-elements) functions does not allow to use a custom-built comparison function.
declare function local:deduplicate($list) {
if (fn:empty($list)) then ()
else
let $head := $list[1],
$tail := $list[position() > 1]
return
if (fn:exists($tail[ . = $head ])) then local:deduplicate($tail)
else ($head, local:deduplicate($tail))
};
let $list := (1,2,3,4,1,2,1) return local:deduplicate($list)

I solved my problem by implementing a recursive uniqueness search function, based solely on the text content of my document for uniqueness matching.
declare function ssd:unique-elements($list, $rules, $unique) {
let $element := subsequence($rules, 1, 1)
let $return :=
if ($element) then
if (index-of($list, $element) >= 1) then
ssd:unique-elements(insert-before($element, 1, $list), subsequence($rules, 2), $unique)
else <test>
<unique>{$element}</unique>
{ssd:unique-elements(insert-before($element, 1, $list), subsequence($rules, 2), insert-before($element, 1, $unique))/*}
</test>
else ()
return $return
};
Called as follows:
declare function ssd:start2() {
let $data := ()
let $sift-this :=
<test>
<data>123</data>
<data>456</data>
<data>123</data>
<data>456</data>
<more-data>456</more-data>
</test>
return ssd:unique-elements($data, $sift-this/*, ())/*/*
};
ssd:start2()
output:
<?xml version="1.0" encoding="UTF-8"?>
<data>123</data>
<data>456</data>
I guess if you need slightly different equivalence matching, you can alter the matching in the algorithm accordingly. Should get you started at any rate.

What about fn:distinct-values?

You can use this functx function: functx:distinct-deep
No need reinvent the wheel

To remove duplicates I usually use a helper function. In your case it'll look like that:
declare function local:remove-duplicates($items as item()*)
as item()*
{
for $i in $items
group by $i
return $items[index-of($items, $i)[1]]
};
declare function local:start2() {
let $data := local:scan_books()
return <books>{local:remove-duplicates($data)}</books>
};

Related

XQUERY select an element one after the conditional element that has been found

I'm fairly new to XQuery so forgive me if this is extremely simple.
Essentially I'm searching a corpus of xml data for the word "has", and then I want to be able to return the word that follows immediately after "has" e.g. if the sentence was "has there been a fire?" I would like to return the word "there".
The XML corpus structure looks like this:
<s n="129">
<w c5="NP0" hw="indonesia" pos="SUBST">Indonesia</w>
<w c5="VHZ" hw="have" pos="VERB">has</w>
<w c5="AJ0" hw="large" pos="ADJ">large</w>
<w c5="NN2" hw="industry" pos="SUBST">industries</w>
<c c5="PUN">,</c>
<w c5="AV0" hw="recently" pos="ADV">recently</w>
<w c5="VVN" hw="develop" pos="VERB">developed</w>
</s>
In this sample of data, I'd like the word "large" as it immediately follows "has".
My current XQuery code looks like this:
<hascount>
{
for $v in
doc ("KS0.xml")/bncDoc/stext/div/u/s/w
where
$v = "has"
return ($v)
}
</hascount>
It simply returns all the instances of has at the moment. How would I change this code to be able to perform what my intended task is above?
Thank you in advance.

Try This Code
let $markup:=doc ("KS0.xml")
return $markup//w[matches(.,'^has$')]/following-sibling::w[1]

So I've found the answer to my own question.
This can be done by using XPath axis "following-sibling".
The implementation of this code in xquery would be:
<hascount>
{
for $v in
doc ("KS0.xml")/bncDoc/stext/div/u/s/w
where
$v = "has"
return ($v/following-sibling::*[1])
}
</hascount>

Customize JSON created by CL_SXML_STRING_WRITER

I create JSON like this to extract any table (name "randomly" decided at runtime, its name is in variable iv_table_name):
FIELD-SYMBOLS <itab> TYPE STANDARD TABLE.
DATA ref_itab TYPE REF TO data.
DATA(iv_table_name) = 'SCARR'.
CREATE DATA ref_itab TYPE STANDARD TABLE OF (iv_table_name).
ASSIGN ref_itab->* TO <itab>.
SELECT *
INTO TABLE <itab>
FROM (iv_table_name).
DATA results_json TYPE TABLE OF string.
DATA sub_json TYPE string.
DATA(lo_json_writer) = cl_sxml_string_writer=>create( type = if_sxml=>co_xt_json ).
CALL TRANSFORMATION id
SOURCE result = <itab>
RESULT XML lo_json_writer.
cl_abap_conv_in_ce=>create( )->convert(
EXPORTING
input = lo_json_writer->get_output( )
IMPORTING
data = sub_json ).
The result variable sub_json looks like this:
{"RESULT":
[
{"MANDT":"220","AUFNR":"0000012", ...},
{"MANDT":"220","AUFNR":"0000013", ...},
...
]
}
Is there a way to avoid the surrounding dictionary and get the result like this?
[
{"MANDT":"220","AUFNR":"0000012", ...},
{"MANDT":"220","AUFNR":"0000013", ...},
...
]
Background:
I used this:
sub_json = /ui2/cl_json=>serialize( data = <lt_result> pretty_name = /ui2/cl_json=>pretty_mode-low_case ).
But the performance of /ui2/cl_json=>serialize( ) is not good.

If you really want to use it just as a tool for extracting table records then you could write your own ID transformation in STRANS. It could look like that, let us name it Z_JSON_TABLE_CONTENTS (create it with type XSLT):
<xsl:transform version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:sap="http://www.sap.com/sapxsl"
>
<xsl:output method="text" encoding="UTF-8" />
<xsl:strip-space elements="*"/>
<xsl:template match="RESULT">
[
<xsl:for-each select="*">
{
<xsl:for-each select="*">
"<xsl:value-of select="local-name()" />": "<xsl:value-of select="text()" />"<xsl:if test="position() != last()">,</xsl:if>
</xsl:for-each>
}<xsl:if test="position() != last()">,</xsl:if>
</xsl:for-each>
]
</xsl:template>
</xsl:transform>
Then you could use it like that.
REPORT ZZZ.
FIELD-SYMBOLS <itab> TYPE STANDARD TABLE.
DATA ref_itab TYPE REF TO data.
DATA(iv_table_name) = 'SCARR'.
CREATE DATA ref_itab TYPE STANDARD TABLE OF (iv_table_name).
ASSIGN ref_itab->* TO <itab>.
SELECT *
INTO TABLE <itab>
FROM (iv_table_name).
DATA results_json TYPE TABLE OF string.
DATA sub_json TYPE string.
DATA g_string TYPE string.
DATA(g_document) = cl_ixml=>create( )->create_document( ).
DATA(g_ref_stream_factory) = cl_ixml=>create( )->create_stream_factory( ).
DATA(g_ostream) = g_ref_stream_factory->create_ostream_cstring( g_string ).
CALL TRANSFORMATION Z_JSON_TABLE_CONTENTS
SOURCE result = <itab>
RESULT XML g_ostream.
DATA(g_json_parser) = new /ui5/cl_json_parser( ).
g_json_parser->parse( g_string ).

I've got no answer whether it's possible to omit the initial "RESULT" tag in full sXML, but my opinion is NO.
Now, there's the solution with the KISS principle :
REPLACE ALL OCCURRENCES OF REGEX '^\{"RESULT":|\}$' IN sub_json WITH ``.
There's also this other writing (slightly slower):
sub_json = replace( val = sub_json regex = '^\{"RESULT":|\}$' with = `` occ = 0 ).
ADDENDUM about performance:
I measured that for a string of 880K characters, the following code with the exact number of positions to remove (10 leading characters and 1 trailing character) is 6 times faster than regex (could vary based on version of ABAP kernel), but maybe it won't be noticeable compared to the rest of the program:
SHIFT sub_json LEFT BY 10 PLACES CIRCULAR.
REPLACE SECTION OFFSET strlen( sub_json ) - 11 OF sub_json WITH ``.

Just a bit of manual work and voila!
DATA(writer) = CAST if_sxml_writer( cl_sxml_string_writer=>create( type = if_sxml=>co_xt_json ) ).
DATA(components) =
CAST cl_abap_structdescr( cl_abap_typedescr=>describe_by_name( iv_table_name ) )->components.
writer->open_element( name = 'object' ).
LOOP AT <itab> ASSIGNING FIELD-SYMBOL(<line>).
LOOP AT components ASSIGNING FIELD-SYMBOL(<fs_comp>).
ASSIGN COMPONENT <fs_comp>-name OF STRUCTURE <line> TO FIELD-SYMBOL(<fs_val>).
writer->open_element( name = 'str' ).
writer->write_attribute( name = 'name' value = CONV string( <fs_comp>-name ) ).
writer->write_value( CONV string( <fs_val> ) ).
writer->close_element( ).
ENDLOOP.
ENDLOOP.
writer->close_element( ).
DATA(xml_json) = CAST cl_sxml_string_writer( writer )->get_output( ).
sub_json = cl_abap_codepage=>convert_from( source = xml_json codepage = `UTF-8` ).
No surrounding list and no dictionary. If you wanna each line in separate dictionary it is easily adjustable.

If you use ID call transformation, then what ever node you give at transformation that node will be added by default. We cannot skip this but you can remove following way..
Replace: Using Regex or Direct word with Replace First Occurrence statement and next last closing brace }. The way you did.
FIND: You can simple use this below statement
FIND REGEX '(\[.*\])' in sub_json SUBMATCHES sub_json.

Byte length of a string in xslt

Is there any xslt function to retrieve the byte length of a string.
For. e.g: i ♥ u
Character length obtained by string-length = 5
Byte length which I need = 7 bytes.

Assuming there is support for the EXPath binary module then you can use bin:length(bin:encode-string('i ♥ u')), as in
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
xmlns:bin="http://expath.org/ns/binary">
<xsl:template name="main" match="/">
<xsl:value-of select="for $enc in ('UTF-8', 'UTF-16') return bin:length(bin:encode-string('i ♥ u', $enc))"/>
</xsl:template>
</xsl:transform>

You could also play some tricks with iri-to-uri().
Try this:
Apply iri-to-uri() to the string
Convert any %xx sequences in the result to a single ASCII character using the replace() function
The length of the resulting string is the number of bytes in the UTF-8 representation of the original string.
For example string-length(replace(iri-to-uri('§'), '%..', '%')) => 2
Also tested on your example.

And here's another approach (again assuming UTF-8 encoding):
sum(for $c in string-to-codepoints($in)
return (1 + number($c>127) + number($c>2047) + number($c>65535)))

How to iterate through DOM elements that match a css class using xpath?

I'm processing an HTML page with a variable number of p elements with a css class "myclass", using Python + Selenium RC.
When I try to select each node with this xpath:
//p[#class='myclass'][n]
(with n a natural number)
I get only the first p element with this css class for every n, unlike the situation if I iterate through selecting ALL p elements with:
//p[n]
Is there any way I can iterate through elements by css class using xpath?

XPath 1.0 doesn't provide an iterating construct.
Iteration can be performed on the selected node-set in the language that is hosting XPath.
Examples:
In XSLT 1.0:
<xsl:for-each select="someExpressionSelectingNodes">
<!-- Do something with the current node -->
</xsl:for-each>
In C#:
using System;
using System.IO;
using System.Xml;
public class Sample {
public static void Main() {
XmlDocument doc = new XmlDocument();
doc.Load("booksort.xml");
XmlNodeList nodeList;
XmlNode root = doc.DocumentElement;
nodeList=root.SelectNodes("descendant::book[author/last-name='Austen']");
//Change the price on the books.
foreach (XmlNode book in nodeList)
{
book.LastChild.InnerText="15.95";
}
Console.WriteLine("Display the modified XML document....");
doc.Save(Console.Out);
}
}
XPath 2.0 has its own iteration construct:
for $varname1 in someExpression1,
$varname2 in someExpression2,
. . . . . . . . . . .
$varnameN in someExpressionN
return
SomeExpressionUsingTheVarsAbove

Now that I look again at this question, I think the real problem is not in iterating, but in using //.
This is a FAQ:
//p[#class='myclass'][1]
selects every p element that has a class attribute with value "myclass" and that is the first such child of its parent. Therefore this expression may select many p elements, none of which is really the first such p element in the document.
When we want to get the first p element in the document that satisfies the above predicate, one correct expression is:
(//p)[#class='myclass'][1]
Remember: The [] operator has a higher priority (precedence) than the // abbreviation.
WHanever you need to index the nodes selected by //, always put the expression to be indexed in brackets.
Here is a demonstration:
<nums>
<a>
<n x="1"/>
<n x="2"/>
<n x="3"/>
<n x="4"/>
</a>
<b>
<n x="5"/>
<n x="6"/>
<n x="7"/>
<n x="8"/>
</b>
</nums>
The XPath expression:
//n[#x mod 2 = 0][1]
selects the following two nodes:
<n x="2" />
<n x="6" />
The XPath expression:
(//n)[#x mod 2 = 0][1]
selects exactly the first n element in the document with the wanted property:
<n x="2" />
Try this first with the following transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select="//n[#x mod 2 = 0][1]"/>
</xsl:template>
</xsl:stylesheet>
and the result is two nodes.
<n x="2" />
<n x="6" />
Now, change the XPath expression as below and try again:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select="(//n)[#x mod 2 = 0][1]"/>
</xsl:template>
</xsl:stylesheet>
and the result is what we really wanted -- the first such n element in the document:
<n x="2" />

Maybe all your divs with this class are at the same level, so by //p[#class='myclass'] you receive the array of paragraphs with the specified class. So you should iterate through it using indexes, i.e.
//p[#class='myclass'][1], //p[#class='myclass'][2],...,//p[#class='myclass'][last()]

I don't think you're using the "index" for it's real purpose. The //p[selection][index] syntax in this selection is actually telling you which element within its parent it should be... So //p[selection][1] is saying that your selected p must be the first child of its parent. //p[selection][2] is saying it must be the 2nd child. Depending on your html, it's likely this isn't what you want.
Given that you're using Selenium and Python, there's a couple ways to do what you want, and you can look at this question to see them (there are two options given there, one in selenium Javascript, the other using the server-side selenium calls).

Here's a C# code snippet that might help you out.
The key here is the Selenium function GetXpathCount(). It should return the number of occurrences of the Xpath expression you are looking for.
You can enter //p[#class='myclass'] in XPather or any other Xpath analysis tool so you can indeed verify multiple results are returned. Then you just iterate through the results in your code.
In my case, it was all the list items in an UL that needed to be iterated -i.e. //li[#class='myclass']/ul/li - so based on your requirements should be something like:
int numProductsInLeftNav = Convert.ToInt32(selenium.GetXpathCount("//p[#class='myclass']"));
List<string> productsInLeftNav = new List<string>();
for (int i = 1; i <= numProductsInLogOutLeftNav; i++) {
string productName = selenium.GetText("//p[#class='myclass'][" + i + "]");
productsInLogoutLeftNav.Add(productName);
}

SelectNodes and GetElementsByTagName

what are main differences between SelectNodes and GetElementsByTagName.

SelectNodes is a .NET/MSXML-specific method that gets a list of matching nodes for an XPath expression. XPaths can select elements by tag name but can also do lots of other, more complicated selection rules.
getElementByTagName is a DOM Level 1 Core standard method available in many languages (but spelled with a capital G in .NET). It selects elements only by tag name; you can't ask it to select elements with a certain attribute, or elements with tag name a inside other elements with tag name b or anything clever like that. It's older, simpler, and in some environments faster.

SelectNodes takes an XPath expression as a parameter and returns all nodes that match that expression.
GetElementsByTagName takes a tag name as a parameter and returns all tags that have that name.
SelectNodes is therefore more expressive, as you can write any GetElementsByTagName call as a SelectNodes call, but not the other way around. XPath is a very robust way of expressing sets of XML nodes, offering more ways of filtering than just name. XPath, for example, can filter by tag name, attribute names, inner content and various aggregate functions on tag children as well.

SelectNodes() is a Microsoft extension to the Document Object Model (DOM) (msdn).
SelectNodes as mentioned by Welbog and others takes XPath expression. I would like to mention difference with GetElementsByTagName() when deleting xml node is needed.
Answer and code provided user chilberto at msdn forum
The next test illustrates the difference by performing the same function (removing the person nodes) but by using the GetElementByTagName() method to select the nodes. Though the same object type is returned its construction is different. The SelectNodes() is a collection of references back to the xml document. That means we can remove from the document in a foreach without affecting the list of references. This is shown by the count of the nodelist not being affected. The GetElementByTagName() is a collection that directly reflects the nodes in the document. That means as we remove the items in the parent, we actually affect the collection of nodes. This is why the nodelist can not be manipulated in a foreach but had to be changed to a while loop.
.NET SelectNodes()
[TestMethod]
public void TestSelectNodesBehavior()
{
XmlDocument doc = new XmlDocument();
doc.LoadXml(#"<root>
<person>
<id>1</id>
<name>j</name>
</person>
<person>
<id>2</id>
<name>j</name>
</person>
<person>
<id>1</id>
<name>j</name>
</person>
<person>
<id>3</id>
<name>j</name>
</person>
<business></business>
</root>");
XmlNodeList nodeList = doc.SelectNodes("/root/person");
Assert.AreEqual(5, doc.FirstChild.ChildNodes.Count, "There should have been a total of 5 nodes: 4 person nodes and 1 business node");
Assert.AreEqual(4, nodeList.Count, "There should have been a total of 4 nodes");
foreach (XmlNode n in nodeList)
n.ParentNode.RemoveChild(n);
Assert.AreEqual(1, doc.FirstChild.ChildNodes.Count, "There should have been only 1 business node left in the document");
Assert.AreEqual(4, nodeList.Count, "There should have been a total of 4 nodes");
}
.NET GetElementsByTagName()
[TestMethod]
public void TestGetElementsByTagNameBehavior()
{
XmlDocument doc = new XmlDocument();
doc.LoadXml(#"<root>
<person>
<id>1</id>
<name>j</name>
</person>
<person>
<id>2</id>
<name>j</name>
</person>
<person>
<id>1</id>
<name>j</name>
</person>
<person>
<id>3</id>
<name>j</name>
</person>
<business></business>
</root>");;
XmlNodeList nodeList = doc.GetElementsByTagName("person");
Assert.AreEqual(5, doc.FirstChild.ChildNodes.Count, "There should have been a total of 5 nodes: 4 person nodes and 1 business node");
Assert.AreEqual(4, nodeList.Count, "There should have been a total of 4 nodes");
while (nodeList.Count > 0)
nodeList[0].ParentNode.RemoveChild(nodeList[0]);
Assert.AreEqual(1, doc.FirstChild.ChildNodes.Count, "There should have been only 1 business node left in the document");
Assert.AreEqual(0, nodeList.Count, "All the nodes have been removed");
}
With SelectNodes() we get collection / list of references to xml document nodes. We can manipulate with those references. If we delete node, the change will be visible to xml document, but the collection / list of references is the same (although node which was deleted, it's reference points now to null -> System.NullReferenceException) Although I do not really know how this is implemented. I suppose if we use XmlNodeList nodeList = GetElementsByTagName() and delete node with nodeList[i].ParentNode.RemoveChild(nodeList[i]) is frees/deletes reference in nodeList variable.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How can I remove duplicate nodes in XQuery? - duplicates

You can use the built-in distinct-values() function...

What about fn:distinct-values?

You can use this functx function: functx:distinct-deep No need reinvent the wheel

Related

XQUERY select an element one after the conditional element that has been found

Customize JSON created by CL_SXML_STRING_WRITER

Byte length of a string in xslt

How to iterate through DOM elements that match a css class using xpath?

SelectNodes and GetElementsByTagName

Categories

Resources