XML: How to recursively combine the text of identical elements in Xquery - html

How can I merge all the context that are in identical and repeated elements throughout a document using Xquery?
sample document:
<webMessage xmlns="http://www.website.gov.uk/CM/envelope">
<EnvelopeVersion>2.0</EnvelopeVersion>
<Header>
<MessageDetails>
<Class>Web-CT600</Class>
<Qualifier/>
<Function/>
</MessageDetails>
<SenderDetails>
<IDAuthentication>
<SenderID/>
<Authentication>
<Method/>
<Role/>
<Value/>
</Authentication>
</IDAuthentication>
</SenderDetails>
</Header>
<webTalkDetails>
<Keys>
<Key Type="UTR">2274792909</Key>
</Keys>
<ChannelRouting>
<Channel>
<URI/>
<Product/>
<Version/>
</Channel>
</ChannelRouting>
</webTalkDetails>
<Body>
<IRenvelope xmlns="http://www.website.gov.uk/taxation/CT/3">
<IRheader>
<Keys>
<Key Type="UTR">2274792909</Key>
</Keys>
<PeriodEnd/>
<DefaultCurrency/>
<IRmark Type="generic">n1uS2MiavBsb6YwL82MK</IRmark>
<Sender/>
</IRheader>
<CompanyReturn ReturnType="new">
<CompanyInformation>
<CompanyName/>
<RegistrationNumber/>
<Reference/>
<PeriodCovered>
<From>2013-01-07</From>
<To>2014-01-07</To>
</PeriodCovered>
</CompanyInformation>
<Turnover>
<Total>45893</Total>
</Turnover>
<CompanyCalculation>
<Income>
<TradingAndProfessional>
<Profits>95517</Profits>
<NetProfits>51276</NetProfits>
</TradingAndProfessional>
</Income>
</CompanyCalculation>
<AttachedFiles>
<Xsubmission>
<Accounts>
<Instance>
<EncodedInlineSubmission> TEXT I WANT TO JOIN</EncodedInlineSubmission>
</Instance>
</Accounts>
<Computations>
<Instance>
<EncodedInlineSubmission> MORE TEXT I WANT TO JOIN</EncodedInlineSubmission>
</Instance>
</Computations>
</Xsubmission>
</AttachedFiles>
</CompanyTaxReturn>
</IRenvelope>
</Body>
So in This XML here I want to combine all the text in all the instances of and put them into one element single element so it will read:
<EncodedInlineSubmission> TEXT I WANT TO JOIN MORE TEXT I WANT TO JOIN</EncodedInlineSubmission>

Update: added an element constructor around the returned string.
You can use fn:string-join() to join a sequence of strings with a joiner string. You'll need to evaluate an XPath expression that selects all the nodes you want to join, and then retrieve their string values.
Here's an example:
declare namespace env = "http://www.website.gov.uk/CM/envelope";
let $nodes := $doc/env:webMessage/env:Body//env:EncodedInlineSubmission
return element EncodedInlineSubmission { fn:string-join($nodes/fn:string(), " ") }
Notes:
Assume $doc is bound to the document-node of your sample document
you may need a different XPath expression
your sample is not well-formed

You can simplye construct a new element using the concatenated values of all elements and the previous name of an element:
for $x in //*:Xsubmission
let $encoded := $x//*:EncodedInlineSubmission
return element {$encoded[1]/local-name()} {string-join($encoded)}

Related

How to create nodes with variable labels in cypher?

I am using JSON APOC plugin to create nodes from a JSON with lists in it, and I am trying to create nodes whose label is listed as an element in the list:
{
"pdf":[
{
"docID": "docid1",
"docLink": "/examplelink.pdf",
"docType": "PDF"
}
],
"jpeg":[
{
"docID": "docid20",
"docLink": "/examplelink20.pdf",
"docType": "JPEG"
}
],
...,}
And I want to both iterate through the doctypes (pdf, jpeg) and set the label as the docType property in the list. Right now I have to do separate blocks for each doctype list (jpeg: [], pdf:[]):
WITH "file:////input.json" AS url
CALL apoc.load.json(url) YIELD value
UNWIND value.pdf as doc
MERGE (d:PDF {docID: doc.docID})
I'd like to loop through the doctype lists, creating the node for each doctype with the label as either the list name (pdf) or the node's docType name (PDF). Something like:
WITH "file:////input.json" AS url
CALL apoc.load.json(url) YIELD value
for each doctypelist in value
for each doc in doctype list
MERGE(d:doc.docType {docID: doc.docID})
Or
WITH "file:////input.json" AS url
CALL apoc.load.json(url) YIELD value
for each doctypelist in value
for each doc in doctype list
MERGE(d {docID: doc.docID})
ON CREATE SET d :doc.docType
Cypher currently does not support this. To set a label, you must hardcode it into the Cypher. You could do filters, or multiple matches to do this in a tedious way, but if you aren't allowed to install any plug-ins to your Neo4j db, I would recommend either just putting an index on the type, or use a node+relation instead of the label. (There are a lot of valid doc types, so if you have to support them all, pure Cypher will make it very painful.)
Using APOC however, there is a procedure specifically for this apoc.create.addLabels
CREATE (:Movie {title: 'A Few Good Men', genre: 'Drama'});
MATCH (n:Movie)
CALL apoc.create.addLabels( id(n), [ n.genre ] ) YIELD node
REMOVE node.genre
RETURN node;

How to send MarkDown to API

I'm trying to send Some Markdown text to a rest api. Just now I figure it out that break lines are not accepted in json.
Example. How to send this to my api:
An h1 header
============
Paragraphs are separated by a blank line.
2nd paragraph. *Italic*, **bold**, and `monospace`. Itemized lists
look like:
* this one
* that one
* the other one
Note that --- not considering the asterisk --- the actual text
content starts at 4-columns in.
> Block quotes are
> written like so.
>
> They can span multiple paragraphs,
> if you like.
Use 3 dashes for an em-dash. Use 2 dashes for ranges (ex., "it's all
in chapters 12--14"). Three dots ... will be converted to an ellipsis.
Unicode is supported. ☺
as
{
"body" : " (the markdown) ",
}
As you're trying to send it to a REST API endpoint, I'll assume you're searching for ways to do it using Javascript (since you didn't specify what tech you were using).
Rule of thumb: except if your goal is to re-build a JSON builder, use the ones already existing.
And, guess what, Javascript implements its JSON tools ! (see documentation here)
As it's shown in the documentation, you can use the JSON.stringify function to simply convert an object, like a string to a json-compliant encoded string, that can later be decoded on the server side.
This example illustrates how to do so:
var arr = {
text: "This is some text"
};
var json_string = JSON.stringify(arr);
// Result is:
// "{"text":"This is some text"}"
// Now the json_string contains a json-compliant encoded string.
You also can decode JSON client-side with javascript using the other JSON.parse() method (see documentation):
var json_string = '{"text":"This is some text"}';
var arr = JSON.parse(json_string);
// Now the arr contains an array containing the value
// "This is some text" accessible with the key "text"
If that doesn't answer your question, please edit it to make it more precise, especially on what tech you're using. I'll edit this answer accordingly
You need to replace the line-endings with \n and then pass it in your body key.
Also, make sure you escape double-quotes (") by \" else your body will end there.
# An h1 header\n============\n\nParagraphs are separated by a blank line.\n\n2nd paragraph. *Italic*, **bold**, and `monospace`. Itemized lists\nlook like:\n\n * this one\n * that one\n * the other one\n\nNote that --- not considering the asterisk --- the actual text\ncontent starts at 4-columns in.\n\n> Block quotes are\n> written like so.\n>\n> They can span multiple paragraphs,\n> if you like.\n\nUse 3 dashes for an em-dash. Use 2 dashes for ranges (ex., \"it's all\nin chapters 12--14\"). Three dots ... will be converted to an ellipsis.\nUnicode is supported.

How to return single string value of XPath expression?

This is my HTML:
<?xml version="1.0" encoding="UTF-8"?>
<div class="single-main">
<h3 class="description-area">Description</h3>
<p>bla bla bla
<br/> some text
<br/> some text here ,
<br/> other text here
</p>
</div>
I want to get the whole text but in one XPath expression.
This is my code:
response.xpath(".//h3[#class='description-area']/following-sibling::p
//text()[count(preceding-sibling::br) >= 0]").extract()[0]
but it returns just the text before the first br (I know why, and that's because I am using .extract()[0] and if i used .extract()[1] and [2] .... I will get what I want, but I must use .extract[0] because it is a platform that does just that. Is there any XPath to return the whole text but in one string rather than in multiple strings?
string(/) will return the string value of the whole document.
Update: To return the four separate strings returned by this XPath,
.//h3[#class='description-area']/following-sibling::p//text()[count(preceding-sibling::br) >= 0]
as a single string, wrap the above XPath similarly in string():
string(.//h3[#class='description-area']/following-sibling::p//text()[count(preceding-sibling::br) >= 0])
Update 2: But the br and text() maneuvers aren't necessary. You can simply get the string value of the p:
string(.//h3[#class='description-area']/following-sibling::p)

Using Perl LibXML to read textContent that contains html tags

If I have the following XML:
<File id="MyTestApp/app/src/main/res/values/strings.xml">
<Identifier id="page_title" isArray="0" isPlural="0">
<EngTranslation eng_indx="0" goesWith="-1" index="0">My First App</EngTranslation>
<Description index="0">Home page title</Description>
<LangTranslation index="0">My First App</LangTranslation>
</Identifier>
<Identifier id="count" isArray="0" isPlural="0">
<EngTranslation eng_indx="0" goesWith="-1" index="0">You have <b>%1$d</b> view(s)</EngTranslation>
<Description index="0">Number of page views</Description>
<LangTranslation index="0">You have <b>%1$d</b> view(s)</LangTranslation>
</Identifier>
</File>
I'm trying to read the 'EngTranslation' text value, and want to return the full value including any HTML tags. For example, I have the following:
my $parser = XML::LibXML->new;
my $dom = $parser->parse_file("test.xml") or die;
foreach my $file ($dom->findnodes('/File')) {
print $file->getAttribute("id")."\n";
foreach my $identifier ($file->findnodes('./Identifier')) {
print $identifier->getAttribute("id")."\n";
print encode('UTF-8',$identifier->findnodes('./EngTranslation')->get_node(1)->textContent."\n");
print encode('UTF-8',$identifier->findnodes('./Description')->get_node(1)->textContent."\n");
print encode('UTF-8',$identifier->findnodes('./LangTranslation')->get_node(1)->textContent."\n");
}
}
The output I get is:
MyTestApp/app/src/main/res/values/strings.xml
page_title
My First App
Home page title
My First App
count
You have %1$d view(s)
Number of page views
You have %1$d views
What I'm hoping to get is:
MyTestApp/app/src/main/res/values/strings.xml
page_title
My First App
Home page title
My First App
count
You have <b>%1$d</b> view(s)
Number of page views
You have <b>%1$d</b> views
I'm just using this as an example for a more complicated situation, hopefully it makes sense.
Thanks!
Here's a rather monkey patching solution, but it works:
sub XML::LibXML::Node::innerXML{
my ($self) = shift;
join '', $self->childNodes();
}
…
say $identifier->findnodes('./Description')->get_node(1)->innerXML;
Oh, and if the encoding becomes a problem, use the toString method, it's first argument handles encoding. (I did use open, but there were no out of range characters in the xml).
If you don't like the monkey patching. you can change the sub to a normal one and supply the argument, like this:
sub myInnerXML{
my ($self) = shift;
join '', map{$_->toString(1)} $self->childNodes();
}
…
say myInnerXML($identifier->findnodes('./Description')->get_node(1));
In your source XML, you either need to encode the tags as entities or wrap that content in a CDATA section.
One problem with embedding HTML in XML is that HTML is not necessarily 'well formed'. For example the <br> tag and the <img> tag are not usually followed by matching closing tags and without the closing tags, it would not be valid in an XML document unless you XML-escape the whole string of HTML, e.g.:
<EngTranslation eng_indx="0" goesWith="-1" index="0">You have <b>%1$d</b> view(s)</EngTranslation>
Or use a CDATA section:
<EngTranslation eng_indx="0" goesWith="-1" index="0"><![CDATA[You have <b>%1$d</b> view(s)]]></EngTranslation>
However, if you restrict your HTML to always be well-formed, you can achieve what you want with the toString() method.
If you called toString() on the <EngTranslation> element node, the output would include the <EngTranslation>...</EngTranslation> wrapper tags. So instead, you would need to call toString() on each of the child nodes and concatenate the results together:
binmode(STDOUT, ':utf8');
foreach my $file ($dom->findnodes('/File')) {
print $file->getAttribute("id")."\n";
foreach my $identifier ($file->findnodes('./Identifier')) {
print $identifier->getAttribute("id")."\n";
my $html = join '', map { $_->toString }
$identifier->findnodes('./EngTranslation')->get_node(1)->childNodes;
print $html."\n";
print $identifier->findnodes('./Description')->get_node(1)->textContent."\n";
print $identifier->findnodes('./LangTranslation')->get_node(1)->textContent."\n";
}
}
Note I took the liberty of using binmode to set UTF8 encoding on the output filehandle so it was not necessary to call encode for every print.

SelectNodes and GetElementsByTagName

what are main differences between SelectNodes and GetElementsByTagName.
SelectNodes is a .NET/MSXML-specific method that gets a list of matching nodes for an XPath expression. XPaths can select elements by tag name but can also do lots of other, more complicated selection rules.
getElementByTagName is a DOM Level 1 Core standard method available in many languages (but spelled with a capital G in .NET). It selects elements only by tag name; you can't ask it to select elements with a certain attribute, or elements with tag name a inside other elements with tag name b or anything clever like that. It's older, simpler, and in some environments faster.
SelectNodes takes an XPath expression as a parameter and returns all nodes that match that expression.
GetElementsByTagName takes a tag name as a parameter and returns all tags that have that name.
SelectNodes is therefore more expressive, as you can write any GetElementsByTagName call as a SelectNodes call, but not the other way around. XPath is a very robust way of expressing sets of XML nodes, offering more ways of filtering than just name. XPath, for example, can filter by tag name, attribute names, inner content and various aggregate functions on tag children as well.
SelectNodes() is a Microsoft extension to the Document Object Model (DOM) (msdn).
SelectNodes as mentioned by Welbog and others takes XPath expression. I would like to mention difference with GetElementsByTagName() when deleting xml node is needed.
Answer and code provided user chilberto at msdn forum
The next test illustrates the difference by performing the same function (removing the person nodes) but by using the GetElementByTagName() method to select the nodes. Though the same object type is returned its construction is different. The SelectNodes() is a collection of references back to the xml document. That means we can remove from the document in a foreach without affecting the list of references. This is shown by the count of the nodelist not being affected. The GetElementByTagName() is a collection that directly reflects the nodes in the document. That means as we remove the items in the parent, we actually affect the collection of nodes. This is why the nodelist can not be manipulated in a foreach but had to be changed to a while loop.
.NET SelectNodes()
[TestMethod]
public void TestSelectNodesBehavior()
{
XmlDocument doc = new XmlDocument();
doc.LoadXml(#"<root>
<person>
<id>1</id>
<name>j</name>
</person>
<person>
<id>2</id>
<name>j</name>
</person>
<person>
<id>1</id>
<name>j</name>
</person>
<person>
<id>3</id>
<name>j</name>
</person>
<business></business>
</root>");
XmlNodeList nodeList = doc.SelectNodes("/root/person");
Assert.AreEqual(5, doc.FirstChild.ChildNodes.Count, "There should have been a total of 5 nodes: 4 person nodes and 1 business node");
Assert.AreEqual(4, nodeList.Count, "There should have been a total of 4 nodes");
foreach (XmlNode n in nodeList)
n.ParentNode.RemoveChild(n);
Assert.AreEqual(1, doc.FirstChild.ChildNodes.Count, "There should have been only 1 business node left in the document");
Assert.AreEqual(4, nodeList.Count, "There should have been a total of 4 nodes");
}
.NET GetElementsByTagName()
[TestMethod]
public void TestGetElementsByTagNameBehavior()
{
XmlDocument doc = new XmlDocument();
doc.LoadXml(#"<root>
<person>
<id>1</id>
<name>j</name>
</person>
<person>
<id>2</id>
<name>j</name>
</person>
<person>
<id>1</id>
<name>j</name>
</person>
<person>
<id>3</id>
<name>j</name>
</person>
<business></business>
</root>");;
XmlNodeList nodeList = doc.GetElementsByTagName("person");
Assert.AreEqual(5, doc.FirstChild.ChildNodes.Count, "There should have been a total of 5 nodes: 4 person nodes and 1 business node");
Assert.AreEqual(4, nodeList.Count, "There should have been a total of 4 nodes");
while (nodeList.Count > 0)
nodeList[0].ParentNode.RemoveChild(nodeList[0]);
Assert.AreEqual(1, doc.FirstChild.ChildNodes.Count, "There should have been only 1 business node left in the document");
Assert.AreEqual(0, nodeList.Count, "All the nodes have been removed");
}
With SelectNodes() we get collection / list of references to xml document nodes. We can manipulate with those references. If we delete node, the change will be visible to xml document, but the collection / list of references is the same (although node which was deleted, it's reference points now to null -> System.NullReferenceException) Although I do not really know how this is implemented. I suppose if we use XmlNodeList nodeList = GetElementsByTagName() and delete node with nodeList[i].ParentNode.RemoveChild(nodeList[i]) is frees/deletes reference in nodeList variable.