I am dealing with some JSON-LD data in MarkLogic and have trouble using XPath on property names with "#" symbol. For example:
{
"#type": "News",
"title": "some title",
"description": "some description"
}
My goal is to retrieve the title if the type is "News". I understand "#" is reserved to represent attribute in XPath, so something below should not work.
doc.xpath('.[#type="News"]/title')
With the xdmp.encodeForNCName function, I see the "#" symbol is represented as _40_ in the JSON representation. But it still doesn't work.
doc.xpath('.[_40_type="News"]/title')
While using fn:name() would work too, as suggested by the other answers, you can address nodes with funny spelling in MarkLogic XPath directly too. Probably a deviation from the official XPath standard itself, but MarkLogic allows writing expressions like:
doc.xpath('node("#type")[. eq "News"]/title'
Very useful for JSON properties containing spaces and such as well..
HTH!
You could test the name() in a predicate:
doc.xpath('.[*[contains(name(), "#type")] = "News"]/title')
Here is the dirty solution.
.[#*[name() = '#type']][#*='News']/title
I know you are working with json, but I just checked the xpath in html with similar attribute and value combination. You can see the xpath considered both attribute name and value (as it's not selecting other node with the same name but different value).
Related
I have a website I need to isolate XPATH identifiers on - they have an XPath ID like this //*[#id="panel-detail-6163748c7952a-partnerCode"]
The issue is that the website changes the value 6163748c7952a on every page load.
Is there any such XPath expression which can match on the first/last part of that string? So of a wildcard like //*[#id="panel-detail-*-partnerCode"]
This XPath 2.0 expression,
//*[matches(#id, "^panel-detail-.*-partnerCode$")]
or this XPath 1.0 expression,
//*[starts-with(#id, 'panel-detail-') and
substring(#id, string-length(#id) - string-length('-partnerCode') + 1)
= '-partnerCode']
will match all elements whose id attribute value starts and ends with the noted substrings.
See also
XPath testing that string ends with substring?
There are few methods in xpath such as starts-with or ends-with. Many time folks replaces them with contains which should be discourage.
Please note that ends-with is available with xpath v2.0 .
xpath v1.0 :
//*[starts-with(#id,'panel-detail-') and contains(#id, '-partnerCode')]
xpath v2.0 :
//*[starts-with(#id,'panel-detail-') and ends-with(#id, '-partnerCode')]
I am looking for a way to get text which is not inside an HTML element:
<div class="col-sm-4">
<strong>Handelnde Personen:</strong><br><br>
<strong>Geschäftsführer</strong><br>
Mr John Doe<br>
Privatperson<br>
.....<br>
<br>
I want to get "Mr John Doe".
The only way I see is looking for a strong element which contains "Geschäftsführer" and then look for the following text.
My idea so far:
//strong[contains(text(), 'Gesch')]/br/../text()
... I simply can't make it work.
Also, is there a "wildcard" for strings? That I could use
*esch*ftsf*hr*
for "Geschäftsführer"?
I highly appreciate your help, thanks!
Try
//strong[starts-with(., 'Gesch')]/following-sibling::text()[1]
As for wildcard matching, with XPath 2.0 you use regular expressions:
//strong[matches(., '.*esch.*ftsf.*hr.*')]
With XPath 3.0 you could also use the Unicode collation algorithm
//strong[compare(., 'Geschäftsführer',
'http://www.w3.org/2013/collation/UCA?strength=primary') = 0]
(strength=primary ignores case and accents)
But to get anything more advanced than XPath 1.0 in the browser, you would need to deploy Saxon-JS.
Another option with 1.0 is to use translate() to remove case and umlauts:
//strong[translate(., 'ABCD..XYZÄÖÜäöüß', 'abcd..xyzaouaous') = 'geschaftsfuhrer']
Note, in all these examples I have used "." rather than "text()" to get the string value of an element - this is recommended practice.
I'm new to Xpath and I'm trying to figure out how to extract the attribute of the extension with the value of D000001602 as shown below.
<ClinDoc>
<ComponentOf>
<encompassingEncounter>
<id root="2.16.840.1.113883.3.52.3" extension="D000001602"/>
<effectiveTime>
<low value="20140620135800"/>
<high value="20140701140756"/>
</effectiveTime>
</encompassingEncounter>
</componentOf>
</ClinDoc>
I am using an online extractor with the following code but I can't seem to get it to work:
/clindoc/componentof/encompassingEncounter/id[#root=2.16.840.1.113883.3.529.3]/#extension
//id[#root=2.16.840.1.113883.3.529.3]/#extension
Thanks much!
You're trying to match root with the string value 2.16.840.1.113883.3.52.3, which means you need to represent it as a string literal in the XPath.
Also, node names are case sensitive:
/ClinDoc/Componentof/encompassingEncounter/id[#root="2.16.840.1.113883.3.529.3"]/#extension
I have bits like the following in an XML file that is a data source for an HTML page that uses CSS and javascript only. The special XML codes are my own, and I want to process them with javascript.
<listitem>regular text could be in here</listitem>
<listitem>possibly with <b>HTML markup</b></listitem>
<listitem>or <special>special xml</special></listitem>
What I dream of is a way to get from .getElementsByTagName("listitem") to the following array.
["regular text could be in here", "possibly with <b>HTML markup</b>", "or <special>special xml</special>"]
That way, I could process each listitem as part of the array. However, the XML parser breaks apart all the XML for each listitem. Other than using CDATA, which gets messy, is there another way?
I think the answer is:
Array.prototype.slice.call(document.getElementsByTagName("listitem")).map(function(x) {return x.innerHTML})
It will return:
["regular text could be in here", "possibly with <b>HTML markup</b>", "or <special>special xml</special>"]
C#: What is a good Regex to parse hyperlinks and their description?
Please consider case insensitivity, white-space and use of single quotes (instead of double quotes) around the HREF tag.
Please also consider obtaining hyperlinks which have other tags within the <a> tags such as <b> and <i>.
As long as there are no nested tags (and no line breaks), the following variant works well:
<a\s+href=(?:"([^"]+)"|'([^']+)').*?>(.*?)</a>
As soon as nested tags come into play, regular expressions are unfit for parsing. However, you can still use them by applying more advanced features of modern interpreters (depending on your regex machine). E.g. .NET regular expressions use a stack; I found this:
(?:<a.*?href=[""'](?<url>.*?)[""'].*?>)(?<name>(?><a[^<]*>(?<DEPTH>)|</a>(?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:</a>)
Source: http://weblogs.asp.net/scottcate/archive/2004/12/13/281955.aspx
See this example from StackOverflow: Regular expression for parsing links from a webpage?
Using The HTML Agility Pack you can parse the html, and extract details using the semantics of the HTML, instead of a broken regex.
I found this but apparently these guys had some problems with it.
Edit: (It works!)
I have now done my own testing and found that it works, I don't know C# so I can't give you a C# answer but I do know PHP and here's the matches array I got back from running it on this:
Text
array(3) { [0]=> string(52) "Text" [1]=> string(15) "pages/index.php" [2]=> string(4) "Text" }
I have a regex that handles most cases, though I believe it does match HTML within a multiline comment.
It's written using the .NET syntax, but should be easily translatable.
Just going to throw this snippet out there now that I have it working..this is a less greedy version of one suggested earlier. The original wouldnt work if the input had multiple hyperlinks. This code below will allow you to loop through all the hyperlinks:
static Regex rHref = new Regex(#"<a.*?href=[""'](?<url>[^""^']+[.]*?)[""'].*?>(?<keywords>[^<]+[.]*?)</a>", RegexOptions.IgnoreCase | RegexOptions.Compiled);
public void ParseHyperlinks(string html)
{
MatchCollection mcHref = rHref.Matches(html);
foreach (Match m in mcHref)
AddKeywordLink(m.Groups["keywords"].Value, m.Groups["url"].Value);
}
Here is a regular expression that will match the balanced tags.
(?:""'[""'].*?>)(?(?>(?)|(?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:)