xquery with xpath and HTML - html

I have to produce an xquery file using normalize text and lower case Xpath function. (that is the page which includes the schema that my query must process: http://www.natcorp.ox.ac.uk/docs/URG/bnctags.html) I must put it in a table and I do not know who to produce a HTML table with xquery and I do not know also who to link xquery with xpath.
(That is the specification of my exercise)
Produce a .xquery file containing a XQuery FLWOR expression which returns all the
occurrences of the word 'has' in the collection of files, together with the word which comes next in the sentence in each case. The resulting list should be formatted as a HTML table, with each row containing the two words in their own cells, e.g.:
Target Successor
has there
has n't
has n't
... ...

Try something like the following, with the main thing being to figure out the proper pattern match. let $seek := "has", $pattern := concat($seek,"(\s?[^\s])") return <table> <tr><th>Target</th><th>Successor</th> { for $item in //text()[contains(.,'has')] let $match := string($item) return <tr> <td>{replace($match,$pattern,"$1")}</td> <td>{replace($match,$pattern,"$2")}</td> </tr> } </table>

Related

How to use regex (regular expressions) in Notepad++ to remove all HTML and JSON code that does not contain a specific string?

Using regular expressions (in Notepad++), I want to find all JSON sections that contain the string foo. Note that the JSON just happens to be embedded within a limited set of HTML source code which is loaded into Notepad++.
I've written the following regex to accomplish this task:
({[^}]*foo[^}]*})
This works as expected in all the input that is possible.
I want to improve my workflow, so instead of just finding all such JSON sections, I want to write a regex to remove all the HTML & JSON that does not match this expression. The result will be only JSON sections that contain foo.
I tried using the Notepad++ regex Replace functionality with this find expression:
(?:({[^}]*?foo[^}]*?})|.)+
and this replace expression:
$1\n\n$2\n\n$3\n\n$4\n\n$5\n\n$6\n\n$7\n\n$8\n\n$9\n\n
This successfully works for the last occurrence of foo within the JSON, but does not find the rest of the occurrences.
How can I improve my code to find all the occurrences?
Here is a simplified minimal example of input and desired output. I hope I haven't simplified it too much for it to be useful:
Simplified input:
<!DOCTYPE html>
<html>
<div dat="{example foo1}"> </div>
<div dat="{example bar}"> </div>
<div dat="{example foo2}"> </div>
</html>
Desired output:
{example foo1}
{example foo2}
You can use
{[^}]*foo[^}]*}|((?s:.))
Replace with (?1:$0\n). Details:
{[^}]*foo[^}]*} - {, zero or more chars other than }, foo, zero or more chars other than } and then a }
| - or
((?s:.)) - Capturing group 1: any one char ((?s:...) is an inline modifier group where . matches all chars including line break chars, same as if you enabled . matches newline option).
The (?1:$0\n) replacement pattern replaces with an empty string if Group 1 was matched, else the replacement is the match text + a newline.
See the demo and search and replace dialog settings:
Updates
The comment section was full tried to suggest a code here,
Let me know if this is a bit close to your intended result,
Find: ({.+?[\n]*foo[ \d]*})|.*?
Replace all: $1
Also added Toto's example

How do I get rid of the tags in XPath

I have a bunch of html files with tons of data in it and I want to extract the important parts of it.
The files are all very similar; I've to search for a <tr> which contains a certain keyword. The third column of this table row always contains the name of the "block" I'm searching for (it's a few table rows).
//body/table/tbody/tr[td = "Deployed to"]/td[3]/div//span[text()]
with this XPath query I get the names (maybe one, maybe more)
The problem is, how do I get rid of the tags around the data?
Right now my output is something like this:
<span class="log_entry_text">Name1</span><span class="log_entry_text">Name2</span><span class="log_entry_text">Name3</span>
I want to have something like that: Name1 Name2 Name3
So I can use it for extracting these blocks more easily.
With string() i can only extract the first element (result would be: Name1)
Thanks for helping me!
Just wrap your xpath with data() element like data(//body/table/tbody/tr[td = "Deployed to"]/td[3]/div//span[text()]) for retrieve text.
Your XPath expression asks to retrieve span elements and that's what it has returned. If you're seeing tags with angle brackets in the output, that's because of the way the XPath result is being processed and rendered by the receiving application.
If you're in XPath 2.0+ or XQuery 1.0+ you can combine the several span elements into a single string using
string-join(//path/span, ' ')

HtmlAgilityPack Wildcard Search in Powershell

How could I shorten the following?
$contactsBlock is an HTMLAgilityPack node, XPath: /html[1]/body[1]/div[3]/div[2]/div[2]/div[1]/div[1]/div[2]/div[1]/div[3]/div[5]/div[1]/div[2]
$contactsBlock.SelectSingleNode(".//table").SelectSingleNode(".//table")
Results in desired XPath: /html[1]/body[1]/div[3]/div[2]/div[2]/div[1]/div[1]/div[2]/div[1]/div[3]/div[5]/div[1]/div[2]/table[1]/tr[2]/td[1]/div[1]/div[2]/table[1]
The second table is nested in the first, and I'd like to shorten the above SelectSingleNode twice to something like this
$contactsBlock.SelectSingleNode(".//table/*/table") and skip the in-between.
Is there a way to wild-card like this?
An XPath expression .//table//table should match all tables nested within other tables under the current node. Double forward slashes match arbitrary length paths.
.//table/*/table is unlikely to give you a match, because the asterisk wildcard matches one node (i.e. one level of hierarchy), so the nested table would have to be a grandchild node of the first table:
<table>
<tr>
<table>...</table> <!-- nested table would have to go here -->
</tr>
</table>
which would be quite unusual. Doesn't match the structure suggested by the XPath expression from your question, too.

Finding a string that is split by multiple html tags

I am using Xpath to find a list of strings in an HTML document. The strings appear when you type into a text box, to suggest possible results - in other words, it's auto-complete. The problem is, I'm trying to retrieve the whole list of auto-complete suggestions, the results are all split up by <strong> tags.
To give a couple examples: I start typing "str" and the HTML will look like this:
<strong>str</strong>ing
But it gets better! If I don't type anything at all, every single character in the auto-complete results will be interrupted with opening and closing strong tags. Like so:
s
<strong></strong>
t
<strong></strong>
r
<strong></strong>
i
<strong></strong>
n
<strong></strong>
g
So, my question is, how do I construct an xpath that retrieves this string, but omits the strong tags?
For reference, the hierarchy of the HTML looks like this:
-div
--ul
---li
----(string I'm looking for)
---li
----(another string I'm looking for)
So my xpath at this point is: //div[#class='class']/ul/li/text(), which will get me the individual parts of the strings.
This XPath expression:
string(PathToYourDiv/ul/li[$n])
evaluates to the string value of $n-th li child of the ul that is a child of YourDiv. And this is the concatenation of all the text-node descendents od this li element -- effectively giving you the complete string you want.
You have just to substitute YourDiv and $n with specific expressions.
Do not use the // abbreviation, because:
Its evaluation can be very slow.
Indexing such an expression with [] in not intuitive and produces surprizing results that result in a FAQ.
That is much less code on the question than people would like to see around here.
But why don't you try a variant like this:
//div[#class='class']/ul/li/strong/text()

How can I retrieve a collection of values from nested HTML-like elements using RegExp?

I have a problem creating a regular expression for the following task:
Suppose we have HTML-like text of the kind:
<x>...<y>a</y>...<y>b</y>...</x>
I want to get a collection of values inside <y></y> tags located inside a given <x> tag, so the result of the above example would be a collection of two elements ["a","b"].
Additionally, we know that:
<y> tags cannot be enclosed in other <y> tags
... can include any text or other tags.
How can I achieve this with RegExp?
This is a job for an HTML/XML parser. You could do it with regular expressions, but it would be very messy. There are examples in the page I linked to.
I'm taking your word on this:
"y" tags cannot be enclosed in other "y" tags
input looks like: <x>...<y>a</y>...<y>b</y>...</x>
and the fact that everything else is also not nested and correctly formatted. (Disclaimer: If it is not, it's not my fault.)
First, find the contents of any X tags with a loop over the matches of this:
<x[^>]*>(.*?)</x>
Then (in the loop body) find any Y tags within match group 1 of the "outer" match from above:
<y[^>]*>(.*?)</y>
Pseudo-code:
input = "<x>...<y>a</y>...<y>b</y>...</x>"
x_re = "<x[^>]*>(.*?)</x>"
y_re = "<y[^>]*>(.*?)</y>"
for each x_match in input.match_all(x_re)
for each y_match in x_match.group(1).value.match_all(y_re)
print y_match.group(1).value
next y_match
next x_match
Pseudo-output:
a
b
Further clarification in the comments revealed that there is an arbitrary amount of Y elements within any X element. This means there can be no single regex that matches them and extracts their contents.
Short and simple: Use XPath :)
It would help if we knew what language or tool you're using; there's a great deal of variation in syntax, semantics, and capabilities. Here's one way to do it in Java:
String str = "<y>c</y>...<x>...<y>a</y>...<y>b</y>...</x>...<y>d</y>";
String regex = "<y[^>]*+>(?=(?:[^<]++|<(?!/?+x\\b))*+</x>)(.*?)</y>";
Matcher m = Pattern.compile(regex).matcher(str);
while (m.find())
{
System.out.println(m.group(1));
}
Once I've matched a <y>, I use a lookahead to affirm that there's a </x> somewhere up ahead, but there's no <x> between the current position and it. Assuming the pseudo-HTML is reasonably well-formed, that means the current match position is inside an "x" element.
I used possessive quantifiers heavily because they make things like this so much easier, but as you can see, the regex is still a bit of a monster. Aside from Java, the only regex flavors I know of that support possessive quantifiers are PHP and the JGS tools (RegexBuddy/PowerGrep/EditPad Pro). On the other hand, many languages provide a way to get all of the matches at once, but in Java I had to code my own loop for that.
So it is possible to do this job with one regex, but a very complicated one, and both the regex and the enclosing code have to be tailored to the language you're working in.