Get 2 separate xpath values from one span with a line break - html

I've got my HTML which looks like this:
<span>
Word 1
Sentence 1
</span>
I can extract it with:
//span/text()
which gives me
Word 1
Sentence 1
Is it possible in XPATH, to get/extract Word 1 and Sentence 1 separately?
(XPath extractor in Python for Scrapy)
I've tried:
//span/text()[1]
//span/text()[2]
substring-before(//span/text(),'\n')
but both were wild guesses and not working.

You can get the first item "Word 1" with
normalize-space(substring-before(substring-after(translate(span/text(),'
',''),'
'),'
'))
and get the second item "Sentence 1" with
normalize-space(substring-after(substring-after (translate(span/text(),'
',''),'
'),'
'))
You can remove the normalize-space(...) if you don't need it.
The context node should be the parent of span, otherwise you should prefix the expression with //. Your main problem has been that there was a line feed (\n) before the first item.
EDIT:
I added a solution for handling the CR char for Windows' CRLF. It simply removes the CR char and acts on the LF char.

See a previous question to understand how to properly access the inner content of the element.
Then, process the output string to fit your needs.

Related

RegEx replace only occurrences outside of <h> html tags

I would like to regex replace Plus in the below text, but only when it's not wrapped in a header tag:
<h4 class="Somethingsomething" id="something">Plus plan</h4>The <b>Plus</b> plan starts at $14 per person per month and comes with everything from Basic.
In the above I would like to replace the second "Plus" but not the first.
My regex attempt so far is:
(?!<h\d*>)\bPlus\b(?!<\\h>)
Meaning:
Do not capture the following if in a <h + 1 digit and 0 or more characters and end an closing <\h>
Capture only if the group "Plus" is surrounded by spaces or white space
However - this captures both occurrences. Can someone point out my mistake and correct this?
I want to use this in VBA but should be a general regex question, as far as I understand.
Somewhat related but not addressing my problem in regex
Not relevant, as not RegEx
You can use
\bPlus\b(?![^>]*<\/h\d+>)
See the regex demo. To use the match inside the replacement pattern, use the $& backreference in your VBA code.
Details:
\bPlus\b - a whole word Plus
(?![^>]*<\/h\d+>) - a negative lookahead that fails the match if, immediately to the right of the current location, there are
[^>]* - zero or more chars other than >
<\/h - </h string
\d+ - one or more digits
> - a > char.

Split a paragraph to separate one or two digit numbers with PowerShell

I'm attempting to parse and format some text from an HTML file into Word. I'm doing this by capturing each paragraph into an array and then writing it into the word document one paragraph at a time. However, there are superscripted references sprinkled throughout the text. I'm looking for a way to superscript these references in the new Word file and thought I would use regex and split to make this work. Here is an example paragraph:
$p = "This is an example sentence.1 The number is a reference note that should be superscripted and can be one or two digits long."
Here is the code I tried to split and select the digit(s):
[regex]::Split($p,"(\d{1,2})")
This works for single and double digits. However, if there are more than two digits, it still splits it, but moves the extra numbers to the next line. Like so:
This is an example sentence.
10
0
The number is a reference note that should be superscripted and can be one or two digits long.
This is important because there are sometimes larger numbers (3-10 digits) in the text that I don't want to split on. My goal is to take a block of text with reference note numbers and seperate out the notes so I can perform formatting functions on them when I write it out to the Word file. Something like this (untested):
$paragraphs | % {
$a = #([regex]::Split($_,"(\d{1,2})"))
$a | % {
$text = $_
if ($text -match "(\d{1,2})")
{
$objSelection.Font.SuperScript = 1
$objSelection.TypeText("$text")
$objSelection.Font.SuperScript = 0
}
Else
{
$objSelection.Style="Normal"
$objSelection.TypeText("$text")
}
}
$text = "`v"
$objSelection.TypeText("$text")
$objSelection.TypeParagraph()
}
EDIT:
The following regex expression works when I test it with the above loop in it's own script:
"(?<![\d\s])(\d{1,2})(?!\d)"
However, when I run it in the parent script, I get the following error:
Cannot find an overload for "Split" and the argument count: "2"
$a = [regex]::Split($_,"(?<![\d\s])(\d{1,2})(?!\d)")
How would I go about troubleshooting this error?
You may use
[regex]::Split($p,"(?<![\d\s])(\d{1,2})(?!\d)\s*")
It only matches and captures one or two digits that are neither followed nor preceded with another digit, and not preceded with any whitespace char. Any trailing whitespace is matched with \s* and is thus removed from the items that are added into the resulting array.
See this regex demo:
Details
(?<![\d\s]) - a negative lookbehind that fails the match if, immediately to the left of the current position, there is a digit or a whitespace
(\d{1,2}) - Group 1: one or two digits
(?!\d) - that cannot be followed with another digit (it is a negative lookahead that fails the match if its pattern matches immediately to the right of the current location)
\s* - 0+ whitespaces.

How to use XPath contains() for specific text?

Say we have an HTML table which basically looks like this:
2|1|28|9|
3|8|5|10|
18|9|8|0|
I want to select the cells which contain only 8 and nothing else, that is, only 2nd cell of row2 and 3rd cell of row3.
This is what I tried: //table//td[contains(.,'8')]. It gives me all cells which contain 8. So, I get unwanted values 28 and 18 as well.
How do I fix this?
EDIT: Here is a sample table if you want to try your xpath. Use the calendar on the left side-https://sfbay.craigslist.org/sfc/
Be careful of the contains() function.
It is a common mistake to use it to test if an element contains a value. What it really does is test if a string contains a substring. So, td[contains(.,'8')] takes the string value of td (.) and tests if it contains any '8' substrings. This might be what you want, but often it is not.
This XPath,
//td[.='8']
will select all td elements whose string-value equals 8.
Alternatively, this XPath,
//td[normalize-space()='8']
will select all td elements whose normalize-space() string-value equals 8. (The normalize-space() XPath function strips leading and trailing whitespace and replaces sequences of whitespace characters with a single space.)
Notes:
Both will work even if the 8 is inside of another element such as a
a, b, span, div, etc.
Both will not match <td>gr8t</td>, <td>123456789</td>, etc.
Using normalize-space() will ignore leading or trailing whitespace
surrounding the 8.
See also:
Why is contains(text(), "string" ) not working in XPath?
Try the following xpath, which will select the whole text contents rather than partial matches:
//table//td[text()='8']
Edit: Your example HTML has a tags inside the td elements, so the following will work:
//table//td/a[text()="8"]
See example in php here: https://3v4l.org/56SBn

Regex match a string that doesn't contains a string

I want to replace all the <span...> (including <span id="... and <span class="...) in an html by <span> except if the span starts by <span id="textmarker (for example I don't want to keep this span : <span attr="blah" id="textmarker">)
I've tried the regex proposed here and here, I finally came up with this regex that never returns a <span id="textmarker but somehow it sometimes misses the other spans:
<span(?!.*? id="textmarker).*?">
You can see my (simplified) html here : https://regex101.com/r/yT9jG2/2
Strangely, if I run the regex in notepad++ it returns 3 matches (the three spans in the second paragraph) but regex101 only returns 1 match. Notepad++ and regex101 both miss the span in the first paragraph.
This regex also doesn't return every spans it should( cf the spans with a gray highlights here
<span(?![^>]*? id="textmarker)[^>]*?>
Updated: To exclude id="textmarker while including id="anythingelse and all other spans:
(<span(?! *id="textmarker)[^>]*>)
On your posted example at: https://regex101.com/r/yT9jG2/2 , and at the top, choosing version 2, set the fields so:
field 1: (<span(?! *id="textmarker)[^>]*>)
field 2, (the smaller field that lets you set modifier): g
With your example and choosing version 2, matches 9 and lists them on the right, including empty spans as well as non-id="textmarker such as <span id="YellowType">
Explanation
Field 1:
optional: ( and ). An extra outer parenthesis was added to the expression for educational purposes, just for making use of regex101's matched group listing feature to list results on the right pane in addition to the default inline highlighting of matches. When using Notepad++ you can of course omit these outer ( ) parentheses.
<span: matches <span
(?! starts a negative lookahead assertion for the following,
* meaning space zero or more times, in case you have extra spaces
followed by id="textmarker
) to end the negative lookahead assertion
so if the match sees the negative lookahead assertion it automatically discards that as a match
[^ starts an exclusion set. so not of of the following, the following being the >
] to stop defining the exclusion
* to match the preceding 0 or more times. The preceding being [^>]
> to match to end of the open-a-span tag
Field 2
g tells regex101 you want this to be a greedy match
so the result does not stop at the first match, but will have all matches

using Xpath to scrape inconsistent DOM

I want to scrape the post name, which for pattern one it's located within a span
but the forum thread can goes like this (line 7)
because the thread is a poll.
so in my case I can't target the span (line 8 first picture), I used descendants-or-self but hardly to get it right. What's wrong here?
$postTitle = $xpath->query("//tr/td[#class='row1'][3]/div/div[1]//descendant-or-self::text()");
With this expression you will select the first <a> in the <div> where the text you wish to extract is located:
//tr/td[#class='row1'][3]/div/div[1]/a[1]
I'm assuming you intend to select one element (and not a node-set). For that you can get the string-value of this expression (which will return all the text in the descendant nodes) using string() or normalize-space() (which trims and removes extra spaces):
normalize-space(//tr/td[#class='row1'][3]/div/div[1]/a[1])
This will extract Salary vs age or /ktards are you... depending on the node found.
If there is more than one match it will return a collection, which you should iterate over and get the string value of each one individually. Using those functions on a node-set will give you the text in the first element, discarding the others.
If you only have to deal with two cases: 1) text inside a/span, 2) text inside a, you can select the text nodes directly using a union (|) operator:
//tr/td[#class='row1'][3]/div/div[1]/a[1]/text() | //tr/td[#class='row1'][3]/div/div[1]/a[1]/span/text()