Regex match a string that doesn't contains a string - html

I want to replace all the <span...> (including <span id="... and <span class="...) in an html by <span> except if the span starts by <span id="textmarker (for example I don't want to keep this span : <span attr="blah" id="textmarker">)
I've tried the regex proposed here and here, I finally came up with this regex that never returns a <span id="textmarker but somehow it sometimes misses the other spans:
<span(?!.*? id="textmarker).*?">
You can see my (simplified) html here : https://regex101.com/r/yT9jG2/2
Strangely, if I run the regex in notepad++ it returns 3 matches (the three spans in the second paragraph) but regex101 only returns 1 match. Notepad++ and regex101 both miss the span in the first paragraph.
This regex also doesn't return every spans it should( cf the spans with a gray highlights here
<span(?![^>]*? id="textmarker)[^>]*?>

Updated: To exclude id="textmarker while including id="anythingelse and all other spans:
(<span(?! *id="textmarker)[^>]*>)
On your posted example at: https://regex101.com/r/yT9jG2/2 , and at the top, choosing version 2, set the fields so:
field 1: (<span(?! *id="textmarker)[^>]*>)
field 2, (the smaller field that lets you set modifier): g
With your example and choosing version 2, matches 9 and lists them on the right, including empty spans as well as non-id="textmarker such as <span id="YellowType">
Explanation
Field 1:
optional: ( and ). An extra outer parenthesis was added to the expression for educational purposes, just for making use of regex101's matched group listing feature to list results on the right pane in addition to the default inline highlighting of matches. When using Notepad++ you can of course omit these outer ( ) parentheses.
<span: matches <span
(?! starts a negative lookahead assertion for the following,
* meaning space zero or more times, in case you have extra spaces
followed by id="textmarker
) to end the negative lookahead assertion
so if the match sees the negative lookahead assertion it automatically discards that as a match
[^ starts an exclusion set. so not of of the following, the following being the >
] to stop defining the exclusion
* to match the preceding 0 or more times. The preceding being [^>]
> to match to end of the open-a-span tag
Field 2
g tells regex101 you want this to be a greedy match
so the result does not stop at the first match, but will have all matches

Related

What is the Correct XPath to Identify Element with Text Occuring Minimum Number of Times?

I'm trying to identify an element that has certain text but I only want to identify the element if the desired text occurs a specific number of times.
For example, imagine we have the following two HTML snippets on the same page:
Snippet 1:
<span id="price">
$36.46
<span>
($0.38 / Count)
</span>
</span>
Snippet 2:
<span id="price">$38.38</span>
I could identify both elements using the XPath: .//span[contains(text(),'$')] However, I only want to identify the element if it (or any descendant of span element) contain at least two instances of the character: $
In above example, it would only identify the first snippet because the second snippet only contains one instance of $, not two.
What is the correct XPath syntax to use?
You can use the XPath //span[count(.//text()[contains(., "$")]) >= 2]
This is a moderately complicated XPath, so to explain it some by expanding outwards:
.//text()[contains(., "$")]
Select all text elements descending from the current node whose self contains "$".
count(.//text()[contains(., "$")])
Count the number of text elements descending from the current node whose self contains "$".
//span[count(.//text()[contains(., "$")]) >= 2]
Select all span elements with two or more text descendants whose self contains "$"
As a caveat, this only works if the dollar sign is in two different text elements. If you want to include the span in this example:
<span>
$$
<span>
foo
</span>
</span>
...then you'll need a different approach:
//span[string-length(.) - string-length(translate(., "$", "")) >= 2]
This predicate compares the string length of the span to the string length of the same span with all "$" characters removed.
One usable XPath-1.0 expression is
string-length(/span[#id='price'])-string-length(translate(/span[#id='price'],'$',''))
In a predicate this could look like
//span[string-length(.)-string-length(translate(.,'$',''))>=2]
This expression selects only the elements with a count of $ >= 2

Regex to match style=' '

I'm using a series of regex patterns to remove HTML elements from my code. I need to also remove the style="{stuff}" attributes that are also present in the file.
At the moment I have style.*?, which matches only the word style, however I thought that by adding .*? to the regex it would also match with zero to unlimited characters after the style declaration?
I also have style={0,1}"{0,1}.*?"{0,1} which matches:
style=""
style="
style
But does not match style="something", again in this regex I would expect the .*? to match everything between the first " and the second ", but this is not the case. What do I need to do to change this regex so that it will match with all of the following:
style="font-family:"Open Sans", Arial, sans-serif;background-color:rgb(255, 255, 255);display:inline !important;"
style=""
style="something"
style
The pattern style.*? does not match the following parts as there is nothing following the non greedy part so it is matching as least as possible.
You could use an optional group and a negated character class:
\bstyle(?:="[^"]*")?
In parts
\bstyle Word bounary, match style
(?: Non capturing group
=" Match = and opening "
[^"]* Match any char 0+ times except "using a negated character class
" Match closing "
)? Close group and make it optional
Regex demo
If you want to match single or double quotes with the accompanying closing single or double quote to not match for example style="', you could use a capturing group (["']) with a backreference \1 to what was captured in group 1:
\bstyle(?:=(["'])[^"]*\1)?
Regex demo
Here's what I cooked up. It uses positive lookbehind (?<=...) and lookahead (?=...) to ensure that the found match is inside an HTML tag:
(?<=<[a-zA-Z][^<>]*?)\sstyle(?:="[^"]*")?(?=[\s>])(?=[^<>]*>)
Test it out.
It will match any whitespace before the "style", so that a removal of all matches goes from <a stuff="..." style="width:18px;" href="someurl"> to <a stuff="..." href="someurl"> without leaving a double space behind where it was removed.
Note that some regex parsers (like the Python one) don't like lookbehind with non-fixed size. This can be solved simply by changing the first and last parts, the lookbehind and lookahead groups, into capture groups instead, thereby capturing the whole html tag. Then you simply need to replace the match by $1$2 instead of an empty string, replacing the found match by the same thing but without the style="..." part inside it.
The resulting regex for that would be:
(<[a-zA-Z][^<>]*?)\sstyle(?:="[^"]*")?(?=[\s>])([^<>]*>)
Test it out.

Split a paragraph to separate one or two digit numbers with PowerShell

I'm attempting to parse and format some text from an HTML file into Word. I'm doing this by capturing each paragraph into an array and then writing it into the word document one paragraph at a time. However, there are superscripted references sprinkled throughout the text. I'm looking for a way to superscript these references in the new Word file and thought I would use regex and split to make this work. Here is an example paragraph:
$p = "This is an example sentence.1 The number is a reference note that should be superscripted and can be one or two digits long."
Here is the code I tried to split and select the digit(s):
[regex]::Split($p,"(\d{1,2})")
This works for single and double digits. However, if there are more than two digits, it still splits it, but moves the extra numbers to the next line. Like so:
This is an example sentence.
10
0
The number is a reference note that should be superscripted and can be one or two digits long.
This is important because there are sometimes larger numbers (3-10 digits) in the text that I don't want to split on. My goal is to take a block of text with reference note numbers and seperate out the notes so I can perform formatting functions on them when I write it out to the Word file. Something like this (untested):
$paragraphs | % {
$a = #([regex]::Split($_,"(\d{1,2})"))
$a | % {
$text = $_
if ($text -match "(\d{1,2})")
{
$objSelection.Font.SuperScript = 1
$objSelection.TypeText("$text")
$objSelection.Font.SuperScript = 0
}
Else
{
$objSelection.Style="Normal"
$objSelection.TypeText("$text")
}
}
$text = "`v"
$objSelection.TypeText("$text")
$objSelection.TypeParagraph()
}
EDIT:
The following regex expression works when I test it with the above loop in it's own script:
"(?<![\d\s])(\d{1,2})(?!\d)"
However, when I run it in the parent script, I get the following error:
Cannot find an overload for "Split" and the argument count: "2"
$a = [regex]::Split($_,"(?<![\d\s])(\d{1,2})(?!\d)")
How would I go about troubleshooting this error?
You may use
[regex]::Split($p,"(?<![\d\s])(\d{1,2})(?!\d)\s*")
It only matches and captures one or two digits that are neither followed nor preceded with another digit, and not preceded with any whitespace char. Any trailing whitespace is matched with \s* and is thus removed from the items that are added into the resulting array.
See this regex demo:
Details
(?<![\d\s]) - a negative lookbehind that fails the match if, immediately to the left of the current position, there is a digit or a whitespace
(\d{1,2}) - Group 1: one or two digits
(?!\d) - that cannot be followed with another digit (it is a negative lookahead that fails the match if its pattern matches immediately to the right of the current location)
\s* - 0+ whitespaces.

How to use XPath contains() for specific text?

Say we have an HTML table which basically looks like this:
2|1|28|9|
3|8|5|10|
18|9|8|0|
I want to select the cells which contain only 8 and nothing else, that is, only 2nd cell of row2 and 3rd cell of row3.
This is what I tried: //table//td[contains(.,'8')]. It gives me all cells which contain 8. So, I get unwanted values 28 and 18 as well.
How do I fix this?
EDIT: Here is a sample table if you want to try your xpath. Use the calendar on the left side-https://sfbay.craigslist.org/sfc/
Be careful of the contains() function.
It is a common mistake to use it to test if an element contains a value. What it really does is test if a string contains a substring. So, td[contains(.,'8')] takes the string value of td (.) and tests if it contains any '8' substrings. This might be what you want, but often it is not.
This XPath,
//td[.='8']
will select all td elements whose string-value equals 8.
Alternatively, this XPath,
//td[normalize-space()='8']
will select all td elements whose normalize-space() string-value equals 8. (The normalize-space() XPath function strips leading and trailing whitespace and replaces sequences of whitespace characters with a single space.)
Notes:
Both will work even if the 8 is inside of another element such as a
a, b, span, div, etc.
Both will not match <td>gr8t</td>, <td>123456789</td>, etc.
Using normalize-space() will ignore leading or trailing whitespace
surrounding the 8.
See also:
Why is contains(text(), "string" ) not working in XPath?
Try the following xpath, which will select the whole text contents rather than partial matches:
//table//td[text()='8']
Edit: Your example HTML has a tags inside the td elements, so the following will work:
//table//td/a[text()="8"]
See example in php here: https://3v4l.org/56SBn

using Xpath to scrape inconsistent DOM

I want to scrape the post name, which for pattern one it's located within a span
but the forum thread can goes like this (line 7)
because the thread is a poll.
so in my case I can't target the span (line 8 first picture), I used descendants-or-self but hardly to get it right. What's wrong here?
$postTitle = $xpath->query("//tr/td[#class='row1'][3]/div/div[1]//descendant-or-self::text()");
With this expression you will select the first <a> in the <div> where the text you wish to extract is located:
//tr/td[#class='row1'][3]/div/div[1]/a[1]
I'm assuming you intend to select one element (and not a node-set). For that you can get the string-value of this expression (which will return all the text in the descendant nodes) using string() or normalize-space() (which trims and removes extra spaces):
normalize-space(//tr/td[#class='row1'][3]/div/div[1]/a[1])
This will extract Salary vs age or /ktards are you... depending on the node found.
If there is more than one match it will return a collection, which you should iterate over and get the string value of each one individually. Using those functions on a node-set will give you the text in the first element, discarding the others.
If you only have to deal with two cases: 1) text inside a/span, 2) text inside a, you can select the text nodes directly using a union (|) operator:
//tr/td[#class='row1'][3]/div/div[1]/a[1]/text() | //tr/td[#class='row1'][3]/div/div[1]/a[1]/span/text()