Use XPath to find links containing two things - html

I'm using XPath to parse an HTML document to find a specific link. The specific link has a domain name in it and the character '#'.
//a[#*[contains(., 'domain')]] | //a[#*[contains(., '#')]]"
Will return links with 'domain' OR '#' in them and I need 'domain' AND '#'
I've been trying to use:
//a[#*[contains(., 'domain')]] & //a[#*[contains(., '#')]]"
But that's no good.

You can read about XPath operators here. The & operator does not exist.
Also, there is no need to select the element twice.
You could use either
//a[#*[contains(., 'domain')]][#*[contains(., '#')]]
or
//a[#*[contains(., 'domain')] and #*[contains(., '#')]]

Should be as easy as:
//a[#*[contains(., 'domain')]][#*[contains(., '#')]]

Related

XPath for id attribute with changing substring in middle?

I have a website I need to isolate XPATH identifiers on - they have an XPath ID like this //*[#id="panel-detail-6163748c7952a-partnerCode"]
The issue is that the website changes the value 6163748c7952a on every page load.
Is there any such XPath expression which can match on the first/last part of that string? So of a wildcard like //*[#id="panel-detail-*-partnerCode"]
This XPath 2.0 expression,
//*[matches(#id, "^panel-detail-.*-partnerCode$")]
or this XPath 1.0 expression,
//*[starts-with(#id, 'panel-detail-') and
substring(#id, string-length(#id) - string-length('-partnerCode') + 1)
= '-partnerCode']
will match all elements whose id attribute value starts and ends with the noted substrings.
See also
XPath testing that string ends with substring?
There are few methods in xpath such as starts-with or ends-with. Many time folks replaces them with contains which should be discourage.
Please note that ends-with is available with xpath v2.0 .
xpath v1.0 :
//*[starts-with(#id,'panel-detail-') and contains(#id, '-partnerCode')]
xpath v2.0 :
//*[starts-with(#id,'panel-detail-') and ends-with(#id, '-partnerCode')]

How to edit this html lexer rule?

I want to edit this HTML lexer rule and I need help with the Regular Expression
the TAG_NAME refers to any HTML attribute for ex: (required, class, id, etc...).
I want to edit it to make it does not accept this exact syntax: 'az-'.
I think this needs regular expression modification, I looked it up but I couldn't integrate what I found online with the way these rules are written.
I tried to remove the '-' in the Tag_NameChar as a first try but that made the HTML doesnt recognize attributes like 'data-target'.
This snippet is for the rule:
and this one shows how the attributes are recognized.
ANTLR does not support lookahead syntax like some regex engines do, so there's no easy way to exclude certain matches from within the regex. It's always possible to rewrite a regular expression to exclude a given string (regular expressions are closed under negation and intersection), but it usually ends up quite painful. In your case, you'd end up with something following the logic of "a tag name can either have less than 3 characters, more than 3 characters, or it could have three characters where the first isn't an 'a', the second isn't a 'z' or the last isn't a '-'".
The less painful, but also less cross-language solution is to use a predicate that returns false if the text of the tag name equals az-. So something like {getText().equals("az-")}? depending on the language.
If you're okay with introducing an additional lexer rule, you may also introduce a rule INVALID_TAG_NAME (or whatever you want to call it) that matches exactly az- and that's defined before TAG_NAME. That way any tag that's named exactly az- will produce an INVALID_TAG_NAME token instead of a TAG_NAME token.
Depending on your requirements, you could also leave the grammar unchanged altogether and simply produce an error when you see a tag named az- when you traverse the tree in a listener or visitor.

RegEx matching for HTML and non-HTML URLs

I'm trying to get all urls from this text. The absolute and relative URLs, but I'm not getting the right regular expression. The expression is combining with more things than I would like. You are getting HTML tags and other information that I do not want.
Attempt
(\w*.)(\\\/){1,}(.*)(?![^"])
Input
<div class=\"loader\">\n <div class=\"loaderImage\"><img src=\"\/c\/Community\/Rating\/img\/loader.gif\" \/><\/div>\n <\/div>\n<\/div>\n<\/div><\/span><\/span>\n
<a title=\"Avengers\" href=\"\/pt\/movie\/Avengers\/57689\" >Avengers<\/a> <\/div>\n
<img title=\"\" alt=\"\" id=\"145793\" src=\"https:\/\/images04-cdn.google.com\/movies\/74932\/74932_02\/previews\/2\/128\/top_1_307x224\/74932_02_01.jpg\" class=\"tlcImageItem img\" width=\"307\" height=\"224\" \/>
pageLink":"\/pt\/videos\/\/updates\/1\/0\/Category\/0","previousPage":"\/pt\/videos\/\/updates\/1\/0\/Category\/0","nextUrl":"\/pt\/videos\/\/updates\/2\/0\/Category\/0","method":"updates","type":"scenes","callbackJs"
<span class=\"value\">4<\/span>\n <\/div>\n <\/div>\n <div class=\"loader\">\n <div class=\"loaderImage\"><img src=\"\/c\/Community\/Rating\/img\/loader.gif\" \/><\/div>\n <\/div>\n<\/div>\n<\/div><\/span><\/span>
Demo
As it has been commented, it may not really be the best idea that you solve this problem with RegEx. However, if you wish to practice or you really have to, you may do an exact match in between "" where you URLs are present. You can bound them from left using scr, href, or any other fixed components that you may have. You can simply use an | and list them in the first group ().
RegEx 1 for HTML URLs
This RegEx may not be the right solution, but it might give you a perspective that how you might approach solving this problem using RegEx:
(src=|href=)(\\")([a-zA-Z\\\/0-9\.\:_-]+)(")
It creates four groups, so that to simplify updating it, and the $3 group might be your desired URLs. You can add any chars that your URLs might have in the third group.
RegEx 2 for both HTML and non-HTML URLs
For capturing other non-HTML URLs, you can update it similar to this RegEx:
(src=\\|href=\\|pageLink\x22:|previousPage\x22:|nextUrl\x22:)(")([a-zA-Z\\\/0-9\.\:_-]+)(")
where \x22 stands for ", which you can simply replace it. I have just added \x22 such that you could see those ", where your target URLs are located in between:
The second RegEx also has four groups, where the target group is $3. You can also simplify or DRY it, if you wish.

Extracting content of HTML tag with specific attribute

Using regular expressions, I need to extract a multiline content of a tag, which has specific id value. How can I do this?
This is what I currently have:
<div(.|\n)*?id="${value}"(.|\n)*?>(.|\n)*?<\/div>
The problem with this is this sample:
<div id="1">test</div><div id="2">test</div>
If I want to replace id="2" using this regexp (with ${value} = 2), the whole string would get matched. This is because from the tag opening to closing I match everything until id is found, which is wrong.
How can I do this?
A fairly simple way is to use
Raw: <div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)</div>
Delimited: /<div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)<\/div>/
Use the variable in place of 2.
The content will be in group 1.
Change (.|\n) to [^>] so it won't match the > that ends the tag. Then it can't match across different divs.
<div\b[^>]*\bid="${value}"[^>]*>.*?<\/div>
Also, instead of using (.|\n)* to match across multiple lines, use the s modifier to the regexp. This makes . match any character, including newlines.
However, using regular expressions to parse HTML is not very robust. You should use a DOM parser.

Picking the first element using xpath in capybara

I have the following line of code
link = find(:xpath, "//div[#id='tree']//a[contains(.,'#{peril}')]")
Above step yields in two elements. How do I pick the first one.
I am getting a Ambiguous match found 2 elements matching xpath. Here is the HTML
"
ShipCase_US_MortalityRatingGroup_Life Portfolio result Earthquake Infectious Disease"
You need to surround the entire XPath in parentheses and add the [1] after it.
(//div[#id='tree']//a[contains(.,'#{peril}')])[1]
find(".active", match: :first).click
this solution uses Capybara's (quite important) waiting capabilities