Can you search html attributes using wildcards with ruby nokogiri - html

I know you can search text in html using wildcards. Can you search for attribute values in html using wildcards with nokogiri
e.g., suppose I want to search for classes with value *session*

You can use xpath contains() function to search the document. Something like:
doc.xpath("//*[#*[contains(., 'session')]]").each do |ele|
# something
end
This search returns all the elements with any attribute whose value contains the string 'session'.

Had a similar problem few days ago - notice spaces around class values.
find(:xpath, "//*[contains(concat(' ', normalize-space(#class), ' '), ' icon-edit ')]")

Related

XPath for id attribute with changing substring in middle?

I have a website I need to isolate XPATH identifiers on - they have an XPath ID like this //*[#id="panel-detail-6163748c7952a-partnerCode"]
The issue is that the website changes the value 6163748c7952a on every page load.
Is there any such XPath expression which can match on the first/last part of that string? So of a wildcard like //*[#id="panel-detail-*-partnerCode"]
This XPath 2.0 expression,
//*[matches(#id, "^panel-detail-.*-partnerCode$")]
or this XPath 1.0 expression,
//*[starts-with(#id, 'panel-detail-') and
substring(#id, string-length(#id) - string-length('-partnerCode') + 1)
= '-partnerCode']
will match all elements whose id attribute value starts and ends with the noted substrings.
See also
XPath testing that string ends with substring?
There are few methods in xpath such as starts-with or ends-with. Many time folks replaces them with contains which should be discourage.
Please note that ends-with is available with xpath v2.0 .
xpath v1.0 :
//*[starts-with(#id,'panel-detail-') and contains(#id, '-partnerCode')]
xpath v2.0 :
//*[starts-with(#id,'panel-detail-') and ends-with(#id, '-partnerCode')]

Extracting content of HTML tag with specific attribute

Using regular expressions, I need to extract a multiline content of a tag, which has specific id value. How can I do this?
This is what I currently have:
<div(.|\n)*?id="${value}"(.|\n)*?>(.|\n)*?<\/div>
The problem with this is this sample:
<div id="1">test</div><div id="2">test</div>
If I want to replace id="2" using this regexp (with ${value} = 2), the whole string would get matched. This is because from the tag opening to closing I match everything until id is found, which is wrong.
How can I do this?
A fairly simple way is to use
Raw: <div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)</div>
Delimited: /<div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)<\/div>/
Use the variable in place of 2.
The content will be in group 1.
Change (.|\n) to [^>] so it won't match the > that ends the tag. Then it can't match across different divs.
<div\b[^>]*\bid="${value}"[^>]*>.*?<\/div>
Also, instead of using (.|\n)* to match across multiple lines, use the s modifier to the regexp. This makes . match any character, including newlines.
However, using regular expressions to parse HTML is not very robust. You should use a DOM parser.

How can I use a wildcard on Xpath

cover: $main//*[has-class("aligncenter wp-image-121146 size-large")]//img
cover: $main//img[has-class("aligncenter wp-image-121146 size-large")]
the string has a static part aligncenter wp-image- and a dynamic part "" what I want to do here is concatenate all the posibles ""
in bash is somthing like this:
"aligncenter wp-image-"*
How can I make that in Xpath?
I think this will work:
'//E[#class="aligncenter" and contains(concat(" ", #class, "wp-image-"), " C ")]'
This might be more robust, haven't tried a lot of and and it looks like you're expecting the classes to be in that order:
'//E[contains(concat(" ", #class, "aligncenter wp-image-"), " C ")]'
I haven't tested it though, either way, this should help you:
https://gist.github.com/glenpierce/400d5b569094b902f06789d80757454e
There is not enough information to give an exact XPath solution to this question as you didn't provide sample XML/HTML, and both of the attempted XPath expressions (if they actually are) are invalid.
To answer the title, you can't use wildcalrd like that in XPath 1.0, which is the most widely implemented version of XPath. But you can use contains() or starts-with() function. The latter seems more suitable in this particular case. For example, the following XPath returns all img elements, anyhwere in the document, where class attribute value starts with substring 'aligncenter wp-image-' :
//img[starts-with(#class, 'aligncenter wp-image-')]
demo

Alternative to lookbehind with variable width

I have some html which contains a number of hyperlinks to html files, but they don't have any file extensions.
For example in the string <a href='variablelengthfilename'> I'm trying to match the trailing ' , so I can replace it with .html' (using a RegEx search in Notepad++) using something like this:
`(?<=href='[A-Za-z]*)'`
but that won't work because Notepad++ doesn't allow variable-length lookbehind assertions.
How else can I achieve this?
Thanks
Since you are working in Notepad++, here is a way to achieve what you are after:
Find what: \bhref='[^']*
Replace with: $&.html
The \bhref='[^']* regex matches a href as a whole word, then =' are matched literally, and [^']* matches 0 or more characters other than '. Note you will need to replace ' with " if the href value is inside double quotes.
Assuming all your links look like that, why not just do a simple replace
'>
with
.html'>
?

Use XPath to find links containing two things

I'm using XPath to parse an HTML document to find a specific link. The specific link has a domain name in it and the character '#'.
//a[#*[contains(., 'domain')]] | //a[#*[contains(., '#')]]"
Will return links with 'domain' OR '#' in them and I need 'domain' AND '#'
I've been trying to use:
//a[#*[contains(., 'domain')]] & //a[#*[contains(., '#')]]"
But that's no good.
You can read about XPath operators here. The & operator does not exist.
Also, there is no need to select the element twice.
You could use either
//a[#*[contains(., 'domain')]][#*[contains(., '#')]]
or
//a[#*[contains(., 'domain')] and #*[contains(., '#')]]
Should be as easy as:
//a[#*[contains(., 'domain')]][#*[contains(., '#')]]