I am parsing the html code of one site with Jsoup. I need to find some html elements that has an specific id but their parent´s tree is complicating me the task. So I would like to know if is it possible to search an specific html element without having to search first all of their parents.
For instance I am doing the next:
Elements el=elements.select(".scroller.context-inplay").select(".zone.grid-1-1").select(".grid-1").select(".module-placeholder");
I would likte to know if there is a simple way to get the same element I get with this code searching by its id
The id of an html element should be unique within the page. Some html that you find in the wild breaks this requirement unfortunately tough. However, if your html source follows the standard you can simply use the # css operator to select the element in question:
Element el = doc.select("#someID").first();
Alternatively you can directly use the getElmentById Jsoup method:
Element el = doc.getElmentById("someID");
Also, if you decide to go by class names as you suggest in your question, it is easy to combine all selects into one selector:
Elements els = elements.select(".scroller.context-inplay .zone.grid-1-1 .grid-1 .module-placeholder");
The spaces in the CSS selector mean that any subselector right of the space must be a child of the stuff on the left side.
Related
I am using screaming frog and I want to do this using XPath.
Extract all links and anchors containing a certain class from the main body content
but I want to exclude all links within div.list
Right now I am trying this but it's not working too well, plus I want it to spit it out in text form in possible.
//div[#class="page-content"]/*[not(class="list")]//a[#data-wpel-link="internal"]
Anyone got an idea?
This XPath,
//a[#data-wpel-link="internal"][not(ancestor::div[#class="list"])]
will select all a elements with the given attribute value that do not have an ancestor div of the given class.
You can, of course, prefix any heritage in which to restrict the selection, e.g:
//div[#class="page-content"]//a[#data-wpel-link="internal"]
[not(ancestor::div[#class="list"])]
I want to build an external GUI that operates on a generic HTML piece that comes with associated CSS. In order to enable some functionalities of the GUI, I would need to create some "meta" HTML elements to contain parts of content and associate them with data.
Example:
<div id="root">
<foo:meta data-source="document:1111" data-xref="...">
sometext
<p class="quote">...</p>
</foo:meta>
<p class="other">...</p>
</div>
This HTML is auto-generated starting from already existing HTML that has associated CSS:
<div id="root">
sometext
<p class="quote">...</p>
<p class="other">...</p>
</div>
#root>p {
color:green;
}
#root>p+p {
color:red;
}
The problem is, when adding the <foo:meta> element, this breaks CSS child and sibling selectors. I am looking for a way for the CSS selectors to keep working when encapsulating content in this way. We have tried foo\:meta{display:contents} style, but, although it works in terms of hiding the meta element from the box renderer, it doesn't hide it from the selector matcher. We do not produce the HTML/CSS to be processed, so writing them in a certain way before processing is not an option. They come as they are, generic HTML documents with associated CSS.
Is there a way to achieve what we are looking for using HTML/CSS?
To restate, we are looking for a way to dynamically encapsulate parts of content in non-visual elements without breaking child and sibling CSS selectors. The elements should only be available to DOM traversal such as document.getElementsByTagName('foo:meta')
If I understood your problem correctly.I would suggest using the space between the grandparent and the child instead of a '>'. Also your selector is an id and not a class.
The selector you have put in selects the next level child that is the children. But adding the space in between enables you to select grandchildren too!
so you have do is this
#root .quote {
color:green;
}
Let me know if this helped.
A working css is here
So, after much fiddling and research, we came to the conclusion that this can't be done, even with ShadowDom, as even that would require massive CSS rewrites that might not preserve semantics.
However, for anyone stumbling upon this question, we came to the same end by employing the following (I'll be short, pointers only):
using two comments to mark where the tag would start/end, instead of an XML tag (eg. <!--<foo:bar data-source="1111">-->...content...<!--</foo:bar>-->)
these pointers work more or less like the markup equivalent of a DOM Range and they can work together with it.
this approach has the interesting advantage (as opposed to a single node) that it can start and end in different nodes, so it can span subtrees.
But this also breaks the XML structure when you try to recompose it. Also it's quite easy by manipulation to end up with the range end moving before the range start, multiple ranges overlapping etc.
In order to recompose it (to send to a next XML processor or noSQL XML database for cross-referencing), we need to make sure we avoid the XML-breaking manipulations described above; then, one only needs to convert encapsulated tags to regular tags by using string manipulation on the document (X)HTML (innerHtml, outerHtml, XMLSerializer) to get a clean XML which can be mined and cross-referenced for content.
We used the TreeWalker API for document scanning of comments, you might need it, although scanning the document for comments this way can be slow (works for us though). If you are bolder you can try using xPath, ie. document.evaluate('//comment()',document), seems to work but we don't trust all browsers comply.
I wish to extract the entire CSS affecting a div that is highlighted. With complex designs the CSS is made up of many classes which add in some CSS. Looking for a technique that can strip out and perhaps concatenate all these together, like Inspect Element but cleaner.
For example, on this Adobe Experience Page(http://www.adobe.com/uk/products/experience-design.html). I wish to select the article div "A new experience in user experience." and then pull out all the CSS affecting everything inside it attached to a class.
There is an ExtractCSS tool that does something similar, but looking for something a bit more intuitive. That ignores all the strikethroughs too.
The simplest way is:
Select your element in the developer tools
Run window.getComputedStyle($0).cssText on the Js console
where $0 represents the currently selected DOM element.
In alternative, if you want to target a specific element with a given class, then do
window.getComputedStyle( document.getElementsByClassName('hero2-align-2 hero2-basis-0')[0] ).cssText
Querying elements by class name might return more than 1 element, the [0] is there to guarantee only one is processed.
Or by id
window.getComputedStyle( document.getElementById('yourID') ).cssText
<div class="thumbnail-popular" style="background: url('http://images.gogoanime.tv/images/upload/Go!.Princess.Precure.full.1812487.jpg');"></div>
I am trying to get the url component of this div class but I seem to be unable to fetch that specific data in the div class.
I have looked into making use of attributes but my attempts have been unsuccessful so far.
Usage of this CSS selector is through Kimonolabs.
div.thumbnail-popular should get you the element you're looking for — unless there is more than one such element, in which case you will need to narrow down your selector.
For example you will need to find out if this particular element belongs to a specific parent, or is the first, second, ... nth child, or any other information about the surrounding elements in the page that you're working with.
The background URL is in a style attribute on this element, so you will need to extract that attribute as described here. However you will still need to parse the declarations inside the style value in order to get the URL; I am not sure if it is possible to do this through kimono as I am not familiar with it (I'm not sure what its advanced mode really does, and it's difficult to tell from the lone screenshot that is provided in that help article).
So, on a given website: for example there is a div element. I want to properly specify the xpath for a given sub-set of the main content of the page, found in:
<div[#id="content"> otherwise known as <div[3]>
Specifically, I want the xpath for the content between the second horizontal-rule (hr) tag and the third horizontal-rule (hr) tag. Which I believe should be, respectively:
'//div[#id="content"]/hr[2]' **AND** '//div[#id="content"]/hr'
I have been reading the XPath Tutorial and trying to figure out if the two hr tags are siblings or not, which I believe they are. However, Python does not seem to be recognizing them as such. I have tried every derivation of:
"following-sibling" and "preceding:: and not(preceding::)"
to the point that I no longer know which is which, and what is what. I do know that I am confused, and I believe the script is being confounded by the fact that the second hr of interest is not being numbered/identified as the third hr within the content/div (does not follow logically in numbering) as it 'should' be... according to what Firebug has been telling me.
The bottom line is: How do I properly specify this xpath? Again, these horizontal-rule tags appear to be siblings to me so I would think it would follow a structure such as following-sibling & preceding-sibling to specify the content between these two tags.
If you have access to XPath 2.0 functions you can use intersect to select all elements between the two:
//hr[2]/following-sibling::node()
intersect
//hr[3]/preceding-sibling::node()
If you only have access to XPath 1.0 functions, you can use this wonderful workaround to achieve the same result:
//hr[2]/following-sibling::node()[
count(.| //hr[3]/preceding-sibling::node())
=
count(//hr[3]/preceding-sibling::node())
]