Selecting both links and anchors using XPath - html

I am using screaming frog and I want to do this using XPath.
Extract all links and anchors containing a certain class from the main body content
but I want to exclude all links within div.list
Right now I am trying this but it's not working too well, plus I want it to spit it out in text form in possible.
//div[#class="page-content"]/*[not(class="list")]//a[#data-wpel-link="internal"]
Anyone got an idea?

This XPath,
//a[#data-wpel-link="internal"][not(ancestor::div[#class="list"])]
will select all a elements with the given attribute value that do not have an ancestor div of the given class.
You can, of course, prefix any heritage in which to restrict the selection, e.g:
//div[#class="page-content"]//a[#data-wpel-link="internal"]
[not(ancestor::div[#class="list"])]

Related

Extracting entire CSS affecting a DIV on a web page

I wish to extract the entire CSS affecting a div that is highlighted. With complex designs the CSS is made up of many classes which add in some CSS. Looking for a technique that can strip out and perhaps concatenate all these together, like Inspect Element but cleaner.
For example, on this Adobe Experience Page(http://www.adobe.com/uk/products/experience-design.html). I wish to select the article div "A new experience in user experience." and then pull out all the CSS affecting everything inside it attached to a class.
There is an ExtractCSS tool that does something similar, but looking for something a bit more intuitive. That ignores all the strikethroughs too.
The simplest way is:
Select your element in the developer tools
Run window.getComputedStyle($0).cssText on the Js console
where $0 represents the currently selected DOM element.
In alternative, if you want to target a specific element with a given class, then do
window.getComputedStyle( document.getElementsByClassName('hero2-align-2 hero2-basis-0')[0] ).cssText
Querying elements by class name might return more than 1 element, the [0] is there to guarantee only one is processed.
Or by id
window.getComputedStyle( document.getElementById('yourID') ).cssText

How to find a child html element by id with jsoup?

I am parsing the html code of one site with Jsoup. I need to find some html elements that has an specific id but their parent´s tree is complicating me the task. So I would like to know if is it possible to search an specific html element without having to search first all of their parents.
For instance I am doing the next:
Elements el=elements.select(".scroller.context-inplay").select(".zone.grid-1-1").select(".grid-1").select(".module-placeholder");
I would likte to know if there is a simple way to get the same element I get with this code searching by its id
The id of an html element should be unique within the page. Some html that you find in the wild breaks this requirement unfortunately tough. However, if your html source follows the standard you can simply use the # css operator to select the element in question:
Element el = doc.select("#someID").first();
Alternatively you can directly use the getElmentById Jsoup method:
Element el = doc.getElmentById("someID");
Also, if you decide to go by class names as you suggest in your question, it is easy to combine all selects into one selector:
Elements els = elements.select(".scroller.context-inplay .zone.grid-1-1 .grid-1 .module-placeholder");
The spaces in the CSS selector mean that any subselector right of the space must be a child of the stuff on the left side.

CSS selector select by div class attributes

<div class="thumbnail-popular" style="background: url('http://images.gogoanime.tv/images/upload/Go!.Princess.Precure.full.1812487.jpg');"></div>
I am trying to get the url component of this div class but I seem to be unable to fetch that specific data in the div class.
I have looked into making use of attributes but my attempts have been unsuccessful so far.
Usage of this CSS selector is through Kimonolabs.
div.thumbnail-popular should get you the element you're looking for — unless there is more than one such element, in which case you will need to narrow down your selector.
For example you will need to find out if this particular element belongs to a specific parent, or is the first, second, ... nth child, or any other information about the surrounding elements in the page that you're working with.
The background URL is in a style attribute on this element, so you will need to extract that attribute as described here. However you will still need to parse the declarations inside the style value in order to get the URL; I am not sure if it is possible to do this through kimono as I am not familiar with it (I'm not sure what its advanced mode really does, and it's difficult to tell from the lone screenshot that is provided in that help article).

What HTML element(s) to use to organize the page?

I have search page that contains 4 groups of elements:
The field where user types the keyword (event name)
The filters such as dates (from/to), city, place
The category filters (check boxes): concert, theater, musical, show and so on
Top 20 events
Plus "Search" button.
So, I'm trying to figure out the right way to organize the page. Is it better to use "div" or "section" or something else and why?
I found this nice text in the w3-documentation:
The section element is not a generic container element. When an
element is needed only for styling purposes or as a convenience for
scripting, authors are encouraged to use the div element instead. A
general rule is that the section element is appropriate only if the
element's contents would be listed explicitly in the document's
outline.
I hope this helps.
It's up to you. if you want to stick with HTML5 Standards then Use scetion, etc. or your not wrong when use divs creating layouts

Proper xpath specification between two children within the same parent?

So, on a given website: for example there is a div element. I want to properly specify the xpath for a given sub-set of the main content of the page, found in:
<div[#id="content"> otherwise known as <div[3]>
Specifically, I want the xpath for the content between the second horizontal-rule (hr) tag and the third horizontal-rule (hr) tag. Which I believe should be, respectively:
'//div[#id="content"]/hr[2]' **AND** '//div[#id="content"]/hr'
I have been reading the XPath Tutorial and trying to figure out if the two hr tags are siblings or not, which I believe they are. However, Python does not seem to be recognizing them as such. I have tried every derivation of:
"following-sibling" and "preceding:: and not(preceding::)"
to the point that I no longer know which is which, and what is what. I do know that I am confused, and I believe the script is being confounded by the fact that the second hr of interest is not being numbered/identified as the third hr within the content/div (does not follow logically in numbering) as it 'should' be... according to what Firebug has been telling me.
The bottom line is: How do I properly specify this xpath? Again, these horizontal-rule tags appear to be siblings to me so I would think it would follow a structure such as following-sibling & preceding-sibling to specify the content between these two tags.
If you have access to XPath 2.0 functions you can use intersect to select all elements between the two:
//hr[2]/following-sibling::node()
intersect
//hr[3]/preceding-sibling::node()
If you only have access to XPath 1.0 functions, you can use this wonderful workaround to achieve the same result:
//hr[2]/following-sibling::node()[
count(.| //hr[3]/preceding-sibling::node())
=
count(//hr[3]/preceding-sibling::node())
]