CSS: Select elements with only one parent matching attribute selector - html

I'm a bit of a noob to coding so sorry if this is a dumb question, but I'm trying to write a general purpose scraper for getting some product data using the "schema.org/Product" HTML microdata.
However, I came into an issue when testing (on this page in particular where the name was being set as "Electronics" from the Breadcrumbs schema) as there were ancestor elements with different itemtypes/schema.
I first have this variable declared to check if the page has an element using the Product schema microdata.
var productMicrodata = document.querySelector('[itemscope][itemtype="https://schema.org/Product"], [itemscope][itemtype="http://schema.org/Product"]');
I then wanted to select for all elements with the itemprop attribute. e.g.
productMicrodata.querySelectorAll('[itemprop]');
The issue however is that I want to ignore any elements that have other ancestors with different itemtypes/schema attributes, as in this instance the Breadcrumbs and ListItem schema data is still being included.
I figured I would then just be able to do something like this:
productMicrodata.querySelectorAll(':not([itemscope]) [itemprop]');
However this is still returning matches for the child elements having ancestor elements with different itemscope attributes (e.g. breadcrumbs).
I'm sure I'm just missing something super obvious, but any help on how I can achieve only selecting elements that have only the one ancestor with itemtype="http://schema.org/Product" attribute would be much appreciated.
EDIT: For clarification of where the element(s) are that I'm trying to avoid matching with are, here's what the DOM looks like on the example page linked. I'm trying to ignore the elements that have any ancestors with itemtype attributes.
EDIT 2: changed incorrect use of parent to ancestor. Apologies, I am still new to this :|
EDIT 4/SOLUTION: I've found a non-CSS solution for what I'm trying to achieve using the javascript Element.closest() method. e.g.
let productMicrodata = document.querySelectorAll('[itemprop]');
let itemProp = {};
for (let i = 0; i < productMicrodata.length; i++) {
if (productMicrodata[i].closest('[itemtype]').getAttribute('itemtype') === "http://schema.org/Product" || productMicrodata[i].closest('[itemtype]').getAttribute('itemtype') === "https://schema.org/Product") {
itemProp[productMicrodata[i].getAttribute('itemprop')] = productMicrodata[i].textContent;
}
}
console.log(itemProp);

:not([itemscope]) [itemprop] means:
An element with an itemprop attribute and any ancestor with no itemprop ancestor.
So:
<div>
<div itemprop>
<div itemprop> <!-- this one -->
</div>
</div>
</div>
… would match because while the parent element has the attribute, the grandparent does not.
You need to use the child combinator to eliminate elements with matching parent elements:
:not([itemscope]) > [itemprop]

[...] help on how I can achieve only selecting elements that have only
the itemtype="http://schema.org/Product" attribute would be much
appreciated.
Attribute selectors can take explicit values:
[myAttribute="myValue"]
So the syntax for this would be:
var productMicrodata.querySelectorAll('[itemtype="http://schema.org/Product"]');

Related

Terminology - The types of elements in HTML

A while ago there was a term that I remembered that described two categories of elements. I forgot the term and I want to know what that term was. The information I can remember is that the first category of elements get their values from within HTML like <p> or <a> or <ul> but there is another category of elements which get their values from "outside" of HTML like <img> or <input type="textbox">. I want to know the terminology for these types.
Edit - I've went through Zomry, Difster and BoltClock's answers and didn't get anything. So I remembered some extra piece of information and decided to add it. The two categories are Lazy Opposites of each other. For example if one is called xyz, then the other is called non-xyz.
Probably you mean replaced elements (and non-replaced, respectively)?
However, the distinction between them is not so unambigous. For example, form controls were traditionally considered replaced elements, but the HTML spec currently explicitly lists them as non-replaced (introducing the "widget" term instead).
The HTML specification mentions for tags like <img> and <input> the following: Tag omission in text/html: No end tag.
Tags with an end tag are defined as: Tag omission in text/html: Neither tag is omissible.
So as far as I can find, the HTML spec does define a technical name for this, apart from void versus normal elements, so what Watilin pointed out in the comments should be fine: standalone vs containers.
As an added side-note: HTML has a lot more HTML content categories. You can find a complete overview at the HTML spec here: https://html.spec.whatwg.org/multipage/indices.html#element-content-categories
Also interesting to read to visualize that a bit better: https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Content_categories
Elements whose contents are defined by text and/or other elements between their start and end tags don't have a special category. Even the HTML spec just calls them normal elements for the most part in section 8.1.2.
Elements whose primary values are defined by attributes and that cannot have content between their tags are called void elements. img and input are indeed two examples of void elements. Note that void elements are not to be confused with empty elements; see the following questions for more details on that:
Are void elements and empty elements the same?
HTML "void elements" so called because without content?
<input type="text" id="someField" name="someField">
With an input selector, you can get a value from it like so (with jQuery):
$("#someField).val();
Where as with a paragraph or a div, you don't get a value, you get the text or html.
<div id="someDiv">Blah, blah, blah</div> You can get that with jQuery as follows:
$("#someDiv").html();
Do you see the difference?

HTML: element breakdown

Suppose I have the following HTML element:
<foo spam="eggs">bar</foo>
I know that foo is the 'tag', but what are the technical names for spam, eggs, and bar?
spam="eggs" is attribute (as a whole) and bar is the child node.
You can also break the attribute down to attribute name and attribute value.
spam is an attribute name
"eggs" is the attribute spam's value
and bar is a child node, in this case, a child of type textNode. Children can also be "elements" (aka tags).
Read more on elements (aka tags) here: http://www.w3schools.com/html/html_elements.asp
And on attributes here: http://www.w3schools.com/html/html_attributes.asp
I recommend this one and the next chapter: http://www.w3schools.com/html/html_elements.asp
In any case, these are called attributes, values and content.
EDIT: whoa, ninjas abound.

Parsing awful HTML: How do I recognize boundaries with xpath?

This is almost going to sound like a joke, but I promise you this is real life. There is a site on the internet, one which you have all used, that does not believe in css classes. Everything is defined directly in the style tag on an element. It's horrifying.
My problem though is that it also makes the html extraordinarily difficult to parse. The structure that I've got to go on looks something like this:
<td>
<a name="<random_string>"></a>
<div style="generic-style, used by other elements">
<div style="similarly generic style">{some_stuff}</div>
</div>
<a name="<random_string>"></a>
...
</td>
Basically, I've got these a tags that are forming the boundaries of the reviews, whos only defining information is the random string that is their name. I don't actually care about the anchor tags, but I would like to grab the reviews between them using xpath.
I've looked into sibling queries, but they don't seem to be well suited for alternating boundaries. I also looked into the Kayessian method of xpath queries, which (aside from having an awesome name) only seems well suited to grab a particular div, rather than all divs between the anchor tags.
Any thoughts on how I could grab the divs here?
If //td/div[../a[#name]] works for you, then the following should also work :
//td[a/#name]/div
This way you don't need to go back and forth -or rather down and up-. For a more specific selector, you may want to try the following :
//td/div[preceding-sibling::*[1][self::a/#name]][following-sibling::*[1][self::a/#name]]
The XPath selects div element having all the following properties :
td/div : is child of <td> element
[preceding-sibling::*[1][self::a/#name]] : preceded directly by <a> element having attribute name
[following-sibling::*[1][self::a/#name]] : followed directly by <a> element having attribute name
I figured it out! It turns out that xpath will allow for relative attribute assertions. I am not sure if this behavior is desired, but it happens to work in this case! Here's the xpath:
//td/div[../a[#name]]
Nice and clean, the ../a[#name] basically just says:
Go up a level, and make sure on that level of the hierarchy there's an a element with a name attribute

XPath to select all href's in element

i have a trouble with XPath. I have a HTML page with complicated structure and i want to select ALL href's elements in particular div, regardless of the depth of nesting.
Why next code doesn't work and what can I do to fix?
//*[#id='some_id']//*//a
Matching #href attributes
Select all #href attributes, not all anchor tags.
//*[#id='some_id']//#href
If you only want to match the #href attributes of anchor tags, go for this query, which selects all anchor tags inside that "some_id"-element, and then their #href tags.
//*[#id='some_id']//a/#href
// and the descendant-or-self-axis
I'm not sure what you wanted to achieve with the .//*//a construct. This is an abbreviation for
./descendant-or-self::node()/child::*/descendant-or-self::node()/child::a
so there must be some element in-between. If the anchor tag is directly contained within the #id='some_id'-element, it will not be found, for example for this input:
<div id='some_id'>bar</div>
//*[#id='some_id']//a would have matched this element.
// addresses the entire descendant axis, so this is sufficient:
//*[#id='some_id']//a
Otherwise, you wouldn't get a elements that are immediate descendants of the element addressed with //*[#id='some_id']. (If your environment recognizes id attributes as being IDs, you can also address this element with id('some_id').)
But your problem is likely to be something different. //a usually addresses all a elements in the null namespace. Possibly your a elements aren't in the null namespace but in the XHTML namespace. You could match them like
//*[#id='some_id']//*[local-name()='a' and namespace-uri()='http://www.w3.org/1999/xhtml']
or, if you only have to expect HTML elements anyway
//*[#id='some_id']//*[local-name()='a']
or in XPath 2.0 even simpler
//*[#id='some_id']//*:a
Depending on your environment, you can also register a namespace prefix so that you can do something like
//*[#id='some_id']//html:a
in both XPath 1.0 and 2.0.

Jsoup: <div> within an <a>

According to this answer:
HTML 4.01 specifies that <a> elements
may only contain inline elements. A
<div> is a block element, so it may
not appear inside an <a>.
But...
HTML5 allows <a> elements to contain
blocks.
Well, I just tried selecting a <div class="m"> within an <a> block, using:
Elements elems = a.select("m");
and elmes returns empty, despite the div being there.
So I am thinking: Either I am not using the correct syntax for selecting a div within an a or... Jsoup doesn't support this HTML5-only feature?
What is the right Jsoup syntax for selecting a div within an a?
Update: I just tried
Elements elems = a.getElementsByClass("m");
And Jsoup had no problems with it (i.e. it returns the correct number of such divs within a).
So my question now is: Why?
Why does a.getElementsByClass("m") work whereas a.select("m") doesn't?
Update: I just tried, per #Delan Azabani's suggestion:
Elements elems = a.select(".m");
and it worked. So basically the a.select() works but I was missing the . in front of the class name.
The select function takes a selector. If you pass 'm' as the argument, it'll try to find m elements that are children of the a element. You need to pass '.m' as the argument, which will find elements with the m class under the a element.
The current version of jsoup (1.5.2) does support div tags nested within a tags.
In situations like this I suggest printing out the parse tree, to ensure that jsoup has parsed the HTML like you expect, or if it hasn't to know what the correct selector to use.
E.g.:
Document doc = Jsoup.parse("<a href='./'><div class=m>Check</div></a>");
System.out.println("Parse tree:\n" + doc);
Elements divs = doc.select("a .m");
System.out.println("\nDiv in A:\n" + divs);
Gives:
Parse tree:
<html>
<head></head>
<body>
<a href="./">
<div class="m">
Check
</div></a>
</body>
</html>
Div in A:
<div class="m">
Check
</div>