According to this answer:
HTML 4.01 specifies that <a> elements
may only contain inline elements. A
<div> is a block element, so it may
not appear inside an <a>.
But...
HTML5 allows <a> elements to contain
blocks.
Well, I just tried selecting a <div class="m"> within an <a> block, using:
Elements elems = a.select("m");
and elmes returns empty, despite the div being there.
So I am thinking: Either I am not using the correct syntax for selecting a div within an a or... Jsoup doesn't support this HTML5-only feature?
What is the right Jsoup syntax for selecting a div within an a?
Update: I just tried
Elements elems = a.getElementsByClass("m");
And Jsoup had no problems with it (i.e. it returns the correct number of such divs within a).
So my question now is: Why?
Why does a.getElementsByClass("m") work whereas a.select("m") doesn't?
Update: I just tried, per #Delan Azabani's suggestion:
Elements elems = a.select(".m");
and it worked. So basically the a.select() works but I was missing the . in front of the class name.
The select function takes a selector. If you pass 'm' as the argument, it'll try to find m elements that are children of the a element. You need to pass '.m' as the argument, which will find elements with the m class under the a element.
The current version of jsoup (1.5.2) does support div tags nested within a tags.
In situations like this I suggest printing out the parse tree, to ensure that jsoup has parsed the HTML like you expect, or if it hasn't to know what the correct selector to use.
E.g.:
Document doc = Jsoup.parse("<a href='./'><div class=m>Check</div></a>");
System.out.println("Parse tree:\n" + doc);
Elements divs = doc.select("a .m");
System.out.println("\nDiv in A:\n" + divs);
Gives:
Parse tree:
<html>
<head></head>
<body>
<a href="./">
<div class="m">
Check
</div></a>
</body>
</html>
Div in A:
<div class="m">
Check
</div>
Related
A document has several <div class="ok"> tags. I am able to select all of them with
"//*[#class="ok"]" (i don't have to specify div, because only div tags have this class). I get a list of 6 nodes matching this.
Now, i need
either to test each node in order to see if it includes the tag <a href="soft://an.id/">. This inclusion is not direct. I mean, the <div> includes a <table> with many <tr> and <td> and <span>, and the <a..> (only one, or none) somewhere before </div>.
or to directly select only (div) nodes of class="ok" that include this <a> tag.
I have tried many things, that all fail. Including protecting the "/" in the href detection (is it required?).
I am quite familiar with regular expressions, but i must confess that i find XPath syntax even harder to understand.. And the W3C reference documents are so hard, without examples..
Any hints are welcome.
In order to select only <div class="ok"> element containing <a href="soft://an.id/"> child element you can use the following XPath locator:
"//div[#class='ok' and .//a[#href='soft://an.id/']]"
If I understand you correctly, you have a nested somewhere under the div with class "ok", right?
So in xpath, the a / is meant for a direct locator under/above the current tag. If you are looking for the somewhere under the found div, you need to use:
//div[#class="ok"]//a[#href="soft://an.id/"]
Then you need to check if it exists or not by using some kind of an assertion.
I'm a bit of a noob to coding so sorry if this is a dumb question, but I'm trying to write a general purpose scraper for getting some product data using the "schema.org/Product" HTML microdata.
However, I came into an issue when testing (on this page in particular where the name was being set as "Electronics" from the Breadcrumbs schema) as there were ancestor elements with different itemtypes/schema.
I first have this variable declared to check if the page has an element using the Product schema microdata.
var productMicrodata = document.querySelector('[itemscope][itemtype="https://schema.org/Product"], [itemscope][itemtype="http://schema.org/Product"]');
I then wanted to select for all elements with the itemprop attribute. e.g.
productMicrodata.querySelectorAll('[itemprop]');
The issue however is that I want to ignore any elements that have other ancestors with different itemtypes/schema attributes, as in this instance the Breadcrumbs and ListItem schema data is still being included.
I figured I would then just be able to do something like this:
productMicrodata.querySelectorAll(':not([itemscope]) [itemprop]');
However this is still returning matches for the child elements having ancestor elements with different itemscope attributes (e.g. breadcrumbs).
I'm sure I'm just missing something super obvious, but any help on how I can achieve only selecting elements that have only the one ancestor with itemtype="http://schema.org/Product" attribute would be much appreciated.
EDIT: For clarification of where the element(s) are that I'm trying to avoid matching with are, here's what the DOM looks like on the example page linked. I'm trying to ignore the elements that have any ancestors with itemtype attributes.
EDIT 2: changed incorrect use of parent to ancestor. Apologies, I am still new to this :|
EDIT 4/SOLUTION: I've found a non-CSS solution for what I'm trying to achieve using the javascript Element.closest() method. e.g.
let productMicrodata = document.querySelectorAll('[itemprop]');
let itemProp = {};
for (let i = 0; i < productMicrodata.length; i++) {
if (productMicrodata[i].closest('[itemtype]').getAttribute('itemtype') === "http://schema.org/Product" || productMicrodata[i].closest('[itemtype]').getAttribute('itemtype') === "https://schema.org/Product") {
itemProp[productMicrodata[i].getAttribute('itemprop')] = productMicrodata[i].textContent;
}
}
console.log(itemProp);
:not([itemscope]) [itemprop] means:
An element with an itemprop attribute and any ancestor with no itemprop ancestor.
So:
<div>
<div itemprop>
<div itemprop> <!-- this one -->
</div>
</div>
</div>
… would match because while the parent element has the attribute, the grandparent does not.
You need to use the child combinator to eliminate elements with matching parent elements:
:not([itemscope]) > [itemprop]
[...] help on how I can achieve only selecting elements that have only
the itemtype="http://schema.org/Product" attribute would be much
appreciated.
Attribute selectors can take explicit values:
[myAttribute="myValue"]
So the syntax for this would be:
var productMicrodata.querySelectorAll('[itemtype="http://schema.org/Product"]');
A while ago there was a term that I remembered that described two categories of elements. I forgot the term and I want to know what that term was. The information I can remember is that the first category of elements get their values from within HTML like <p> or <a> or <ul> but there is another category of elements which get their values from "outside" of HTML like <img> or <input type="textbox">. I want to know the terminology for these types.
Edit - I've went through Zomry, Difster and BoltClock's answers and didn't get anything. So I remembered some extra piece of information and decided to add it. The two categories are Lazy Opposites of each other. For example if one is called xyz, then the other is called non-xyz.
Probably you mean replaced elements (and non-replaced, respectively)?
However, the distinction between them is not so unambigous. For example, form controls were traditionally considered replaced elements, but the HTML spec currently explicitly lists them as non-replaced (introducing the "widget" term instead).
The HTML specification mentions for tags like <img> and <input> the following: Tag omission in text/html: No end tag.
Tags with an end tag are defined as: Tag omission in text/html: Neither tag is omissible.
So as far as I can find, the HTML spec does define a technical name for this, apart from void versus normal elements, so what Watilin pointed out in the comments should be fine: standalone vs containers.
As an added side-note: HTML has a lot more HTML content categories. You can find a complete overview at the HTML spec here: https://html.spec.whatwg.org/multipage/indices.html#element-content-categories
Also interesting to read to visualize that a bit better: https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Content_categories
Elements whose contents are defined by text and/or other elements between their start and end tags don't have a special category. Even the HTML spec just calls them normal elements for the most part in section 8.1.2.
Elements whose primary values are defined by attributes and that cannot have content between their tags are called void elements. img and input are indeed two examples of void elements. Note that void elements are not to be confused with empty elements; see the following questions for more details on that:
Are void elements and empty elements the same?
HTML "void elements" so called because without content?
<input type="text" id="someField" name="someField">
With an input selector, you can get a value from it like so (with jQuery):
$("#someField).val();
Where as with a paragraph or a div, you don't get a value, you get the text or html.
<div id="someDiv">Blah, blah, blah</div> You can get that with jQuery as follows:
$("#someDiv").html();
Do you see the difference?
I have an html file which is like:
<div id='author'>
<div>
<div>
...
<a> John Doe </a>
I do not know how many div's would be under the author div. It may have different depth for different pages.
So what would be the XPath expression for this kind of xml?
By the way, I tried:
//div[#id = "author"]/*/a/text()
but this only seems to work for grandchildren of the author div.
Use double slash to find an a element anywhere inside the div element with id="author":
//div[#id = "author"]//a/text()
i have a trouble with XPath. I have a HTML page with complicated structure and i want to select ALL href's elements in particular div, regardless of the depth of nesting.
Why next code doesn't work and what can I do to fix?
//*[#id='some_id']//*//a
Matching #href attributes
Select all #href attributes, not all anchor tags.
//*[#id='some_id']//#href
If you only want to match the #href attributes of anchor tags, go for this query, which selects all anchor tags inside that "some_id"-element, and then their #href tags.
//*[#id='some_id']//a/#href
// and the descendant-or-self-axis
I'm not sure what you wanted to achieve with the .//*//a construct. This is an abbreviation for
./descendant-or-self::node()/child::*/descendant-or-self::node()/child::a
so there must be some element in-between. If the anchor tag is directly contained within the #id='some_id'-element, it will not be found, for example for this input:
<div id='some_id'>bar</div>
//*[#id='some_id']//a would have matched this element.
// addresses the entire descendant axis, so this is sufficient:
//*[#id='some_id']//a
Otherwise, you wouldn't get a elements that are immediate descendants of the element addressed with //*[#id='some_id']. (If your environment recognizes id attributes as being IDs, you can also address this element with id('some_id').)
But your problem is likely to be something different. //a usually addresses all a elements in the null namespace. Possibly your a elements aren't in the null namespace but in the XHTML namespace. You could match them like
//*[#id='some_id']//*[local-name()='a' and namespace-uri()='http://www.w3.org/1999/xhtml']
or, if you only have to expect HTML elements anyway
//*[#id='some_id']//*[local-name()='a']
or in XPath 2.0 even simpler
//*[#id='some_id']//*:a
Depending on your environment, you can also register a namespace prefix so that you can do something like
//*[#id='some_id']//html:a
in both XPath 1.0 and 2.0.