How to select <div class="ok">.....<a href="soft://an.id/">...</div> nodes? - html

A document has several <div class="ok"> tags. I am able to select all of them with
"//*[#class="ok"]" (i don't have to specify div, because only div tags have this class). I get a list of 6 nodes matching this.
Now, i need
either to test each node in order to see if it includes the tag <a href="soft://an.id/">. This inclusion is not direct. I mean, the <div> includes a <table> with many <tr> and <td> and <span>, and the <a..> (only one, or none) somewhere before </div>.
or to directly select only (div) nodes of class="ok" that include this <a> tag.
I have tried many things, that all fail. Including protecting the "/" in the href detection (is it required?).
I am quite familiar with regular expressions, but i must confess that i find XPath syntax even harder to understand.. And the W3C reference documents are so hard, without examples..
Any hints are welcome.

In order to select only <div class="ok"> element containing <a href="soft://an.id/"> child element you can use the following XPath locator:
"//div[#class='ok' and .//a[#href='soft://an.id/']]"

If I understand you correctly, you have a nested somewhere under the div with class "ok", right?
So in xpath, the a / is meant for a direct locator under/above the current tag. If you are looking for the somewhere under the found div, you need to use:
//div[#class="ok"]//a[#href="soft://an.id/"]
Then you need to check if it exists or not by using some kind of an assertion.

Related

Correct Microdata syntax for breadcrumbs NOT in a list?

Trying to determine the correct syntax for using Microdata inside my breadcumbs implementation. Everything I have read seems to lean towards the fact that the breadcrumbs are structured inside an ordered or unorderd list. Mine is not.
<body itemscope="" itemtype="http://schema.org/WebPage">
...
<div class="breadcrumbs" itemprop="breadcrumb">
Home
<span class="delimiter"> > </span>
Parent Item
<span class="delimiter"> > </span>
<span>Child</span>
</div>
...
</body>
If I run it inside Google's tool it seems correct, but compared to their example it is missing a lot of elements and doesn't have the structure of their example BreadcrumbList.
I'm also a little confused about the correct properties for the links. Should they all have title and url properties?
I was looking at the examples at the bottom of the page here: http://schema.org/WebPage
The breadcrumb property expects one of two values:
Text
BreadcrumbList
If you provide a Text value (like you do in the example), you can’t provide data about each link. If you are fine with that, the Microdata in your example is correct (but it also contains RDFa, which doesn’t seem to make sense, at least not without further context; so if you didn’t add them intentionally, you might want to remove the property attributes).
If you want to provide data about each link, you have to provide a BreadcrumbList value.
For the Microdata, it doesn’t matter whether or not you use a list. If the example uses ol→li→a→span, you could as well use something like div→span→a→span. You just have to make sure to use the correct element type.
If you can’t add parent elements to the a elements, it’s still possible to use BreadcrumbList. But then you would have to duplicate the URL with a link element inside the a element.

Terminology - The types of elements in HTML

A while ago there was a term that I remembered that described two categories of elements. I forgot the term and I want to know what that term was. The information I can remember is that the first category of elements get their values from within HTML like <p> or <a> or <ul> but there is another category of elements which get their values from "outside" of HTML like <img> or <input type="textbox">. I want to know the terminology for these types.
Edit - I've went through Zomry, Difster and BoltClock's answers and didn't get anything. So I remembered some extra piece of information and decided to add it. The two categories are Lazy Opposites of each other. For example if one is called xyz, then the other is called non-xyz.
Probably you mean replaced elements (and non-replaced, respectively)?
However, the distinction between them is not so unambigous. For example, form controls were traditionally considered replaced elements, but the HTML spec currently explicitly lists them as non-replaced (introducing the "widget" term instead).
The HTML specification mentions for tags like <img> and <input> the following: Tag omission in text/html: No end tag.
Tags with an end tag are defined as: Tag omission in text/html: Neither tag is omissible.
So as far as I can find, the HTML spec does define a technical name for this, apart from void versus normal elements, so what Watilin pointed out in the comments should be fine: standalone vs containers.
As an added side-note: HTML has a lot more HTML content categories. You can find a complete overview at the HTML spec here: https://html.spec.whatwg.org/multipage/indices.html#element-content-categories
Also interesting to read to visualize that a bit better: https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Content_categories
Elements whose contents are defined by text and/or other elements between their start and end tags don't have a special category. Even the HTML spec just calls them normal elements for the most part in section 8.1.2.
Elements whose primary values are defined by attributes and that cannot have content between their tags are called void elements. img and input are indeed two examples of void elements. Note that void elements are not to be confused with empty elements; see the following questions for more details on that:
Are void elements and empty elements the same?
HTML "void elements" so called because without content?
<input type="text" id="someField" name="someField">
With an input selector, you can get a value from it like so (with jQuery):
$("#someField).val();
Where as with a paragraph or a div, you don't get a value, you get the text or html.
<div id="someDiv">Blah, blah, blah</div> You can get that with jQuery as follows:
$("#someDiv").html();
Do you see the difference?

Parsing awful HTML: How do I recognize boundaries with xpath?

This is almost going to sound like a joke, but I promise you this is real life. There is a site on the internet, one which you have all used, that does not believe in css classes. Everything is defined directly in the style tag on an element. It's horrifying.
My problem though is that it also makes the html extraordinarily difficult to parse. The structure that I've got to go on looks something like this:
<td>
<a name="<random_string>"></a>
<div style="generic-style, used by other elements">
<div style="similarly generic style">{some_stuff}</div>
</div>
<a name="<random_string>"></a>
...
</td>
Basically, I've got these a tags that are forming the boundaries of the reviews, whos only defining information is the random string that is their name. I don't actually care about the anchor tags, but I would like to grab the reviews between them using xpath.
I've looked into sibling queries, but they don't seem to be well suited for alternating boundaries. I also looked into the Kayessian method of xpath queries, which (aside from having an awesome name) only seems well suited to grab a particular div, rather than all divs between the anchor tags.
Any thoughts on how I could grab the divs here?
If //td/div[../a[#name]] works for you, then the following should also work :
//td[a/#name]/div
This way you don't need to go back and forth -or rather down and up-. For a more specific selector, you may want to try the following :
//td/div[preceding-sibling::*[1][self::a/#name]][following-sibling::*[1][self::a/#name]]
The XPath selects div element having all the following properties :
td/div : is child of <td> element
[preceding-sibling::*[1][self::a/#name]] : preceded directly by <a> element having attribute name
[following-sibling::*[1][self::a/#name]] : followed directly by <a> element having attribute name
I figured it out! It turns out that xpath will allow for relative attribute assertions. I am not sure if this behavior is desired, but it happens to work in this case! Here's the xpath:
//td/div[../a[#name]]
Nice and clean, the ../a[#name] basically just says:
Go up a level, and make sure on that level of the hierarchy there's an a element with a name attribute

Simple Xpath puzzle

I'm trying to automate the Google Translate web interface with Selenium (but it's not necessary to understand Selenium to understand this question, just know that it finds elements and clicks them). I'm stuck on selecting the language to translate from.
I can't get to the point where the drop-down menu opens, as seen in the screenshot below.
Now, I want to select 'Japanese'.
This xpath expression works: $b.find_element(:xpath,"//*[#id=':13']/div").click But I would rather have one where I can just input the name of the language.
This xpath expression also works: $b.find_element(:xpath,"//*[contains(text(),'Japanese')]").click But only as long as there is no other 'Japanese' text on the page.
So I'm trying to narrow down the scope of my xpath, but when I try to specify the path to take to find the 'Japanese' text, the expression no longer works, I can't find the element: $b.find_element(:xpath,"//*div[#id='gt-sl-gms']/*[contains(text(),'Japanese')]").click
It also no longer works for the original xpath either: $b.find_element(:xpath,"//*div[#id='gt-sl-gms']/*[#id=':13']/div").click
Which is weird, because to bring down the drop-down menu, I use this xpath $b.find_element(:xpath,"//*[#id='gt-sl-gms']/*[contains(text(),'From:')]").click.
So it's not that I have two wildcards in my expression and it's not that my expression is too specific. There's something else that I'm missing and I'm sure it's really simple.
Any suggestions are appreciated.
Edit Other things I have tried unsuccessfully:
$b.find_element(:xpath,"//*/div[#id='gt-sl-gms']/*[#id=':13']/div").click
$b.find_element(:xpath,"//*[#id='gt-sl-gms']/*[#id=':13']/div").click
$b.find_element(:xpath,"//*[#id='gt-sl-gms']//*[#id=':13']/div").click
If the div with "#id=':13'" is an descendant of the div with "#id='gt-sl-gms" your xpaht "//*[#id='gt-sl-gms']//*[#id=':13']/div" would work.
The above xpaht expect that the html looks somehow like:
<div id="gt-sl-gms">
<div>
<div id=":13">
<div></div>
</div>
</div>
</div>
If <div id="gt-sl-gms"> in not an ancestor (as I expect) you have to look for an "real" ancestor, or you may use following (for nodes later in the document) or following-sibling (for nodes later in the document at the same level as the previous.
*div is incorrect, it should be just div. Also, depending on he structure of the HTML, you may need // instead of /.
Try selecting descendants (//) instead of (/*) which is really grandchildren or deeper.

Jsoup: <div> within an <a>

According to this answer:
HTML 4.01 specifies that <a> elements
may only contain inline elements. A
<div> is a block element, so it may
not appear inside an <a>.
But...
HTML5 allows <a> elements to contain
blocks.
Well, I just tried selecting a <div class="m"> within an <a> block, using:
Elements elems = a.select("m");
and elmes returns empty, despite the div being there.
So I am thinking: Either I am not using the correct syntax for selecting a div within an a or... Jsoup doesn't support this HTML5-only feature?
What is the right Jsoup syntax for selecting a div within an a?
Update: I just tried
Elements elems = a.getElementsByClass("m");
And Jsoup had no problems with it (i.e. it returns the correct number of such divs within a).
So my question now is: Why?
Why does a.getElementsByClass("m") work whereas a.select("m") doesn't?
Update: I just tried, per #Delan Azabani's suggestion:
Elements elems = a.select(".m");
and it worked. So basically the a.select() works but I was missing the . in front of the class name.
The select function takes a selector. If you pass 'm' as the argument, it'll try to find m elements that are children of the a element. You need to pass '.m' as the argument, which will find elements with the m class under the a element.
The current version of jsoup (1.5.2) does support div tags nested within a tags.
In situations like this I suggest printing out the parse tree, to ensure that jsoup has parsed the HTML like you expect, or if it hasn't to know what the correct selector to use.
E.g.:
Document doc = Jsoup.parse("<a href='./'><div class=m>Check</div></a>");
System.out.println("Parse tree:\n" + doc);
Elements divs = doc.select("a .m");
System.out.println("\nDiv in A:\n" + divs);
Gives:
Parse tree:
<html>
<head></head>
<body>
<a href="./">
<div class="m">
Check
</div></a>
</body>
</html>
Div in A:
<div class="m">
Check
</div>