Parsing awful HTML: How do I recognize boundaries with xpath? - html

This is almost going to sound like a joke, but I promise you this is real life. There is a site on the internet, one which you have all used, that does not believe in css classes. Everything is defined directly in the style tag on an element. It's horrifying.
My problem though is that it also makes the html extraordinarily difficult to parse. The structure that I've got to go on looks something like this:
<td>
<a name="<random_string>"></a>
<div style="generic-style, used by other elements">
<div style="similarly generic style">{some_stuff}</div>
</div>
<a name="<random_string>"></a>
...
</td>
Basically, I've got these a tags that are forming the boundaries of the reviews, whos only defining information is the random string that is their name. I don't actually care about the anchor tags, but I would like to grab the reviews between them using xpath.
I've looked into sibling queries, but they don't seem to be well suited for alternating boundaries. I also looked into the Kayessian method of xpath queries, which (aside from having an awesome name) only seems well suited to grab a particular div, rather than all divs between the anchor tags.
Any thoughts on how I could grab the divs here?

If //td/div[../a[#name]] works for you, then the following should also work :
//td[a/#name]/div
This way you don't need to go back and forth -or rather down and up-. For a more specific selector, you may want to try the following :
//td/div[preceding-sibling::*[1][self::a/#name]][following-sibling::*[1][self::a/#name]]
The XPath selects div element having all the following properties :
td/div : is child of <td> element
[preceding-sibling::*[1][self::a/#name]] : preceded directly by <a> element having attribute name
[following-sibling::*[1][self::a/#name]] : followed directly by <a> element having attribute name

I figured it out! It turns out that xpath will allow for relative attribute assertions. I am not sure if this behavior is desired, but it happens to work in this case! Here's the xpath:
//td/div[../a[#name]]
Nice and clean, the ../a[#name] basically just says:
Go up a level, and make sure on that level of the hierarchy there's an a element with a name attribute

Related

How to select <div class="ok">.....<a href="soft://an.id/">...</div> nodes?

A document has several <div class="ok"> tags. I am able to select all of them with
"//*[#class="ok"]" (i don't have to specify div, because only div tags have this class). I get a list of 6 nodes matching this.
Now, i need
either to test each node in order to see if it includes the tag <a href="soft://an.id/">. This inclusion is not direct. I mean, the <div> includes a <table> with many <tr> and <td> and <span>, and the <a..> (only one, or none) somewhere before </div>.
or to directly select only (div) nodes of class="ok" that include this <a> tag.
I have tried many things, that all fail. Including protecting the "/" in the href detection (is it required?).
I am quite familiar with regular expressions, but i must confess that i find XPath syntax even harder to understand.. And the W3C reference documents are so hard, without examples..
Any hints are welcome.
In order to select only <div class="ok"> element containing <a href="soft://an.id/"> child element you can use the following XPath locator:
"//div[#class='ok' and .//a[#href='soft://an.id/']]"
If I understand you correctly, you have a nested somewhere under the div with class "ok", right?
So in xpath, the a / is meant for a direct locator under/above the current tag. If you are looking for the somewhere under the found div, you need to use:
//div[#class="ok"]//a[#href="soft://an.id/"]
Then you need to check if it exists or not by using some kind of an assertion.

How to access <a> tag which is present inside <th> in VBA that doesn't have id or classname?

I'm trying to automate the web form where I have <a> tag which is inside <th> tag. When I tried getElementByTagName("a").innerText I'm not getting the desired element/text. But when I wrote getElementByTagName("th").innerText it is showing me the exact text that I'm pointing at. But the issue is I wanted to click on the link which this text i.e <a> tag has. getElementByTagName("th").Click is not working. Can someone please help?
There's no such method as getElementByTagName().
There are: document.getElementsByTagName() and Element.getElementsByTagName(). Both return a live HTMLCollection.
In the latter Element refers to a DOM element. It allows you to search for specific tags in children of that element.
Plese refer to the following MDN documents:
Element.getElementsByTagName()
Document.getElementsByTagName()
Also, it's worth mentioning that, without document.* or anything else, the browser would assume you're trying to call window.getElementByTagName().
NOTE: I'm aware the question is tagged vba instead of javascript, but in this case it doesn't seem to matter.

Correct Microdata syntax for breadcrumbs NOT in a list?

Trying to determine the correct syntax for using Microdata inside my breadcumbs implementation. Everything I have read seems to lean towards the fact that the breadcrumbs are structured inside an ordered or unorderd list. Mine is not.
<body itemscope="" itemtype="http://schema.org/WebPage">
...
<div class="breadcrumbs" itemprop="breadcrumb">
Home
<span class="delimiter"> > </span>
Parent Item
<span class="delimiter"> > </span>
<span>Child</span>
</div>
...
</body>
If I run it inside Google's tool it seems correct, but compared to their example it is missing a lot of elements and doesn't have the structure of their example BreadcrumbList.
I'm also a little confused about the correct properties for the links. Should they all have title and url properties?
I was looking at the examples at the bottom of the page here: http://schema.org/WebPage
The breadcrumb property expects one of two values:
Text
BreadcrumbList
If you provide a Text value (like you do in the example), you can’t provide data about each link. If you are fine with that, the Microdata in your example is correct (but it also contains RDFa, which doesn’t seem to make sense, at least not without further context; so if you didn’t add them intentionally, you might want to remove the property attributes).
If you want to provide data about each link, you have to provide a BreadcrumbList value.
For the Microdata, it doesn’t matter whether or not you use a list. If the example uses ol→li→a→span, you could as well use something like div→span→a→span. You just have to make sure to use the correct element type.
If you can’t add parent elements to the a elements, it’s still possible to use BreadcrumbList. But then you would have to duplicate the URL with a link element inside the a element.

Terminology - The types of elements in HTML

A while ago there was a term that I remembered that described two categories of elements. I forgot the term and I want to know what that term was. The information I can remember is that the first category of elements get their values from within HTML like <p> or <a> or <ul> but there is another category of elements which get their values from "outside" of HTML like <img> or <input type="textbox">. I want to know the terminology for these types.
Edit - I've went through Zomry, Difster and BoltClock's answers and didn't get anything. So I remembered some extra piece of information and decided to add it. The two categories are Lazy Opposites of each other. For example if one is called xyz, then the other is called non-xyz.
Probably you mean replaced elements (and non-replaced, respectively)?
However, the distinction between them is not so unambigous. For example, form controls were traditionally considered replaced elements, but the HTML spec currently explicitly lists them as non-replaced (introducing the "widget" term instead).
The HTML specification mentions for tags like <img> and <input> the following: Tag omission in text/html: No end tag.
Tags with an end tag are defined as: Tag omission in text/html: Neither tag is omissible.
So as far as I can find, the HTML spec does define a technical name for this, apart from void versus normal elements, so what Watilin pointed out in the comments should be fine: standalone vs containers.
As an added side-note: HTML has a lot more HTML content categories. You can find a complete overview at the HTML spec here: https://html.spec.whatwg.org/multipage/indices.html#element-content-categories
Also interesting to read to visualize that a bit better: https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Content_categories
Elements whose contents are defined by text and/or other elements between their start and end tags don't have a special category. Even the HTML spec just calls them normal elements for the most part in section 8.1.2.
Elements whose primary values are defined by attributes and that cannot have content between their tags are called void elements. img and input are indeed two examples of void elements. Note that void elements are not to be confused with empty elements; see the following questions for more details on that:
Are void elements and empty elements the same?
HTML "void elements" so called because without content?
<input type="text" id="someField" name="someField">
With an input selector, you can get a value from it like so (with jQuery):
$("#someField).val();
Where as with a paragraph or a div, you don't get a value, you get the text or html.
<div id="someDiv">Blah, blah, blah</div> You can get that with jQuery as follows:
$("#someDiv").html();
Do you see the difference?

Simple Xpath puzzle

I'm trying to automate the Google Translate web interface with Selenium (but it's not necessary to understand Selenium to understand this question, just know that it finds elements and clicks them). I'm stuck on selecting the language to translate from.
I can't get to the point where the drop-down menu opens, as seen in the screenshot below.
Now, I want to select 'Japanese'.
This xpath expression works: $b.find_element(:xpath,"//*[#id=':13']/div").click But I would rather have one where I can just input the name of the language.
This xpath expression also works: $b.find_element(:xpath,"//*[contains(text(),'Japanese')]").click But only as long as there is no other 'Japanese' text on the page.
So I'm trying to narrow down the scope of my xpath, but when I try to specify the path to take to find the 'Japanese' text, the expression no longer works, I can't find the element: $b.find_element(:xpath,"//*div[#id='gt-sl-gms']/*[contains(text(),'Japanese')]").click
It also no longer works for the original xpath either: $b.find_element(:xpath,"//*div[#id='gt-sl-gms']/*[#id=':13']/div").click
Which is weird, because to bring down the drop-down menu, I use this xpath $b.find_element(:xpath,"//*[#id='gt-sl-gms']/*[contains(text(),'From:')]").click.
So it's not that I have two wildcards in my expression and it's not that my expression is too specific. There's something else that I'm missing and I'm sure it's really simple.
Any suggestions are appreciated.
Edit Other things I have tried unsuccessfully:
$b.find_element(:xpath,"//*/div[#id='gt-sl-gms']/*[#id=':13']/div").click
$b.find_element(:xpath,"//*[#id='gt-sl-gms']/*[#id=':13']/div").click
$b.find_element(:xpath,"//*[#id='gt-sl-gms']//*[#id=':13']/div").click
If the div with "#id=':13'" is an descendant of the div with "#id='gt-sl-gms" your xpaht "//*[#id='gt-sl-gms']//*[#id=':13']/div" would work.
The above xpaht expect that the html looks somehow like:
<div id="gt-sl-gms">
<div>
<div id=":13">
<div></div>
</div>
</div>
</div>
If <div id="gt-sl-gms"> in not an ancestor (as I expect) you have to look for an "real" ancestor, or you may use following (for nodes later in the document) or following-sibling (for nodes later in the document at the same level as the previous.
*div is incorrect, it should be just div. Also, depending on he structure of the HTML, you may need // instead of /.
Try selecting descendants (//) instead of (/*) which is really grandchildren or deeper.