XPath to select all href's in element - html

i have a trouble with XPath. I have a HTML page with complicated structure and i want to select ALL href's elements in particular div, regardless of the depth of nesting.
Why next code doesn't work and what can I do to fix?
//*[#id='some_id']//*//a

Matching #href attributes
Select all #href attributes, not all anchor tags.
//*[#id='some_id']//#href
If you only want to match the #href attributes of anchor tags, go for this query, which selects all anchor tags inside that "some_id"-element, and then their #href tags.
//*[#id='some_id']//a/#href
// and the descendant-or-self-axis
I'm not sure what you wanted to achieve with the .//*//a construct. This is an abbreviation for
./descendant-or-self::node()/child::*/descendant-or-self::node()/child::a
so there must be some element in-between. If the anchor tag is directly contained within the #id='some_id'-element, it will not be found, for example for this input:
<div id='some_id'>bar</div>
//*[#id='some_id']//a would have matched this element.

// addresses the entire descendant axis, so this is sufficient:
//*[#id='some_id']//a
Otherwise, you wouldn't get a elements that are immediate descendants of the element addressed with //*[#id='some_id']. (If your environment recognizes id attributes as being IDs, you can also address this element with id('some_id').)
But your problem is likely to be something different. //a usually addresses all a elements in the null namespace. Possibly your a elements aren't in the null namespace but in the XHTML namespace. You could match them like
//*[#id='some_id']//*[local-name()='a' and namespace-uri()='http://www.w3.org/1999/xhtml']
or, if you only have to expect HTML elements anyway
//*[#id='some_id']//*[local-name()='a']
or in XPath 2.0 even simpler
//*[#id='some_id']//*:a
Depending on your environment, you can also register a namespace prefix so that you can do something like
//*[#id='some_id']//html:a
in both XPath 1.0 and 2.0.

Related

How to select <div class="ok">.....<a href="soft://an.id/">...</div> nodes?

A document has several <div class="ok"> tags. I am able to select all of them with
"//*[#class="ok"]" (i don't have to specify div, because only div tags have this class). I get a list of 6 nodes matching this.
Now, i need
either to test each node in order to see if it includes the tag <a href="soft://an.id/">. This inclusion is not direct. I mean, the <div> includes a <table> with many <tr> and <td> and <span>, and the <a..> (only one, or none) somewhere before </div>.
or to directly select only (div) nodes of class="ok" that include this <a> tag.
I have tried many things, that all fail. Including protecting the "/" in the href detection (is it required?).
I am quite familiar with regular expressions, but i must confess that i find XPath syntax even harder to understand.. And the W3C reference documents are so hard, without examples..
Any hints are welcome.
In order to select only <div class="ok"> element containing <a href="soft://an.id/"> child element you can use the following XPath locator:
"//div[#class='ok' and .//a[#href='soft://an.id/']]"
If I understand you correctly, you have a nested somewhere under the div with class "ok", right?
So in xpath, the a / is meant for a direct locator under/above the current tag. If you are looking for the somewhere under the found div, you need to use:
//div[#class="ok"]//a[#href="soft://an.id/"]
Then you need to check if it exists or not by using some kind of an assertion.

CSS: Select elements with only one parent matching attribute selector

I'm a bit of a noob to coding so sorry if this is a dumb question, but I'm trying to write a general purpose scraper for getting some product data using the "schema.org/Product" HTML microdata.
However, I came into an issue when testing (on this page in particular where the name was being set as "Electronics" from the Breadcrumbs schema) as there were ancestor elements with different itemtypes/schema.
I first have this variable declared to check if the page has an element using the Product schema microdata.
var productMicrodata = document.querySelector('[itemscope][itemtype="https://schema.org/Product"], [itemscope][itemtype="http://schema.org/Product"]');
I then wanted to select for all elements with the itemprop attribute. e.g.
productMicrodata.querySelectorAll('[itemprop]');
The issue however is that I want to ignore any elements that have other ancestors with different itemtypes/schema attributes, as in this instance the Breadcrumbs and ListItem schema data is still being included.
I figured I would then just be able to do something like this:
productMicrodata.querySelectorAll(':not([itemscope]) [itemprop]');
However this is still returning matches for the child elements having ancestor elements with different itemscope attributes (e.g. breadcrumbs).
I'm sure I'm just missing something super obvious, but any help on how I can achieve only selecting elements that have only the one ancestor with itemtype="http://schema.org/Product" attribute would be much appreciated.
EDIT: For clarification of where the element(s) are that I'm trying to avoid matching with are, here's what the DOM looks like on the example page linked. I'm trying to ignore the elements that have any ancestors with itemtype attributes.
EDIT 2: changed incorrect use of parent to ancestor. Apologies, I am still new to this :|
EDIT 4/SOLUTION: I've found a non-CSS solution for what I'm trying to achieve using the javascript Element.closest() method. e.g.
let productMicrodata = document.querySelectorAll('[itemprop]');
let itemProp = {};
for (let i = 0; i < productMicrodata.length; i++) {
if (productMicrodata[i].closest('[itemtype]').getAttribute('itemtype') === "http://schema.org/Product" || productMicrodata[i].closest('[itemtype]').getAttribute('itemtype') === "https://schema.org/Product") {
itemProp[productMicrodata[i].getAttribute('itemprop')] = productMicrodata[i].textContent;
}
}
console.log(itemProp);
:not([itemscope]) [itemprop] means:
An element with an itemprop attribute and any ancestor with no itemprop ancestor.
So:
<div>
<div itemprop>
<div itemprop> <!-- this one -->
</div>
</div>
</div>
… would match because while the parent element has the attribute, the grandparent does not.
You need to use the child combinator to eliminate elements with matching parent elements:
:not([itemscope]) > [itemprop]
[...] help on how I can achieve only selecting elements that have only
the itemtype="http://schema.org/Product" attribute would be much
appreciated.
Attribute selectors can take explicit values:
[myAttribute="myValue"]
So the syntax for this would be:
var productMicrodata.querySelectorAll('[itemtype="http://schema.org/Product"]');

Terminology - The types of elements in HTML

A while ago there was a term that I remembered that described two categories of elements. I forgot the term and I want to know what that term was. The information I can remember is that the first category of elements get their values from within HTML like <p> or <a> or <ul> but there is another category of elements which get their values from "outside" of HTML like <img> or <input type="textbox">. I want to know the terminology for these types.
Edit - I've went through Zomry, Difster and BoltClock's answers and didn't get anything. So I remembered some extra piece of information and decided to add it. The two categories are Lazy Opposites of each other. For example if one is called xyz, then the other is called non-xyz.
Probably you mean replaced elements (and non-replaced, respectively)?
However, the distinction between them is not so unambigous. For example, form controls were traditionally considered replaced elements, but the HTML spec currently explicitly lists them as non-replaced (introducing the "widget" term instead).
The HTML specification mentions for tags like <img> and <input> the following: Tag omission in text/html: No end tag.
Tags with an end tag are defined as: Tag omission in text/html: Neither tag is omissible.
So as far as I can find, the HTML spec does define a technical name for this, apart from void versus normal elements, so what Watilin pointed out in the comments should be fine: standalone vs containers.
As an added side-note: HTML has a lot more HTML content categories. You can find a complete overview at the HTML spec here: https://html.spec.whatwg.org/multipage/indices.html#element-content-categories
Also interesting to read to visualize that a bit better: https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Content_categories
Elements whose contents are defined by text and/or other elements between their start and end tags don't have a special category. Even the HTML spec just calls them normal elements for the most part in section 8.1.2.
Elements whose primary values are defined by attributes and that cannot have content between their tags are called void elements. img and input are indeed two examples of void elements. Note that void elements are not to be confused with empty elements; see the following questions for more details on that:
Are void elements and empty elements the same?
HTML "void elements" so called because without content?
<input type="text" id="someField" name="someField">
With an input selector, you can get a value from it like so (with jQuery):
$("#someField).val();
Where as with a paragraph or a div, you don't get a value, you get the text or html.
<div id="someDiv">Blah, blah, blah</div> You can get that with jQuery as follows:
$("#someDiv").html();
Do you see the difference?

Xpath targeting website text

I am trying to set my xpath to only target a pages text content, however a section below the article 'about the author' keeps getting included, I would like xpath that targets the articles text only + the title.
my xpath so far:
//*[#class="content"]//p[not(contains(#id, "author-bio"))] |
//*[#id="content_wrapper"]//h1
This works but does not remove the about the author section as expected. I am working off the below article.
http://www.intomobile.com/2013/11/05/samsung-galaxy-s3-android-43-update-rolling-out-international-users/
I am using the firepath extension to firefox/firebug which lets me view the elements i am targeting.
That particular document is XHTML, and it has a root element of
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US"
xmlns:og="http://opengraphprotocol.org/schema/"
xmlns:fb="http://www.facebook.com/2008/fbml">
The xmlns="..." means that the html element (and all its un-prefixed descendants) are in the http://www.w3.org/1999/xhtml namespace. Now un-prefixed names in XPath expressions refer to nodes that are not in a namespace, so
//p[not(contains(#id, "author-bio"))]
is looking for an element named p in no namespace, and won't match an element named p in the http://www.w3.org/1999/xhtml namespace.
The correct approach would be to map a prefix to that namespace URI and use the prefix in the XPath expressions, e.g.
//xhtml:p[not(contains(#id, "author-bio"))]
but exactly how you define the prefix mappings depends on the XPath engine you're using. If your tool doesn't provide a way to do prefix mappings then you'll have to use predicates on the local-name(), e.g.
//*[local-name() = 'p'][not(contains(#id, "author-bio"))]
The same applies to the h1, you need to either bind and use a prefix or use the *[local-name() = 'h1'] trick.
id('home_right_column')//p[not(ancestor::[#id= 'author-bio'])] | //[#id="content_wrapper"]//h1
Got it myself :)

Why should one add ID to their HTML tags?

A simple question: why should we add the id into our HTML tags if they work perfectly well without them? I know that one of their uses is being able to navigate though the page via hashtags (#), but is there any other use for them?
Uses of id attributes in HTML
As a target for a fragment identifier on a URL.
As a target on form controls for the for attribute on <label> and <output> elements.
As a target on <form> elements for the form attribute on form associated elements.
As a target for element references via the microdata itemref attribute.
As a target for element references via some ARIA attributes including aria-describedby, aria-labelledby and 4 others.
As a target on <th> elements for the headers attribute on <td> and <th> elements.
As a target on <menu> elements for the contextmenu attribute.
As a target on <datalist> elements for the list attribute on <input> elements.
As part of a hash-name reference to <map> elements for the usemap attribute on the <img> and <object> elements.
As an identifier of an element in a CSS selector
As an identifier of an element for JavaScript processing
They're most often used to uniquely identify elements for styling (CSS) and scripting (JavaScript et al) purposes.
But if you're asking about HTML and only HTML, then one example where declarative IDs are useful is associating a <label> with its <input>, <button> or <textarea> control via its for attribute:
<label for="ex">Example field:</label>
<input type="text" name="ex" id="ex">
Without assigning this attribute, activating the label does nothing, but when you pair both elements together using for and id, activating the label causes its control to gain focus.
The other way to associate a form label with its control is to contain it within the label:
<label>
Example field:
<input type="text" name="ex">
</label>
But this doesn't always suit the structure of a form or a page, so an ID reference is offered as an alternative.
Other circumstances where an id attribute serves a function are covered extensively in Alohci's answer.
You can use IDs to acces your divs from javascript, CSS and jquery. If you don't use IDs it will be very difficult for you to interact with your HTML page from JS.
AFAIK, they are used to uniquely refer to a tag.And makes it easier for you to refer to the tag.
IDs are used for accessing your elements in CSS and JavaScript. Strictly speaking IDs should uniquely identify an element. You can also use class attributes to identify groups of elements.
The id attribute provides a unique identifier for an element within the document. It may be used by an a element to create a hyperlink to this particular element.
This identifier may also be used in CSS code as a hook that can be used for styling purposes, or by JavaScript code (via the Document Object Model, or DOM) to make changes or add behavior to the element by referencing its unique id.
see http://reference.sitepoint.com/html/core-attributes/id
for more info on class see here: http://reference.sitepoint.com/html/core-attributes/class
it is there to help you identify your element in java-script code.the getElementByID function in java-script give the handle of an element with specific ID for you.like this.
var someElement = document.getelementById("someID");
// do whatever with someElement;
I myself also prefer class for styling through CSS but sometimes you need an element to be unique. For accessibility reasons you use id to input elements to "connect" its label to it by using for attribute. And for Javascript it's much simpler to select an element if it has got id attribute.
The main reason I use ids for my HTML elements is the fact that their selection is faster, in Javascript with getElementById and in CSS as well, using the #id class.
Of course, I'm not saying this is always a good idea, especially in CSS, where having classes based on ids can cause a lot of redundancy, it's just one of the reasons
First, only add ID when you will need to use them. In most cases id is used to do other things like:
A reference for scripts,Selecting elements to apply scripts to,
A style sheet selector, selecting elements for styling
Named anchors for linking to, which is what u called page navigation
So simply because in most cases you will want to do something to or with your content in any tag, its good to put an identifier, that is the id attribute.