Xpath targeting website text

Xpath targeting website text - html

I am trying to set my xpath to only target a pages text content, however a section below the article 'about the author' keeps getting included, I would like xpath that targets the articles text only + the title.
my xpath so far:
//*[#class="content"]//p[not(contains(#id, "author-bio"))] |
//*[#id="content_wrapper"]//h1
This works but does not remove the about the author section as expected. I am working off the below article.
http://www.intomobile.com/2013/11/05/samsung-galaxy-s3-android-43-update-rolling-out-international-users/
I am using the firepath extension to firefox/firebug which lets me view the elements i am targeting.

That particular document is XHTML, and it has a root element of
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US"
xmlns:og="http://opengraphprotocol.org/schema/"
xmlns:fb="http://www.facebook.com/2008/fbml">
The xmlns="..." means that the html element (and all its un-prefixed descendants) are in the http://www.w3.org/1999/xhtml namespace. Now un-prefixed names in XPath expressions refer to nodes that are not in a namespace, so
//p[not(contains(#id, "author-bio"))]
is looking for an element named p in no namespace, and won't match an element named p in the http://www.w3.org/1999/xhtml namespace.
The correct approach would be to map a prefix to that namespace URI and use the prefix in the XPath expressions, e.g.
//xhtml:p[not(contains(#id, "author-bio"))]
but exactly how you define the prefix mappings depends on the XPath engine you're using. If your tool doesn't provide a way to do prefix mappings then you'll have to use predicates on the local-name(), e.g.
//*[local-name() = 'p'][not(contains(#id, "author-bio"))]
The same applies to the h1, you need to either bind and use a prefix or use the *[local-name() = 'h1'] trick.

id('home_right_column')//p[not(ancestor::[#id= 'author-bio'])] | //[#id="content_wrapper"]//h1
Got it myself :)

Related

What is the purpose of the <html> element?

Doesn't the file type already let the browser know that the document is an html document. MDN mentions that it is the root element, so is using it just a formality?

It is a family trait of HTML, XML, and SGML that they all need to be nested inside a root element. It's just part of the data standard and lets the interpreter know where to start and stop, and verifies that the document is complete and well-formed.
<!DOCTYPE html> specifies the type of document. In that case it means that it is HTML 5 currently, as opposed to XML or XHTML 1.0 transitional, as examples. Keep in mind that if you are downloading these as byte streams you may not always know the file type.

Yes. The <html> tag is the root and can even be omitted in somes cases (from MDN):
The start tag may be omitted if the first thing inside the <html> element is not a comment. The end tag may be omitted if the <html> element is not immediately followed by a comment, and it contains a <body> element either that is not empty or whose start tag is present.
But not only:
It can be styled with CSS (though styling the <body> will usually be enough).
It can have global attributes, especially lang, which is the W3C way of defining an HTML document language.
There is probably more to say but that’s what I see as arguments for the <html> element, apart from its main role of being the root element for an HTML document.

Are there two ways to jump to a fragment identifier in HTML?

I always thought the standard way to specify a fragment identifier is by <a name="foo"></a>.
go to foo
<a name="foo"></a> <!-- obsolete method, it seems -->
<p>some content under that anchor with name</p>
But it seems like this is the old way, and the new way is using an id, like this:
go to bar
<p id="bar">some content under that p with id</p>
In fact, the W3C validator says that name is obsolete for the <a> element. So are there 2 ways to jump to the fragment identifier but 1 of them is obsolete? (And when did that happen?)
(there are other questions about the difference between id and name, but this one is about fragment identifier)

So are there 2 ways to jump to the fragment identifier but 1 of them is obsolete?
There are two ways to identify a fragment.
(There are also two ways to jump to one, since you can do it with a URL or write a pile of JavaScript to scroll the page).
And when did that happen?
id was introduced in 1996 when HTML 4 came out. It effectively obsoleted the name attribute for anchors.
name was made officially obsolete in HTML 5 in 2014 (or in Living HTML on some date that I'm not going to try to figure out).

Yes there are two ways to jump to a fragment identifier and both aren't obsolete ( except a element).
That's rules applied to all HTML 5 elements other than a (because in a hasn't name attribute in HTML5).
So shortly it's obsolete to idenfity name attribute as fragment idenitifier for a element as that's attribute depricated since HTML4.
Flow of accessing fragment from HTML5 Specification:
If there is an element in the DOM that has
an ID exactly equal to fragid, then the first such element in tree
order is the indicated part of the document; stop the algorithm here.
If there is an a element in the DOM that has a name attribute whose
value is exactly equal to fragid, then the first such element in tree
order is the indicated part of the document; stop the algorithm here.
Otherwise, there is no indicated part of the document.

Both ways of doing fragment identifiers work.
Using id="fragment" is the newer, recommended way of jumping to fragments in HTML. It was introduced with HTML4, and works basically everywhere (I just verified this with IE5).
<a name="fragment">, the older way, still works, but is obsolete since HTML5.

Answer to your question: Yes, There are two ways to identify a fragment and one is obsolete.
What is Fragment Identifiers ?
Fragment identifiers for text/plain.
URIs refer to a location in the same resource. This kind of URI starts with "#" followed by an anchor identifier (called the fragment identifier).
Fragment Identifier using JS like below.
location.replace('#middle');

More information on the name attribute.
Basically, the name attribute has been deprecated (obsolete in HTML5-speak) for just about everything except for form elements. Forms retain them as the method of identifying data, and it is the name plus the value property which is sent back to the server. (The id in form elements is used for attaching label elements, and has nothing to do with the actual data).
There is a fundamental weakness in the name attribute, which the id attribute addresses: the name attribute is not required to be unique. This is OK for forms where you can have multiple elements with the same name, but unsuitable for the rest of the document where you are trying to uniquely identify an element.
The id attribute was specifically required to be unique, which makes it better for identifying a link target, among other things. CSS is pretty relaxed about applying styles to multiple elements with the same id, but JavaScript is more strict about this requirement. And, of course, you can’t have a practical link target if you can’t guarantee uniqueness.

P attribute inside <li> tag

I have seen this code from the tutorial that I'm studying. I searched for the purpose of the p attribute inside the li tag but found no answer. What is the purpose of that p attribute inside the li tag?
$msgs .= "<li p=\"$no_of_paginations\" class=\"inactive\">Last</li>";

The purpose cannot be inferred from the code snippet. As such, the attribute, being not defined in any HTML specification or draft or browser-specific extension, has no effect beyond being stored as data into the p element node in the document tree.
Such an attribute, though invalid by the specs, can be used like any other attribute in styling (e.g. attribute selector .p) in CSS or in scripting. In this case, it is probable, but by no means certain, that the attribute is meant to be used in scripting to carry a number as its value, with that number inserted with some server-side code, so that this value can be accessed in client-side scripting, as relating to a specific element.
The recommended way is to use data-* attributes instead, such as data-p, to avoid any risk of clashing with attribute names that might be introduced in some future HTML version.

The default HTML(whichever version) namespace doesn't have a purpose for "p" inside a li tag. If there's another namespace declared then that's where it's from. Other than that, it's not valid by w3 standards.

It should be a custom attribute to use in JavaScript codes to get something.

That is just a custom tag used in some javascript functions

XPath to select all href's in element

i have a trouble with XPath. I have a HTML page with complicated structure and i want to select ALL href's elements in particular div, regardless of the depth of nesting.
Why next code doesn't work and what can I do to fix?
//*[#id='some_id']//*//a

Matching #href attributes
Select all #href attributes, not all anchor tags.
//*[#id='some_id']//#href
If you only want to match the #href attributes of anchor tags, go for this query, which selects all anchor tags inside that "some_id"-element, and then their #href tags.
//*[#id='some_id']//a/#href
// and the descendant-or-self-axis
I'm not sure what you wanted to achieve with the .//*//a construct. This is an abbreviation for
./descendant-or-self::node()/child::*/descendant-or-self::node()/child::a
so there must be some element in-between. If the anchor tag is directly contained within the #id='some_id'-element, it will not be found, for example for this input:
<div id='some_id'>bar</div>
//*[#id='some_id']//a would have matched this element.

// addresses the entire descendant axis, so this is sufficient:
//*[#id='some_id']//a
Otherwise, you wouldn't get a elements that are immediate descendants of the element addressed with //*[#id='some_id']. (If your environment recognizes id attributes as being IDs, you can also address this element with id('some_id').)
But your problem is likely to be something different. //a usually addresses all a elements in the null namespace. Possibly your a elements aren't in the null namespace but in the XHTML namespace. You could match them like
//*[#id='some_id']//*[local-name()='a' and namespace-uri()='http://www.w3.org/1999/xhtml']
or, if you only have to expect HTML elements anyway
//*[#id='some_id']//*[local-name()='a']
or in XPath 2.0 even simpler
//*[#id='some_id']//*:a
Depending on your environment, you can also register a namespace prefix so that you can do something like
//*[#id='some_id']//html:a
in both XPath 1.0 and 2.0.

Trying to extract attribute values using Nokogiri with custom pseudoclass CSS selectors

Having loaded a (X)HTML page, I'm trying to get the value of a meta tag's "content" attribute. For example, given:
<meta name="author" content="John Smith" />
I'd like to extract the value "John Smith".
I know how to do that using XPath and understand that CSS was meant primarily for element selection but Nokogiri supports defining custom CSS pseudoclasses which I thought could be used as follows:
class CSSext
def attr(nodeset, tag)
nodeset.first.attribute_nodes.find_all {|node| node.name == tag}
end
end
doc = Nokogiri::HTML(open(someurl))
doc.css("meta[name='name']:attr('content')", CSSext.new)
However, this returns the same result as
doc.css("meta[name='name']")
What gives? Nokogiri uses the same engine underneath for both CSS and XPath searches so anything that's possible in XPath should be doable in CSS. How should I go about extracting the attribute value?

Why not just?
doc.at("meta[name='author']")['content']
As far as I understand, pseudoclasses can be used to filter the nodeset only, but not to replace the nodeset with some other value such as the value of one of the nodes's attribute.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Xpath targeting website text - html

id('home_right_column')//p[not(ancestor::[#id= 'author-bio'])] | //[#id="content_wrapper"]//h1 Got it myself :)

Related

What is the purpose of the <html> element?

Are there two ways to jump to a fragment identifier in HTML?

P attribute inside <li> tag

XPath to select all href's in element

Trying to extract attribute values using Nokogiri with custom pseudoclass CSS selectors

Categories

Resources