What xpath should i use to extract only the url link from the below element?
<a class="url" rel="noreferrer" onclick="redirect('https://www.fotbollskanalen.se/allsvenskan/kujovic-siktar-pa-startplats-var-naturligt-att-agera-inhoppare---nu-vill-jag-s');">Läs mer på Fotbollskanalen</a>
I have tried using the below xpath, but it only returns "Läs mer på Fotbollskanalen" and not the url itself.
a[1]/child::node()
Also tried different versions attempting to set specify the class but unable to get it right.
Try this:
substring-before(substring-after(//a/#onclick, "'"),"'")
It will,
substring-before(substring-after(foo, "'"),"'"):
Get in everything enclosed by ' in foo.
//a: Of the element a.
/#onclick: Inside the attribute onclick.
Related
I just started out with Python and learning about xpath expressions.
I'm trying to get a div, an a class, look for the href inside the a class and then get the part of the href, then just continue with something.
div class: dropdown-menu and a class: dropdown-item
My url: https://www.something.com/library/category/stuff
My xpath expression: response.xpath("//div[#class='dropdown-menu']//a[#class='dropdown-item']//a[contains(#href, 'category')]")
It just returns an empty string and I can't figure out why, please advice.
Since an <a> can't really be nested inside an <a>, I suppose you meant to write two conditions for the same <a> here:
response.xpath("//div[#class='dropdown-menu']//a[#class='dropdown-item']//a[contains(#href, 'category')]")
That would be written like this:
response.xpath("//div[#class='dropdown-menu']//a[#class='dropdown-item' and contains(#href, 'category')]")
or like this (predicates, i.e. the filter conditions in the square brackets, can be chained and are evaluated one after another):
response.xpath("//div[#class='dropdown-menu']//a[#class='dropdown-item'][contains(#href, 'category')]")
Which one is the correct way or using a tag in this context?
<a name="test">Test</a>
or
<a name="test></a>Test
the correct is the first one
Test
The first example is correct
<a name="test">Test</a>
The correct syntax is link text
"name" is not a valid attribute for <a> tag. The allowed attributes are:
download, href, hreflang, media, ping , referrerpolicy, rel, target and type. Of course, <a> tag also allows global attributes and event attributes in addition to these.
If you want to use custom attributes consider using data- format like data-name
eg.
link text
I have this piece of HTML code. I've already tried several xpath selectors but don't seem to be able to get the "Ask us" text from within the span with class "someClass".
<span class="someClass">Ask us</span>
Thanks in advance.
You can reach the content from the link with "/text()"
For me works this XPath snippet on your example.
/span[#class="someClass"]/a/text()
string(//span[#class="someClass"])
If you want the string() function to concatenate all child text, you
must then pass a single node instead of a node-set.
I'm using scrapy to write a scraper that finds links with images inside them and grabs the link's href. The page I'm scraping is populated with image thumbnails, and when you click on the thumbnail it links to a full size version of the image. I'd like to grab the full size images.
The html looks somewhat like this:
<a href="example.com/full_size_image.jpg">
<img src="example.com/image_thumbnail.jpg">
</a>
And I want to grab "example.com/full_size_image.jpg".
My current method of doing so is
img_urls = scrapy.Selector(response).xpath('//a/img/..').xpath("#href").extract()
But I'd like to reduce that to a single xpath expression, as I plan to allow the user to enter their own xpath expression string.
You can check if an element has an another child element this way:
response.xpath('//a[img]/#href').extract()
Note that I'm using the response.xpath() shortcut and providing a single XPath expression.
I used DOM in order to extract all HREF-s from given html source. But, there's a problem: If i have link like this one:
<LINK rel="alternate" TYPE="application/rss+xml" TITLE="ES: Glavni RSS feed" HREF="/rss.xml">
then "href" element will be presented as /rss.xml, although that "/rss.xml" is just anchor text. Clicking on that link from Chrome's page source view, real link is opened.
I would like to take that href-s LINK, not anchor text. Please, how can i do it with dom?
Get a hold of the link element and get its href property. Suppose you were using an id,
<link id="myLink" rel="alternate" href="/rss.xml" />
var link = document.getElementById("myLink");
link.href; // http://www.example.com/rss.xml
"href" element will be presented as /rss.xml
Yes, that is the value of the attribute
although that "/rss.xml" is just anchor text.
No. <link> elements don't have anchor text. In the following example 'bar' is anchor text.
bar
Clicking on that link from Chrome's page source view, real link is opened.
Browsers know how to resolve relative URIs.
I would like to take that href-s LINK, not anchor text. Please, how can i do it with dom?
You can't use DOM to resolve a URI. You use DOM to get the value of the attribute and then use something else to resolve it as a relative URI.
The article Using and interpreting relative URLs explains how they work, and there are tools that can help resolve them.
You need to know the base URI that the relative URI is relative to (normally the URI of the document containing the link, but things like the base element can throw that off)
In Perl you might:
#!/usr/bin/perl
use strict;
use warnings;
use URI;
my $str = '/rss.xml';
my $base_uri = 'http://example.com/page/with/link/to/rss.xml';
print URI->new_abs( $str, $base_uri );
Which gives:
http://example.com/rss.xml
You can try using document.location.href to get the current URL and append the result you are getting from your example. That should give you an absolute path for the link.