Extract (random) image with no useful src= from web page - html

First I'd like to know how this can be achieved in general, and then maybe someone knows how to accomplish this using Capybara.
Example: <img src="http://example.com/getrandomimage">
Thing is, src points to a script which returns random image, not to the image itself.
Page is loaded, script is run, image is displayed. I can easily get the src value, but if I access the link to download the image, the script runs again and returns a totally different picture. And I need the one that's already on the page.

I think the process would be very similar using JS or Capybara. I'd break it down into two steps:
Write a selector that will find the <img> tag. In JS that might look like:
myImg = document.getElementByTagName("img")
Call .src on the returned node:
result = myImg.src
I believe Capybara is limited to XPath and CSS selectors. Therefore, depending on the page you are trying to scrape, you'll have to identify some sort of pattern in the HTML tags or the CSS attributes to find the <img> tag.

Related

how to add alter link in <a> tag in html

I'm implementing tag in my website. I load data from two different sources.
Because if one source(DB) is not available(not in connection), the data comes from second source(DB).
In this scenario, my tag have a two different link to reach for every page.
My doubt is if any possibility to add alter source in tag like img alt function
CODE
url // If source 1
url // If source 2
If possible to represent the above two link as below link,
url
No - generally, the alt attribute is for the purpose of showing alternative text. However, according to the specification the alt is not allowed in the anchor tags. Please see here.
You should use Javascript to replace the source if your run time environment matches conditions.
if (!url1)
document.getElementById('#my-anchor').href = url2;
If using twig template, use conditional operator to resolve this problem,
For example,

Scraping pseudo-elements from a website with XPath

I want to extract data from a website, but it seems that the elements that I want to extract are not "accessible".I also discovered they seem to be pseudo-elements. I can se that their tags are marked with a # before in my web-inspector.
Moreover, while using XPath I can't extract the text I want to access. Their is a point in the CSS "cascade tree" when I can't extract the content of a tag, you can see it below.
Here I can extract information up to the tag 'content fond'. But when I ask for the tag "fos_comment_thread" which is the tag just below, the return is empty. And it is especially this tag which is a pseudo-element, and the following behind. However the text I want to access is even more deeper in this part of the CSS tree...
Input
reponse.xpath=('//div[class#='row']/div[#class='span9 forum']/div[class#='content fond'].extract()
Output
['<div id="foc_comment_thread"<div>']
Input
reponse.xpath=('//div[class#='row']/div[#class='span9 forum']/div[class#='content fond']/div[id#='fos_comment_thread'].extract()
Output
[]
I don't understand why I can't extract, I think it is due to the fact that the rest of my tags are pseudo-elements,but I haven't found a solution to solve the problem...
The first thing you need to do is to not using your web-inspector tool and look at the raw HTML of the website.
Web inspectors take into account the transformations made by Javascript and may show you an update HTML after Javascript execution, that scrapy obviously can't see.

Retrieve all hashes in a page for URL use

I am trying to copy a link from this site (stack overflow), but I like the link to include a hash so when someone clicks on the link they go directly to the answer I would like them to see. How can I find the hashes in a page?
Example:
http://www.blahblah.com/index.php#label
How can I know there is a #label, and how to find it?
The value of the hash is simply the ID attribute of any element in the page.
You can see them in the source or the DOM inspector.
Are you looking for something like this?
var hash = window.location.hash;
There might not be a simple answer for your here. In a pure HTML context (i.e. excluding javascript functionality). The has would reference an anchor on the page like this:
<a name="label"></a>
So you could just look for named anchors.
Now, if you are talking about javascript functionality it gets much more complex. Via javascript you can use a hash tag like that and make it do any number of things (like show a hidden element with id="label", download some content asynchronously based on that hash, etc. So there might not be an easy way to determine allowable values.

HTML page title: localization and empty title

2 questions about a better way of solving the problem:
1) is there is a way to make HTML page title looking different for different locales of the client-side code except for javascript?
I.e. write HTML page title which is shown in the browser's tab in corresponding language.
I know I can use javascript for this, but may be there is another way?
2)I set my HTML page header with javascript (it is a different case). But there is a delay before the script will run. Is there is a way to set HTML page header to empty line before javascript evaluates?
If I remove tag I get the page URL.
If I use empty tag - same thing.
I have to use &nbsp content inside which looks a bit ugly.
Some other options?
I don't see any other means but JavaScript on the client side for this, sorry.
For the delay: try using an inline javascript to change the page title right on top of the page before any other scripts are loaded or executed, but after the page title has been set. This should keep the delay to an absolute minimum.
To the first:
Except Javascript, the only Way i know would be PHP, but using Javascript is a lot better and
easier.
To the second:
arkascha's Post is the answer

Pulling out some text from a giant HTML file using Nokogiri/xpath

I am scraping a website and am trying to pull out certain elements from the HTML. In the sites I am scraping, there are script tags with a bunch of info in them however, there is one part inside these tags that I am interested in. The line basically looks like:
'image':'http://ut5.example.com/t/231/3_b_643435.jpg',
With some stuff above and below it. Now, this is different for each page source except for obviously the domain and some of the subfolders that store the images.
How would I go about looking through the source for this specific line, and cutting out just the URL? I would need to use regular expressions I feel as the URLs are dynamic.
The "gsub" method does something similar to what I want to search for, with its ability to use /regex/. But, I am not wanting to replace anything, I just want to find that URL in the source code using a /regex/ and copy it.
According to you comments, this is what you're looking for I guess
var regex = /http.+/;
Example http://jsfiddle.net/Km9ZB/