xpath for HTMLagility pack on site with frames - html

I'm trying to extract all the station names which are encased in the left frame from http://www.raws.dri.edu/wraws/orF.html using HTMLAgility pack.
My Xpath string is currently //frame[#name='list'] at this point it returns the node but I can't seem to access any of it's child nodes. Ultimately I'm trying to return all the attributes that are in frameset[1]/html/body/[#a] which looks something like this :
<a onmouseover="popup('<font color=Black><strong> IDARNG1 RG2 Idaho (RAWS) </strong> </font> ',615,307);update('IDARNG1 RG2 Idaho (RAWS)',615,307,'idIAN1','raw');return true;" onmouseout="removeBox();removedot();" href="/cgi-bin/rawMAIN.pl?idIAN1">`

Here is what the browser is currently doing:
It opens http://www.raws.dri.edu/wraws/orF.html
It parses the source code, and perform another request for every <iframe> that appears on it.
That means you need to open manually the url the <iframe> is pointing to, which can be found in the src attribute. Below is an example:
string src = doc.DocumentNode.SelectSingleNode("//frame[#name='list']").GetAttribute("src", "");
string url = "http://www.raws.dri.edu/wraws/" + src;
The URL you're looking for is:
http://www.raws.dri.edu/wraws/orlst.html
Go and open it manually and you will see only the left sidebar is loaded.
Next time make sure you use a HTTP Web Debugger like Firebug or Fiddler, to see what is happening behind the scenes.

Related

Jsoup - hidden div class?

Im trying to scrape a div class but everything I have tried has failed so far :(
Im trying to scrape the element(s):
<a href="http://www.bellator.com/events/d306b5/bellator-newcastle-pitbull-vs-
scope"><div class="s_buttons_button s_buttons_buttonAlt
s_buttons_buttonSlashBack">More info</div></a>
from the website: http://www.bellator.com/events
I tried accessing the list of elements by doing
Elements elements = document.select("div[class=s_container] > li");
but that didnt return anything.
Then i tried accessing just the parent with
Elements elements = document.select("div[class=s_container]");
and that returned two div with classname "s_container", non of which is the one I needed :<
then i tried accessing that ones parent with
Elements elements = document.select("div[class=ent_m152_bellator module
ent_m152_bellator_V1_1_0 ent_m152]");
And that didnt return anything
I also tried
Elements elements = document.select("div[class=ent_m152_bellator]");
because I wasnt sure about the white spaces but it didnt return anything either
Then I tried accessing its parent by
Elements elements = document.select("div#t3_lc");
and that worked, but it returned an element containing
<div id="t3_lc">
<div class="triforce-module" id="t3_lc_promo1"></div>
</div>
which is kinda weird because i cant see that it has that child when i inspect the website in chrome :S
Anyone knows whats going on? I feel kinda lost..
What you see in your web browser is not what Jsoup sees. Disable JavaScript and refresh page to get what Jsoup gets OR press CTRL+U ("Show source", not "Inspect"!) in your browser to see original HTML document before JavaScript modifications. When you use your browser's debugger it shows final document after modifications so it's not not suitable for your needs.
It seems like whole "UPCOMING EVENTS" section is dynamically loaded by JavaScript.
Even more, this section is asynchronously loaded with AJAX. You can use your browsers debugger (Network tab) to see every possible request and response.
I found it but unfortunately all the data you need is returned as JSON so you're going to need another library to parse JSON.
That's not the end of the bad news and this case is more complicated. You could make direct request for the data:
http://www.bellator.com/feeds/ent_m152_bellator/V1_1_0/d10a728c-547e-4a6f-b140-7eecb67cff6b
but the URL seems random and few of these URLs (one per upcoming event?) are included inside JavaScript code in HTML.
My approach would be to get the URLs of these feeds with something like:
List<String> feedUrls = new ArrayList<>();
//select all the scripts
Elements scripts = document.select("script");
for(Element script: scripts){
if(script.text().contains("http://www.bellator.com/feeds/")){
// here use regexp to get all URLs from script.text() and add them to feedUrls
}
}
for(String feedUrl : feedUrls){
// iterate over feed URLs, download each of them
String json = Jsoup.connect(feedUrl).ignoreContentType(true).get().body().toString();
// here use JSON parsing library to get the data you need
}
ALTERNATIVE approach would be to stop using Jsoup because of its limitations and use Selenium Webdriver as it supports dynamic page modifications by JavaScript so you'd get the HTML of the final result - exactly what you see in web browser and Inspector.
If anyone finds this in the future; I managed to solve it with Selenium, dont know if its a good/correct solution but it seems to be working.
System.setProperty("webdriver.chrome.driver", "C:\\Users\\PC\\Desktop\\Chromedriver\\chromedriver.exe");
WebDriver driver = new ChromeDriver();
driver.get("http://www.bellator.com/events");
String html = driver.getPageSource();
Document doc = Jsoup.parse(html);
Elements elements = doc.select("ul.s_layouts_lineListAlt > li > a");
for(Element element : elements) {
System.out.println(element.attr("href"));
}
Output:
http://www.bellator.com/events/d306b5/bellator-newcastle-pitbull-vs-scope
http://www.bellator.com/events/ylcu8d/bellator-215-mitrione-vs-kharitonov
http://www.bellator.com/events/yk2djw/bellator-216-mvp-vs-daley
http://www.bellator.com/events/e8rdqs/bellator-217-gallagher-vs-graham
http://www.bellator.com/events/281wxq/bellator-218-sanchez-vs-grimshaw
http://www.bellator.com/events/8lcbdi/bellator-219-koreshkov-vs-larkin
http://www.bellator.com/events/9rqguc/bellator-macdonald-vs-fitch

Href without http(s) prefix

I just have created primitive html page. Here it is: example
And here is its markup:
www.google.com
<br/>
http://www.google.com
As you can see it contains two links. The first one's href doesn't have 'http'-prefix and when I click this link browser redirects me to non-existing page https://fiddle.jshell.net/_display/www.google.com. The second one's href has this prefix and browser produces correct url http://www.google.com/. Is it possible to use hrefs such as www.something.com, without http(s) prefixes?
It's possible, and indeed you're doing it right now. It just doesn't do what you think it does.
Consider what the browser does when you link to this:
href="index.html"
What then would it do when you link to this?:
href="index.com"
Or this?:
href="www.html"
Or?:
href="www.index.com.html"
The browser doesn't know what you meant, it only knows what you told it. Without the prefix, it's going to follow the standard for the current HTTP address. The prefix is what tells it that it needs to start at a new root address entirely.
Note that you don't need the http: part, you can do this:
href="//www.google.com"
The browser will use whatever the current protocol is (http, https, etc.) but the // tells it that this is a new root address.
You can omit the protocol by using // in front of the path. Here is an example:
Google
By using //, you can tell the browser that this is actually a new (full) link, and not a relative one (relative to your current link).
I've created a little function in React project that could help you:
const getClickableLink = link => {
return link.startsWith("http://") || link.startsWith("https://") ?
link
: `http://${link}`;
};
And you can implement it like this:
const link = "google.com";
<a href={getClickableLink(link)}>{link}</a>
Omitting the the protocol by just using // in front of the path is a very bad idea in term of SEO.
Ok, most of the modern browsers will work fine. On the other hand, most of the robots will get in trouble scanning your site. Masjestic will not count the flow from those links. Audit tools, like SEMrush, will not be able to perform their jobs

retrieving URLs from functions within HTML (python)

I need to scrape some URLs from some retailer product pages, but the specific URLs I need to get aren't in the html part of the page. The html looks like this for each of the items on which one would click to get to the page with the URL I need to grab:
<div id="name" class="hand bold" onclick="AVON.productcontrol.Go(45714);">ADVANCE TECHNIQUES Color Protection Conditioner Bonus Size</div>
I wrote the following to get URLs from the page, but since the actual URLs I need don’t seem to be stored in the page, it doesn’t get what I need:
def getUrls(URL):
"""input: product page url
output: list of urls to products
"""
connection = urllib.urlopen(URL)
dom = lxml.html.fromstring(connection.read())
selAnchor = CSSSelector('a')
foundElements = selAnchor(dom)
urlList = [e.get('href') for e in foundElements]
return urlList
Is there a way to get the link that the function after ‘onclick’ (I guess AVON.productcontrol.Go(#);) takes you to? I don’t fully understand html, and while I’ve read a bit about onclick, I can’t figure out how the function after 'onclick' works.
In order to find the URL that you are taken to on click, you need to find the JavaScript source code of the 'Go' function and read and understand it. It's buried somewhere within a tag or some JavaScript .js file that is referenced directly or indirectly by the HTML page. Happy digging!
Or: you automate the interaction with the web page with a tool like Selenium (http://docs.seleniumhq.org/) and just check where it takes you if you click.

Get innerHTML via Jsoup

Im trying to scrape data from this website: http://www.bundesliga.de/de/liga/tabelle/
In the source code i can see the tables but there's no content, just things like:
<td>[no content]</td>
<td>[no content]</td>
<td>[no content]</td>
<td>[no content]</td>
....
With firebug (F12 in Firefox) i wont see any content too but i can select the table and then copy the innerHTML via firebug option. In that case i get all the informations about the teams, but i dont know how to get the table with the content in Jsoup.
To get the value of an attribute, use the Node.attr(String key) method
For the text on an element (and its combined children), use Element.text()
For HTML, use Element.html(), or Node.outerHtml() as appropriate
For example:
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String linkText = link.text(); // "example""
String linkOuterH = link.outerHtml();
// "<b>example</b>"
String linkInnerH = link.html(); // "<b>example</b>"
reference:
http://jsoup.org/cookbook/extracting-data/attributes-text-html
The table is not rendered on the server directly, but build by the client side JavaScript of the page and constructed with data that is getting to the client via AJAX. So what you get with the naive Jsoup approach is expected.
I see two possible solutions:
You analyze the network traffic and identify the ajax calls that the site is making. Then you try to reconstruct the format and fire the same requests as the JavaScript would. Then you can reconstruct the table.
you don't use Jsoup but a real browser, that loads the page and runs the JavaScript including all AJAX calls. You could use Selenium webdriver for that. There is a headless browser called phantomjs which has a relatively small footprint that you can use in combination with selenium webdriver.
both options have their (dis)advantages:
This takes more time, since you need to understand the network traffic pretty good. The reward will be a very fast and memory efficient scraper.
The programming of selenium is very easy and you should not have any difficulties achieving your goal. You don't need to understand the inner workings of the site you want to scrape. However, the price is a further dependency in your project. Memory consumption is high. Another process runs. The scraping will be slow.
Maybe you find another source with the soccer table that is holding the infos you want? That might be the easiest. For example http://www.fussballdaten.de/bundesliga/

Chrome extension, replace HTML in response code before browser displays it

i wonder if there is some way to do something like that:
If im on a specific site i want that some of javascript files to be loaded directly from my computer (f.e. file:///c:/test.js), not from the server.
For that i was thinking if there is a possibility to make an extension which could change HTML code in a response which browser gets right before displaying it. So whole process should look like that:
request is made
browser gets response from server
#response is changed# - this is the part when extension comes in
browser parse changed response and display page with that new response.
It doesnt even have to be a Chrome extension anyway. It should just do the job described above. It can block original file and serve another one (DNS/proxy?) or filter whole HTTP traffic in my computer and replace specific code to another one of matched response.
You can use the WebRequest API to achieve that. For example, you can add a onBeforeRequest listener and redirect some requests:
chrome.webRequest.onBeforeRequest.addListener(function(details)
{
var responseData = "<div>Some text</div>"
return {redirectUrl: "data:text/html," + encodeURIComponent(responseData)};
}, {urls: ["https://www.google.com/"]}, ["blocking"]);
This will display a <div> element with the text "some text" instead of the Google homepage. Note that you can only redirect to URLs that the web server itself is allowed to redirect to. This means that redirecting to file:/// URLs is not possible, and you can only redirect to files inside your extension if these are web accessible. data: and http: URLs work fine however.
In Windows you can use the Proxomitron (proxomitron.info) which is a local proxy that can intercept any page or file being loading into your browser and change it using regular expressions (no DOM parsing) however you want, before it is rendered by the browser.