Fetching HTML or links from a website via Json/HtmlUnit

Fetching HTML or links from a website via Json/HtmlUnit - json

I have been trying to extract the html values from a page e.g. https://www.qwant.com/?q=cat&t=web but when I use jSoup or HtmlUnit I always get a basic page that doesn't compare to what is generated when I search via my normal browser.
My codes in general work on other websites but could someone explain to me why when I visit the above with code that I don't get the same results? I am trying to fetch all the url values on the page. Is it to do with javascript?
WebClient wb = new WebClient(BrowserVersion.FIREFOX_52);
wb.getPage(url);
wb.waitForBackgroundJavaScript(25000);
System.out.println(wb.getCurrentWindow().getEnclosedPage().getWebResponse().getContentAsString());

Some website just won't allow you to parse them headlessly (for obvious reasons). As I tried to curl the Qwant cat result page, the result was a blank page.
But you want to give a try at switching from Firefox to Chrome as your browser : It is not possible to detect and block Chrome headless

Related

Scraping prices with BeautifulSoup4 in Python3

I am new scraping with Python and BeautifulSoup4. Also, I do not have knowledge of HTML. To practice, I am trying to use it on Carrefour website to extract the price and price per kilogram of the product that I search for EAN code.
My code:
barcodes = ['5449000000996']
for barcode in barcodes:
url = 'https://www.carrefour.es/?q=' + barcode
html = requests.get(url).content
bs = BeautifulSoup(html, 'lxml')
searchingprice = bs.find_all('strong', {'class':'ebx-result-price__value'})
print(searchingprice)
searchingpricerperkg = bs.find_all('span', {'class':'ebx-result__quantity ebx-result-quantity'})
print(searchingpricerperkg)
But I do not get any result at all
Here is a screenshot of the HTML code:
What am I doing wrong? I tried with other website and it seems to work

The problem here is that you're scraping a page with Javascript-generated content. Basically, the page that you're grabbing with requests actually doesn't have the thing you're grabbing from it - it has a bunch of javascript. When your browser goes to the page, it runs the javascript, which generates the content - so the page you see in the rendered version in your browser is not the same thing returned from the actual page itself. The page contains instructions for your browser to write the page that you see.
If you're just practicing, you might want to simply try a different source to scrape from, but to scrape from this page, you'll need to look into other solutions that can handle javascript generated content:
Web-scraping JavaScript page with Python
Alternatively, the javascript generates content by requesting data from other sources. I don't speak spanish, so I'm not much help in figuring this part out, but you might be able to.
As an exercise, go ahead and have BS4 prettify and print out the page that it receives. You'll see that within that page there are requests to other locations to get the info you're asking for. You might be able to change your request to not go to the page where you view the info, but to the location that page gets it's data from.

Using Angular to get html of a website URL

I am new in Angular
What I am going to try is to get the HTML of a page and reproduce it into an iFrame (it is an exercise).
I am using the following piece of code:
var prova = this._http.get(myUrl, {responseType: "text"}).subscribe((x) =>{
console.log(x);
});
I did it on a website (if is needed I can also insert the name of the pages) and it returns the html only of some pages.
In the other case the string x is empty.
Could it depend on connection?
Or there is some way to wait the end of the get request?
Or simply is wrong my approach and I should make a different type of request?

Your most likely going to need to use a library like puppeteer if you want to render a page properly. Puppeteer is a node library and useless headless chrome so I am not sure how well you could really integrate with Angular.
https://github.com/GoogleChrome/puppeteer

Get innerHTML via Jsoup

Im trying to scrape data from this website: http://www.bundesliga.de/de/liga/tabelle/
In the source code i can see the tables but there's no content, just things like:
<td>[no content]</td>
<td>[no content]</td>
<td>[no content]</td>
<td>[no content]</td>
....
With firebug (F12 in Firefox) i wont see any content too but i can select the table and then copy the innerHTML via firebug option. In that case i get all the informations about the teams, but i dont know how to get the table with the content in Jsoup.

To get the value of an attribute, use the Node.attr(String key) method
For the text on an element (and its combined children), use Element.text()
For HTML, use Element.html(), or Node.outerHtml() as appropriate
For example:
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String linkText = link.text(); // "example""
String linkOuterH = link.outerHtml();
// "<b>example</b>"
String linkInnerH = link.html(); // "<b>example</b>"
reference:
http://jsoup.org/cookbook/extracting-data/attributes-text-html

The table is not rendered on the server directly, but build by the client side JavaScript of the page and constructed with data that is getting to the client via AJAX. So what you get with the naive Jsoup approach is expected.
I see two possible solutions:
You analyze the network traffic and identify the ajax calls that the site is making. Then you try to reconstruct the format and fire the same requests as the JavaScript would. Then you can reconstruct the table.
you don't use Jsoup but a real browser, that loads the page and runs the JavaScript including all AJAX calls. You could use Selenium webdriver for that. There is a headless browser called phantomjs which has a relatively small footprint that you can use in combination with selenium webdriver.
both options have their (dis)advantages:
This takes more time, since you need to understand the network traffic pretty good. The reward will be a very fast and memory efficient scraper.
The programming of selenium is very easy and you should not have any difficulties achieving your goal. You don't need to understand the inner workings of the site you want to scrape. However, the price is a further dependency in your project. Memory consumption is high. Another process runs. The scraping will be slow.
Maybe you find another source with the soccer table that is holding the infos you want? That might be the easiest. For example http://www.fussballdaten.de/bundesliga/

How to handle HTML content in Windows 8 Metro App

I'm designing a Windows 8 Reader App, and I have to use a control to show the HTML content, which is fetched from some website feeds. Cause those HTML content may contains images or some other formatted text, now I'm using a richtextblock to show the HTML content, but it costs a lot of time to parse the HTML content.
So I'm wondering if there is any controls that can handle the HTML content except the WebView.
Thanks.
Updated:
The reason I can't use WebView is that I need to implement pagination, like the image belowed:

As JP Alioto mentioned you should use the WebView control.
You can use the NavigateToString method to load the HTML. Or use Navigate to request a URI.
There are issues however with using the WebView control, specifically it is rendered differently and is not a standard control, this means things like your app bar or settings pane will not render on top of the WebView, there is a workaround by using the WebViewBrush to "paint" the WebView to standard control such as a rectangle when needed.

Also you can make a screenshot of the webpage you want to display. But to make a screenshot of webpage it's also not easy to do, but I offer you to make it with some special sites wich are created to take screenshot of other websites. Then you can download an image this sites return and open and display it in your windows 8 app. I show You some example how to I did that:
StorageFolder screens = await Windows.ApplicationModel.Package.Current.InstalledLocation.CreateFolderAsync(#"Screens\" + folderName, CreationCollisionOption.GenerateUniqueName);
var downloader = new BackgroundDownloader();
IStorageFile file = await screens.CreateFileAsync(fname, CreationCollisionOption.GenerateUniqueName);
string my_uri = "http://api.snapito.com/web/e3c351d5994134eb1aea855ce78e296c3292d48a/lc/" + url + "?type=jpeg";
DownloadOperation download = downloader.CreateDownload(new System.Uri(my_uri), file);
await download.StartAsync();

I think there are only two options but none of them are really good:
Use WebView and transform your HTML with CSS and other techniques to look native. Use the ScriptNotify and NavigationStarting and other events to navigate to another page. In W8.1 the WebView is much better (eg. treated as regular control not floating over all other controls,...)
Parse your HTML and generate native elements. I started such an implementation and created a XAML control to display HTML with native controls (see https://mytoolkit.codeplex.com/wikipage?title=HtmlTextBlock). However if you have complex HTML (eg iframes, etc.) this may not work and you have no other choice than to use the WebView control.

Chrome extension, replace HTML in response code before browser displays it

i wonder if there is some way to do something like that:
If im on a specific site i want that some of javascript files to be loaded directly from my computer (f.e. file:///c:/test.js), not from the server.
For that i was thinking if there is a possibility to make an extension which could change HTML code in a response which browser gets right before displaying it. So whole process should look like that:
request is made
browser gets response from server
#response is changed# - this is the part when extension comes in
browser parse changed response and display page with that new response.
It doesnt even have to be a Chrome extension anyway. It should just do the job described above. It can block original file and serve another one (DNS/proxy?) or filter whole HTTP traffic in my computer and replace specific code to another one of matched response.

You can use the WebRequest API to achieve that. For example, you can add a onBeforeRequest listener and redirect some requests:
chrome.webRequest.onBeforeRequest.addListener(function(details)
{
var responseData = "<div>Some text</div>"
return {redirectUrl: "data:text/html," + encodeURIComponent(responseData)};
}, {urls: ["https://www.google.com/"]}, ["blocking"]);
This will display a <div> element with the text "some text" instead of the Google homepage. Note that you can only redirect to URLs that the web server itself is allowed to redirect to. This means that redirecting to file:/// URLs is not possible, and you can only redirect to files inside your extension if these are web accessible. data: and http: URLs work fine however.

In Windows you can use the Proxomitron (proxomitron.info) which is a local proxy that can intercept any page or file being loading into your browser and change it using regular expressions (no DOM parsing) however you want, before it is rendered by the browser.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Fetching HTML or links from a website via Json/HtmlUnit - json

Some website just won't allow you to parse them headlessly (for obvious reasons). As I tried to curl the Qwant cat result page, the result was a blank page. But you want to give a try at switching from Firefox to Chrome as your browser : It is not possible to detect and block Chrome headless

Related

Scraping prices with BeautifulSoup4 in Python3

Using Angular to get html of a website URL

Get innerHTML via Jsoup

How to handle HTML content in Windows 8 Metro App

Chrome extension, replace HTML in response code before browser displays it

Categories

Resources