I am trying to crawl about 50 pages with puppeteer, and right now I am doing one after another in a single browser, single page. To make this faster, should I use more pages or more browsers?
I don't think you want to use either:
Using more pages isn't great because your browser will build up a history. This may not be what you intend.
Using more browsers isn't great because creating a new browser is a fairly expensive operation.
I think you want to use Browser Contexts. Think of a browser context as an incognito window - you can open it quickly, but it doesn't share history, cookies, etc. with other contexts.
Related
I’m building a simple browser, and I’d like to code most of it using HTML/JS/CSS. I cannot use iframes to display pages, due to frame-busting. What are my options?
The browser is not meant to be production-quality, but as a proof-of-concept for my thesis, similar to this interactive mockup. The main features it will need to support are:
Loading any page without frame-busting (even google.com),
Detecting when a link is clicked and opening it in a new frame, with the original one remaining intact.
I intend to write this using Web technologies, but it’s OK if it needs to be wrapped up in a small amount of something else, e.g., to turn it into an Android app. However, if it’s possible, it would be best if I could load the app as a web page. Finally, it is also preferable to be able to run the app on an Android device, but it’s OK if it only works on a desktop.
In researching this question, I came across a few options:
<iframe>. Google.com doesn't load at all in an iframe. I tried using the sandbox attribute, but it still didn’t load. Is there a way around this (for any page)?
Mozilla’s Browser API. This API was supposed to allow you to use the mozbrowser attribute in an iframe when building FirefoxOS apps. I suspect there’s no longer any way to access it. I couldn’t get the sample browser app loaded, and it seems that mozbrowser is not supported in WebExtensions. Did I miss something? Is there a way to make this work?
<webview> in a Chrome app. This is the only option that worked so far. I was able to download and install the sample browser app in Chrome. The one downside is that it seems to be Chrome-only (and I would prefer cross-platform or Firefox, all else being equal). Are there any issues with this option? Any way to make it run without Chrome?
Electron app with <webview>. While the setup here is more complicated than the previous option, it seems like the code would be very similar (there’s even a similar sample browser app). Are there any advantages/disadvantages to this option over the previous?
So, are there ways to make options 1 or 2 work? Are there perhaps other options?
HTML/CSS is a "language" translated by the browser into pages. You cannot code a browser in HTML. The easiest solution is to code it in C#.
I'm not sure if this is a good solution, but you can try Electron (nodeJS). You will only need to use JS/CSS/HTML.
I've read quite some post about the angularjs html5Mode including this one and I'm still puzzled:
What is it good for? I can see that I can use http://example.com/home instead of http://example.com/#/home, but the two chars saved are not worth mentioning, are they?
How is this related to HTML5?
The answers links to a page showing how to configure a server. It seems like the purpose of this rewriting to make the server return always the same page, no matter what the URL looks like. But doesn't it read to needlessly increased traffic?
Update after Peter Lyons's answer
I started to react in a comment, but it grew too long. His long and valuable answer raises some more questions of mine.
option of rendering the actual "/home"
Yes, but that means a lot of work.
crazy escaped fragment hacks
Yes, but this hack is easy to implement (I did it just a few hours ago). I actually don't know what I should do for in case of the html5mode (as I haven't finished reading this seo article yet.
Here's a demo
It works neither in my Chromium 25 nor in my Firefox 20. Sure, they're both ancient, but everything else I needed works in both of them.
Nope, it's the opposite. The server ONLY gets a full page request when the user first clicks
But the same holds for the hashbang, too. Moreover, a user following an external link to http://example.com/#!/home and then another link to http://example.com/#!/foreign will be always served the same page via the same URL, while in the html5mode they'll be served the same page (unless the burdensome optimization you mentioned gets done) via a different URL (which means it has to be loaded again).
but the two chars saved are not worth mentioning, are they?
Many people consider the URL without the hash considerably more "pretty" or "user friendly". Also, a very big difference is when you browse to a URL with a hash (a "fragment"), the browser does NOT include the fragment in it's request to the server, which means the server has a lot less information available to deliver the right content immediately. As compared to a regular URL without any fragment, where the full path "/home" IS including in the HTTP GET request to the server, thus the server has the option of rendering the actual "/home" content directly instead of sending the generic "index.html" content and waiting for javascript on the browser to update it once it loads and sees the fragment is "#home".
HTML5 mode is also better suited for search engine optimization without any crazy escaped fragment hacks. My guess is this is probably the largest contributing factor to the push for HTML5 mode.
How is this related to HTML5?
HTML5 introduced the necessary javascript APIs to change the browser's location bar URL without reloading the page and without just using the fragment portion of the URL. Here's a demo
It seems like the purpose of this rewriting to make the server return always the same page, no matter what the URL looks like. But doesn't it read to needlessly increased traffic?
Nope, it's the opposite. The server ONLY gets a full page request when the user first clicks a link onto the site OR does a manual browser reload. Otherwise, the user can navigate around the app clicking like mad and in some cases the server will see ZERO traffic from that. More commonly, each click will make at least one AJAX request to an API to get JSON data, but overall the approach serves to reduce browser<->server traffic. If you see an app responding instantly to clicks and the URL is changing, you have HTML5 to thank for that, as compared to a traditional app, where every click includes a certain minimum latency, a flicker as the full page reloads, input forms losing focus, etc.
It works neither in my Chromium 25 nor in my Firefox 20. Sure, they're both ancient, but everything else I needed works in both of them.
A good implementation will use HTML5 when available and fall back to fragments otherwise, but work fine in any browser. But in any case, the web is a moving target. At one point, everything was full page loads. Then there was AJAX and single page apps with fragments. Now HTML5 can do single page apps with fragmentless URLs. These are not overwhelmingly different approaches.
My feeling from this back and forth is like you want someone to declare for you one of these is canonically more appropriate than the other, and it's just not like that. It depends on the app, the users, their devices, etc. Twitter was all about fragments for a good long while and then they realized their mobile users were seeing too much latency and "time to first tweet" was too long, so they went back to server-side rendering of HTML with real data in it.
To your other point about rendering on the server being "a lot of work", it's true but some consider it the "holy grail" of web app development. Look at what airbnb has done with their
rendr framework. See also Derby JS. My point being, if you decide you want rendering in both the browser and the server, you pick a framework that offers that. Not that you have a lot of options to choose from at the moment, granted, but I wouldn't advise hacking together your own.
I'm learning about Chrome and Native Client.
Basically i want to increase number of pages that are prerendered
by chrome (now its just one page).
I was thinking about creating a extension that would
allow to prerender more pages.
Is this a way to go or am i left with hard coding it into Chrome and build from scratch?
EDIT
I started a bounty for this question. I would really appreciate some input.
No, there is no way to go, you would need to hardcode it in Chrome and rebuild as you noted.
As you probably already know, Chrome explicitly states that they currently limit the number of pages that can be prerendered:
Chrome will only prerender at max one URL per instance of Chrome (not one per tab). This limit may change in the future in some cases.
There is nothing in their supported API's or their experimental API's that will give you access to modify this. There is not a toggle in chrome://settings/ or chrome://flags/ or anywhere else in Chrome currently that allows you to change this. You can however use the Page Visibility API to determine if a page is being prerendered.
In addition to the one resource on the page that you specify using the <link rel="prerender" href="http://example.org/index.html"> you could also consider prefetching resources.
The issue with prefetching is it will only load the top level resource of at the specified URL. So, if you tried to prefetch the other pages, like so:
<link rel="prefetch" href="http://example.org/index.html">
...then only the index.html resource would be prefetched, not all of the CSS and JavaScript links containeed in the document. One approach could be to write out link tags for all of the contained resources in the page but that could get messy and difficult to maintain.
Another approach would be to wait for the current page to finish loading, and then use JavaScript to create an iframe, that is hidden off the page, targeted at the URLs you want to prefetch all the assets for. The browser would then load all of the content of the URL and it would be in the user's cache when they go to visit the page.
There is also a Chrome extension that combines these two approaches by searching for link tags that define a prefetch and then creates the hidden iframe causing the assets to be downloaded and cached.
If the goal is to optimize around client performance for navigating your site there might be other alternatives like creating a web application that uses a Single Page Application (SPA) style of development to reduce the number of times JS and CSS are loaded and executed. If you are a fan of Google then you might check out their framework for building SPAs called AngularJs.
If it was a good idea to pre-render more pages, Chrome would probably already do it. Pre-rendering too many pages will drain website bandwidth and possibly end up slowing down the whole web, which is the opposite of what you're trying to achieve. So it's most likely intentional that you can only pre-render a single page and you shouldn't try to break that.
Not possible. Chrome manages pre-rendering based on many factors. If this was possible, it could also be easily abused by many sites. You could, depending on your page, keep all content on one page.
If I display abc.jpg 20 times on a web page, does loading of the web page cause 20 http requests to the abc.jpg? Or it depends if I am using relative or absolute paths?
Thanks
It's down to the browser. A poorly written browser may request the same file multiple times, but any of the widely-used browsers will get this right. It shouldn't matter whether they are using relative or absolute paths (though mixing between relative and absolute paths on the same page might trip up some browsers, so you should probably avoid it).
It depends on the web browser, but any modern browser should only request it once.
It is up to the browser. A modern browser will try hard to cache the image. Use consistent URL format in your requests when possible - consistent capitalization, don't use "www." one time and no "www." another time, etc.
Download Firebug and use the 'Net' tab to inspect all requests.
For this case, I agree with the other answers, any modern browser with proper settings should cache it.
It does depend on the browser settings but it also depends on what the web server tells the client to do with the image.
See this, it's quite complicated
http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html
While I agree with the above statements, I suggest looking at your web server access log for the target image and comparing the referring page and browser fingerprint.
You will possibly see lots of hits to HEAD rather than GET in order to make sure the file cache is up to date.
I would like to know if it is possible to modify Chrome or Firefox display settings, so that it would only show rectangles of HTML DOM objects? What I want to do is to decrease rendering engine job amount as much as possible, so it would only build layout of the page.
People usually refer to this mode of operation as "headless" (i.e. without UI).
Usually there's an additional requirement - to be able to run it server-side without the usual for client software installed. If you're running it client-side, I wouldn't bother about optimization, it shouldn't give you a big win anyway.
Otherwise, try searching using that term. I've seen it asked for several times, but haven't seen a working out-of-box solution.
[edit] just saw http://hg.mozilla.org/incubator/offscreen, which seems to be a headless version of Mozilla.
I wouldn't go as low-level as modifying the renderer. Instead, I suggest you use Firefox's Greasemonkey to replace the elements from the page with whatever it is you need. You'll need to know a bit of JavaScript, but it's not that hard.
However, this will only work on client side. If you want to do this on server-side ( so that it will work on any page a user requests through your own ), my guess is you'll need to grab the page's content in a string, and then modify it using a HTML parser.