Session ID not preserved between page navigation using Puppeteer's .goto method? - puppeteer

When attempting to navigate to a sub-page using Puppeteer's goto method, I have noted that cookie information is not being correctly preserved between navigation.
const puppeteer = require('puppeteer');
puppeteer.launch().then(async browser => {
const page = await browser.newPage();
await page.goto('http://www.example.com/Summary.aspx?sid=100-013-030);
await page.screenshot({path: 'example1.png'});
await page.goto('http://www.example.com/DetailInfo.aspx?did=af902cb3');
await page.screenshot({path: 'example2.png'});
await browser.close();
});
In the code above, upon making the second goto call, the example2.png file generated is a screenshot of the Summary landing page; indicating a silent failure. Conversely, when navigating manually within the Chrome browser itself, copying and pasting the DetailInfo link into a new tab opens the intended page with no issue.
Upon further investigation, I did note that the website is keeping a cookie with a session ID in the browser cache, but what is the difference between the manual approach, and using Puppeteer that is creating this discrepancy?

Related

Add timeout to CDN link loading on Head tag

I am using a CDN for loading some styles. It works just fine with machines that has no proxy configured. But sometimes people can connect to the web page using proxys and the CDNs cannot be resolved, thus causing the page to have long loading times or won't load at all unless the user forces the refresh.
Is there a way to specify an attribute or something on the HTML to avoid trying to load the resources if they cannot be resolved?
AFAIK there's nothing native like this to stop loading a resource. Really the request should time itself out if it can't be fetched in normal cases.
Kinda hacky and there's probably a better way to solve your specific issue.
But you could try removing the link tag from your head, and insert a script instead to fetch the contents of the cdn, then dynamically append a style tag containing the contents to your head instead. Something like this
(function() {
<script>
const controller = new AbortController();
const HOW_LONG_TO_WAIT_IN_MS = 5000;
let loaded = false;
fetch(URL_FOR_CDN, { signal: controller.signal })
.then(response => response.text())
.then(text => {
document.head.append(`<style>${text}</style>`);
loaded = true;
});
setTimeout(() => {
if (!loaded) controller.abort();
}, HOW_LONG_TO_WAIT_IN_MS);
})()
</script>
You'll want the script to come early in your head, and this will definitely slow down loading of the page a little for all users.

puppeteer doesn't render pages with images URLs without a protocol scheme

I'm trying to use puppeteer to render html email messages which contains images from urls which do not always contain a protocol scheme. For example: <img src="example.com/someimage.jpg" /a>, the src really should have been https://example.com/someimage.jpg or http://....
I'm well aware that the url should contain a protocol scheme but I don't have control over the html received in the message body of the emails. Many mail clients such as gmail will render such emails just fine. I would like to mimic this behavior in puppeteer.
Is there some way in Puppeteer to trap the error and then:
try https:// prepended to the href, and failing that
try http:// prepended to the href, and failing that
then display a broken image?
This is what I do to render the html:
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setJavaScriptEnabled(false);
await page.setContent(htmlEmailBody);
const content = await page.$("body");
const imageBuffer = await page.screenshot({type: "jpeg", omitBackground: true, fullPage: true});
This works fine when all the urls have a scheme. What's the proper way to get this to work when some of the URLs don't always contain the scheme?
This question is related to puppeteer doesn't open a url without protocol but unfortunately it doesn't answer my question.

How to intercept request in Puppeteer before current page is left?

Usecase:
We need to capture all outbound routes from a page. Some of them may not be implemented using link elements <a src="..."> but via some javascript code or as GET/POST forms.
PhantomJS:
In Phantom we did this using onNavigationRequested callback. We simply clicked at all the elements defined by some selector and used onNavigationRequested to capture the target url and possibly method or POST data in a case of form and then canceled that navigation event.
Puppeteer:
I tried request interception but at the moment request gets intercepted the current page is already lost so I would have to go back.
Is there a way how to capture the navigation event when the browser is still at the page that triggered the event and to stop it?
Thank you.
You can do the following.
await page.setRequestInterception(true);
page.on('request', request => {
if (request.resourceType() === 'image')
request.abort();
else
request.continue();
});
Example here:
https://github.com/GoogleChrome/puppeteer/blob/master/examples/block-images.js
Available resource types are listed here:
https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#requestresourcetype
So I finally discovered the solution that doesn't require browser extension and therefore works in a headless mode:
Thx to this guy: https://github.com/GoogleChrome/puppeteer/issues/823#issuecomment-467408640
page.on('request', req => {
if (req.isNavigationRequest() && req.frame() === page.mainFrame() && req.url() !== url) {
// no redirect chain means the navigation is caused by setting `location.href`
req.respond(req.redirectChain().length
? { body: '' } // prevent 301/302 redirect
: { status: 204 } // prevent navigation by js
)
} else {
req.continue()
}
})
EDIT: We have added helper function to Apify SDK that implements this - https://sdk.apify.com/docs/api/puppeteer#puppeteer.enqueueLinksByClickingElements
Here is whole source code:
https://github.com/apifytech/apify-js/blob/master/src/enqueue_links/click_elements.js
It's slightly more complicated as it does not only need to intercept requests but additionally also catch newly opened windows, etc.
I met the same problems.Puppeteer doesn't support the feature now, actually it's chrome devtool that doesn't support it. But I found another way to solve it, using chrome extension. Related issue: https://github.com/GoogleChrome/puppeteer/issues/823
The author of the issue shared a solution
here. https://gist.github.com/GuilloOme/2bd651e5154407d2d2165278d5cd7cdb
As the doc says, we can use chrome.webRequest.onBeforeRequest.addListener to intercept all request from the page and block it if you wanna do.
Don't forget to add the following command to the puppeteer launch options:
--load-extension=./your_ext/ --disable-extensions-except=./your_ext/
page.setRequestInterception(true); The documentation has a really thorough example here: https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#pagesetrequestinterceptionvalue.
Make sure to add some logic like in the example (and below) they avoid image requests. You would capture it and then abort each request.
page.on('request', interceptedRequest => {
if (interceptedRequest.url.endsWith('.png') ||
interceptedRequest.url.endsWith('.jpg'))
interceptedRequest.abort();
else
interceptedRequest.continue();
});

How to convert HTML to image in Node.js

I need to convert an HTML template into an image, on a Node server.
The server will receive the HTML as a string. I tried PhantomJS (using a library called Webshot), but it doesn't work well with flex box and modern CSS. I tried to use Chrome headless-browser but it doesn't seem to have an API for parsing html, only URL.
What is the currently best way to convert a piece of HTML into image?
Is there a way to use headless Chrome in a template mode instead of URL mode? I mean, instead of doing something like
chrome.goTo('http://test.com')
I need something like:
chrome.evaluate('<div>hello world</div>');
Another option, suggested here in the comments to this post, is to
save the template in a file on the server and then serve it locally and do something like:
chrome.goTo('http://localhost/saved_template');
But this option sounds a bit awkward. Is there any other, more straightforward solution?
You can use a library called Puppeteer.
Sample code snippet :
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setViewport({
width: 960,
height: 760,
deviceScaleFactor: 1,
});
await page.setContent(imgHTML);
await page.screenshot({path: example.png});
await browser.close();
This will save a screenshot of the HTML in the root directory.
You can easily do it on frontend using html2canvas. On backend you can write the html on a file and access using a file URI (i.e: file:///home/user/path/to/your/file.html), it should work fine with chrome headless-browser and Nightmare (screenshot feature). Another option is to setup a simple HTTP server and access the url.

What does "blob" mean in the `href` property in "<link>"? [duplicate]

My page generates a URL like this: "blob:http%3A//localhost%3A8383/568233a1-8b13-48b3-84d5-cca045ae384f" How can I convert it to a normal address?
I'm using it as an <img>'s src attribute.
A URL that was created from a JavaScript Blob can not be converted to a "normal" URL.
A blob: URL does not refer to data the exists on the server, it refers to data that your browser currently has in memory, for the current page. It will not be available on other pages, it will not be available in other browsers, and it will not be available from other computers.
Therefore it does not make sense, in general, to convert a Blob URL to a "normal" URL. If you wanted an ordinary URL, you would have to send the data from the browser to a server and have the server make it available like an ordinary file.
It is possible convert a blob: URL into a data: URL, at least in Chrome. You can use an AJAX request to "fetch" the data from the blob: URL (even though it's really just pulling it out of your browser's memory, not making an HTTP request).
Here's an example:
var blob = new Blob(["Hello, world!"], { type: 'text/plain' });
var blobUrl = URL.createObjectURL(blob);
var xhr = new XMLHttpRequest;
xhr.responseType = 'blob';
xhr.onload = function() {
var recoveredBlob = xhr.response;
var reader = new FileReader;
reader.onload = function() {
var blobAsDataUrl = reader.result;
window.location = blobAsDataUrl;
};
reader.readAsDataURL(recoveredBlob);
};
xhr.open('GET', blobUrl);
xhr.send();
data: URLs are probably not what you mean by "normal" and can be problematically large. However they do work like normal URLs in that they can be shared; they're not specific to the current browser or session.
another way to create a data url from blob url may be using canvas.
var canvas = document.createElement("canvas")
var context = canvas.getContext("2d")
context.drawImage(img, 0, 0) // i assume that img.src is your blob url
var dataurl = canvas.toDataURL("your prefer type", your prefer quality)
as what i saw in mdn, canvas.toDataURL is supported well by browsers. (except ie<9, always ie<9)
For those who came here looking for a way to download a blob url video / audio, this answer worked for me. In short, you would need to find an *.m3u8 file on the desired web page through Chrome -> Network tab and paste it into a VLC player.
Another guide shows you how to save a stream with the VLC Player.
UPDATE:
An alternative way of downloading the videos from a blob url is by using the mass downloader and joining the files together.
Download Videos Part
Open network tab in chrome dev tools
Reload the webpage
Filter .m3u8 files
Look through all filtered files and find the playlist of the '.ts' files. It should look something like this:
You need to extract those links somehow. Either download and edit the file manually OR use any other method you like. As you can see, those links are very similar, the only thing that differs is the serial number of the video: 's-0-v1-a1.ts', 's-1-v1-a1.ts' etc.
https://some-website.net/del/8cf.m3u8/s-0-v1-a1.ts
https://some-website.net/del/8cf.m3u8/s-1-v1-a1.ts
https://some-website.net/del/8cf.m3u8/s-2-v1-a1.ts
and so on up to the last link in the .m3u8 playlist file. These .ts files are actually your video. You need to download all of them.
For bulk downloading I prefer using the Simple Mass Downloader extension for Chrome (https://chrome.google.com/webstore/detail/simple-mass-downloader/abdkkegmcbiomijcbdaodaflgehfffed)
If you opt in for the Simple Mass Downloader, you need to:
a. Select a Pattern URL
b. Enter your link in the address field with only one modification: that part of the link that is changing for each next video needs to be replaced with the pattern in square brackets [0:400] where 0 is the first file name and 400 is the last one. So your link should look something like this https://some-website.net/del/8cf.m3u8/s-[0:400]-v1-a1.ts.
Afterwards hit the Import button to add these links into the Download List of Mass Downloader.
c. The next action may ask you for the destination folder for EACH video you download. So it is highly recommended to specify the default download folder in Chrome Settings and disable the Select Destination option in Chrome Settings as well. This will save you a lot of time! Additionally you may want you specify the folder where these files will go to:
c1. Click on Select All checkbox to select all files from the Download List.
c2. Click on the Download button in the bottom right corner of the SMD extension window. It will take you to next tab to start downloading
c3. Hit Start selected. This will download all vids automatically into the download folder.
That is it! Simply wait till all files are downloaded and you can watch them via the VLC Player or any other player that supports the .ts format. However, if you want to have one video instead of those you have downloaded, you need to join all these mini-videos together
Joining Videos Part
Since I am working on Mac, I am not aware of how you would do this on Windows. If you are the Windows user and you want to merge the videos, feel free to google for the windows solution. The next steps are applicable for Mac only.
Open Terminal in the folder you want the new video to be saved in
Type: cat and hit space
Open the folder where you downloaded your .ts video. Select all .ts videos that you want to join (use your mouse or cmd+A)
Drag and drop them into the terminal
Hit space
Hit >
Hit Space
Type the name of the new video, e.g. my_new_video.ts. Please note that the format has to be the same as in the original videos, otherwise it will take long time to convert and even may fail!
Hit Enter. Wait for the terminal to finish the joining process and enjoy watching your video!
Found this answer here and wanted to reference it as it appear much cleaner than the accepted answer:
function blobToDataURL(blob, callback) {
var fileReader = new FileReader();
fileReader.onload = function(e) {callback(e.target.result);}
fileReader.readAsDataURL(blob);
}
I'm very late to the party.
If you want to download the content you can simply use fetch now
fetch(blobURL)
.then(res => res.blob())
.then(blob => /*do what you want with the blob here*/)
Here the solution:
let blob = new Blob(chunks, { 'type' : 'video/mp4;' });
let videoURL = window.URL.createObjectURL(blob);
const blobF = await fetch(videoURL).then(res => res.blob())
As the previous answer have said, there is no way to decode it back to url, even when you try to see it from the chrome devtools panel, the url may be still encoded as blob.
However, it's possible to get the data, another way to obtain the data is to put it into an anchor and directly download it.
<a href="blob:http://example.com/xxxx-xxxx-xxxx-xxxx" download>download</a>
Insert this to the page containing blob url and click the button, you get the content.
Another way is to intercept the ajax call via a proxy server, then you could view the true image url.