puppeteer cluster _ how to prevent close page? - puppeteer

I am glad to find the puppeteer cluster. this library made life easy on crawling and automation tasks.tnx to Thomas Dondorf.
according to the author of the puppeteer cluster, when a task finished page will be closed immediately.this is good by the way. but what about some cases that you need to page will be open?
my use case:
I will try to explain briefly:
there is some activity on the page that in the background a socket is involved in for sending some data to the front .this data changes the dome and I need to capture that.
this is my code :
async function runCrawler(){
const links = [
"foo.com/barSome324",
"foo.com/barSome22",
"foo.com/barSome1",
"foo.com/barSome765",
]
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
workerCreationDelay: 5000,
puppeteerOptions:{args: ['--no-sandbox', '--disable-setuid-sandbox'], headless:false},
maxConcurrency: numCPUs,
});
await cluster.task(async ({ page, data: url }) => {
await crawler(page, url)
});
for(link of links){
await cluster.queue(link);
}
await cluster.idle();
await cluster.close();
}
and this is the crawler logic in page section:
module.exports.crawler = async(page, link)=>{
await page.goto(link, { waitUntil: 'networkidle2' })
await page.waitForTimeout(10000)
await page.waitForSelector('#dbp')
try {
// method to be executed;
setInterval(async()=>{
const tables=await page.evaluate(async()=>{
/// data I need to catch in every 30 seconds
});
},30000)
} catch (error) {
console.log(error)
}
}
I searched And find out in js we can capture DOM changes with mutationObserver .and tried this solution . but did not work either.page will be closed with this error:
UnhandledPromiseRejectionWarning: Error: Protocol error
(Runtime.callFunctionOn): Session closed. Most likely the page has
been closed.
so I have two options here:
1.mutationObserver
2.set interval for every 30 seconds evaluates the page itself.
but they did not suit my needs. so any idea how to overcome this problem?

Related

How to abort request Puppeteer when using "Network. requestIntercepted" (same as request.abort () of setRequestInterception)?

I can't intercept with the setRequestInterception function, some important request doesn't go through it.
Instead I used the Network.setRequestInterception, and it was nice in addition to not being able to abort the unwanted request.
If I intentionally ignore it with the if else sentence, a lot of request will not leave the browser and cause many errors in the suspended state.
I need an alternate command request.abort() when I work with Network.requestIntercepted, here is my code:
const client = await page.target().createCDPSession();
await client.send('Network.enable');
await client.send('Network.setRequestInterception',{
patterns:[
{urlPattern :'*'}
]
});
client.on('Network.requestIntercepted',async({interceptionId,request})=>{
if (request.url.indexOf('Loading') == -1 ){
await client.send('Network.continueInterceptedRequest',{
interceptionId
});
}else{
//I want request.abort()
}
});
Thanks!!
It seems the same 'Network.continueInterceptedRequest' can be used to abort:
}else{
await client.send('Network.continueInterceptedRequest', {
interceptionId,
errorReason: 'Aborted',
});
}

GDrive API v3 files.get download progress?

How can I show progress of a download of a large file from GDrive using the gapi client-side v3 API?
I am using the v3 API, and I've tried to use a Range request in the header, which works, but the download is very slow (below). My ultimate goal is to playback 4K video. GDrive limits playback to 1920x1280. My plan was to download chunks to IndexedDB via v3 API and play from the locally cached data. I have this working using the code below via Range requests, but it is unusably slow. A normal download of the full 438 MB test file directly (e.g. via the GDrive web page) takes about 30-35s on my connection, and, coincidentally, each 1 MB Range requests takes almost exactly the same 30-35s. It feels like the GDrive back-end is reading and sending the full file for each subrange?
I've also tried using XHR and fetch to download the file, which fails. I've been using the webContent link (which typically ends in &export=download) but I cannot get access headers correct. I get either CORS or other odd permission issues. The webContent links work fine in <image> and <video> src tags. I expect this is due to special permission handling or some header information I'm missing that the browser handles specifically for these media tags. My solution must be able to read private (non-public, non-sharable) links, hence the use of the v3 API.
For video files that are smaller than the GDrive limit, I can set up a MediaRecorder and use a <video> element to get the data with progress. Unfortunately, the 1920x1080 limit kills this approach for larger files, where progress feedback is even more important.
This is the client-side gapi Range code, which works, but is unusably slow for large (400 MB - 2 GB) files:
const getRange = (start, end, size, fileId, onProgress) => (
new Promise((resolve, reject) => gapi.client.drive.files.get(
{ fileId, alt: 'media', Range: `bytes=${start}-${end}` },
// { responseType: 'stream' }, Perhaps this fails in the browser?
).then(res => {
if (onProgress) {
const cancel = onProgress({ loaded: end, size, fileId })
if (cancel) {
reject(new Error(`Progress canceled download at range ${start} to ${end} in ${fileId}`))
}
}
return resolve(res.body)
}, err => reject(err)))
)
export const downloadFileId = async (fileId, size, onProgress) => {
const batch = 1024 * 1024
try {
const chunks = []
for (let start = 0; start < size; start += batch) {
const end = Math.min(size, start + batch - 1)
const data = await getRange(start, end, size, fileId, onProgress)
if (!data) throw new Error(`Unable to get range ${start} to ${end} in ${fileId}`)
chunks.push(data)
}
return chunks.join('')
} catch (err) {
return console.error(`Error downloading file: ${err.message}`)
}
}
Authentication works fine for me, and I use other GDrive commands just fine. I'm currently using drives.photos.readonly scope, but I have the same issues even if I use a full write-permission scope.
Tangentially, I'm unable to get a stream when running client-side using gapi (works fine in node on the server-side). This is just weird. If I could get a stream, I think I could use that to get progress. Whenever I add the commented-out line for the responseType: 'stream', I get the following error: The server encountered a temporary error and could not complete your request. Please try again in 30 seconds. That’s all we know. Of course waiting does NOT help, and I can get a successful response if I do not request the stream.
I switched to using XMLHttpRequest directly, rather than the gapi wrapper. Google provides these instructions for using CORS that show how to convert any request from using gapi to a XHR. Then you can attach to the onprogress event (and onload, onerror and others) to get progres.
Here's the drop-in replacement code for the downloadFileId method in the question, with a bunch of debugging scaffolding:
const xhrDownloadFileId = (fileId, onProgress) => new Promise((resolve, reject) => {
const user = gapi.auth2.getAuthInstance().currentUser.get()
const oauthToken = user.getAuthResponse().access_token
const xhr = new XMLHttpRequest()
xhr.open('GET', `https://www.googleapis.com/drive/v3/files/${fileId}?alt=media`)
xhr.setRequestHeader('Authorization', `Bearer ${oauthToken}`)
xhr.responseType = 'blob'
xhr.onloadstart = event => {
console.log(`xhr ${fileId}: on load start`)
const { loaded, total } = event
onProgress({ loaded, size: total })
}
xhr.onprogress = event => {
console.log(`xhr ${fileId}: loaded ${event.loaded} of ${event.total} ${event.lengthComputable ? '' : 'non-'}computable`)
const { loaded, total } = event
onProgress({ loaded, size: total })
}
xhr.onabort = event => {
console.warn(`xhr ${fileId}: download aborted at ${event.loaded} of ${event.total}`)
reject(new Error('Download aborted'))
}
xhr.onerror = event => {
console.error(`xhr ${fileId}: download error at ${event.loaded} of ${event.total}`)
reject(new Error('Error downloading file'))
}
xhr.onload = event => {
console.log(`xhr ${fileId}: download of ${event.total} succeeded`)
const { loaded, total } = event
onProgress({ loaded, size: total })
resolve(xhr.response)
}
xhr.onloadend = event => console.log(`xhr ${fileId}: download of ${event.total} completed`)
xhr.ontimeout = event => {
console.warn(`xhr ${fileId}: download timeout after ${event.loaded} of ${event.total}`)
reject(new Error('Timout downloading file'))
}
xhr.send()
})

get post title after Infinite scroll finished

I manage to show all the post on a site where it has load_more button to go to the next page, but something is missing,
I got error of
e Error: Node is either not visible or not an HTMLElement
at ElementHandle._clickablePoint (/Users/minghann/Documents/productnation_scraper/node_modules/puppeteer/lib/ExecutionContext.js:331:13)
at <anonymous>
at process._tickCallback (internal/process/next_tick.js:188:7)
Which doesn't happen if I don't load all the post. It's hard to debug because I don't know which post is missing what. Full code as below:
const browser = await puppeteer.launch({
devtools: true
});
const page = await browser.newPage();
await page.goto("https://example.net");
await page.waitForSelector(".load_more_btn");
const load_more_exist = !!(await page.$(".load_more_btn"));
while (load_more_exist > 0) {
await page.click(".load_more_btn");
}
const posts = await page.$$(".post");
let result = [];
for (const post of posts) {
result = [
...result,
{
title: await post.$eval(".post_title a", e => e.innerText)
}
];
}
console.log(result);
browser.close();
There are multiple ways and best way is to combine the following two different ways.
Look for Ajax
Wait for request instead. Whenever you click on Load More, it will do a simple ajax request to ?ajax-request=jnews. We can use .waitForRequest or .waitForResponse for this use case. Here is a working example,
await Promise.all([
page.waitForRequest(response => response.url().includes('?ajax-request=jnews') && response.status() === 200),
page.click(".load_more_btn")
])
Clean DOM and wait for new Element
Refer to these answers here and here.
Basically you can remove the dom elements that you collected, so next time you collect more data, there won't be any duplicates.
So, once you remove all current elements like document.querySelectorAll('.jeg_post'), you can simply do another page.waitFor('.jeg_post') later if you need.

How to handle multiple redirection in puppeteer?

I am trying to open a page after a form post inside evaluate. There are 2 redirections after form post which can be any number and then I find a final page.
I tried to handled it by putting below (2 times for 2 redirections) after evaluate in which form post happened.
await page.waitForNavigation({'waitUntil':'domcontentloaded'});
await page.waitForNavigation({'waitUntil':'domcontentloaded'});
The above worked properly but I have to handle the situations when any number of redirections can happen.
I won't have any specific selector on DOM as page might be different many times.
Puppeteer version: 1.4.0
Platform / OS version: Linux
URLs (if applicable): NA
Node.js version: 8.10.0
The below is part of code which I am using:
const formPost = await page.evaluate(a => {
var form = formBuilder("payment_post", "post", acsUrl);
for (var i in a) {
form.add(i, i, 'hidden', a[i]);
}
form.generate("pareqFormContainer");
form.submit();
return document.querySelector('#pareqFormContainer').innerHTML;
}, jsonData)
.then(function () {
logger.info("form submitted with pareq and MD for txnId : " + jsonData.txnId)
});
await page.waitForNavigation({'waitUntil' : 'domcontentloaded', 'timeout' : waitTimeOut});
await page.waitForNavigation({'waitUntil' : 'domcontentloaded', 'timeout' : waitTimeOut});

How to enable parallel tests with puppeteer?

I am using the chrome puppeteer library directly to run browser integration tests. I have a few tests written now in individual files. Is there a way run them in parallel? What is the best way to achieve this?
To run puppeteer instances in parallel you can check out this library I wrote: puppeteer-cluster
It helps to run different puppeteer tasks in parallel in multiple browsers, contexts or pages and takes care of errors and browser crashes. Here is a minimal example:
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT, // use one browser per worker
maxConcurrency: 4, // cluster with four workers
});
// Define a task to be executed for your data
cluster.task(async ({ page, data: url }) => {
await page.goto(url);
const screen = await page.screenshot();
// ...
});
// Queue URLs
cluster.queue('http://www.google.com/');
cluster.queue('http://www.wikipedia.org/');
// ...
// Wait for cluster to idle and close it
await cluster.idle();
await cluster.close();
})();
You can also queue your functions directly like this:
const cluster = await Cluster.launch(...);
cluster.queue(async ({ page }) => {
await page.goto('http://www.wikipedia.org');
await page.screenshot({path: 'wikipedia.png'});
});
cluster.queue(async ({ page }) => {
await page.goto('https://www.google.com/');
const pageTitle = await page.evaluate(() => document.title);
// ...
});
cluster.queue(async ({ page }) => {
await page.goto('https://www.example.com/');
// ...
});
// My tests contain about 30 pages I want to test in parallel
const aBunchOfUrls = [
{
desc: 'Name of test #1',
url: SOME_URL,
},
{
desc: 'Name of test #2',
url: ANOTHER_URL,
},
// ... snip ...
];
const browserPromise = puppeteer.launch();
// These test pass! And rather quickly. Slowest link is the backend server.
// They're running concurrently, generating a new page within the same browser instance
describe('Generate about 20 parallel page tests', () => {
aBunchOfUrls.forEach((testObj, idx) => {
it.concurrent(testObj.desc, async () => {
const browser = await browserPromise;
const page = await browser.newPage();
await page.goto(testObj.url, { waitUntil: 'networkidle' });
await page.waitForSelector('#content');
// assert things..
});
});
});
from https://github.com/GoogleChrome/puppeteer/issues/474
written by https://github.com/quicksnap
How I achieved this was to create your suite(s) of tests in individual files, as you have done already. Then create a testSuiteRunner.js file (or whatever you wish to call it) and set it up as follows:
require('path/to/test/suite/1);
require('path/to/test/suite/2);
require('path/to/test/suite/3);
...
Import all of your suite(s) using require statements, like above, (no need to give them const variable names or anything like that) and then you can use node ./path/to/testSuiteRunner.js to execute all of your suites in parallel. Simplest solution I could come up with for this one!
I think the best idea would be to use a test runner like Jest that can manage that for you. At least I do that this way. Please keep in mind the machine might blow up if you'd run too many Chrome instances at the same time so it's safe to limit the number of parallel tests to 2 or so.
Unfortunately, it isn't clearly described in the official documentation of how Jest parallels the tests. You might find this https://github.com/facebook/jest/issues/6957 useful.
Libraries as puppeteer-cluster are great but remember first of all you want to parallel your tests, not puppeteer's tasks.