How to enable parallel tests with puppeteer? - puppeteer

I am using the chrome puppeteer library directly to run browser integration tests. I have a few tests written now in individual files. Is there a way run them in parallel? What is the best way to achieve this?

To run puppeteer instances in parallel you can check out this library I wrote: puppeteer-cluster
It helps to run different puppeteer tasks in parallel in multiple browsers, contexts or pages and takes care of errors and browser crashes. Here is a minimal example:
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT, // use one browser per worker
maxConcurrency: 4, // cluster with four workers
});
// Define a task to be executed for your data
cluster.task(async ({ page, data: url }) => {
await page.goto(url);
const screen = await page.screenshot();
// ...
});
// Queue URLs
cluster.queue('http://www.google.com/');
cluster.queue('http://www.wikipedia.org/');
// ...
// Wait for cluster to idle and close it
await cluster.idle();
await cluster.close();
})();
You can also queue your functions directly like this:
const cluster = await Cluster.launch(...);
cluster.queue(async ({ page }) => {
await page.goto('http://www.wikipedia.org');
await page.screenshot({path: 'wikipedia.png'});
});
cluster.queue(async ({ page }) => {
await page.goto('https://www.google.com/');
const pageTitle = await page.evaluate(() => document.title);
// ...
});
cluster.queue(async ({ page }) => {
await page.goto('https://www.example.com/');
// ...
});

// My tests contain about 30 pages I want to test in parallel
const aBunchOfUrls = [
{
desc: 'Name of test #1',
url: SOME_URL,
},
{
desc: 'Name of test #2',
url: ANOTHER_URL,
},
// ... snip ...
];
const browserPromise = puppeteer.launch();
// These test pass! And rather quickly. Slowest link is the backend server.
// They're running concurrently, generating a new page within the same browser instance
describe('Generate about 20 parallel page tests', () => {
aBunchOfUrls.forEach((testObj, idx) => {
it.concurrent(testObj.desc, async () => {
const browser = await browserPromise;
const page = await browser.newPage();
await page.goto(testObj.url, { waitUntil: 'networkidle' });
await page.waitForSelector('#content');
// assert things..
});
});
});
from https://github.com/GoogleChrome/puppeteer/issues/474
written by https://github.com/quicksnap

How I achieved this was to create your suite(s) of tests in individual files, as you have done already. Then create a testSuiteRunner.js file (or whatever you wish to call it) and set it up as follows:
require('path/to/test/suite/1);
require('path/to/test/suite/2);
require('path/to/test/suite/3);
...
Import all of your suite(s) using require statements, like above, (no need to give them const variable names or anything like that) and then you can use node ./path/to/testSuiteRunner.js to execute all of your suites in parallel. Simplest solution I could come up with for this one!

I think the best idea would be to use a test runner like Jest that can manage that for you. At least I do that this way. Please keep in mind the machine might blow up if you'd run too many Chrome instances at the same time so it's safe to limit the number of parallel tests to 2 or so.
Unfortunately, it isn't clearly described in the official documentation of how Jest parallels the tests. You might find this https://github.com/facebook/jest/issues/6957 useful.
Libraries as puppeteer-cluster are great but remember first of all you want to parallel your tests, not puppeteer's tasks.

Related

puppeteer cluster _ how to prevent close page?

I am glad to find the puppeteer cluster. this library made life easy on crawling and automation tasks.tnx to Thomas Dondorf.
according to the author of the puppeteer cluster, when a task finished page will be closed immediately.this is good by the way. but what about some cases that you need to page will be open?
my use case:
I will try to explain briefly:
there is some activity on the page that in the background a socket is involved in for sending some data to the front .this data changes the dome and I need to capture that.
this is my code :
async function runCrawler(){
const links = [
"foo.com/barSome324",
"foo.com/barSome22",
"foo.com/barSome1",
"foo.com/barSome765",
]
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
workerCreationDelay: 5000,
puppeteerOptions:{args: ['--no-sandbox', '--disable-setuid-sandbox'], headless:false},
maxConcurrency: numCPUs,
});
await cluster.task(async ({ page, data: url }) => {
await crawler(page, url)
});
for(link of links){
await cluster.queue(link);
}
await cluster.idle();
await cluster.close();
}
and this is the crawler logic in page section:
module.exports.crawler = async(page, link)=>{
await page.goto(link, { waitUntil: 'networkidle2' })
await page.waitForTimeout(10000)
await page.waitForSelector('#dbp')
try {
// method to be executed;
setInterval(async()=>{
const tables=await page.evaluate(async()=>{
/// data I need to catch in every 30 seconds
});
},30000)
} catch (error) {
console.log(error)
}
}
I searched And find out in js we can capture DOM changes with mutationObserver .and tried this solution . but did not work either.page will be closed with this error:
UnhandledPromiseRejectionWarning: Error: Protocol error
(Runtime.callFunctionOn): Session closed. Most likely the page has
been closed.
so I have two options here:
1.mutationObserver
2.set interval for every 30 seconds evaluates the page itself.
but they did not suit my needs. so any idea how to overcome this problem?

how to execute a script in every window that gets loaded in puppeteer?

I need to execute a script in every Window object created in Chrome – that is:
tabs opened through puppeteer
links opened by click()ing links in puppeteer
all the popups (e.g. window.open or "_blank")
all the iframes contained in the above
it must be executed without me evaluating it explicitly for that particular Window object...
I checked Chrome's documentation and what I should be using is Page.addScriptToEvaluateOnNewDocument.
However, it doesn't look to be possible to use through puppeteer.
Any idea? Thanks.
This searches for a target in all browser contexts.
An example of finding a target for a page opened
via window.open() or popups:
await page.evaluate(() => window.open('https://www.example.com/'))
const newWindowTarget = await browser.waitForTarget(async target => {
await page.evaluate(() => {
runTheScriptYouLike()
console.log('Hello StackOverflow!')
})
})
via browser.pages() or tabs
This script run evaluation of a script in the second tab:
const pageTab2 = (await browser.pages())[1]
const runScriptOnTab2 = await pageTab2.evaluate(() => {
runTheScriptYouLike()
console.log('Hello StackOverflow!')
})
via page.frames() or iframes
An example of getting eval from an iframe element:
const frame = page.frames().find(frame => frame.name() === 'myframe')
const result = await frame.evaluate(() => {
return Promise.resolve(8 * 7);
});
console.log(result); // prints "56"
Hope this may help you

get post title after Infinite scroll finished

I manage to show all the post on a site where it has load_more button to go to the next page, but something is missing,
I got error of
e Error: Node is either not visible or not an HTMLElement
at ElementHandle._clickablePoint (/Users/minghann/Documents/productnation_scraper/node_modules/puppeteer/lib/ExecutionContext.js:331:13)
at <anonymous>
at process._tickCallback (internal/process/next_tick.js:188:7)
Which doesn't happen if I don't load all the post. It's hard to debug because I don't know which post is missing what. Full code as below:
const browser = await puppeteer.launch({
devtools: true
});
const page = await browser.newPage();
await page.goto("https://example.net");
await page.waitForSelector(".load_more_btn");
const load_more_exist = !!(await page.$(".load_more_btn"));
while (load_more_exist > 0) {
await page.click(".load_more_btn");
}
const posts = await page.$$(".post");
let result = [];
for (const post of posts) {
result = [
...result,
{
title: await post.$eval(".post_title a", e => e.innerText)
}
];
}
console.log(result);
browser.close();
There are multiple ways and best way is to combine the following two different ways.
Look for Ajax
Wait for request instead. Whenever you click on Load More, it will do a simple ajax request to ?ajax-request=jnews. We can use .waitForRequest or .waitForResponse for this use case. Here is a working example,
await Promise.all([
page.waitForRequest(response => response.url().includes('?ajax-request=jnews') && response.status() === 200),
page.click(".load_more_btn")
])
Clean DOM and wait for new Element
Refer to these answers here and here.
Basically you can remove the dom elements that you collected, so next time you collect more data, there won't be any duplicates.
So, once you remove all current elements like document.querySelectorAll('.jeg_post'), you can simply do another page.waitFor('.jeg_post') later if you need.

Is it possible to perform an action with `context` on the init of the app?

I'm simply looking for something like this
app.on('init', async context => {
...
})
Basically I just need to make to calls to the github API, but I'm not sure there is a way to do it without using the API client inside the Context object.
I ended up using probot-scheduler
const createScheduler = require('probot-scheduler')
module.exports = app => {
createScheduler(app, {
delay: false
})
robot.on('schedule.repository', context => {
// this is called on startup and can access context
})
}
I tried probot-scheduler but it didn't exist - perhaps removed in an update?
In any case, I managed to do it after lots of digging by using the actual app object - it's .auth() method returns a promise containing the GitHubAPI interface:
https://probot.github.io/api/latest/classes/application.html#auth
module.exports = app => {
router.get('/hello-world', async (req, res) => {
const github = await app.auth();
const result = await github.repos.listForOrg({'org':'org name});
console.log(result);
})
}
.auth() takes the ID of the installation if you wish to access private data. If called empty, the client will can only retrieve public data.
You can get the installation ID by calling .auth() without paramaters, and then listInstallations():
const github = await app.auth();
const result = github.apps.listInstallations();
console.log(result);
You get an array including IDs that you can in .auth().

Using Artoo.js with Google Puppeteer for Web Scraping

I can't seem to be able to use Artoo.js with Puppeteer.
I tried using it through npm install artoo-js, but it did not work.
I also tried injecting the build path distribution using the Puppeteer command page.injectFile(filePath), but I had no luck.
Was anyone able to implement these two libraries successfully?
If so, I would love a code snippet of how Artoo.js was injected.
I just tried Puppeteer for another answer, I figured I could try Artoo too, so here you go :)
(Step 0 : Install Yarn if you don't have it)
yarn init
yarn add puppeteer
# Download latest artoo script, not as a yarn dependency here because it won't be by the Node JS runtime
wget https://medialab.github.io/artoo/public/dist/artoo-latest.min.js
Save this in index.js:
const puppeteer = require('puppeteer');
(async() => {
const url = 'https://news.ycombinator.com/';
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Go to URL and wait for page to load
await page.goto(url, {waitUntil: 'networkidle'});
// Inject Artoo into page's JS context
await page.injectFile('artoo-latest.min.js');
// Sleeping 2s to let Artoo initialize (I don't have a more elegant solution right now)
await new Promise(res => setTimeout(res, 2000))
// Use Artoo from page's JS context
const result = await page.evaluate(() => {
return artoo.scrape('td.title:nth-child(3)', {
title: {sel: 'a'},
url: {sel: 'a', attr: 'href'}
});
});
console.log(`Result has ${result.length} items, first one is:`, result[0]);
browser.close();
})();
Result:
$ node index.js
Result has 30 items, first one is: { title: 'Headless mode in Firefoxdeveloper.mozilla.org',
url: 'https://developer.mozilla.org/en-US/Firefox/Headless_mode' }
This is too funny to miss: right now the top article of HackerNews is about Firefox Headless...