Using Artoo.js with Google Puppeteer for Web Scraping

Using Artoo.js with Google Puppeteer for Web Scraping - google-chrome

I can't seem to be able to use Artoo.js with Puppeteer.
I tried using it through npm install artoo-js, but it did not work.
I also tried injecting the build path distribution using the Puppeteer command page.injectFile(filePath), but I had no luck.
Was anyone able to implement these two libraries successfully?
If so, I would love a code snippet of how Artoo.js was injected.

I just tried Puppeteer for another answer, I figured I could try Artoo too, so here you go :)
(Step 0 : Install Yarn if you don't have it)
yarn init
yarn add puppeteer
# Download latest artoo script, not as a yarn dependency here because it won't be by the Node JS runtime
wget https://medialab.github.io/artoo/public/dist/artoo-latest.min.js
Save this in index.js:
const puppeteer = require('puppeteer');
(async() => {
const url = 'https://news.ycombinator.com/';
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Go to URL and wait for page to load
await page.goto(url, {waitUntil: 'networkidle'});
// Inject Artoo into page's JS context
await page.injectFile('artoo-latest.min.js');
// Sleeping 2s to let Artoo initialize (I don't have a more elegant solution right now)
await new Promise(res => setTimeout(res, 2000))
// Use Artoo from page's JS context
const result = await page.evaluate(() => {
return artoo.scrape('td.title:nth-child(3)', {
title: {sel: 'a'},
url: {sel: 'a', attr: 'href'}
});
});
console.log(`Result has ${result.length} items, first one is:`, result[0]);
browser.close();
})();
Result:
$ node index.js
Result has 30 items, first one is: { title: 'Headless mode in Firefoxdeveloper.mozilla.org',
url: 'https://developer.mozilla.org/en-US/Firefox/Headless_mode' }
This is too funny to miss: right now the top article of HackerNews is about Firefox Headless...

Related

how to print response json from website using puppeteer?

I try to get google translation website to do some work for me, the website returns a blank web page with a json file. Using web brower, I can save the json file and open it in a text editor.
I am trying to use puppeteer to get this done automatically. Here is my code:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({headless:false, args: ["--no-sandbox"]});
const page = await browser.newPage();
// Approach 1:
const response = await page.goto('https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=zh&dt=t&q=Edit%20Report');
let text = await response.text();
console.log(text);
let json = await response.json();
console.log(json);
await browser.close();
})();
When I run this code, brower is launched, but the returned json file still get automatically saved to the disk instead of printing to the console. What puppeteer class I should use for this task?

Since it is an API call and the expected result is JSON, you can use a simple nodsJS or Jquery code to return the response as below.
$.get('https://translate.googleapis.com/translate_a/single?`client=gtx&sl=en&tl=zh&dt=t&q=Edit%20Report', (data) =>`
{
console.log(data);
});
but if you are particular about using puppeteer and want to return the response. you would do the following.
Add a Jquery dependency to your project, by running
npm install jquery
Import the JQuery to the project.
Invoke the below code, without launching the browser.
$.get('https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=zh&dt=t&q=Edit%20Report', (data) =>
{
console.log(data);
});
Here is the link to JSfiddle code https://jsfiddle.net/faizmagic/0h6cm1o4/latest/
I hope this helps.

how to execute a script in every window that gets loaded in puppeteer?

I need to execute a script in every Window object created in Chrome – that is:
tabs opened through puppeteer
links opened by click()ing links in puppeteer
all the popups (e.g. window.open or "_blank")
all the iframes contained in the above
it must be executed without me evaluating it explicitly for that particular Window object...
I checked Chrome's documentation and what I should be using is Page.addScriptToEvaluateOnNewDocument.
However, it doesn't look to be possible to use through puppeteer.
Any idea? Thanks.

This searches for a target in all browser contexts.
An example of finding a target for a page opened
via window.open() or popups:
await page.evaluate(() => window.open('https://www.example.com/'))
const newWindowTarget = await browser.waitForTarget(async target => {
await page.evaluate(() => {
runTheScriptYouLike()
console.log('Hello StackOverflow!')
})
})
via browser.pages() or tabs
This script run evaluation of a script in the second tab:
const pageTab2 = (await browser.pages())[1]
const runScriptOnTab2 = await pageTab2.evaluate(() => {
runTheScriptYouLike()
console.log('Hello StackOverflow!')
})
via page.frames() or iframes
An example of getting eval from an iframe element:
const frame = page.frames().find(frame => frame.name() === 'myframe')
const result = await frame.evaluate(() => {
return Promise.resolve(8 * 7);
});
console.log(result); // prints "56"
Hope this may help you

how to integrate lighthouse with testcafe?

I need to pass the connection argument while calling lighthouse
https://github.com/GoogleChrome/lighthouse/blob/master/lighthouse-core/index.js#L41
async function lighthouse(url, flags = {}, configJSON, connection) {
// verify the url is valid and that protocol is allowed
if (url && (!URL.isValid(url) || !URL.isProtocolAllowed(url))) {
throw new LHError(LHError.errors.INVALID_URL);
}
// set logging preferences, assume quiet
flags.logLevel = flags.logLevel || 'error';
log.setLevel(flags.logLevel);
const config = generateConfig(configJSON, flags);
connection = connection || new ChromeProtocol(flags.port, flags.hostname);
// kick off a lighthouse run
return Runner.run(connection, {url, config});
}
And in my testcafe my tests look like
test('Run lighthouse, async t => {
lighthouse('https://www.youtube.com', {}, {}, ????)
})
I am unable to retrieve the connection of the chrome instance that testcafe had opened up, instead of spawning a new chromeRunner

there is an npm library called testcafe-lighthouse which helps to audit web pages using TestCafe. It also has the capability to produce an HTML detailed report.
Install the plugin by:
$ yarn add -D testcafe-lighthouse
# or
$ npm install --save-dev testcafe-lighthouse
Audit with default threshold
import { testcafeLighthouseAudit } from 'testcafe-lighthouse';
fixture(`Audit Test`).page('http://localhost:3000/login');
test('user performs lighthouse audit', async () => {
const currentURL = await t.eval(() => document.documentURI);
await testcafeLighthouseAudit({
url: currentURL,
cdpPort: 9222,
});
});
Audit with custom Thresold:
import { testcafeLighthouseAudit } from 'testcafe-lighthouse';
fixture(`Audit Test`).page('http://localhost:3000/login');
test('user page performance with specific thresholds', async () => {
const currentURL = await t.eval(() => document.documentURI);
await testcafeLighthouseAudit({
url: currentURL,
thresholds: {
performance: 50,
accessibility: 50,
'best-practices': 50,
seo: 50,
pwa: 50,
},
cdpPort: 9222,
});
});
you need to kick start the test like below:
# headless mode, preferable for CI
npx testcafe chrome:headless:cdpPort=9222 test.js
# non headless mode
npx testcafe chrome:emulation:cdpPort=9222 test.js
I hope it will help your automation journey.

I did something similar, I launch ligthouse with google chrome on a specific port using CLI
npm run testcafe -- chrome:headless:cdpPort=1234
Then I make the lighthouse function to get port as an argument
export default async function lighthouseAudit(url, browser_port){
let result = await lighthouse(url, {
port: browser_port, // Google Chrome port Number
output: 'json',
logLevel: 'info',
});
return result;
};
Then you can simply run the audit like
test(`Generate Light House Result `, async t => {
auditResult = await lighthouseAudit('https://www.youtube.com',1234);
});
Hopefully It helps

How to enable parallel tests with puppeteer?

I am using the chrome puppeteer library directly to run browser integration tests. I have a few tests written now in individual files. Is there a way run them in parallel? What is the best way to achieve this?

To run puppeteer instances in parallel you can check out this library I wrote: puppeteer-cluster
It helps to run different puppeteer tasks in parallel in multiple browsers, contexts or pages and takes care of errors and browser crashes. Here is a minimal example:
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT, // use one browser per worker
maxConcurrency: 4, // cluster with four workers
});
// Define a task to be executed for your data
cluster.task(async ({ page, data: url }) => {
await page.goto(url);
const screen = await page.screenshot();
// ...
});
// Queue URLs
cluster.queue('http://www.google.com/');
cluster.queue('http://www.wikipedia.org/');
// ...
// Wait for cluster to idle and close it
await cluster.idle();
await cluster.close();
})();
You can also queue your functions directly like this:
const cluster = await Cluster.launch(...);
cluster.queue(async ({ page }) => {
await page.goto('http://www.wikipedia.org');
await page.screenshot({path: 'wikipedia.png'});
});
cluster.queue(async ({ page }) => {
await page.goto('https://www.google.com/');
const pageTitle = await page.evaluate(() => document.title);
// ...
});
cluster.queue(async ({ page }) => {
await page.goto('https://www.example.com/');
// ...
});

// My tests contain about 30 pages I want to test in parallel
const aBunchOfUrls = [
{
desc: 'Name of test #1',
url: SOME_URL,
},
{
desc: 'Name of test #2',
url: ANOTHER_URL,
},
// ... snip ...
];
const browserPromise = puppeteer.launch();
// These test pass! And rather quickly. Slowest link is the backend server.
// They're running concurrently, generating a new page within the same browser instance
describe('Generate about 20 parallel page tests', () => {
aBunchOfUrls.forEach((testObj, idx) => {
it.concurrent(testObj.desc, async () => {
const browser = await browserPromise;
const page = await browser.newPage();
await page.goto(testObj.url, { waitUntil: 'networkidle' });
await page.waitForSelector('#content');
// assert things..
});
});
});
from https://github.com/GoogleChrome/puppeteer/issues/474
written by https://github.com/quicksnap

How I achieved this was to create your suite(s) of tests in individual files, as you have done already. Then create a testSuiteRunner.js file (or whatever you wish to call it) and set it up as follows:
require('path/to/test/suite/1);
require('path/to/test/suite/2);
require('path/to/test/suite/3);
...
Import all of your suite(s) using require statements, like above, (no need to give them const variable names or anything like that) and then you can use node ./path/to/testSuiteRunner.js to execute all of your suites in parallel. Simplest solution I could come up with for this one!

I think the best idea would be to use a test runner like Jest that can manage that for you. At least I do that this way. Please keep in mind the machine might blow up if you'd run too many Chrome instances at the same time so it's safe to limit the number of parallel tests to 2 or so.
Unfortunately, it isn't clearly described in the official documentation of how Jest parallels the tests. You might find this https://github.com/facebook/jest/issues/6957 useful.
Libraries as puppeteer-cluster are great but remember first of all you want to parallel your tests, not puppeteer's tasks.

Generating HTML report in gulp using lighthouse

I am using gulp for a project and I added lighthouse to the gulp tasks like this:
gulp.task("lighthouse", function(){
return launchChromeAndRunLighthouse('http://localhost:3800', flags, perfConfig).then(results => {
console.log(results);
});
});
And this is my launchChromeAndRunLighthouse() function
function launchChromeAndRunLighthouse(url, flags = {}, config = null) {
return chromeLauncher.launch().then(chrome => {
flags.port = chrome.port;
return lighthouse(url, flags, config).then(results =>
chrome.kill().then(() => results));
});
}
It gives me the json output in command line. I can post my json here and get the report.
Is there any way I can generate the HTML report using gulp ?
You are welcome to start a bounty if you think this question will be helpful for future readers.

The answer from #EMC is fine, but it requires multiple steps to generate the HTML from that point. However, you can use it like this (written in TypeScript, should be very similar in JavaScript):
const { write } = await import(root('./node_modules/lighthouse/lighthouse-cli/printer'));
Then call it:
await write(results, 'html', 'report.html');
UPDATE
There have been some changes to the lighthouse repo. I now enable programmatic HTML reports as follows:
const { write } = await import(root('./node_modules/lighthouse/lighthouse-cli/printer'));
const reportGenerator = await import(root('./node_modules/lighthouse/lighthouse-core/report/report-generator'));
// ...lighthouse setup
const raw = await lighthouse(url, flags, config);
await write(reportGenerator.generateReportHtml(raw.lhr), 'html', root('report.html'));
I know it's hacky, but it solves the problem :).

I've run into this issue too. I found somewhere in the github issues that you can't use the html option programmatically, but Lighthouse does expose the report generator, so you can write simple file write and open functions around it to get the same effect.
const ReportGenerator = require('../node_modules/lighthouse/lighthouse-core/report/v2/report-generator.js');

I do
let opts = {
chromeFlags: ['--show-paint-rects'],
output: 'html'
}; ...
const lighthouseResults = await lighthouse(urlToTest, opts, config = null);
and later
JSON.stringify(lighthouseResults.lhr)
to get the json
and
lighthouseResults.report.toString('UTF-8'),
to get the html

You can define the preconfig in the gulp as
const preconfig = {logLevel: 'info', output: 'html', onlyCategories: ['performance','accessibility','best-practices','seo'],port: (new URL(browser.wsEndpoint())).port};
The output option can be used as the html or json or csv. This preconfig is nothing but the configuration for the lighthouse based on how we want it to run and give us the solution.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Using Artoo.js with Google Puppeteer for Web Scraping - google-chrome

Related

how to print response json from website using puppeteer?

how to execute a script in every window that gets loaded in puppeteer?

how to integrate lighthouse with testcafe?

How to enable parallel tests with puppeteer?

Generating HTML report in gulp using lighthouse

Categories

Resources