Using Puppeteer how can I get DOMContentLoaded, Load time - puppeteer

Using Puppeteer how can I get DOMContentLoaded, Load time. It would be great if some once can explain how to access dev tools object, Network from Puppeteer.

Probably you are asking about window.performance.timing, here is a simple example how to get this data in Puppeteer:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://en.wikipedia.org');
const performanceTiming = JSON.parse(
await page.evaluate(() => JSON.stringify(window.performance.timing))
);
console.log(performanceTiming);
await browser.close();
})();
But results are quite raw and not meaningful. You should calculate the difference between each value and navigationStart, here is a full example of how to do it (code comes from this article):
const puppeteer = require('puppeteer');
const extractDataFromPerformanceTiming = (timing, ...dataNames) => {
const navigationStart = timing.navigationStart;
const extractedData = {};
dataNames.forEach(name => {
extractedData[name] = timing[name] - navigationStart;
});
return extractedData;
};
async function testPage(page) {
await page.goto('https://en.wikipedia.org');
const performanceTiming = JSON.parse(
await page.evaluate(() => JSON.stringify(window.performance.timing))
);
return extractDataFromPerformanceTiming(
performanceTiming,
'domContentLoadedEventEnd',
'loadEventEnd'
);
}
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
console.log(await testPage(page));
await browser.close();
})();

Related

pupeteer function not returning array

Hi Guys can you please point my mistake on this code?
console.log(urls) is printing undefined.
Thanks in advance.
const puppeteer = require('puppeteer');
async function GetUrls() {
const browser = await puppeteer.launch( { headless: false,
executablePath: 'C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe' })
const page = await browser.newPage();
await page.goto("https://some page");
await page.waitForSelector('a.review.exclick');
let urls = await page.evaluate(() => {
let results = [];
let items = document.querySelectorAll('a.review.exclick');
items.forEach((item) => {
results.push({
url: item.getAttribute('href'),
});
});
return results;
browser.close();
});
}
(async () => {
let URLS = await GetUrls();
console.log(URLS);
process.exit(1);
})();
Here is a list:
you don't have a return statement in your GetUrls() function
you close the browser after a return statement AND inside the page.evaluate() method
Keep in mind that anything that is executed within the page.evaluate() will relate to the browser context. To quickly test this, add a console.log("test") before let results = []; and you will notice that nothing appears in your Node.js console, it will appear in your browser console instead.
Therefore, the browser variable is visible within the GetUrls() function but NOT visible within the page.evaluate() method.
Here is the corrected code sample:
const puppeteer = require('puppeteer');
async function GetUrls() {
const browser = await puppeteer.launch({
headless: false,
executablePath: 'C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe'
})
const page = await browser.newPage();
await page.goto("https://some page");
await page.waitForSelector('a.review.exclick');
let urls = await page.evaluate(() => {
let results = [];
let items = document.querySelectorAll('a.review.exclick');
items.forEach((item) => {
results.push({
url: item.getAttribute('href'),
});
});
return results;
});
await browser.close();
return urls;
}
(async () => {
let URLS = await GetUrls();
console.log(URLS);
process.exit(1);
})();

How to iterate through a supermarket website and getting the product name and prices?

Im trying to obtain all the product name and prices from all the categories from a supermarket website, all the tutorials that i have found do it just for one const url, i need to iterate through all of them. So far i have got this
const puppeteer = require('puppeteer');
async function scrapeProduct(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const [el2] = await page.$x('//*[#id="product-nonfood-page"]/main/div/div/div[1]/div[1]/div/div[2]/h1/div');
const text2 = await el2.getProperty('textContent');
const name = await text2.jsonValue();
const [el] = await page.$x('//*[#id="product-nonfood-page"]/main/div/div/div[1]/div[1]/div/div[2]/div[2]/div[1]/div[2]/p[1]/em[2]/strong/text()');
const text = await el.getProperty('textContent');
const price = await text.jsonValue();
console.log({name,price});
await browser.close();
}
scrapeProduct('https://www.jumbo.com.ar/gaseosa-sprite-sin-azucar-lima-limon-1-25-lt/p');
which works just for one. Im using nodejs and puppeteer. How can i achieve this?
You can try for...of loop, using a single browser instance and a single page so that the scraper might not overload the server:
const puppeteer = require('puppeteer');
(async function main() {
try {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
const urls = [
'https://www.jumbo.com.ar/gaseosa-sprite-sin-azucar-lima-limon-1-25-lt/p',
// ...
];
for (const url of urls) {
await page.goto(url);
const [el2] = await page.$x('//*[#id="product-nonfood-page"]/main/div/div/div[1]/div[1]/div/div[2]/h1/div');
const text2 = await el2.getProperty('textContent');
const name = await text2.jsonValue();
const [el] = await page.$x('//*[#id="product-nonfood-page"]/main/div/div/div[1]/div[1]/div/div[2]/div[2]/div[1]/div[2]/p[1]/em[2]/strong/text()');
const text = await el.getProperty('textContent');
const price = await text.jsonValue();
console.log({name,price});
}
await browser.close();
} catch (err) {
console.error(err);
}
})();
You can use an array of urls and forEach:
const puppeteer = require('puppeteer');
const urls = [ 'https://www.jumbo.com.ar/gaseosa-sprite-sin-azucar-lima-limon-1-25-lt/p' ];
urls.forEach(scrapeProduct);
async function scrapeProduct(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const [el2] = await page.$x('//*[#id="product-nonfood-page"]/main/div/div/div[1]/div[1]/div/div[2]/h1/div');
const text2 = await el2.getProperty('textContent');
const name = await text2.jsonValue();
const [el] = await page.$x('//*[#id="product-nonfood-page"]/main/div/div/div[1]/div[1]/div/div[2]/div[2]/div[1]/div[2]/p[1]/em[2]/strong/text()');
const text = await el.getProperty('textContent');
const price = await text.jsonValue();
console.log({name,price});
await browser.close();
}

How to get text from xPath in Puppeteer node js

I need to get a text from the span tag and to verify whether the text equals to "check".
How can I achieve this in puppeteer?
Below is the example of the code I've written, if anyone could put me help me figure this out, please.
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({
headless: false,
// "slowMo": 50,
args: ["--start-fullscreen"],
defaultViewport: null,
});
//Page
const page2 = await browser.newPage();
await page2.goto("https://www.flipkart.com");
await page2.waitFor(2000);
await page2.$x("//input[#class='_2zrpKA _1dBPDZ']").then(async (ele) => {
await ele[0].type(username);
});
await page2.waitFor(2000);
await page2.$x("//input[#type='password']").then(async (ele) => {
await ele[0].type(password);
});
await page2.waitFor(2000);
await page2
.$x("//button[#class='_2AkmmA _1LctnI _7UHT_c']")
.then(async (ele) => {
await ele[0].click();
});
await page2.waitFor(2000);
await page2.$x("//input[#class='LM6RPg']").then(async (ele) => {
await ele[0].type("iPhone 11");
});
await page2.waitFor(2000);
await page2.$x("//button[#class='vh79eN']").then(async (ele) => {
await ele[0].click();
});
await page2.waitFor(2000);
await page2.$x("//div[#class='col col-7-12']/div").then(async (ele) => {
await ele[0].click();
});
await page2.waitFor(2000);
let [element] = await page2.$x('//span[#class="_2aK_gu"]');
let text = await page2.evaluate((element) => element.textContent, element);
if (text.includes("Check")) {
console.log("Check Present");
}
if (text.includes("Change")) {
console.log("Change Present");
}
})();
//get the xpath of the webelement
const [getXpath] = await page.$x('//div[]');
//get the text using innerText from that webelement
const getMsg = await page.evaluate(name => name.innerText, getXpath);
//Log the message on screen
console.log(getMsg)
Here is the complete code for getting div or any html element data using xpath....
const puppeteer = require("puppeteer");
async function scrape () {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto("https://twitter.com/elonmusk", {waitUntil: "networkidle2"})
await page.waitForXPath('/html/body/div[1]/div/div/div[2]/main/div/div/div/div/div/div[2]/div/div/section/div/div/div[1]/div/div/article/div/div/div/div[2]/div[2]/div[1]/div/div/div[1]/div[1]/div/div[1]/a/div/div[1]/span/span');
let [el] = await page.$x('/html/body/div[1]/div/div/div[2]/main/div/div/div/div/div/div[2]/div/div/section/div/div/div[1]/div/div/article/div/div/div/div[2]/div[2]/div[1]/div/div/div[1]/div[1]/div/div[1]/a/div/div[1]/span/span');
// console.log()
const names = await page.evaluate(name => name.innerText, el);
console.log(names);
await browser.close();
};
scrape();
You can get the text form the selected element like this:
await page.goto(url, {waitUntil: "networkidle2"});
await page.waitForXPath('//span[#class="_2aK_gu"]');
//assuming it's the first element
let [element] = await page.$x('//span[#class="_2aK_gu"]');
let text = await page.evaluate(element => element.textContent, element);
Note that page.$x returns an array of ElementHandles, so the code here assumes it's the first element. I'd suggest you chose a more specific XPath than a class as many elements may have it.
For the condition:
if (text.includes("Check"))
//do this
else if (text.includes("Change"))
//do that

Puppeteer wait until Cloudfare redirect is done

I would like to login on a site, which is using Cloudfare DDOS protection like this:
The code is simple:
const puppeteer = require('puppeteer');
const C = require('./constants');
const USERNAME_SELECTOR = 'input[name="username"]';
const PASSWORD_SELECTOR = 'input[name="password"]';
const CTA_SELECTOR = '.button';
var cloudscraper = require('cloudscraper');
async function startBrowser() {
const browser = await puppeteer.launch({
headless: true,
slowMo: 10000,
});
const page = await browser.newPage();
return {browser, page};
}
async function closeBrowser(browser) {
return browser.close();
}
async function playTest(url) {
const {browser, page} = await startBrowser();
page.setViewport({width: 1366, height: 768});
await page.goto(url, {waituntil: 'domcontentloaded'});
await page.screenshot({path: 'debug.png'});
await page.click(USERNAME_SELECTOR);
await page.keyboard.type(C.username);
await page.click(PASSWORD_SELECTOR);
await page.keyboard.type(C.password);
await page.click(CTA_SELECTOR);
await page.waitForNavigation();
await page.screenshot({path: 'ipt.png'});
}
(async () => {
await playTest("https://xy.com/login.php");
process.exit(1);
})();
When I check debug.png, I see Cloudfare DDOS protection page only. I don't really understand why, I added slowMo 10sec to wait with the execution.
You can add a simple waitForSelector to wait until the username selector appears,
await page.waitForSelector(USERNAME_SELECTOR);
await page.click(USERNAME_SELECTOR);

Load dynamic content with puppeteer?

I'm using this code to get the page data. It works but I get the data just one time.
The problem is that this data is updated lets say every second. I want to get it without reloading the page.
This is a simple example of what I want - http://novinite.win/clock.php
Is there a way to refresh the result without reloading the web page?
const puppeteer = require('puppeteer');
(async () => {
const url = process.argv[2];
const browser = await puppeteer.launch({
args: ['--no-sandbox']
})
const page = await browser.newPage();
page.on('request', (request) => {
console.log(`Intercepting: ${request.method} ${request.url}`);
request.continue();
});
await page.goto(url, {waitUntil: 'load'});
const html = await page.content();
console.log(html);
browser.close();
})();
You can await a Promise that logs the current textContent every 1000 ms (1 second) with setInterval and resolves after a set number of intervals (for example, 10 intervals):
'use strict';
const puppeteer = require( 'puppeteer' );
( async () =>
{
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto( 'http://novinite.win/clock.php' );
await new Promise( ( resolve, reject ) =>
{
let i = 0;
const interval = setInterval( async () =>
{
console.log( await page.evaluate( () => document.getElementById( 'txt' ).textContent ) );
if ( ++i === 10 )
{
clearInterval( interval );
resolve( await browser.close() );
}
}, 1000 );
});
})();
Example Result:
19:24:15
19:24:16
19:24:17
19:24:18
19:24:19
19:24:20
19:24:21
19:24:22
19:24:23
19:24:24