Recently I am learning puppeteer using their docs and try to scrape some information.
First approach
First I collect a list of url from the mainpage. Second I create a new tab and go those url iterately and collect some data. I doubt when I enter the loop the new tab didn't work as I expect and freezed without giving any data. Eventually I got a error TimeoutError: Navigation timeout of 30000 ms exceeded. Is there any better approach?
(async () => {
const browser = await puppeteer.launch({ headless: true });
const mainpage = await browser.newPage();
console.log('goto main page'.green);
await mainpage.goto(mainURL);
console.log('collecting some url'.green);
const URLS = await mainpage.evaluate(() =>
Array.from(
document.querySelectorAll('.result-actions a'),
(element) => element.href
)
);
if (typeof URLS[0] === 'string') console.log('OK'.green);
console.log('collecting finished'.green);
const newTab= await browser.newPage();
console.log('create new tab'.green);
var data = [];
for (let i = 0, n = URLS.length; i < n; i++) {
//console.log(URLS[i]);
// use this new tab to collect some data then close this tab
// continue this process
await newTab.waitForNavigation();
await newTab.goto(URLS[i]);
await newTab.waitForSelector('.profile-phone-column span a');
console.log('Go each url using new tab'.green);
// collecting data
data.push(collected_data);
// close this tab
await collectNamePage.close();
console.log(data);
}
await mainpage.close();
await browser.close();
console.log('closing browser'.green);
})();
Second approach
This time I want to skip the part where I collect those data using a new tab. Hence I collect my urls using page.$$() and try to iterating using for...of over urls and collect my data using elementHandle.$(selector) but this approach also failed.
I am getting frustrated. Am I doing it wrong way or I didn't understand their documentation?
In your script, you do not need newTab.waitForNavigation(); at all. Usually, this is used when the navigation is caused by some event. When you just use .goto(), the page loading is waited automatically.
Even if you need waitForNavigation(), you usually do not await it before the navigation triggered, otherwise you just get the timeout. You await it with navigation trigger together:
await Promise.all([element.click(), page.waitForNavigation()]);
So try to just delete await newTab.waitForNavigation();.
Also, do not close the new tab in the loop, delete it after the loop.
Edited script:
const puppeteer = require('puppeteer');
const mainURL = 'https://www.psychologytoday.com/us/therapists/illinois/';
(async () => {
const browser = await puppeteer.launch({ headless: false });
const mainpage = await browser.newPage();
console.log('goto main page');
await mainpage.goto(mainURL);
console.log('collecting urls');
const URLS = await mainpage.evaluate(() =>
Array.from(
document.querySelectorAll('.result-actions a'),
(element) => element.href
)
);
if (typeof URLS[0] === 'string') console.log('OK');
console.log('collection finished');
const collectNamePage = await browser.newPage();
console.log('create new tab');
var data = [];
for (let i = 0, totalUrls = URLS.length; i < totalUrls; i++) {
console.log(URLS[i]);
await collectNamePage.goto(URLS[i]);
await collectNamePage.waitForSelector('.profile-phone-column span a');
console.log('create new tab and go there');
// collecting data
const [name, phone] = await collectNamePage.evaluate(
() => [
document.querySelector('.profile-middle .name-title-column h1').innerText,
document.querySelector('.profile-phone-column span a').innerText
]
);
data.push({ name, phone });
}
console.log(data);
await collectNamePage.close();
await mainpage.close();
await browser.close();
console.log('closing browser');
})();
Related
I have code that will login to a page, navigate to a list of messages, get the first message, and delete it. I need to be able to get the list of messages and delete each in turn. When I try to do that, I run into problems.
The site is rendered as plain html until the delete button is clicked. At this point, an iframe opens with the delete confirmation inside of it. If the confirmation is clicked, it returns me to the list of messages.
This is working until the iframe pops up. The existing code doesn't find the selector in the iframe. The code does work when it is not in a loop, though. So how can I interact with the iframe in the loop?
TimeoutError: Waiting for selector .navigation-footer button failed: Waiting failed: 30000ms exceeded
const messageList = await page.$$(".message-list tr");
for (message of messageList) {
//get first message
await page.click(".message-list tr");
//wait for the message to load
await page.waitForSelector(".circle-cross");
//get time and message text
const msgTime = await page.$eval("time", el => el.getAttribute("dateTime"));
const paragraphs = await page.evaluate(() => {
let paraElements = document.querySelectorAll(".bubble p");
//array literal
const paraList = [...paraElements];
//gets the innerText of each element
return paraList.map((el, index) => el.innerText);
});
//get author name
await page.waitForSelector(".user p a")
let authorLink = await page.$(".user p a")
let authorName = await authorLink.evaluate(el => el.innerText.trim());
//append message to messages.txt
const stream = fs.createWriteStream("messages.txt", { flags: 'a' });
stream.write(authorName + "\n");
stream.write(msgTime + "\n");
paragraphs.forEach((item, index) => {
stream.write(item + "\n");
});
stream.end();
//delete the message
await page.click(".circle-cross");
//handle the iframe verification
const elementHandle = await page.waitForSelector("iframe.fancybox-iframe");
const frame = await elementHandle.contentFrame();
const button = await frame.waitForSelector(".navigation-footer button");
await frame.click(".navigation-footer button");
}
UPDATE: I did get this working, eventually, by substituting in:
await button.evaluate(el => el.click());
instead of
await frame.click(".navigation-footer button")
Simply put the code you have in the forEach loop, however you will have to add async on this line:
messageList.forEach(async (el) => {
Thus your ending result should then be:
//get a list of messages
const messageList = await page.$$(".message-list tr");
messageList.forEach(async (el) => {
//get the first message
await page.click(".message-list tr");
//wait for the message to load
await page.waitForSelector(".circle-cross");
//delete the message
await page.click(".circle-cross");
//handle the iframe verification
const elementHandle = await page.waitForSelector("iframe");
const = await elementHandle.contentFrame();
await frame.waitForSelector(".navigation-footer button");
await frame.click(".navigation-footer button");
}
Is it possible to calculate the TTFB with Puppeteer?
I couldn't find anything in their docs.
I currently have this code:
const browser = await puppeteer.launch(launchOptions);
const page = await browser.newPage();
const response = await page.goto(url);
const { status } = response;
This might do what you want:
let start = new Date()
page.once('response', () => console.log(new Date() - start))
await page.goto(url)
This is how I solved it:
const browser = await puppeteer.launch(launchOptions);
const page = await browser.newPage();
await page.goto(url);
const navigationTimingJson = await page.evaluate(() =>
JSON.stringify(performance.getEntriesByType("navigation"))
);
const [navigationTiming] = JSON.parse(navigationTimingJson)
const TTFB = navigationTiming.responseStart - navigationTiming.requestStart;
You can try using page.metrics()
More info here: https://pptr.dev/#?product=Puppeteer&version=v13.1.3&show=api-pagemetrics
I am trying to scrape a website but I don't get some of the elements, because these elements are dynamically created.
I use the cheerio in node.js and My code is below.
var request = require('request');
var cheerio = require('cheerio');
var url = "http://www.bdtong.co.kr/index.php?c_category=C02";
request(url, function (err, res, html) {
var $ = cheerio.load(html);
$('.listMain > li').each(function () {
console.log($(this).find('a').attr('href'));
});
});
This code returns empty response, because when the page is loaded, the <ul id="store_list" class="listMain"> is empty.
The content has not been appended yet.
How can I get these elements using node.js? How can I scrape pages with dynamic content?
Here you go;
var phantom = require('phantom');
phantom.create(function (ph) {
ph.createPage(function (page) {
var url = "http://www.bdtong.co.kr/index.php?c_category=C02";
page.open(url, function() {
page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
page.evaluate(function() {
$('.listMain > li').each(function () {
console.log($(this).find('a').attr('href'));
});
}, function(){
ph.exit()
});
});
});
});
});
Check out GoogleChrome/puppeteer
Headless Chrome Node API
It makes scraping pretty trivial. The following example will scrape the headline over at npmjs.com (assuming .npm-expansions remains)
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.npmjs.com/');
const textContent = await page.evaluate(() => {
return document.querySelector('.npm-expansions').textContent
});
console.log(textContent); /* No Problem Mate */
browser.close();
})();
evaluate will allow for the inspection of the dynamic element as this will run scripts on the page.
Use the new npm module x-ray, with a pluggable web driver x-ray-phantom.
Examples in the pages above, but here's how to do dynamic scraping:
var phantom = require('x-ray-phantom');
var Xray = require('x-ray');
var x = Xray()
.driver(phantom());
x('http://google.com', 'title')(function(err, str) {
if (err) return done(err);
assert.equal('Google', str);
done();
})
Answering this as a canonical, an alternative to Puppeteer for scraping dynamic sites which is also well-supported as of 2023 is Playwright. Here's a simple example:
const playwright = require("playwright"); // ^1.28.1
let browser;
(async () => {
browser = await playwright.chromium.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
const text = await page.locator('h1:text("Example")').textContent();
console.log(text); // => Example Domain
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Easiest and reliable solution is to use puppeteer. As mentioned in https://pusher.com/tutorials/web-scraper-node which is suitable for both static + dynamic scraping.
Only change the timeout in Browser.js, TimeoutSettings.js, Launcher.js 300000 to 3000000
I am trying to write a program to scan multiple URLs at the same time (parallelizaiton) and I have extracted sitemap and stored it as an array in a Variable as shown below. But i am unable to open using Puppeteer. I am getting the below error:
originalMessage: 'Cannot navigate to invalid URL'
My code below. Can someone please help me out .
const sitemapper = require('#mastixmc/sitemapper');
const SitemapXMLParser = require('sitemap-xml-parser');
const url = 'https://edition.cnn.com/sitemaps/sitemap-section.xml';
/*If sitemapindex (link of xml or gz file) is written in sitemap, the URL will be accessed.
You can optionally specify the number of concurrent accesses and the number of milliseconds after processing and access to resume processing after a delay.
*/
const options = {
delay: 3000,
limit: 50000
};
const sitemapXMLParser = new SitemapXMLParser(url, options);
sitemapXMLParser.fetch().then(result => {
var locs = result.map(value => value.loc)
var locsFiltered = locs.toString().replace("[",'<br>');
const urls = locsFiltered
console.log(locsFiltered)
const puppeteer = require("puppeteer");
async function scrapeProduct(url) {
const urls = locsFiltered
const browser = await puppeteer.launch({
headless: false
});
for (i = 0; i < urls.length; i++) {
const page = await browser.newPage();
const url = urls[i];
const promise = page.waitForNavigation({
waitUntil: "networkidle2"
});
await page.goto(`${url}`);
}};
scrapeProduct();
});
You see invalid URL because you've convert an array into URL string by wrong method.
These line is a better one:
// var locsFiltered = locs.toString().replace("[",'<br>') // This is wrong
// const urls = locsFiltered // So value is invalid
// console.log(locsFiltered)
const urls = locs.map(value => value[0]) // This is better
So to scrape CNN sites, i've added puppeteer-cluster for speed:
const { Cluster } = require('puppeteer-cluster')
const sitemapper = require('#mastixmc/sitemapper')
const SitemapXMLParser = require('sitemap-xml-parser')
const url = 'https://edition.cnn.com/sitemaps/sitemap-section.xml'
async function scrapeProduct(locs) {
const urls = locs.map(value => value[0])
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2, // You can set this to any number you like
puppeteerOptions: {
headless: false,
devtools: false,
args: [],
}
})
await cluster.task(async ({ page, data: url }) => {
await page.goto(url, {timeout: 0, waitUntil: 'networkidle2'})
const screen = await page.screenshot()
// Store screenshot, do something else
})
for (i = 0; i < urls.length; i++) {
console.log(urls[i])
await cluster.queue(urls[i])
}
await cluster.idle()
await cluster.close()
}
/******
If sitemapindex (link of xml or gz file) is written in sitemap, the URL will be accessed.
You can optionally specify the number of concurrent accesses and the number of milliseconds after processing and access to resume processing after a delay.
*******/
const options = {
delay: 3000,
limit: 50000
}
const sitemapXMLParser = new SitemapXMLParser(url, options)
sitemapXMLParser.fetch().then(async result => {
var locs = result.map(value => value.loc)
await scrapeProduct(locs)
})
In the following example how do I wait for the pop up window to finish loading?
After clikcing the google icon you get a pop up window to login to gmail, when I try to interact
with the second page it is undefined (as I don't know how to wait for it to fully load.
Any advice?
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({headless: false});
page = await browser.newPage();
await page.goto("https://www.example.com/signin");
await page.waitForSelector(".Icon-google");
await page.click(".Icon-google");
const pages = await browser.pages();
console.log(pages[2].url());
})();
You can wait for a new target to be created.
const browser = await puppeteer.launch({headless: false});
page = await browser.newPage();
await page.goto("https://app.testim.io/#/signin");
await page.waitForSelector(".Icon-google");
const nav = new Promise(res => browser.on('targetcreated', res))
await page.click(".Icon-google");
await nav
const pages = await browser.pages();
console.log(pages.length);//number of pages increases !
console.log(pages.map(page => page.url()));
P.S. first I tried page.waitForNavigation() but it didn't work, probably because it's a popup.
const [newPage] = await Promise.all([
new Promise((resolve) => page.once('popup', resolve)),
page.click('something.that-will-open-the-popup')
]);
await newPage.waitForSelector('.page-is-loaded')