Puppeteer - webScraper - puppeteer

I have trouble replace hyperlinks with regular text on a website that I scraped by Puppeteer.
I'm able to get the HTML file, but when it renders, the stylesheet doesn't work, the hyperlinks lead to errors.
Anybody can help me with this, please?
const puppeteer = require('puppeteer');
const fs = require('fs')
const scrapeWikipedia = async () => {
const browser = await puppeteer.launch({
headless: false
// defaultViewport: null
});
const page = await browser.newPage();
const wikiUrl = 'https://en.wikipedia.org/wiki/Groundhog_Day_(film)'
await page.goto(wikiUrl, {waitUntil: 'networkidle2'});
//replace hyperlink
await page.evaluate(_ => {
document.querySelectorAll('a[href^="javascript"]')
.forEach(a => {
a.href = '#'
})
});
//get html file get from the website with
const html = await page.content();
fs.writeFileSync("./outputHTML/index.html", html);
await browser.close();
}
scrapeWikipedia()

Related

How to measure TTFB with Puppeteer?

Is it possible to calculate the TTFB with Puppeteer?
I couldn't find anything in their docs.
I currently have this code:
const browser = await puppeteer.launch(launchOptions);
const page = await browser.newPage();
const response = await page.goto(url);
const { status } = response;
This might do what you want:
let start = new Date()
page.once('response', () => console.log(new Date() - start))
await page.goto(url)
This is how I solved it:
const browser = await puppeteer.launch(launchOptions);
const page = await browser.newPage();
await page.goto(url);
const navigationTimingJson = await page.evaluate(() =>
JSON.stringify(performance.getEntriesByType("navigation"))
);
const [navigationTiming] = JSON.parse(navigationTimingJson)
const TTFB = navigationTiming.responseStart - navigationTiming.requestStart;
You can try using page.metrics()
More info here: https://pptr.dev/#?product=Puppeteer&version=v13.1.3&show=api-pagemetrics

How to do web scraping into a web that has the app-root element? [duplicate]

I am trying to scrape a website but I don't get some of the elements, because these elements are dynamically created.
I use the cheerio in node.js and My code is below.
var request = require('request');
var cheerio = require('cheerio');
var url = "http://www.bdtong.co.kr/index.php?c_category=C02";
request(url, function (err, res, html) {
var $ = cheerio.load(html);
$('.listMain > li').each(function () {
console.log($(this).find('a').attr('href'));
});
});
This code returns empty response, because when the page is loaded, the <ul id="store_list" class="listMain"> is empty.
The content has not been appended yet.
How can I get these elements using node.js? How can I scrape pages with dynamic content?
Here you go;
var phantom = require('phantom');
phantom.create(function (ph) {
ph.createPage(function (page) {
var url = "http://www.bdtong.co.kr/index.php?c_category=C02";
page.open(url, function() {
page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
page.evaluate(function() {
$('.listMain > li').each(function () {
console.log($(this).find('a').attr('href'));
});
}, function(){
ph.exit()
});
});
});
});
});
Check out GoogleChrome/puppeteer
Headless Chrome Node API
It makes scraping pretty trivial. The following example will scrape the headline over at npmjs.com (assuming .npm-expansions remains)
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.npmjs.com/');
const textContent = await page.evaluate(() => {
return document.querySelector('.npm-expansions').textContent
});
console.log(textContent); /* No Problem Mate */
browser.close();
})();
evaluate will allow for the inspection of the dynamic element as this will run scripts on the page.
Use the new npm module x-ray, with a pluggable web driver x-ray-phantom.
Examples in the pages above, but here's how to do dynamic scraping:
var phantom = require('x-ray-phantom');
var Xray = require('x-ray');
var x = Xray()
.driver(phantom());
x('http://google.com', 'title')(function(err, str) {
if (err) return done(err);
assert.equal('Google', str);
done();
})
Answering this as a canonical, an alternative to Puppeteer for scraping dynamic sites which is also well-supported as of 2023 is Playwright. Here's a simple example:
const playwright = require("playwright"); // ^1.28.1
let browser;
(async () => {
browser = await playwright.chromium.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
const text = await page.locator('h1:text("Example")').textContent();
console.log(text); // => Example Domain
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Easiest and reliable solution is to use puppeteer. As mentioned in https://pusher.com/tutorials/web-scraper-node which is suitable for both static + dynamic scraping.
Only change the timeout in Browser.js, TimeoutSettings.js, Launcher.js 300000 to 3000000

create new tab in puppeteer inside a loop cause Navigation timeout

Recently I am learning puppeteer using their docs and try to scrape some information.
First approach
First I collect a list of url from the mainpage. Second I create a new tab and go those url iterately and collect some data. I doubt when I enter the loop the new tab didn't work as I expect and freezed without giving any data. Eventually I got a error TimeoutError: Navigation timeout of 30000 ms exceeded. Is there any better approach?
(async () => {
const browser = await puppeteer.launch({ headless: true });
const mainpage = await browser.newPage();
console.log('goto main page'.green);
await mainpage.goto(mainURL);
console.log('collecting some url'.green);
const URLS = await mainpage.evaluate(() =>
Array.from(
document.querySelectorAll('.result-actions a'),
(element) => element.href
)
);
if (typeof URLS[0] === 'string') console.log('OK'.green);
console.log('collecting finished'.green);
const newTab= await browser.newPage();
console.log('create new tab'.green);
var data = [];
for (let i = 0, n = URLS.length; i < n; i++) {
//console.log(URLS[i]);
// use this new tab to collect some data then close this tab
// continue this process
await newTab.waitForNavigation();
await newTab.goto(URLS[i]);
await newTab.waitForSelector('.profile-phone-column span a');
console.log('Go each url using new tab'.green);
// collecting data
data.push(collected_data);
// close this tab
await collectNamePage.close();
console.log(data);
}
await mainpage.close();
await browser.close();
console.log('closing browser'.green);
})();
Second approach
This time I want to skip the part where I collect those data using a new tab. Hence I collect my urls using page.$$() and try to iterating using for...of over urls and collect my data using elementHandle.$(selector) but this approach also failed.
I am getting frustrated. Am I doing it wrong way or I didn't understand their documentation?
In your script, you do not need newTab.waitForNavigation(); at all. Usually, this is used when the navigation is caused by some event. When you just use .goto(), the page loading is waited automatically.
Even if you need waitForNavigation(), you usually do not await it before the navigation triggered, otherwise you just get the timeout. You await it with navigation trigger together:
await Promise.all([element.click(), page.waitForNavigation()]);
So try to just delete await newTab.waitForNavigation();.
Also, do not close the new tab in the loop, delete it after the loop.
Edited script:
const puppeteer = require('puppeteer');
const mainURL = 'https://www.psychologytoday.com/us/therapists/illinois/';
(async () => {
const browser = await puppeteer.launch({ headless: false });
const mainpage = await browser.newPage();
console.log('goto main page');
await mainpage.goto(mainURL);
console.log('collecting urls');
const URLS = await mainpage.evaluate(() =>
Array.from(
document.querySelectorAll('.result-actions a'),
(element) => element.href
)
);
if (typeof URLS[0] === 'string') console.log('OK');
console.log('collection finished');
const collectNamePage = await browser.newPage();
console.log('create new tab');
var data = [];
for (let i = 0, totalUrls = URLS.length; i < totalUrls; i++) {
console.log(URLS[i]);
await collectNamePage.goto(URLS[i]);
await collectNamePage.waitForSelector('.profile-phone-column span a');
console.log('create new tab and go there');
// collecting data
const [name, phone] = await collectNamePage.evaluate(
() => [
document.querySelector('.profile-middle .name-title-column h1').innerText,
document.querySelector('.profile-phone-column span a').innerText
]
);
data.push({ name, phone });
}
console.log(data);
await collectNamePage.close();
await mainpage.close();
await browser.close();
console.log('closing browser');
})();

Can we fetch HTTP return code of previous call?

I search doc and the web, but can't find how to get the HTTP code of a query.
Anyone knows ?
Example :
const puppeteer = require('puppeteer');
const fs = require('fs');
const debug = true;
var base_url = 'https://stackoverflow.com/';
(async () => {
const browser = await puppeteer.launch({
headless: true,
});
const page = await browser.newPage();
await page.goto('https://stackoverflow.com');
// how to get HTTP code of last call ?
await browser.close();
})();
There's response.status() but don't know how to just fetch last query and not all with
page.on('response', response => {
console.log("response code: ", response.status());
});
OK, get it, thanks #Take_Care:
response.status()
const puppeteer = require('puppeteer');
const fs = require('fs');
const debug = true;
var base_url = 'https://stackoverflow.com/';
(async () => {
const browser = await puppeteer.launch({
headless: true,
});
const page = await browser.newPage();
cons ret = await page.goto('https://stackoverflow.com');
console.log(ret.status());
await browser.close();
})();

In puppeteer how to wait for pop up page to finish loading?

In the following example how do I wait for the pop up window to finish loading?
After clikcing the google icon you get a pop up window to login to gmail, when I try to interact
with the second page it is undefined (as I don't know how to wait for it to fully load.
Any advice?
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({headless: false});
page = await browser.newPage();
await page.goto("https://www.example.com/signin");
await page.waitForSelector(".Icon-google");
await page.click(".Icon-google");
const pages = await browser.pages();
console.log(pages[2].url());
})();
You can wait for a new target to be created.
const browser = await puppeteer.launch({headless: false});
page = await browser.newPage();
await page.goto("https://app.testim.io/#/signin");
await page.waitForSelector(".Icon-google");
const nav = new Promise(res => browser.on('targetcreated', res))
await page.click(".Icon-google");
await nav
const pages = await browser.pages();
console.log(pages.length);//number of pages increases !
console.log(pages.map(page => page.url()));
P.S. first I tried page.waitForNavigation() but it didn't work, probably because it's a popup.
const [newPage] = await Promise.all([
new Promise((resolve) => page.once('popup', resolve)),
page.click('something.that-will-open-the-popup')
]);
await newPage.waitForSelector('.page-is-loaded')