exposeFunction very slow with large object argument - puppeteer

Steps to reproduce
Call exposeFunction with large object
Tell us about your environment:
Puppeteer version: puppeteer#13.5.1
Platform / OS version: Windows 11
URLs (if applicable):
Node.js version: v16.15.0
Please include code that reproduces the issue.
pup.js
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.exposeFunction('__addImageData', data => { });
await page.goto('file:///path/to/index.html');
await page.evaluate(() => {
for(var i = 0; i < 600; i++) {
console.log(i);
window.__draw();
}
})
await browser.close();
})();
index.html
<!doctype html>
<html>
<head></head>
<body>
<script>
window.__draw = function() {
var data = new Uint8Array(1280*720*4);
window.__addImageData(data);
}
</script>
</body>
</html>
What is the expected result?
Relatively fast running code
What happens instead?
Each call to window.__addImageData takes ~1 second. Calling the function without
the data is near instant.
Is there another way to pass large objects to exposed functions?

Related

Puppeteer: how to access/intercept a FileSystemDirectoryHandle?

I'm wondering if it's possible within puppeteer to access a FileSystemDirectoryHandle (from the File System Access API). I would like to pass in a directory path via puppeteer as though the user had selected a directory via window.showDirectoryPicker(). On my client page I use the File System Access API to write a series of png files taken from a canvas element, like:
const directoryHandle = await window.showDirectoryPicker();
for (let frame = 0; frame < totalFrames; frame++){
const fileHandle = await directoryHandle.getFileHandle(`${frame}.png`, { create: true });
const writable = await fileHandle.createWritable();
updateCanvas(); // <--- update the contents of my canvas element
const blob = await new Promise((resolve) => canvas.toBlob(resolve, 'image/png'));
await writable.write(blob);
await writable.close();
}
On the puppeteer side, I want to mimic that behavior with something like:
const page = await browser.newPage();
await page.goto("localhost:3333/canvasRenderer.html");
// --- this part doesn't seem to exist ---
const [dirChooser] = await Promise.all([
page.waitForDirectoryChooser(),
page.click('#choose-directory'),
]);
await dirChooser.accept(['save/frames/here']);
//--------------------------------------
but waitForDirectoryChooser() doesn't exist.
I'd really appreciate any ideas or insights on how I might accomplish this!

How to do web scraping into a web that has the app-root element? [duplicate]

I am trying to scrape a website but I don't get some of the elements, because these elements are dynamically created.
I use the cheerio in node.js and My code is below.
var request = require('request');
var cheerio = require('cheerio');
var url = "http://www.bdtong.co.kr/index.php?c_category=C02";
request(url, function (err, res, html) {
var $ = cheerio.load(html);
$('.listMain > li').each(function () {
console.log($(this).find('a').attr('href'));
});
});
This code returns empty response, because when the page is loaded, the <ul id="store_list" class="listMain"> is empty.
The content has not been appended yet.
How can I get these elements using node.js? How can I scrape pages with dynamic content?
Here you go;
var phantom = require('phantom');
phantom.create(function (ph) {
ph.createPage(function (page) {
var url = "http://www.bdtong.co.kr/index.php?c_category=C02";
page.open(url, function() {
page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
page.evaluate(function() {
$('.listMain > li').each(function () {
console.log($(this).find('a').attr('href'));
});
}, function(){
ph.exit()
});
});
});
});
});
Check out GoogleChrome/puppeteer
Headless Chrome Node API
It makes scraping pretty trivial. The following example will scrape the headline over at npmjs.com (assuming .npm-expansions remains)
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.npmjs.com/');
const textContent = await page.evaluate(() => {
return document.querySelector('.npm-expansions').textContent
});
console.log(textContent); /* No Problem Mate */
browser.close();
})();
evaluate will allow for the inspection of the dynamic element as this will run scripts on the page.
Use the new npm module x-ray, with a pluggable web driver x-ray-phantom.
Examples in the pages above, but here's how to do dynamic scraping:
var phantom = require('x-ray-phantom');
var Xray = require('x-ray');
var x = Xray()
.driver(phantom());
x('http://google.com', 'title')(function(err, str) {
if (err) return done(err);
assert.equal('Google', str);
done();
})
Answering this as a canonical, an alternative to Puppeteer for scraping dynamic sites which is also well-supported as of 2023 is Playwright. Here's a simple example:
const playwright = require("playwright"); // ^1.28.1
let browser;
(async () => {
browser = await playwright.chromium.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
const text = await page.locator('h1:text("Example")').textContent();
console.log(text); // => Example Domain
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Easiest and reliable solution is to use puppeteer. As mentioned in https://pusher.com/tutorials/web-scraper-node which is suitable for both static + dynamic scraping.
Only change the timeout in Browser.js, TimeoutSettings.js, Launcher.js 300000 to 3000000

Can a har file be programmatically generated from headless chrome using Puppeteer?

I would like to control a headless chrome instance using puppeteer, taking snapshots and clicking on various page elements, while capturing a har file. Is this possible? I have looked at the API but haven't found anything useful.
There is no HAR generator helper in Puppeteer. But you can use chrome-har to generate HAR file.
const fs = require('fs');
const { promisify } = require('util');
const puppeteer = require('puppeteer');
const { harFromMessages } = require('chrome-har');
// list of events for converting to HAR
const events = [];
// event types to observe
const observe = [
'Page.loadEventFired',
'Page.domContentEventFired',
'Page.frameStartedLoading',
'Page.frameAttached',
'Network.requestWillBeSent',
'Network.requestServedFromCache',
'Network.dataReceived',
'Network.responseReceived',
'Network.resourceChangedPriority',
'Network.loadingFinished',
'Network.loadingFailed',
];
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// register events listeners
const client = await page.target().createCDPSession();
await client.send('Page.enable');
await client.send('Network.enable');
observe.forEach(method => {
client.on(method, params => {
events.push({ method, params });
});
});
// perform tests
await page.goto('https://en.wikipedia.org');
page.click('#n-help > a');
await page.waitForNavigation({ waitUntil: 'networkidle2' });
await browser.close();
// convert events to HAR file
const har = harFromMessages(events);
await promisify(fs.writeFile)('en.wikipedia.org.har', JSON.stringify(har));
})();
Here you can find an article about this solution.
Solution proposed by #Everettss is the only option (so far), but is not as good as HAR saved in browser. Look at this, in both cases I generated HAR for google.com page. At top you have HAR generated by puppeteer-har (which is using chrome-har). Too little requests here, no metrics for main document, strangely different timing.
Puppeteer is not a perfect option for HAR files. Therefore I am suggesting to use https://github.com/cyrus-and/chrome-har-capturer

export json object from .json file to vue through express and assign it to the variable

I would like to display on my page some data which I have in dsa.json file. I am using express with vue.
Here's my code from the server.js:
var data;
fs.readFile('./dsa.json', 'utf8', (err, data) => {
if (err) throw err;
exports.data = data;
});
Here's code from between <script> tags in index.html
var server = require(['../server']);
var data = server.data;
var scoreboards = new Vue({
el: '#scoreboard',
data: {
students: data
}
});
I am using requirejs (CDN) to require server between <script> tags in index.html.
index.html is in public directory whereas dsa.json and server.js are in the main catalogue.
Here are the errors I get in the client:
require.min.js:1 GET http://localhost:3000/server.js
require.min.js:1 Uncaught Error: Script error for "../server"
I think it has something to do with context and scope but I don't know what exactly.
I am using Chrome.
Your approach is completely wrong. You can't include the server script on your page. Also, I'm not a NodeJS ninja, yet I don't think that exporting the data inside the function will work -> exports.data = data.
The workaround:
Server side:
const fs = require('fs');
const express = require('express');
const app = express();
const data = fs.readFileSync('./dsa.json', 'utf8'); // sync is ok in this case, because it runs once when the server starts, however you should try to use async version in other cases when possible
app.get('/json', function(req, res){
res.send(data);
});
Client side:
var xhr = new XMLHttpRequest();
xhr.open('GET', '/json', true);
xhr.addEventListener('load', function() {
var scoreboards = new Vue({
el: '#scoreboard',
data: {
students: JSON.parse(xhr.response)
}
});
});
xhr.addEventListener('error', function() {
// handle error
});
xhr.send();

PhantomJS failing to load Google Maps

My end goal is to open a local html file with javascript embedded, creating a map with polygons, and take a screenshot of it using PhantomJS. I have written a simple JS file to do this:
var page = require('webpage').create();
page.open('https://www.google.com/maps', function(status) {
console.log('State: ' + status);
if(status === 'success') {
page.render('example.pdf', {format: 'pdf', quality: '100'});
}
phantom.exit();
});
This returns the error:
ReferenceError: Can't find variable: google
I've tried this on a local html file and on other websites using google maps and I keep getting the same error. I have been successful in taking a screenshot of other websites without google maps. Searching the internet it doesn't seem like people have had issues like this, and have been successful in taking screenshots of pages with google maps...so I'm wondering what could be wrong.
Another note: I installed PhantomJS as a gem in my rails project and am running the javascript file through the rails console using this gem. I have tried it using the standard installation of PhantomJS (v 2.0.0) and it still didn't work.
You'll have to wait for an element in the DOM.
for example on maps.google.com, you can wait for the watermark which is loaded after all tiles are loaded.
var page = require('webpage').create();
page.open('https://www.google.com/maps', function (status) {
console.log('State: ' + status);
if (status === 'success') {
waitFor(function () {
return page.evaluate(function () {
var document_contains_watermark =
document.body.contains(document.getElementById('watermark'));
return document_contains_watermark;
});
}, function () {
page.render('maps-google-com.pdf', {format: 'pdf', quality: '100'});
phantom.exit();
});
}
});
function waitFor(testFn, onReady) {
var loaded = false;
var interval = setInterval(function () {
loaded = testFn();
if (loaded) {
onReady();
clearInterval(interval);
}
}, 1000);
}
If you want to take a screenshot on a page that you developed, use the same above logic but append by yourself an element on the google maps idle event.
google.maps.event.addListenerOnce(map, 'idle', function () {
var loadedElem = document.createElement('div');
loadedElem.setAttribute("id", "idLoadedElem");
document.body.appendChild(loadedElem);
});
you should give puppeter a go, it makes that easy:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({path: 'example.pdf'});
await browser.close();
})();