Puppeteer: Screenshot lazy images not working [duplicate] - puppeteer

This question already has answers here:
Puppeteer wait for all images to load then take screenshot
(5 answers)
Closed 2 years ago.
I doesn't seems to be able to capture screenshot from https://today.line.me/HK/pc successfully.
In my Puppeteer script, I have also initiate a scroll to the bottom of the page and up again to ensure images are loaded. But for some reason it does't seems to work on the line URL above.
function wait (ms) {
return new Promise(resolve => setTimeout(() => resolve(), ms));
}
const puppeteer = require('puppeteer');
async function run() {
let browser = await puppeteer.launch({headless: false});
let page = await browser.newPage();
await page.goto('https://today.line.me/HK/pc', {waitUntil: 'load'});
//https://today.line.me/HK/pc
// Get the height of the rendered page
const bodyHandle = await page.$('body');
const { height } = await bodyHandle.boundingBox();
await bodyHandle.dispose();
// Scroll one viewport at a time, pausing to let content load
const viewportHeight = page.viewport().height+200;
let viewportIncr = 0;
while (viewportIncr + viewportHeight < height) {
await page.evaluate(_viewportHeight => {
window.scrollBy(0, _viewportHeight);
}, viewportHeight);
await wait(4000);
viewportIncr = viewportIncr + viewportHeight;
}
// Scroll back to top
await page.evaluate(_ => {
window.scrollTo(0, 0);
});
// Some extra delay to let images load
await wait(2000);
await page.setViewport({ width: 1366, height: 768});
await page.screenshot({ path: './image.png', fullPage: true });
}
run();

For anyone wondering, there are many strategies to render lazy loaded images or assets in Puppeteer but not all of them work equally well. Small implementation details in the website that you're attempting to screenshot could change the final result so if you want to have an implementation that works well across many case scenarios you will need to isolate each generic case and address it individually.
I know this because I run a small Screenshot API service and I had to address many cases separately. This is a big task of this project since there seems to be always something new that needs to be addressed with new libraries and UI techniques being used every day.
That being said I think there are some rendering strategies that have good coverage. Probably the best one is a combination of waiting and scrolling through the page like OP did but also making sure to take into account the order of the operations. Here is a slightly modified version of OP's original code.
//Scroll and Wait Strategy
function waitFor (ms) {
return new Promise(resolve => setTimeout(() => resolve(), ms));
}
async function capturePage(browser, url) {
// Load the page that you're trying to screenshot.
const page = await browser.newPage();
await page.goto(url, {waitUntil: 'load'}); // Wait until networkidle2 could work better.
// Set the viewport before scrolling
await page.setViewport({ width: 1366, height: 768});
// Get the height of the page after navigating to it.
// This strategy to calculate height doesn't work always though.
const bodyHandle = await page.$('body');
const { height } = await bodyHandle.boundingBox();
await bodyHandle.dispose();
// Scroll viewport by viewport, allow the content to load
const calculatedVh = page.viewport().height;
let vhIncrease = 0;
while (vhIncrease + calculatedVh < height) {
// Here we pass the calculated viewport height to the context
// of the page and we scroll by that amount
await page.evaluate(_calculatedVh => {
window.scrollBy(0, _calculatedVh);
}, calculatedVh);
await waitFor(300);
vhIncrease = vhIncrease + calculatedVh;
}
// Setting the viewport to the full height might reveal extra elements
await page.setViewport({ width: 1366, height: calculatedVh});
// Wait for a little bit more
await waitFor(1000);
// Scroll back to the top of the page by using evaluate again.
await page.evaluate(_ => {
window.scrollTo(0, 0);
});
return await page.screenshot({type: 'png'});
}
Some key differences here are:
You want to set the viewport from the beginning and operate with that fixed viewport.
You can change the wait time and introduce arbitrary waits to experiment. Sometimes this causes elements that are hanging behind network events to reveal.
Changing the viewport to the full height of the page can also reveal elements as if you were scrolling. You can test this in a real browser by using a vertical monitor. However make sure to go back to the original viewport height, because the viewport also affects the intended rendering.
One thing to understand here is that waiting alone it's not necessarily going to trigger the loading of lazy assets. Scrolling through the height of the document allows the viewport to reveal those elements that need to be within the viewport to get loaded.
Another caveat is that sometimes you need to wait for a relatively long time for the asset to load so in the example above you might need to experiment with the amount of time you're waiting after each scroll. Also as I mentioned arbitrary waits in the general execution sometimes have an effect on whether an asset load or not.
In general, when using Puppeteer for screenshots, you want to make sure that your logic resembles real user behavior. Your goal is to reproduce rending scenarios as if someone was firing Chrome in their computer and navigating to that website.

I have resolved this issue by changing the logic on how I can scroll the page and wait for delay.

A solution that worked for me:
Adjust the timeout limit for my test runner (mocha).
// package.json
"scripts": {
"start": "react-scripts start",
"build": "react-scripts build",
"eject": "react-scripts eject",
"test": "mocha --timeout=5000" <--- set timeout to something higher than 2 seconds
},
Wait for x seconds where x ~ half of what you set above, then take srcreenshot.
var path = require("path"); // built in with NodeJS
await new Promise((resolve) => setTimeout(() => resolve(), 2000));
var file_path = path.join(__dirname, "__screenshots__/initial.png");
await page.screenshot({ path: file_path });

Related

How to dynamically assign Puppeteer viewport size from current screen resolution?

I'm using Puppeteer to automate some page actions in an already open, fully-visible browser (non-headless). Currently, I manually set the viewport like this:
const page = await browser.newPage();
await page.setViewport({width: W, height: H});
I have to manually set W and H based on both the actual screen resolution, and on the system-wide scaling factor. This makes the script very brittle and non-portable.
I would like to have the new page always open with the largest possible visible viewport, without having to manually specify what that is. I tried some of the other solutions suggested on SO and elsewhere, such as setting the viewport to null, but I have not yet stumbled upon a working solution for my specific use case. Any help would be appreciated. Thanks!
If you want to set the W and H persistently across a launched browser you need to set defaultViewport: null together with --window-size=${W},${H} launch arg. It sets the window size and viewport on browser-level, not on page-level (which changes with each new tab).
Like this, all the newly opened tabs will share the same window size and viewport.
const browser = await puppeteer.launch({
defaultViewport: null,
args: [`--window-size=${W},${H}`]
})
If you can retrieve the screen resolution from system specifications you would be able to correctly set the viewport size from it.
Though you will probably not be able to get this information directly from javascript.
If you can get this information from a PowerShell script (see edit), you could try the following to execute that script from javascript and retrieve this information in your program in order to set your viewport dimensions.
const {spawn} = require("child_process");
async function getSomeDataFromAPowerShellScript() {
const child = spawn("powershell.exe", ["./PATH/MyPowerShellScript.ps1"]); // spawn a powershell terminal as a child process of main program and run the provided script in it
return await new Promise(resolve => {
child.stdout.on("data", (data) => { // trigger when data is send into the child terminal
console.log(data);
resolve(data);
};
});
}
A call to getSomeDataFromAPowerShellScript() will return the first outputed data in the powershell terminal as a string.
If you want to retrieve more informations than just the first output in the powershell terminal you can use this instead:
async function getSomeDataFromAPowerShellScript() {
const child = spawn("powershell.exe", ["./PATH/MyPowerShellScript.ps1"]); // spawn a powershell terminal as a child process of main program and run the provided script in it
let result = [];
return await new Promise(resolve => {
child.stdout.on("data", (data) => { // trigger when data is send into the child terminal
console.log(data);
result.push(data);
};
child.on("exit", () => { // trigger when the child process exit after execution
resolve(result);
});
});
}
Edit:
You could use this powershell script from Ben N answer here How to get the current screen resolution on windows via command line? to get the current resolution of your primary screen:
PowerShell-script.ps1
Add-Type #"
using System;
using System.Runtime.InteropServices;
public class PInvoke {
[DllImport("user32.dll")] public static extern IntPtr GetDC(IntPtr hwnd);
[DllImport("gdi32.dll")] public static extern int GetDeviceCaps(IntPtr hdc, int nIndex);
}
"#
$hdc = [PInvoke]::GetDC([IntPtr]::Zero)
[PInvoke]::GetDeviceCaps($hdc, 118) # width
[PInvoke]::GetDeviceCaps($hdc, 117) # height
original explanation
It outputs two lines: first the horizontal resolution, then the
vertical resolution.
To run it, save it to a file (e.g. screenres.ps1) and launch it with
PowerShell:
powershell -ExecutionPolicy Bypass .\screenres.ps1
Using this answer in combination of theDavidBarton answer should achieve what you're asking for.

Puppeteer test runs are not consistent

I've made some tests using Jest and Puppeteer, but some of them pass/fail inconsistently. From what I've found, all of the tests pass consistently when {headless: false}, and I can see Puppeteer interacting with Chromium. But once I set {headless: true}, a few of them pass/fail whenever. It's always the same few tests that inconsistently pass, too. For the tests that inconsistently pass, the reasoning is always < element > not found. This is one of the test cases that inconsistently passes, with the reasoning #gallery not found.
describe('Gallery', () => {
it('opens gallery modal', async () => {
const mediaSelector = 'div#media';
const gallerySelector = '#gallery';
await page.$eval(mediaSelector, (e) => {
e.scrollIntoView({ behavior: 'smooth', block: 'center' });
});
await page.click(mediaSelector);
await page.waitForTimeout(1000);
await expect(page).toMatchElement(gallerySelector);
});
)};
I've made sure to precede every expect statement with await page.waitForTimeout(1000); to give the element enough time to display in the HTML. I had heard that the behavior is different for Puppeteer when its headless/headful, but I didn't think it would affect this type of test that much since it's a fairly straightforward test. Any suggestions on how I can overcome the different behavior between headless/headful Puppeteer?

Unable to locate an element with puppeteer

I'm trying to do a basic search on FB marketplace with puppeteer(and it was working for me before) but fails recently.
The whole thing fails when it gets to "location" link on marketplace page. to change the location i need to click on it, but puppeteer Errors out saying:
Error: Node is either not visible or not an HTMLElement
If i try to get the boundingBox of the element it returns null
const browser = await puppeteer.launch();
const page = await browser.newPage();
const resp = await page.goto('https://www.facebook.com/marketplace', { waitUntil: 'networkidle2' })
const withinLink = await page.waitForXPath('//span[contains(.,"Within")]', { timeout: 4000 })
console.log(await withinLink.boundingBox()) //returns null
await withinLink.click() //errors out
If i take a screenshot of the page right before i locate an element it is clearly there and i am able to locate in in chrome console using the same xPath manually.
It just doesn't seem to work in puppeteer
Something clearly changed on FB. Maybe they started to use some AI technology to detect scraping?
I don't think facebook changed in headless browser detection lately, but it seems you haven't taken into account that const withinLink = await page.waitForXPath('//span[contains(.,"Within")]', { timeout: 4000 }) returns an array, even if there is only one matching elment to contains(.,"Within").
That should work if you add [0] index to the elementHandles:
const withinLink = await page.waitForXPath('//span[contains(.,"Within")]')
console.log(await withinLink[0].boundingBox())
await withinLink[0].click()
Note: Timeout is not mandatory in waitForXPath, but I'd suggest to rather use domcontentloaded instead of networkidle2 in page.goto if you don't need all analytics/tracking events to achive the desired results, it just slows down your script execution.
Note 2: Honestly, I don't have such element on my fb platform, maybe it is market dependent. But it works with any other XPath selectors with specific content.

How to add scraped data with puppetteer into html

ADVISORY : I'm trying my hands at this for the first time.
I created an html page that displays bus timings. To get the bus timings I had to scrape the local bus service website with puppeteer. I do scrape the time for the next bus correctly, but I can't seem to add it to my html page.
I've tried adding the script tags with src pointing to my js file. I tried adding them to the head, in the div that should display the time and right before the closing body tag, but I couldn't display the time. I event tried to add the js in a script tag to the html and that didn't work.
//Here's code for scraping in busTimeScraper.js :
let scrape = async() => {
const browser = await puppeteer.launch({
headless: true
});
const page = await browser.newPage();
await page.goto('bustimes.com'); //Dummy website for this eg
await page.setViewport({width: 1500, height: 1500})
await page.waitFor(5000);
const result = await page.evaluate(() => {
let time = document.querySelector('#RouteTimetable').innerText;
return {
time
}
});
browser.close();
return result;
};
scrape().then((value) => {
var timing = value.time;
document.querySelector('#Time').innerText=timing;
});
//The html is :
<div id="Time">
<!--<script type="text/javascript" src="busTimeScraper.js">
</script>-->
</div>
I can see the time being scraped when I run the js file and do a console.log on the timing variable. I expected the div to be populated with the same time value, but it just stays blank
You simply cannot add your server-side JS in your client-side html using a script tag and expect it to work, no matter where you add (in head, inside element or before closing body);
Simplest solution would be to expose the result (timing variable) VIA NodeJsAPI and consume the API VIA your client-side JS to get the value and do rest of the client-side things.

Puppeteer Element Handle loses context when navigating

What I'm trying to do:
I'm trying to get a screenshot of every element example in my storybooks project. The way I'm trying to do this is by clicking on the element and then taking the screenshot, clicking on the next one, screenshot etc.
Here is the attached code:
test('no visual regression for button', async () => {
const selector = 'a[href*="?selectedKind=Buttons&selectedStory="]';
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://localhost:8080');
let examples = await page.$$(selector);
await examples.map( async(example) => {
await example.click();
const screen = await page.screenshot();
expect(screen).toMatchImageSnapshot();
});
await browser.close();
});
But when I run this code I get the following error:
Protocol error (Runtime.callFunctionOn): Target closed.
at Session._onClosed (../../node_modules/puppeteer/lib/Connection.js:209:23)
at Connection._onClose (../../node_modules/puppeteer/lib/Connection.js:116:15)
at Connection.dispose (../../node_modules/puppeteer/lib/Connection.js:121:10)
at Browser.close (../../node_modules/puppeteer/lib/Browser.js:60:22)
at Object.<anonymous>.test (__tests__/visual.spec.js:21:17)
at <anonymous>
at process._tickCallback (internal/process/next_tick.js:169:7)
I believe it is because the element loses its context or something similar and I don't know what methods to use to get around this. Could you provide a deeper explanation or a possible solution? I don't find the API docs helpful at all.
ElementHandle.dispose() is called once page navigation occurs as garbage collection as stated here in the docs. So when you call element.click() it navigates and the rest of the elements no longer point to anything.