pageFunction in Puppeteer returns empty object - puppeteer

I'm using page.$eval in Puppeteer and I dont know why a pageFunction would return an empty object when the object isn't empty. Here's a code sample:
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch({
headless: false,
slowMo: 1000
});
const page = await browser.newPage();
await page.goto('https://www.google.com/search?q=news');
const result1 = await page.$eval('#resultStats', elem => elem.textContent)
console.log('result1', result1); // returns 'About 2,890,000,000 results (0.45 seconds)'. This is expected behavior straight from puppeeteer docs
const result2 = await page.$eval('#resultStats', elem => elem)
console.log('result2', result2); // returns and empty object. Why? I would have expected to see a DOM Node Object here
await browser.close();
})();
How do I get the whole element back in result2?

I didn't understand that the pageFunction function is running within Chromium itself, so in the second example where it is returning elem => elem, it's actually returning a live NodeList collection to Puppeteer.
But returning a live NodeList collection from Chromium back to puppeteer isn't possible because the way Puppeteer passes data to and from Chromium has to be serializable via JSON.stringify / JSON.parse. When Puppeteer runs JSON.stringify on a live NodeList, I believe it returns an empty object.

Well, as you said above.You can get the dom node in the evaluate function.But when you return the dom node from evaluate function, puppeteer will handle the data you returned.So you can't get the adjective data.

Related

How to Check if Text Exists on a Particular Element (not Page) with Puppeteer

I want to run $x on a specific element, not the whole page, and tried this:
let selector = await page.$('.myClass"]');
let [el] = await selector.$x(`//p[contains(text(), 'myTextString')]`);
pass = el ? true : false
I'm expecting pass to evaluate to false, because the element with myClass does not contain myTextString, but el is not falsy because myTextString exists elsewhere on the page.
Is there a way to check for a string of text on a particular element?
Code to recreate the issue with $x:
const puppeteer = require('puppeteer');
async function test() {
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
args: ['--start-maximized']
});
const page = await browser.newPage();
await page.goto('https://stackoverflow.com/');
// select a header div with no text content
let selector = await page.$('#notify-container');
// an element is still found, even though I'm running $x on the selector, not the page
let [el] = await selector.$x(`//h2[contains(text(), 'Find the best answer to your technical question, help others answer theirs')]`);
console.log(el);
await browser.close();
}
test();
According to the docs, "The method evaluates the XPath expression relative to the elementHandle as its context node". So you just need to use the context node symbol in the beginning of the XPath: .//h2 instead of //h2.

how to print response json from website using puppeteer?

I try to get google translation website to do some work for me, the website returns a blank web page with a json file. Using web brower, I can save the json file and open it in a text editor.
I am trying to use puppeteer to get this done automatically. Here is my code:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({headless:false, args: ["--no-sandbox"]});
const page = await browser.newPage();
// Approach 1:
const response = await page.goto('https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=zh&dt=t&q=Edit%20Report');
let text = await response.text();
console.log(text);
let json = await response.json();
console.log(json);
await browser.close();
})();
When I run this code, brower is launched, but the returned json file still get automatically saved to the disk instead of printing to the console. What puppeteer class I should use for this task?
Since it is an API call and the expected result is JSON, you can use a simple nodsJS or Jquery code to return the response as below.
$.get('https://translate.googleapis.com/translate_a/single?`client=gtx&sl=en&tl=zh&dt=t&q=Edit%20Report', (data) =>`
{
console.log(data);
});
but if you are particular about using puppeteer and want to return the response. you would do the following.
Add a Jquery dependency to your project, by running
npm install jquery
Import the JQuery to the project.
Invoke the below code, without launching the browser.
$.get('https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=zh&dt=t&q=Edit%20Report', (data) =>
{
console.log(data);
});
Here is the link to JSfiddle code https://jsfiddle.net/faizmagic/0h6cm1o4/latest/
I hope this helps.

how to execute a script in every window that gets loaded in puppeteer?

I need to execute a script in every Window object created in Chrome – that is:
tabs opened through puppeteer
links opened by click()ing links in puppeteer
all the popups (e.g. window.open or "_blank")
all the iframes contained in the above
it must be executed without me evaluating it explicitly for that particular Window object...
I checked Chrome's documentation and what I should be using is Page.addScriptToEvaluateOnNewDocument.
However, it doesn't look to be possible to use through puppeteer.
Any idea? Thanks.
This searches for a target in all browser contexts.
An example of finding a target for a page opened
via window.open() or popups:
await page.evaluate(() => window.open('https://www.example.com/'))
const newWindowTarget = await browser.waitForTarget(async target => {
await page.evaluate(() => {
runTheScriptYouLike()
console.log('Hello StackOverflow!')
})
})
via browser.pages() or tabs
This script run evaluation of a script in the second tab:
const pageTab2 = (await browser.pages())[1]
const runScriptOnTab2 = await pageTab2.evaluate(() => {
runTheScriptYouLike()
console.log('Hello StackOverflow!')
})
via page.frames() or iframes
An example of getting eval from an iframe element:
const frame = page.frames().find(frame => frame.name() === 'myframe')
const result = await frame.evaluate(() => {
return Promise.resolve(8 * 7);
});
console.log(result); // prints "56"
Hope this may help you

Finding/Returning Text

I need to open very simple websites and scan for a json object i.e.
myJSONObject:["el1","el2"]. There is only one HTML <pre> tag on the site that contains 100s of lines of text. Nothing else.
I was planning on scanning the page for myJSONObject: and then return ["el1", "el2"].
I used the following, which returns true, as it finds "myJSONObject:", but I have no way to return any text.
const found = await page.evaluate(() => window.find("myJSONObject:"));
Is there a way to use a regexp or something to find the needed text and return it? Is this at all possible?
I am new to puppeteer, so I am unsure of its capabilities. I appreciate any feedback.
You already find the right function (puppeteer.evaluate) to do the job. With it you can return strings, objects, numbers or booleans (in fact any serializable/stringifyable value) from browser/page context to the node context.
Don't know if you already grasp this: browser/page context and node context are different. The only way you can transfer data between them is by stringifying the data and then transfer them.
Said that, to solve your problem you have to come up with a regex and return the string matched. Full working example:
Suppose the <pre> text is something like this: <pre>[...] myJSONObject:["el1","el2"] [...]</pre>
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
// setup test page
await page.evaluate(() => {
const pre = document.createElement('pre');
pre.innerText = '<pre>[...] myJSONObject:["el1","el2"] [...]</pre>';
document.body.append(pre);
});
// important part (this is the answer to your question)
const myJson = await page.evaluate(() => {
var re = /myJSONObject:(\[.*?])/; // regex to match "json text"
const pre = document.querySelector('pre').innerText;
const matchedJsonText = pre.match(re)[1];
const json = JSON.parse(matchedJsonText);
return json;
});
// show results
console.log('myJSONObject:', myJson);
await browser.close();
})();
Please note that this regex only work with the json you provided as example. You'll have to come up with a better regex to match the jsons that you need.

get post title after Infinite scroll finished

I manage to show all the post on a site where it has load_more button to go to the next page, but something is missing,
I got error of
e Error: Node is either not visible or not an HTMLElement
at ElementHandle._clickablePoint (/Users/minghann/Documents/productnation_scraper/node_modules/puppeteer/lib/ExecutionContext.js:331:13)
at <anonymous>
at process._tickCallback (internal/process/next_tick.js:188:7)
Which doesn't happen if I don't load all the post. It's hard to debug because I don't know which post is missing what. Full code as below:
const browser = await puppeteer.launch({
devtools: true
});
const page = await browser.newPage();
await page.goto("https://example.net");
await page.waitForSelector(".load_more_btn");
const load_more_exist = !!(await page.$(".load_more_btn"));
while (load_more_exist > 0) {
await page.click(".load_more_btn");
}
const posts = await page.$$(".post");
let result = [];
for (const post of posts) {
result = [
...result,
{
title: await post.$eval(".post_title a", e => e.innerText)
}
];
}
console.log(result);
browser.close();
There are multiple ways and best way is to combine the following two different ways.
Look for Ajax
Wait for request instead. Whenever you click on Load More, it will do a simple ajax request to ?ajax-request=jnews. We can use .waitForRequest or .waitForResponse for this use case. Here is a working example,
await Promise.all([
page.waitForRequest(response => response.url().includes('?ajax-request=jnews') && response.status() === 200),
page.click(".load_more_btn")
])
Clean DOM and wait for new Element
Refer to these answers here and here.
Basically you can remove the dom elements that you collected, so next time you collect more data, there won't be any duplicates.
So, once you remove all current elements like document.querySelectorAll('.jeg_post'), you can simply do another page.waitFor('.jeg_post') later if you need.