I manage to show all the post on a site where it has load_more button to go to the next page, but something is missing,
I got error of
e Error: Node is either not visible or not an HTMLElement
at ElementHandle._clickablePoint (/Users/minghann/Documents/productnation_scraper/node_modules/puppeteer/lib/ExecutionContext.js:331:13)
at <anonymous>
at process._tickCallback (internal/process/next_tick.js:188:7)
Which doesn't happen if I don't load all the post. It's hard to debug because I don't know which post is missing what. Full code as below:
const browser = await puppeteer.launch({
devtools: true
});
const page = await browser.newPage();
await page.goto("https://example.net");
await page.waitForSelector(".load_more_btn");
const load_more_exist = !!(await page.$(".load_more_btn"));
while (load_more_exist > 0) {
await page.click(".load_more_btn");
}
const posts = await page.$$(".post");
let result = [];
for (const post of posts) {
result = [
...result,
{
title: await post.$eval(".post_title a", e => e.innerText)
}
];
}
console.log(result);
browser.close();
There are multiple ways and best way is to combine the following two different ways.
Look for Ajax
Wait for request instead. Whenever you click on Load More, it will do a simple ajax request to ?ajax-request=jnews. We can use .waitForRequest or .waitForResponse for this use case. Here is a working example,
await Promise.all([
page.waitForRequest(response => response.url().includes('?ajax-request=jnews') && response.status() === 200),
page.click(".load_more_btn")
])
Clean DOM and wait for new Element
Refer to these answers here and here.
Basically you can remove the dom elements that you collected, so next time you collect more data, there won't be any duplicates.
So, once you remove all current elements like document.querySelectorAll('.jeg_post'), you can simply do another page.waitFor('.jeg_post') later if you need.
Related
I want to scrape specific JSON from specific request link "http://website.com/request/api" on page.
I have to scroll to the bottom of the page to get all the articles (already coded). At each scroll, I would like to get the JSON corresponding to the articles just displayed.
So there are 2 problems:
The fact that the same URL query "http://website.com/request/api" is also used to returns other JSON which is not useful for me (other elements of the page).
The fact of having several JSONs to collect and assemble
For problem 1, I thought of adding a condition to my code to get only the JSON beginning with a precise text "Data : object"?
For the problem 2, I should be able to write in a file or the buffer the different JSON selected by assembling them.
Do you know how I could do it?
page.on('response', async(response) => { const request => response.request();
if (request.url().includes('/api/graphql/')){
const text = await response.text();
fs.writeFile('./tmp/response.json', JSON.stringify((text)));
console.log(text);
}
})
i have resolve the problem.
listener = page.on('response', async response => {
const isXhr = ['xhr','fetch','json'].includes(response.request().resourceType())
try {
if (isXhr){
if (response.url().includes('/api/graphql/')) {
const resp = await response.buffer();
if (resp.includes('Data : object')) {
fs.writeFileSync('./tmp/response.json', resp, { flag: 'a+' })
}
}
}}
catch(e){}
})
I am glad to find the puppeteer cluster. this library made life easy on crawling and automation tasks.tnx to Thomas Dondorf.
according to the author of the puppeteer cluster, when a task finished page will be closed immediately.this is good by the way. but what about some cases that you need to page will be open?
my use case:
I will try to explain briefly:
there is some activity on the page that in the background a socket is involved in for sending some data to the front .this data changes the dome and I need to capture that.
this is my code :
async function runCrawler(){
const links = [
"foo.com/barSome324",
"foo.com/barSome22",
"foo.com/barSome1",
"foo.com/barSome765",
]
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
workerCreationDelay: 5000,
puppeteerOptions:{args: ['--no-sandbox', '--disable-setuid-sandbox'], headless:false},
maxConcurrency: numCPUs,
});
await cluster.task(async ({ page, data: url }) => {
await crawler(page, url)
});
for(link of links){
await cluster.queue(link);
}
await cluster.idle();
await cluster.close();
}
and this is the crawler logic in page section:
module.exports.crawler = async(page, link)=>{
await page.goto(link, { waitUntil: 'networkidle2' })
await page.waitForTimeout(10000)
await page.waitForSelector('#dbp')
try {
// method to be executed;
setInterval(async()=>{
const tables=await page.evaluate(async()=>{
/// data I need to catch in every 30 seconds
});
},30000)
} catch (error) {
console.log(error)
}
}
I searched And find out in js we can capture DOM changes with mutationObserver .and tried this solution . but did not work either.page will be closed with this error:
UnhandledPromiseRejectionWarning: Error: Protocol error
(Runtime.callFunctionOn): Session closed. Most likely the page has
been closed.
so I have two options here:
1.mutationObserver
2.set interval for every 30 seconds evaluates the page itself.
but they did not suit my needs. so any idea how to overcome this problem?
I need to execute a script in every Window object created in Chrome – that is:
tabs opened through puppeteer
links opened by click()ing links in puppeteer
all the popups (e.g. window.open or "_blank")
all the iframes contained in the above
it must be executed without me evaluating it explicitly for that particular Window object...
I checked Chrome's documentation and what I should be using is Page.addScriptToEvaluateOnNewDocument.
However, it doesn't look to be possible to use through puppeteer.
Any idea? Thanks.
This searches for a target in all browser contexts.
An example of finding a target for a page opened
via window.open() or popups:
await page.evaluate(() => window.open('https://www.example.com/'))
const newWindowTarget = await browser.waitForTarget(async target => {
await page.evaluate(() => {
runTheScriptYouLike()
console.log('Hello StackOverflow!')
})
})
via browser.pages() or tabs
This script run evaluation of a script in the second tab:
const pageTab2 = (await browser.pages())[1]
const runScriptOnTab2 = await pageTab2.evaluate(() => {
runTheScriptYouLike()
console.log('Hello StackOverflow!')
})
via page.frames() or iframes
An example of getting eval from an iframe element:
const frame = page.frames().find(frame => frame.name() === 'myframe')
const result = await frame.evaluate(() => {
return Promise.resolve(8 * 7);
});
console.log(result); // prints "56"
Hope this may help you
I need to open very simple websites and scan for a json object i.e.
myJSONObject:["el1","el2"]. There is only one HTML <pre> tag on the site that contains 100s of lines of text. Nothing else.
I was planning on scanning the page for myJSONObject: and then return ["el1", "el2"].
I used the following, which returns true, as it finds "myJSONObject:", but I have no way to return any text.
const found = await page.evaluate(() => window.find("myJSONObject:"));
Is there a way to use a regexp or something to find the needed text and return it? Is this at all possible?
I am new to puppeteer, so I am unsure of its capabilities. I appreciate any feedback.
You already find the right function (puppeteer.evaluate) to do the job. With it you can return strings, objects, numbers or booleans (in fact any serializable/stringifyable value) from browser/page context to the node context.
Don't know if you already grasp this: browser/page context and node context are different. The only way you can transfer data between them is by stringifying the data and then transfer them.
Said that, to solve your problem you have to come up with a regex and return the string matched. Full working example:
Suppose the <pre> text is something like this: <pre>[...] myJSONObject:["el1","el2"] [...]</pre>
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
// setup test page
await page.evaluate(() => {
const pre = document.createElement('pre');
pre.innerText = '<pre>[...] myJSONObject:["el1","el2"] [...]</pre>';
document.body.append(pre);
});
// important part (this is the answer to your question)
const myJson = await page.evaluate(() => {
var re = /myJSONObject:(\[.*?])/; // regex to match "json text"
const pre = document.querySelector('pre').innerText;
const matchedJsonText = pre.match(re)[1];
const json = JSON.parse(matchedJsonText);
return json;
});
// show results
console.log('myJSONObject:', myJson);
await browser.close();
})();
Please note that this regex only work with the json you provided as example. You'll have to come up with a better regex to match the jsons that you need.
I am having issue with puppeteer.
I want to delete the added item to the form. For example, I have a form and added some fake data ("example"). I want to delete this "example", it doesn't matter whatever position it located. I just only want to delete this "example".
So, it means, puppeteer adds it and will delete in the next step.
I have tried:
// fake data
const metadatatest = {
text: 'example,
}
describe('Should be navigate through details', () => {
it('can navigate through detail', async () => {
// this adds fake data successfully
await page.waitForSelector('[data-testid="appCard"]')
await page.click('[data-testid="appCardDetails"]')
await page.waitForSelector('[data-testid="overviewSectionMetadataForm"]')
await page.click('[data-testid="overviewSectionMetadataEditButton"]')
//await page.$eval('[data-testid="metadataInput"]', el => el.value = 'example')
await page.type('[data-testid="metadataInput"]', metadatatest.text)
await page.waitForSelector('[data-testid="metadataInput"]')
await Promise.all([
page.click('[data-testid="overviewSectionMetadataEditButton"]'),
]);
// I want to delete this
})
})
I have also tried using
await page.keyboard.press('Backspace')
await page.keyboard.press('Clear')
await page.keyboard.press('Delete')
but no luck.
any help please!
So what you're asking is about clearing text from an input field, am I reading that correctly? Puppeteer doesn't have a built in method for that but I have found a workaround which will do it for you.
First, you need to click 3 times on the input field you wish to clear. This acts as a select all action for all text entered in that element:
await page.click(selector, { clickCount: 3 });
Now you can use your previous attempt to clear the text:
await page.keyboard.press('Backspace');
Update 1:
Your final code for clearing and then entering the text you want into the input field should look something like this:
await page.click('[data-testid="metadataInput"]', { clickCount: 3 });
await page.keyboard.press('Backspace');
await page.type('[data-testid="metadataInput"]', metadatatest.text);