Scrape paragraphs in html with anchor links in Puppeteer

Scrape paragraphs in html with anchor links in Puppeteer - puppeteer

I am trying to scrape a text on a website with puppeteer. Now I have reached the point where I can read the p-tag between two h2-tags, only this paragraph texts also contain words with internal links. With the current code I get the plain paragraph texts in an array as ouput but actually I need to have the text with the tags in it. Is this possible with puppeteer?
My current code for paragraph scrape is:
const puppeteer = require('puppeteer');
const plaatsengids = async () => {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('https://www.plaatsengids.nl/urmond');
let paragraphs = await page.evaluate(() => {
const status = document.querySelector('h2[name="status"]');
const naam = document.querySelector('h2[name="naam"]');
return [...document.querySelectorAll('p')]
.filter(p => p.compareDocumentPosition(status) & Node.DOCUMENT_POSITION_PRECEDING &&
p.compareDocumentPosition(naam) & Node.DOCUMENT_POSITION_FOLLOWING)
.map(p => p.textContent.replace("- ",""));
});
console.log(paragraphs);
await browser.close();
return paragraphs;
};
module.exports = plaatsengids;
The page I am trying to scrape is:
https://www.plaatsengids.nl/urmond
For example, the words that contain an a-tag in the text are circled in red here:

If I understand correctly, you can try p.innerHTML.replace("- ","") instead of p.textContent.replace("- ","").

Related

How to use Puppeteer functions repetitively, with an iframe

I have code that will login to a page, navigate to a list of messages, get the first message, and delete it. I need to be able to get the list of messages and delete each in turn. When I try to do that, I run into problems.
The site is rendered as plain html until the delete button is clicked. At this point, an iframe opens with the delete confirmation inside of it. If the confirmation is clicked, it returns me to the list of messages.
This is working until the iframe pops up. The existing code doesn't find the selector in the iframe. The code does work when it is not in a loop, though. So how can I interact with the iframe in the loop?
TimeoutError: Waiting for selector .navigation-footer button failed: Waiting failed: 30000ms exceeded
const messageList = await page.$$(".message-list tr");
for (message of messageList) {
//get first message
await page.click(".message-list tr");
//wait for the message to load
await page.waitForSelector(".circle-cross");
//get time and message text
const msgTime = await page.$eval("time", el => el.getAttribute("dateTime"));
const paragraphs = await page.evaluate(() => {
let paraElements = document.querySelectorAll(".bubble p");
//array literal
const paraList = [...paraElements];
//gets the innerText of each element
return paraList.map((el, index) => el.innerText);
});
//get author name
await page.waitForSelector(".user p a")
let authorLink = await page.$(".user p a")
let authorName = await authorLink.evaluate(el => el.innerText.trim());
//append message to messages.txt
const stream = fs.createWriteStream("messages.txt", { flags: 'a' });
stream.write(authorName + "\n");
stream.write(msgTime + "\n");
paragraphs.forEach((item, index) => {
stream.write(item + "\n");
});
stream.end();
//delete the message
await page.click(".circle-cross");
//handle the iframe verification
const elementHandle = await page.waitForSelector("iframe.fancybox-iframe");
const frame = await elementHandle.contentFrame();
const button = await frame.waitForSelector(".navigation-footer button");
await frame.click(".navigation-footer button");
}

UPDATE: I did get this working, eventually, by substituting in:
await button.evaluate(el => el.click());
instead of
await frame.click(".navigation-footer button")

Simply put the code you have in the forEach loop, however you will have to add async on this line:
messageList.forEach(async (el) => {
Thus your ending result should then be:
//get a list of messages
const messageList = await page.$$(".message-list tr");
messageList.forEach(async (el) => {
//get the first message
await page.click(".message-list tr");
//wait for the message to load
await page.waitForSelector(".circle-cross");
//delete the message
await page.click(".circle-cross");
//handle the iframe verification
const elementHandle = await page.waitForSelector("iframe");
const = await elementHandle.contentFrame();
await frame.waitForSelector(".navigation-footer button");
await frame.click(".navigation-footer button");
}

Puppeteer: how to access/intercept a FileSystemDirectoryHandle?

I'm wondering if it's possible within puppeteer to access a FileSystemDirectoryHandle (from the File System Access API). I would like to pass in a directory path via puppeteer as though the user had selected a directory via window.showDirectoryPicker(). On my client page I use the File System Access API to write a series of png files taken from a canvas element, like:
const directoryHandle = await window.showDirectoryPicker();
for (let frame = 0; frame < totalFrames; frame++){
const fileHandle = await directoryHandle.getFileHandle(`${frame}.png`, { create: true });
const writable = await fileHandle.createWritable();
updateCanvas(); // <--- update the contents of my canvas element
const blob = await new Promise((resolve) => canvas.toBlob(resolve, 'image/png'));
await writable.write(blob);
await writable.close();
}
On the puppeteer side, I want to mimic that behavior with something like:
const page = await browser.newPage();
await page.goto("localhost:3333/canvasRenderer.html");
// --- this part doesn't seem to exist ---
const [dirChooser] = await Promise.all([
page.waitForDirectoryChooser(),
page.click('#choose-directory'),
]);
await dirChooser.accept(['save/frames/here']);
//--------------------------------------
but waitForDirectoryChooser() doesn't exist.
I'd really appreciate any ideas or insights on how I might accomplish this!

How to do web scraping into a web that has the app-root element? [duplicate]

I am trying to scrape a website but I don't get some of the elements, because these elements are dynamically created.
I use the cheerio in node.js and My code is below.
var request = require('request');
var cheerio = require('cheerio');
var url = "http://www.bdtong.co.kr/index.php?c_category=C02";
request(url, function (err, res, html) {
var $ = cheerio.load(html);
$('.listMain > li').each(function () {
console.log($(this).find('a').attr('href'));
});
});
This code returns empty response, because when the page is loaded, the <ul id="store_list" class="listMain"> is empty.
The content has not been appended yet.
How can I get these elements using node.js? How can I scrape pages with dynamic content?

Here you go;
var phantom = require('phantom');
phantom.create(function (ph) {
ph.createPage(function (page) {
var url = "http://www.bdtong.co.kr/index.php?c_category=C02";
page.open(url, function() {
page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
page.evaluate(function() {
$('.listMain > li').each(function () {
console.log($(this).find('a').attr('href'));
});
}, function(){
ph.exit()
});
});
});
});
});

Check out GoogleChrome/puppeteer
Headless Chrome Node API
It makes scraping pretty trivial. The following example will scrape the headline over at npmjs.com (assuming .npm-expansions remains)
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.npmjs.com/');
const textContent = await page.evaluate(() => {
return document.querySelector('.npm-expansions').textContent
});
console.log(textContent); /* No Problem Mate */
browser.close();
})();
evaluate will allow for the inspection of the dynamic element as this will run scripts on the page.

Use the new npm module x-ray, with a pluggable web driver x-ray-phantom.
Examples in the pages above, but here's how to do dynamic scraping:
var phantom = require('x-ray-phantom');
var Xray = require('x-ray');
var x = Xray()
.driver(phantom());
x('http://google.com', 'title')(function(err, str) {
if (err) return done(err);
assert.equal('Google', str);
done();
})

Answering this as a canonical, an alternative to Puppeteer for scraping dynamic sites which is also well-supported as of 2023 is Playwright. Here's a simple example:
const playwright = require("playwright"); // ^1.28.1
let browser;
(async () => {
browser = await playwright.chromium.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
const text = await page.locator('h1:text("Example")').textContent();
console.log(text); // => Example Domain
})()
.catch(err => console.error(err))
.finally(() => browser?.close());

Easiest and reliable solution is to use puppeteer. As mentioned in https://pusher.com/tutorials/web-scraper-node which is suitable for both static + dynamic scraping.
Only change the timeout in Browser.js, TimeoutSettings.js, Launcher.js 300000 to 3000000

puppeteer splitwise login not wokring

I have tried multiple examples to get the splitwise login but unable to get it working.
Although, I'm quite new to puppeteer but login felt a simple usecase for understanding puppeteer.
const puppeteer = require('puppeteer')
const screenshot = 'login.png';
(async () => {
const browser = await puppeteer.launch({headless: false})
const page = await browser.newPage()
await page.goto("https://www.splitwise.com/login", {
waitUntil: 'networkidle2'
});
await page.type('#user_session_email', 'atest')
await page.type('#user_session_password', 'test')
await page.click('[name="commit"]')
await page.waitForNavigation()
browser.close()
console.log('See screenshot: ' + screenshot)
})()

Unfortunately, the page has two forms with identical ids (but different classes) and these forms have inputs with identical ids as well. You just need more specific selectors:
await page.type('form.form-stacked #user_session_email', 'atest')
await page.type('form.form-stacked #user_session_password', 'test')
await page.click('form.form-stacked [name="commit"]')

This does not seem to be a puppeteer issue.
It seems that javascript code in page is actively blocking triggered events somehow.
Are you able to set these values using regular javascript in the console?

Puppeteer - How to fill form that is inside an iframe?

I have to fill out a form that is inside an iframe, here the sample page. I cannot access by simply using page.focus() and page.type(). I tried to get the form iframe by using const formFrame = page.mainFrame().childFrames()[0], which works but I cannot really interact with the form iframe.

I figured it out myself. Here's the code.
console.log('waiting for iframe with form to be ready.');
await page.waitForSelector('iframe');
console.log('iframe is ready. Loading iframe content');
const elementHandle = await page.$(
'iframe[src="https://example.com"]',
);
const frame = await elementHandle.contentFrame();
console.log('filling form in iframe');
await frame.type('#Name', 'Bob', { delay: 100 });

Instead of figuring out how to get inside the iFrame and type, I would simplify the problem by navigating to the IFrame URL directly
https://warranty.goodmanmfg.com/registration/NewRegistration/NewRegistration.aspx?Sender=Goodman
Make your script directly go to the above URL and try automating, it should work
Edit-1: Using frames
Since the simple approach didn't work for you, we do it with the frames itself
Below is a simple script which should help you get started
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('http://www.goodmanmfg.com/product-registration', { timeout: 80000 });
var frames = await page.frames();
var myframe = frames.find(
f =>
f.url().indexOf("NewRegistration") > -1);
const serialNumber = await myframe.$("#MainContent_SerNumText");
await serialNumber.type("12345");
await page.screenshot({ path: 'example.png' });
await browser.close();
})();
The output is

If you can't select/find iFrame read this:
I had an issue with finding stripe elements.
The reason for that is the following:
You can't access an with different origin using JavaScript, it would be a huge security flaw if you could do it. For the same-origin policy browsers block scripts trying to access a frame with a different origin. See more detailed answer here
Therefore when I tried to use puppeteer's methods:Page.frames() and Page.mainFrame(). ElementHandle.contentFrame() I did not return any iframe to me. The problem is that it was happening silently and I couldn't figure out why it couldn't find anything.
Adding these arguments to launch options solved the issue:
'--disable-web-security',
'--disable-features=IsolateOrigins,site-per-process'

Though you have figured out but I think I have better solution. Hope it helps.
async doFillForm() {
return await this.page.evaluate(() => {
let iframe = document.getElementById('frame_id_where_form_is _present');
let doc = iframe.contentDocument;
doc.querySelector('#username').value='Bob';
doc.querySelector('#password').value='pass123';
});
}

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Scrape paragraphs in html with anchor links in Puppeteer - puppeteer

If I understand correctly, you can try p.innerHTML.replace("- ","") instead of p.textContent.replace("- ","").

Related

How to use Puppeteer functions repetitively, with an iframe

Puppeteer: how to access/intercept a FileSystemDirectoryHandle?

How to do web scraping into a web that has the app-root element? [duplicate]

puppeteer splitwise login not wokring

Puppeteer - How to fill form that is inside an iframe?

Categories

Resources