How to add scraped data with puppetteer into html - html

ADVISORY : I'm trying my hands at this for the first time.
I created an html page that displays bus timings. To get the bus timings I had to scrape the local bus service website with puppeteer. I do scrape the time for the next bus correctly, but I can't seem to add it to my html page.
I've tried adding the script tags with src pointing to my js file. I tried adding them to the head, in the div that should display the time and right before the closing body tag, but I couldn't display the time. I event tried to add the js in a script tag to the html and that didn't work.
//Here's code for scraping in busTimeScraper.js :
let scrape = async() => {
const browser = await puppeteer.launch({
headless: true
});
const page = await browser.newPage();
await page.goto('bustimes.com'); //Dummy website for this eg
await page.setViewport({width: 1500, height: 1500})
await page.waitFor(5000);
const result = await page.evaluate(() => {
let time = document.querySelector('#RouteTimetable').innerText;
return {
time
}
});
browser.close();
return result;
};
scrape().then((value) => {
var timing = value.time;
document.querySelector('#Time').innerText=timing;
});
//The html is :
<div id="Time">
<!--<script type="text/javascript" src="busTimeScraper.js">
</script>-->
</div>
I can see the time being scraped when I run the js file and do a console.log on the timing variable. I expected the div to be populated with the same time value, but it just stays blank

You simply cannot add your server-side JS in your client-side html using a script tag and expect it to work, no matter where you add (in head, inside element or before closing body);
Simplest solution would be to expose the result (timing variable) VIA NodeJsAPI and consume the API VIA your client-side JS to get the value and do rest of the client-side things.

Related

Unable to locate an element with puppeteer

I'm trying to do a basic search on FB marketplace with puppeteer(and it was working for me before) but fails recently.
The whole thing fails when it gets to "location" link on marketplace page. to change the location i need to click on it, but puppeteer Errors out saying:
Error: Node is either not visible or not an HTMLElement
If i try to get the boundingBox of the element it returns null
const browser = await puppeteer.launch();
const page = await browser.newPage();
const resp = await page.goto('https://www.facebook.com/marketplace', { waitUntil: 'networkidle2' })
const withinLink = await page.waitForXPath('//span[contains(.,"Within")]', { timeout: 4000 })
console.log(await withinLink.boundingBox()) //returns null
await withinLink.click() //errors out
If i take a screenshot of the page right before i locate an element it is clearly there and i am able to locate in in chrome console using the same xPath manually.
It just doesn't seem to work in puppeteer
Something clearly changed on FB. Maybe they started to use some AI technology to detect scraping?
I don't think facebook changed in headless browser detection lately, but it seems you haven't taken into account that const withinLink = await page.waitForXPath('//span[contains(.,"Within")]', { timeout: 4000 }) returns an array, even if there is only one matching elment to contains(.,"Within").
That should work if you add [0] index to the elementHandles:
const withinLink = await page.waitForXPath('//span[contains(.,"Within")]')
console.log(await withinLink[0].boundingBox())
await withinLink[0].click()
Note: Timeout is not mandatory in waitForXPath, but I'd suggest to rather use domcontentloaded instead of networkidle2 in page.goto if you don't need all analytics/tracking events to achive the desired results, it just slows down your script execution.
Note 2: Honestly, I don't have such element on my fb platform, maybe it is market dependent. But it works with any other XPath selectors with specific content.

How do I parse a html page using nodejs to find a qr code?

I want to parse a web page, searching for QRcodes in the page. When I find them, I am going to read them using the QRcode npm module.
The hard part is, I don't know how to parse the html page in a way I can detect the only the image tags that contains a QRcode inside it.
I tried finding some kind of pattern in the images that contain a Qr code, but it usually starts with "?qr" but I think the ending is different everytimwe.
I'm using the module require-promise to get the raw html, and then I parse through it
const rp = require('request-promise');
const url = 'https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States';
rp(url)
.then(function(html){
//success!
console.log(html);
})
.catch(function(err){
//handle error
});
I want to be able to download the image of the QRcode.
You need to pass the html returned into something like https://www.npmjs.com/package/node-html-parser
const rp = require('request-promise');
const parser = require('node-html-parser');
const url = 'https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States';
rp(url)
.then(function(html){
const data = parser.parse(html);
console.log(JSON.stringify(data));
})
.catch(function(err){
//handle error
});
Then you can access things off the data object to find the QR code

How do you paste text using Puppeteer?

I am trying to write a test (using jest-puppeteer) for an input in my React application that handles autocomplete or copy/pasted strings in a unique way.
I was hoping by using Puppeteer, I could paste text into the input and then validate that the page is updated correctly. Unfortunately, I can't find any working example of how to do this.
I've tried using page.keyboard to simulate CMD+C & CMD+V but it does not appear that these sorts of commands work in Puppeteer.
I've also tried using a library such as clipboardy to write and read to the OS clipboard. While clipboardy does work for write (copy), it seems read (paste) does not affect the page run by Puppeteer.
I have successfully copied the text using a variety of methods but have no way to paste into the input. I've validated this assumption by adding event listeners for "copy" and "paste" to the document. The "copy" events fire, but no method has resulted in the "paste" event firing.
Here are a few approaches I have tried:
await clipboardy.write('1234'); // writes "1234" to clipboard
await page.focus("input");
await clipboardy.read(); // Supposedly pastes from clipboard
// assert input has updated
await clipboardy.write('1234');
await page.focus("input");
await page.keyboard.down('Meta');
await page.keyboard.press('KeyV');
await page.keyboard.up('Meta');
// assert input has updated
await page.evaluate(() => {
const input = document.createElement('input');
document.body.appendChild(input);
input.value = '1234';
input.focus();
input.select();
document.execCommand('copy');
document.body.removeChild(input);
});
wait page.focus("input");
await page.keyboard.down('Meta');
await page.keyboard.press('KeyV');
await page.keyboard.up('Meta');
I think the only missing piece here is pasting the text; but how do you paste text using Puppeteer?
This works for me with clipboardy, but not when I launch it in headless :
await clipboardy.write('foo')
const input= await puppeteerPage.$(inputSelector)
await input.focus()
await puppeteerPage.keyboard.down('Control')
await puppeteerPage.keyboard.press('V')
await puppeteerPage.keyboard.up('Control')
If you make it works in headless tell me.
I tried it the clipBoard API too but I couldn t make it compile:
const browser = await getBrowser()
const context = browser.defaultBrowserContext();
// set clipBoard API permissions
context.clearPermissionOverrides()
context.overridePermissions(config.APPLICATION_URL, ['clipboard-write'])
puppeteerPage = await browser.newPage()
await puppeteerPage.evaluate((textToCopy) =>{
navigator.clipboard.writeText(textToCopy)
}, 'bar')
const input= await puppeteerPage.$(inputSelector)
await input.focus()
await puppeteerPage.evaluate(() =>{
navigator.clipboard.readText()
})
I came up with a funny workaround how to paste a long text into React component in a way that the change would be registered by the component and it would not take insanely long time to type as it normally does with type command:
For text copying I use approach from Puppeteer docs (assume I want to select text from first 2 paragraphs on a page for example). I assume you already know how to set the permissions for clipboard reading and writing (for example one of the answers above shows how to do it).
const fromJSHandle = await page.evaluateHandle(() =>Array.from(document.querySelectorAll('p'))[0])
const toJSHandle = await page.evaluateHandle(() =>Array.from(document.querySelectorAll('p'))[1])
// from puppeteer docs
await page.evaluate((from, to) => {
const selection = from.getRootNode().getSelection();
const range = document.createRange();
range.setStartBefore(from);
range.setEndAfter(to);
selection.removeAllRanges();
selection.addRange(range);
}, fromJSHandle, toJSHandle);
await page.bringToFront();
await page.evaluate(() => {
document.execCommand('copy') // Copy the selected content to the clipboard
return navigator.clipboard.readText() // Obtain the content of the clipboard as a string
})
This approach does not work for pasting (on Mac at least): document.execCommand('paste')
So for pasting I use this:
await page.$eval('#myInput', (el, value) =>{ el.value = value }, myLongText)
await page.type(`#myInput`,' ') // this assumes your app trims the input value in the end so the whitespace doesn't bother you
Without the last typing step (the white space) React does not register change/input event. So after submitting the form (of which the input is part of for example) the input value would still be "".
This is where typing the whitespace comes in - it triggers the change event and we can submit the form.
It seems that one needs to develop quite a bit of ingenuity with Puppeteer to figure out how to work around all the limitations and maintaining some level of developer comfort at the same time.

google chrome headless puppeteer get DOM of page

Hello I want to get content of page.
I use page.content() from doc. Still that gets me not DOM after render and process by javascript but source code,
I want to be able grab iframe and generated by javascript content like it was from devtools chrome.
I also try:
let bodyHTML = await page.evaluate(() => document.body.innerHTML);
That also gives me source code.
Is that possible ?
I am not sure about the iframe, but using this code u can get the inner Text. This worked for me.
const body = await page.evaluate(() => {
return {
'body': document.body.innerText
};
});
console.log('body:', body);

Puppeteer Element Handle loses context when navigating

What I'm trying to do:
I'm trying to get a screenshot of every element example in my storybooks project. The way I'm trying to do this is by clicking on the element and then taking the screenshot, clicking on the next one, screenshot etc.
Here is the attached code:
test('no visual regression for button', async () => {
const selector = 'a[href*="?selectedKind=Buttons&selectedStory="]';
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://localhost:8080');
let examples = await page.$$(selector);
await examples.map( async(example) => {
await example.click();
const screen = await page.screenshot();
expect(screen).toMatchImageSnapshot();
});
await browser.close();
});
But when I run this code I get the following error:
Protocol error (Runtime.callFunctionOn): Target closed.
at Session._onClosed (../../node_modules/puppeteer/lib/Connection.js:209:23)
at Connection._onClose (../../node_modules/puppeteer/lib/Connection.js:116:15)
at Connection.dispose (../../node_modules/puppeteer/lib/Connection.js:121:10)
at Browser.close (../../node_modules/puppeteer/lib/Browser.js:60:22)
at Object.<anonymous>.test (__tests__/visual.spec.js:21:17)
at <anonymous>
at process._tickCallback (internal/process/next_tick.js:169:7)
I believe it is because the element loses its context or something similar and I don't know what methods to use to get around this. Could you provide a deeper explanation or a possible solution? I don't find the API docs helpful at all.
ElementHandle.dispose() is called once page navigation occurs as garbage collection as stated here in the docs. So when you call element.click() it navigates and the rest of the elements no longer point to anything.