Please tell me what a unique selector is set on puppeteer, when elements have duplicate query selector - puppeteer

My Html code has Button-tags that have same id "hoge".
If you get the selector from the Chrome Dev Tool, it will be the same for both "#hoge".
<html>
<body>
<button id="hoge">Hoge</button>
<div class="shadow">
#shadow-root (open)
<button id="hoge">Hoge</button>
</div>
</body>
</html>
I want to get element of button-tag in shadow dom with puppeteer.
But, my javascript code gets element of 1st button.
const element = page.waitForSelector("pierce/#hoge");
This is not what I want.
I'm guessing it's because you didn't specify a unique selector, but i don't know what is unique selector for puppeteer.
If you know how to solve this problem, please let me know.

Long story short
I work with puppeteer a lot and wanted this knowlegde to be in my bag. One way to select a shadow Element is by accessing the parent DOM Node's shadowRoot property. The answer is based on this article.
Accessing Shadow Root property
For your html example this does the trick:
const button = document.querySelector('.shadow').shadowRoot.querySelector('#hoge')
waiting
Waiting though is a little more complicated but can be acquired using page.waitForFunction().
Working Sandbox
I wrote this full working sandbox example on how to wait for a certain shadowRoot element.
index.html (located in same directory as app.js)
<html>
<head>
<script>
// attach shadowRoot after 6 seconds for emulating waiting..
setTimeout(() => {
const btn = document.getElementById('hoge')
const container = document.getElementsByClassName('shadow')[0]
const shadowRoot = container.attachShadow({
mode: 'open'
})
shadowRoot.innerHTML = `<button id="hoge" onClick="doStuff()">hoge2</button>`
console.log('attached!.')
}, 6000)
function doStuff() {
alert('shadow button clicked!')
}
</script>
</head>
<body>
<button id="hoge">Hoge</button>
<div class="shadow">
</div>
</body>
</html>
app.js (located in same directory as index.html)
var express = require('express')
var { join } = require('path')
var puppeteer = require('puppeteer')
//utility..
const wait = (seconds) => {
console.log('waiting', seconds, 'seconds')
return new Promise((res, rej) => {
setTimeout(res, seconds * 1000)
})
}
const runPuppeteer = async() => {
const browser = await puppeteer.launch({
defaultViewport: null,
headless: false
})
const page = await browser.newPage()
await page.goto('http://127.0.0.1:5000')
await wait(3)
console.log('page opened..')
// only execute this function within a page context!.
// for example in page.evaluate() OR page.waitForFunction etc.
// don't forget to pass the selector args to the page context function!
const selectShadowElement = (containerSelector, elementSelector) => {
try {
// get the container
const container = document.querySelector(containerSelector)
// Here's the important part, select the shadow by the parentnode of the
// actual shadow root and search within the shadowroot which is like another DOM!,
return container.shadowRoot.querySelector(elementSelector)
} catch (err) {
return null
}
}
console.log('waiting for shadow elemetn now.')
const containerSelector = '.shadow'
const elementSelector = '#hoge'
const result = await page.waitForFunction(selectShadowElement, { timeout: 15 * 1000 }, containerSelector, elementSelector)
if (!result) {
console.error('Shadow element not found..')
return
}
// since waiting succeeded we can get the elemtn now.
const element = await page.evaluateHandle(selectShadowElement, containerSelector, elementSelector)
try {
// click the element.
await element.click()
console.log('clicked')
} catch (err) {
console.log('failed to click..')
}
await wait(10)
}
var app = express()
app.get('/', (req, res) => {
res.sendFile(join(__dirname, 'index.html'))
})
app.listen(5000, '127.0.0.1', () => {
console.log('listening!')
runPuppeteer()
})
Start example
$ npm i express puppeteer
$ node app.js
Make sure to use headless:false option to see what's happening.
The application does this:
start a small express server only serving index.html on /
open puppeteer after server has started and wait for the shadow root element to appear.
Once it appeared, it gets clicked and an alert() is shown. => success!
Browser Support
Tested with chrome.
Cheers ' ^^

Related

How to create childElement from server response in a react app

I created a react app with many nested routes.
One of my nested route is using a backend api that returns a complete HTML content.
And I need to display that exact content with same HTML and styling in my UI.
I'm able to successfully achieve it by manipulating the DOM according to axios response using createElement and appendChild inside useEffect method.
But, the whole philosophy behind using react is, to NOT modify the DOM and let react work on it by simly updating the states or props.
My question is:
Is there a cleaner way to use api returned HTML in a react app?
Here is sample relevant code:
Item.js
...
...
useEffect( ()=>{
const fetchAndUpdateItemContent = async () => {
try {
const response = await axios.get(url);
var contentDiv = document.createElement("div");
contentDiv.innerHTML = response.data.markup;
document.getElementById(‘itemContent’)
.appendChild(contentDiv);
} catch (err) {
.....
console.error(err);
......
}
}
};
fetchAndUpdateItemContent();
},[itemId])
return (
<div id=‘itemContent'/>
);
}
What did NOT work
Ideally I should be able to have a state as itemContent in Item.js and be able to update it based upon server response like this. But when I do something like below, whole HTML markup is displayed instead of just the displayable content.
const [itemContent, setItemContent] = useState(‘Loading ...');
...
useEffect( ()=>{
const fetchAndUpdateItemContent = async () => {
try {
const response = await axios.get(url);
setItemContent(response.data.markup)
} catch (err) {
.....
console.error(err);
......
}
}
};
fetchAndUpdateItemContent();
},[itemId])
return (
<div id=‘itemContent'>
{itemContent}
</div>
You're actually trying to convert an HTML string to a JSX. You can assign it into react component props called dangerouslySetInnerHTML
Eg:
const Item = () => {
const yourHtmlStringResponse = '<h1>Heading 1</h1><h2>Heading 2</h2>'
return <div dangerouslySetInnerHTML={{__html: yourHtmlStringResponse}}></div>
}
You can try it here dangerouslySetInnerHTML-Codesandbox
I believe you can use dangerouslySetInnerHTML

how to execute a script in every window that gets loaded in puppeteer?

I need to execute a script in every Window object created in Chrome – that is:
tabs opened through puppeteer
links opened by click()ing links in puppeteer
all the popups (e.g. window.open or "_blank")
all the iframes contained in the above
it must be executed without me evaluating it explicitly for that particular Window object...
I checked Chrome's documentation and what I should be using is Page.addScriptToEvaluateOnNewDocument.
However, it doesn't look to be possible to use through puppeteer.
Any idea? Thanks.
This searches for a target in all browser contexts.
An example of finding a target for a page opened
via window.open() or popups:
await page.evaluate(() => window.open('https://www.example.com/'))
const newWindowTarget = await browser.waitForTarget(async target => {
await page.evaluate(() => {
runTheScriptYouLike()
console.log('Hello StackOverflow!')
})
})
via browser.pages() or tabs
This script run evaluation of a script in the second tab:
const pageTab2 = (await browser.pages())[1]
const runScriptOnTab2 = await pageTab2.evaluate(() => {
runTheScriptYouLike()
console.log('Hello StackOverflow!')
})
via page.frames() or iframes
An example of getting eval from an iframe element:
const frame = page.frames().find(frame => frame.name() === 'myframe')
const result = await frame.evaluate(() => {
return Promise.resolve(8 * 7);
});
console.log(result); // prints "56"
Hope this may help you

Is it possible to populate the input bar in webchat with an onclick method

I'm attempting to display a list of popular questions to the user, when they click them I want them to populate the input bar and/or send the message to the bot via the directline connection.
I've attempted using the ReactDOM.getRootNode() and tracking down the input node and setting the .value attribute, but this does not populate the field. I assume there is some sort of form validation that prevents this.
Also, if I console log the input node then save it as a global variable in the console screen I can change the value that way, but then the message will not actually be able to be sent, hitting enter or the send arrow does nothing. While it may seem that the suggestedActions option would work well for this particular application, I CANNOT use it for this use case.
const [chosenOption, setChosenOption] = useState(null);
const getRootNode = (componentRoot) =>{
let root = ReactDom.findDOMNode(componentRoot)
let inputBar = root.lastChild.lastChild.firstChild.firstChild
console.log('Initial Console log ',inputBar)
setInputBar(inputBar)
}
//in render method
{(inputBar && chosenOption) && (inputBar.value = chosenOption)}
this is the function I tried to use to find the node, the chosen option works as intended, but I cannot change the value in a usable way.
I would like the user to click on a <p> element which changes the chosenOption value and for that choice to populate the input bar and/or send a that message to the bot over directline connection.What I'm trying to accomplish
You can use Web Chat's store to dispatch events to set the send box (WEB_CHAT/SET_SEND_BOX) or send a message (WEB_CHAT/SEND_MESSAGE) when an item gets clicked. Take a look at the code snippet below.
Simple HTML
<body>
<div class="container">
<div class="details">
<p>Hello World!</p>
<p>My name is TJ</p>
<p>I am from Denver</p>
</div>
<div class="wrapper">
<div id="webchat" class="webchat" role="main"></div>
</div>
</div>
<script src="https://cdn.botframework.com/botframework-webchat/latest/webchat.js"></script>
<script>
// Initialize Web Chat store
const store = window.WebChat.createStore();
// Get all paragraph elements and add on click listener
const paragraphs = document.getElementsByTagName("p");
for (const paragraph of paragraphs) {
paragraph.addEventListener('click', ({ target: { textContent: text }}) => {
// Dispatch set send box event
store.dispatch({
type: 'WEB_CHAT/SET_SEND_BOX',
payload: {
text
}
});
});
}
(async function () {
const res = await fetch('/directline/token', { method: 'POST' });
const { token } = await res.json();
window.WebChat.renderWebChat({
directLine: window.WebChat.createDirectLine({ token }),
store,
}, document.getElementById('webchat'));
document.querySelector('#webchat > *').focus();
})().catch(err => console.error(err));
</script>
</body>
React Version
import React, { useState, useEffect } from 'react';
import ReactWebChat, { createDirectLine, createStore } from 'botframework-webchat';
const WebChat = props => {
const [directLine, setDirectLine] = useState();
useEffect(() => {
const initializeDirectLine = async () => {
const res = await fetch('http://localhost:3978/directline/token', { method: 'POST' });
const { token } = await res.json();
setDirectLine(createDirectLine({ token }));
};
initializeDirectLine();
}, []);
return directLine
? <ReactWebChat directLine={directLine} {...props} />
: "Connecting..."
}
export default () => {
const [store] = useState(createStore());
const items = ["Hello World!", "My name is TJ.", "I am from Denver."]
const click = ({target: { textContent: text }}) => {
store.dispatch({
type: 'WEB_CHAT/SET_SEND_BOX',
payload: {
text
}
});
}
return (
<div>
<div>
{ items.map((item, index) => <p key={index} onClick={click}>{ item }</p>) }
</div>
<WebChat store={store} />
</div>
)
};
Screenshot
For more details, take a look at the Programmatic Post as Activity Web Chat sample.
Hope this helps!

get post title after Infinite scroll finished

I manage to show all the post on a site where it has load_more button to go to the next page, but something is missing,
I got error of
e Error: Node is either not visible or not an HTMLElement
at ElementHandle._clickablePoint (/Users/minghann/Documents/productnation_scraper/node_modules/puppeteer/lib/ExecutionContext.js:331:13)
at <anonymous>
at process._tickCallback (internal/process/next_tick.js:188:7)
Which doesn't happen if I don't load all the post. It's hard to debug because I don't know which post is missing what. Full code as below:
const browser = await puppeteer.launch({
devtools: true
});
const page = await browser.newPage();
await page.goto("https://example.net");
await page.waitForSelector(".load_more_btn");
const load_more_exist = !!(await page.$(".load_more_btn"));
while (load_more_exist > 0) {
await page.click(".load_more_btn");
}
const posts = await page.$$(".post");
let result = [];
for (const post of posts) {
result = [
...result,
{
title: await post.$eval(".post_title a", e => e.innerText)
}
];
}
console.log(result);
browser.close();
There are multiple ways and best way is to combine the following two different ways.
Look for Ajax
Wait for request instead. Whenever you click on Load More, it will do a simple ajax request to ?ajax-request=jnews. We can use .waitForRequest or .waitForResponse for this use case. Here is a working example,
await Promise.all([
page.waitForRequest(response => response.url().includes('?ajax-request=jnews') && response.status() === 200),
page.click(".load_more_btn")
])
Clean DOM and wait for new Element
Refer to these answers here and here.
Basically you can remove the dom elements that you collected, so next time you collect more data, there won't be any duplicates.
So, once you remove all current elements like document.querySelectorAll('.jeg_post'), you can simply do another page.waitFor('.jeg_post') later if you need.

How to use cheerio to get the URL of an image on a given page for ALL cases

right now I have a function that looks like this:
static getPageImg(url) {
return new Promise((resolve, reject) => {
//get our html
axios.get(url)
.then(resp => {
//html
const html = resp.data;
//load into a $
const $ = cheerio.load(html);
//find ourself a img
const src = url + "/" + $("body").find("img")[0].attribs.src;
//make sure there are no extra slashes
resolve(src.replace(/([^:]\/)\/+/g, "$1"));
})
.catch(err => {
reject(err);
});
});
}
this will handle the average case where the page uses a relative path to link to an image, and the host name is the same as the URL provided.
However,
most of the time the URL scheme will be more complex, like for example the URL might be stackoverflow.com/something/asdasd and what I need is to get stackoverflow.com/someimage link. Or the more interesting case where a CDN is used and the images come from a separate server. For example if I want to link to something from imgur ill give a link like : http://imgur.com/gallery/epqDj. But the actual location of the image is at http://i.imgur.com/pK0thAm.jpg a subdomain of the website. More interesting is the fact that if i was to get the src attribute I would have: "//i.imgur.com/pK0thAm.jpg".
Now I imagine there must be a simple way to get this image, as the browser can very quickly and easily do a "open window in new tab" so I am wondering if anyone knows an easy way to do this other than writing a big function that can handle all these cases.
Thank you!
This is my function that ended up working for all my test cases uysing nodes built in URL type. I had to just use the resolve function.
static getPageImg(url) {
return new Promise((resolve, reject) => {
//get our html
axios.get(url)
.then(resp => {
//html
const html = resp.data;
//load into a $
const $ = cheerio.load(html);
//find ourself a img
const retURL = nodeURL.resolve(url,$("body").find("img")[0].attribs.src);
resolve(retURL);
})
.catch(err => {
reject(err);
});
});
}