Get an attribute of a page element in pupeeter/apify - puppeteer

I could fetch the textContent of a html element in pupeeter:
var website_element = await page.$('a[itemprop="url"]');
var website= await (await website_element .getProperty('textContent')).jsonValue();
yet, sometimes the textContent is not enough, see the following html:
<a itemprop="url" href="https://www.4-b.ch/de/4b-fenster-fassaden/home/">
https://www.4-b.ch/de/4b-fenster-fassad...</a>
the result is obscure: "https://www.4-b.ch/de/4b-fenster-fassad..." with ... at the end.
So, i better get the href attribute.
But when:
var website_element = await page.$('a[itemprop="url"]');
var website = await (await website_element.getAttribute('href')).jsonValue();
The result is TypeError: website_element.getAttribute is not a function
Any suggestion?

There's an easy and fast way to do this using the page.$eval function:
var website = await page.$eval('a[itemprop="url"]', el => el.href);
What page.$eval does is that it first finds an element in the DOM using the provided selector (first argument) and then invokes the callback (second argument) with the found element as its only argument. The return value of the callback becomes the return value of page.$eval() itself.

it works:
var website_element = await page.$('a[itemprop="url"]');
var website = await (await website_element.getProperty('href')).jsonValue();

Related

how to url split and use the second element as a new url

I try to split url with '?' and use the second element on html
example:
https://url/page?google.com
the output I want to receive is: google.com
and redirect the page to the output, I'm using webflow so if anyone can help with a full script it will be amazing.
I tried:
window.location.replace(id="new_url");
let url = window.location;
const array = url.split("?");
document.getElementById("new_url").innerHTML = array[1];
but it doesn't work :(
window.location.replace(id="new_url"); is not valid syntax.
window.location.replace(new_url); where new_url contained a valid URL would instantly change the page and ignore all other script after it.
I assume you can use the URL api?
Note
your parameter is non-standard
you need to add protocol (https://) to go to the URL
Here is a complicated version, but using a standard tool
const urlString = "https://url/page?google.com"
const url = new URL(urlString)
console.log(url.toString())
const firstSearchKey = [...url.searchParams.entries()][0][0]; // normally parameter=value
console.log(firstSearchKey)
location.replace(`https://${firstSearchKey}`)
Here is a simpler version
const urlString = "https://url/page?google.com"
const [origin,passedUrl] = urlString.split("?");
location.replace(`https://${passedUrl}`)
Try this
const url = window.location.search.split("?")[1]
window.location.href = url
let url = "https://url/page?google.com"
const regex = /\?(.*)/;
let res = regex.exec(url)
console.log(res[1])
Is this what you want?
const inputUrl = window.location.href // ex. https://url/page?google.com
const splitUrl = inputUrl.split("?") // = ["https://url/page", "google.com"]
const targetUrl = splitUrl[1] // = "google.com"
window.location.href = targetUrl // sets current window URL to google.com

Puppeteer: How to write XPath for button in puppeteer

var eleLoginButton = await page.waitForXPath("//input[#class='._ant-btn._ant-btn-primary']")
await eleLoginButton.click()
My Code is not getting Excuted if I have written it like this.
You do not need dot for class in XPath. Try this:
var eleLoginButton = await page.waitForXPath("//input[#class='_ant-btn _ant-btn-primary']");
await eleLoginButton.click();

Puppeteer - How to fill form that is inside an iframe?

I have to fill out a form that is inside an iframe, here the sample page. I cannot access by simply using page.focus() and page.type(). I tried to get the form iframe by using const formFrame = page.mainFrame().childFrames()[0], which works but I cannot really interact with the form iframe.
I figured it out myself. Here's the code.
console.log('waiting for iframe with form to be ready.');
await page.waitForSelector('iframe');
console.log('iframe is ready. Loading iframe content');
const elementHandle = await page.$(
'iframe[src="https://example.com"]',
);
const frame = await elementHandle.contentFrame();
console.log('filling form in iframe');
await frame.type('#Name', 'Bob', { delay: 100 });
Instead of figuring out how to get inside the iFrame and type, I would simplify the problem by navigating to the IFrame URL directly
https://warranty.goodmanmfg.com/registration/NewRegistration/NewRegistration.aspx?Sender=Goodman
Make your script directly go to the above URL and try automating, it should work
Edit-1: Using frames
Since the simple approach didn't work for you, we do it with the frames itself
Below is a simple script which should help you get started
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('http://www.goodmanmfg.com/product-registration', { timeout: 80000 });
var frames = await page.frames();
var myframe = frames.find(
f =>
f.url().indexOf("NewRegistration") > -1);
const serialNumber = await myframe.$("#MainContent_SerNumText");
await serialNumber.type("12345");
await page.screenshot({ path: 'example.png' });
await browser.close();
})();
The output is
If you can't select/find iFrame read this:
I had an issue with finding stripe elements.
The reason for that is the following:
You can't access an with different origin using JavaScript, it would be a huge security flaw if you could do it. For the same-origin policy browsers block scripts trying to access a frame with a different origin. See more detailed answer here
Therefore when I tried to use puppeteer's methods:Page.frames() and Page.mainFrame(). ElementHandle.contentFrame() I did not return any iframe to me. The problem is that it was happening silently and I couldn't figure out why it couldn't find anything.
Adding these arguments to launch options solved the issue:
'--disable-web-security',
'--disable-features=IsolateOrigins,site-per-process'
Though you have figured out but I think I have better solution. Hope it helps.
async doFillForm() {
return await this.page.evaluate(() => {
let iframe = document.getElementById('frame_id_where_form_is _present');
let doc = iframe.contentDocument;
doc.querySelector('#username').value='Bob';
doc.querySelector('#password').value='pass123';
});
}

How to get Chrome and Safari to accept query strings on blobs? [duplicate]

Say I've got a reference to a html file as a Blob b and I create a URL for it, url = URL.createObjectURL(b);.
This gives me something that looks like blob:http%3A//example.com/a0440b61-4850-4568-b6d1-329bae4a3276
I then tried opening this in an <iframe> with a GET parameter ?foo=bar, but it didn't work. How can I pass the parameter?
var html ='<html><head><title>Foo</title></head><body><script>document.body.textContent = window.location.search<\/script></body></html>',
b = new Blob([html], {type: 'text/html'}),
url = URL.createObjectURL(b),
ifrm = document.createElement('iframe');
ifrm.src = url + '?foo=bar';
document.body.appendChild(ifrm);
// expect to see ?foo=bar in <iframe>
DEMO
I don't think adding a query string to the url will work as it essentially changes it to a different url.
However if you simply want to pass parameters you can use the hash to add a fragment to the url
ifrm.src = url + '#foo=bar';
http://jsfiddle.net/thpf584n/1/
For completeness sake, if you want to be able to reference a blob that has as question mark "query string" indicator in it, you can do so in Firefox any way you choose, such as: blob:lalalal?thisworksinfirefox
For Chrome, the above will not work, but this will: blob:lalalla#?thisworksinchromeandfirefox
And for Safari and Microsaft, nothing really works, so do a pre test like so, then plan accordingly:
function initScriptMode() {
var file = new Blob(["test"], {type: "text/javascript"});
var url = URL.createObjectURL(file) + "#test?test";
var request = new XMLHttpRequest();
request.responseType = responseType || "text";
request.open('GET', url);
request.onload = function() {
alert("you can use query strings")
};
try {
request.send();
}
catch(e) {
alert("you can not use query strings")
}
}
If you are doing this with a Javascript Blob for say a WebWorker then you can just to add the parameters into the Blob constructor as a global variable:
const parameters = 'parameters = ' + JSON.stringify({foo:'bar'});
const body = response.body; // From some previous HTTP request
const blob = new Blob([parameters, body], { type: 'application/javascript' });
new Worker(URL.createObjectURL(blob));
Or more general case just store the original URL on the location object
const location = 'location.originalHref = "' + url + '";';
const body = response.body; // From some previous HTTP request
const blob = new Blob([location, body], { type: 'application/javascript' });
new Worker(URL.createObjectURL(blob));
You could also do this with HTML if you can add them say to the root <HTML> tag as attributes or use the <BASE> element for the url or insert them as a script tag but this would require you to modify the response HTML rather then just prepend some extra data

Node.js return html as a variable

I'm setting up a web scraper using Node.js and want to grab some html from a url and save it as a variable. A stripped down version follows.
var request = require('request');
var get_html = function(){
var url = "http://www.google.com";
var html = '';
request.get(url,function(error, response, body){
html += body;
});
return html;
};
console.log(get_html());
It seems that the function returns before request can concatenate the html to the variable html. As far as I can see, request only allows me to manipulate the html within the callback function or pipe it to a file. Is there anyway to just return it as a variable?
request.get is asynchronous and it will return result in the callback function.
You need to adapt your code a little bit like this
var request = require('request');
// get_html receive callback to process result
var get_html = function(callback) {
var url = "http://www.google.com";
var html = '';
request.get(url,function(error, response, body){
return callback(body); // call callback and parse result to it
});
};
// call get_html function
// and log html result here
get_html(function (body) { console.log(body); });
Code with a lot of function callbacks looks not beautiful.
I prefer promise than callback.
If you wish to use promise, try 'request-promise' lib.
It appears that request.get is async, so you have to put return html; in the callback. Otherwise it's returning instantly, before request.get can finish running.