I am using puppeteer to scrape a page, there is a simple form for searching on the page
I enter a search text in one input, the others are empty, however, when i click on the search, the others input are not empty, what' s wrong with it ?
await page.goto('http://www.dollmedia-btp.com/annuaire/', {waitUntil: 'domcontentloaded'});
await page.type("input#input-recherche-activite", "Maçonnerie");
await page.type("input#input-recherche-raison-sociale", "");
await page.type("input#input-recherche-ville","");
await page.type("input#input-recherche-tel", "");
I don' t know what to do, thanks for your helps
You have to clear the fields, the type method with an empty string does not remove the input's value.
Custom function:
async function clear(page, selector) {
await page.evaluate(selector => {
document.querySelector(selector).value = "";
}, selector);
}
Usage:
await clear(page,"input#input-recherche-raison-sociale");
Or without creating function:
await page.$eval('#input-recherche-raison-sociale', el => el.value = '');
Related
await page.on("response", async (response) => {
const request = await response.request();
if (
request.url().includes("https://www.jobs.abbott/us/en/search-results")
) {
const text = await response.text();
const root = await parse(text);
root.querySelectorAll("script").map(async function (n) {
if (n.rawText.includes("eagerLoadRefineSearch")) {
const text = await n.rawText.match(
/"eagerLoadRefineSearch":(\{.*\})\,/,
);
const refinedtext = await text[0].match(/\[{.*}\]/);
//console.log(refinedtext);
console.log(JSON.parse(refinedtext[0]));
}
});
}
});
In the snippet I have posted a data which is in text format I want to extract eagerLoadRefineSearch : { (and its content too)} as a text with regex and perform json.parse on extracted text so that i get finally a json object of "eagerLoadRefineSearch" : {}.
I am using puppetter for intercepting response. I just want a correct regex which can get me whole object text of "eagerLoadRefineSearch" : {} (with its content).
I am sharing the response text from the server in this link https://codeshare.io/bvjzJA .
I want to extract "eagerLoadRefineSearch" : {} from the data which is in text format in this https://codeshare.io/bvjzJA
Context
Silly mistakes
The text you are parsing has no flanked " around eagerLoadRefineSearch. Now the object to match spans across several lines thus m flag is required. Also . does not match new line so the alternative is to use [\s\S]. Refer to how-to-use-javascript-regex-over-multiple-lines.
Also also, don't use await on string method match.
Matching the closing brace
Quick search on this topic lead me to this link and as I suspected, this is complicated. To ease this problem I made this assumption that the text is correctly indented. We can match on the indentation level to find the closing brace with this pattern.
/(?<indent>[\s]+)\{[\s\S]+\k<indent>\}/gm
This works if the both the opening and the closing braces are at the same level of indentation. They are not in our case since eagerLoadRefineSearch: is between the indent and opening brace but we can account for this.
const reMatchObject = /(?<indent>[\s]+)eagerLoadRefineSearch: \{[\s\S]+?\k<indent>\}/gm
Valid JSON
As metioned earlier the keys lack flanking double quotes so lets replace all keys with "key"s.
const reMatchKeys = /(\w+):/gm
const impure = 'hello: { name: "nammu", age: 18, subjects: { first: "english", second: "mythology"}}'
const pure = impure.replace(reMatchKeys, '"$1":')
console.log(pure)
Then we get rid of the trailing commas. Here's the regex that worked for this example.
const reMatchTrailingCommas = /,(?=\s+[\]\}])/gm
Once we pipe these replace functions, the data is good to use by JSON.parse.
Code
await page.on('response', async (response) => {
const request = await response.request();
if (
request
.url()
.includes('https://www.jobs.abbott/us/en/search-results')
) {
const text = await response.text();
const root = await parse(text);
root.querySelectorAll('script').map(async function (n) {
const data = n.rawText;
if (data.includes('eagerLoadRefineSearch')) {
const reMatchObject = /(?<indent>[\s]+)eagerLoadRefineSearch: \{[\s\S]+?\k<indent>\}/gm;
const reMatchKeys = /(\w+):\s/g;
const reMatchTrailingCommas = /,(?=\s+[\]\}])/gm;
const parsedStringArray = data.toString().match(reMatchObject);
for (const parsed of parsedStringArray) {
const noTrailingCommas = parsed.replace(reMatchTrailingCommas, '');
const validJSONString = '{' + noTrailingCommas.replace(reMatchKeys, '"$1":') + '}';
console.log(JSON.parse(validJSONString));
}
}
});
}
});
I want to run $x on a specific element, not the whole page, and tried this:
let selector = await page.$('.myClass"]');
let [el] = await selector.$x(`//p[contains(text(), 'myTextString')]`);
pass = el ? true : false
I'm expecting pass to evaluate to false, because the element with myClass does not contain myTextString, but el is not falsy because myTextString exists elsewhere on the page.
Is there a way to check for a string of text on a particular element?
Code to recreate the issue with $x:
const puppeteer = require('puppeteer');
async function test() {
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
args: ['--start-maximized']
});
const page = await browser.newPage();
await page.goto('https://stackoverflow.com/');
// select a header div with no text content
let selector = await page.$('#notify-container');
// an element is still found, even though I'm running $x on the selector, not the page
let [el] = await selector.$x(`//h2[contains(text(), 'Find the best answer to your technical question, help others answer theirs')]`);
console.log(el);
await browser.close();
}
test();
According to the docs, "The method evaluates the XPath expression relative to the elementHandle as its context node". So you just need to use the context node symbol in the beginning of the XPath: .//h2 instead of //h2.
I need to open very simple websites and scan for a json object i.e.
myJSONObject:["el1","el2"]. There is only one HTML <pre> tag on the site that contains 100s of lines of text. Nothing else.
I was planning on scanning the page for myJSONObject: and then return ["el1", "el2"].
I used the following, which returns true, as it finds "myJSONObject:", but I have no way to return any text.
const found = await page.evaluate(() => window.find("myJSONObject:"));
Is there a way to use a regexp or something to find the needed text and return it? Is this at all possible?
I am new to puppeteer, so I am unsure of its capabilities. I appreciate any feedback.
You already find the right function (puppeteer.evaluate) to do the job. With it you can return strings, objects, numbers or booleans (in fact any serializable/stringifyable value) from browser/page context to the node context.
Don't know if you already grasp this: browser/page context and node context are different. The only way you can transfer data between them is by stringifying the data and then transfer them.
Said that, to solve your problem you have to come up with a regex and return the string matched. Full working example:
Suppose the <pre> text is something like this: <pre>[...] myJSONObject:["el1","el2"] [...]</pre>
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
// setup test page
await page.evaluate(() => {
const pre = document.createElement('pre');
pre.innerText = '<pre>[...] myJSONObject:["el1","el2"] [...]</pre>';
document.body.append(pre);
});
// important part (this is the answer to your question)
const myJson = await page.evaluate(() => {
var re = /myJSONObject:(\[.*?])/; // regex to match "json text"
const pre = document.querySelector('pre').innerText;
const matchedJsonText = pre.match(re)[1];
const json = JSON.parse(matchedJsonText);
return json;
});
// show results
console.log('myJSONObject:', myJson);
await browser.close();
})();
Please note that this regex only work with the json you provided as example. You'll have to come up with a better regex to match the jsons that you need.
I am having issue with puppeteer.
I want to delete the added item to the form. For example, I have a form and added some fake data ("example"). I want to delete this "example", it doesn't matter whatever position it located. I just only want to delete this "example".
So, it means, puppeteer adds it and will delete in the next step.
I have tried:
// fake data
const metadatatest = {
text: 'example,
}
describe('Should be navigate through details', () => {
it('can navigate through detail', async () => {
// this adds fake data successfully
await page.waitForSelector('[data-testid="appCard"]')
await page.click('[data-testid="appCardDetails"]')
await page.waitForSelector('[data-testid="overviewSectionMetadataForm"]')
await page.click('[data-testid="overviewSectionMetadataEditButton"]')
//await page.$eval('[data-testid="metadataInput"]', el => el.value = 'example')
await page.type('[data-testid="metadataInput"]', metadatatest.text)
await page.waitForSelector('[data-testid="metadataInput"]')
await Promise.all([
page.click('[data-testid="overviewSectionMetadataEditButton"]'),
]);
// I want to delete this
})
})
I have also tried using
await page.keyboard.press('Backspace')
await page.keyboard.press('Clear')
await page.keyboard.press('Delete')
but no luck.
any help please!
So what you're asking is about clearing text from an input field, am I reading that correctly? Puppeteer doesn't have a built in method for that but I have found a workaround which will do it for you.
First, you need to click 3 times on the input field you wish to clear. This acts as a select all action for all text entered in that element:
await page.click(selector, { clickCount: 3 });
Now you can use your previous attempt to clear the text:
await page.keyboard.press('Backspace');
Update 1:
Your final code for clearing and then entering the text you want into the input field should look something like this:
await page.click('[data-testid="metadataInput"]', { clickCount: 3 });
await page.keyboard.press('Backspace');
await page.type('[data-testid="metadataInput"]', metadatatest.text);
I have a line of code that looks
await page.$$eval("a", as => as.find(a => a.innerText.includes("shop")).click());
So, it will click at shop and all okay, but if shop is written like this - "Shop". So, puppeteer wouldn't be able to find it. Is it possible to ignore ? So, that puppeteer would only see "shop".
You can decode the innerText using DOMParser. Example copied from this answer.
window.getDecodedHTML = function getDecodedHTML(encodedStr) {
const parser = new DOMParser();
const dom = parser.parseFromString(
`<!doctype html><body>${encodedStr}`,
"text/html"
);
return dom.body.textContent;
}
Save the above snippet to some file like script.js and inject it for easier usage.
await page.evaluate(fs.readFileSync('script.js', 'utf8'));
Now you can use it to decode the innerText.
await page.$$eval("a", as => as.find(a => getDecodedHTML(a.innerText).includes("shop")).click());
The solution might not be optimal. But it should work out.
Here is another snippet for you which doesn't require DOMparser.
window.getDecodedHTML = function(str) {
return str.replace(/&#(\d+);/g, function(match, dec) {
return String.fromCharCode(dec);
});
};