Is it possible to ignore &#65279 in innerhtml - puppeteer

I have a line of code that looks
await page.$$eval("a", as => as.find(a => a.innerText.includes("shop")).click());
So, it will click at shop and all okay, but if shop is written like this - "S&#65279h&#65279op". So, puppeteer wouldn't be able to find it. Is it possible to ignore &#65279? So, that puppeteer would only see "shop".

You can decode the innerText using DOMParser. Example copied from this answer.
window.getDecodedHTML = function getDecodedHTML(encodedStr) {
const parser = new DOMParser();
const dom = parser.parseFromString(
`<!doctype html><body>${encodedStr}`,
"text/html"
);
return dom.body.textContent;
}
Save the above snippet to some file like script.js and inject it for easier usage.
await page.evaluate(fs.readFileSync('script.js', 'utf8'));
Now you can use it to decode the innerText.
await page.$$eval("a", as => as.find(a => getDecodedHTML(a.innerText).includes("shop")).click());
The solution might not be optimal. But it should work out.
Here is another snippet for you which doesn't require DOMparser.
window.getDecodedHTML = function(str) {
return str.replace(/&#(\d+);/g, function(match, dec) {
return String.fromCharCode(dec);
});
};

Related

Save filereader result to variable for later use

I can't find simple answer, but my code is simple.
I tried something like that, but always when i try to console.log my testResult, then i always recieving null. How to save data from file correctly?
public getFile(
sourceFile: File
): string {
let testResult;
const file = sourceFile[0]
const fileReader = new FileReader();
fileReader.readAsText(file, "UTF-8")
fileReader.onloadend = (e) => {
testResult = fileReader.result.toString()
}
console.log(testResult)
return testResult
}
This problem is related to my other topics, main reason is i can't handle load json file, translate them and upload to user. If i can save this file outside onloadend, then i hope i can handle rest of them (other attempts failed, this one blocking me at beginning)
Your issue is quite classical and is related to the asynchronous operations. Function which you assign to the onloadend request is called only when loadend event fires, but the rest of code will not wait for that to happen and will continue execution. So console.log will be executed immediately and then return will actually return testResult while it is still empty.
Firstly, in order to understand what I just said, put the console.log(testResult) line inside of your onloadend handler:
fileReader.onloadend = (e) => {
testResult = fileReader.result.toString();
console.log(testResult);
}
At this point testResult is not empty and you may continue handling it inside this function. However, if you want your getFile method to be really reusable and want it to return the testResult and process it somewhere else, you need to wrap this method into a Promise, like this:
public getFile(
sourceFile: File
): Promise<string> {
return new Promise((resolve) => {
const file = sourceFile[0]
const fileReader = new FileReader();
fileReader.onloadend = (e) => {
const testResult = fileReader.result.toString();
resolve(testResult);
}
fileReader.readAsText(file, "UTF-8");
});
}
Now whereever you need a file you can use the yourInstance.getFile method as follows:
yourInstance.getFile().then(testResult => {
// do whatever you need here
console.log(testResult);
});
Or in the async/await way:
async function processResult() {
const testResult = await yourInstance.getFile();
// do whatever you need
console.log(testResult);
}
If you are now familiar with promises and/or async/await, please read more about here and here.

ReactJS DraftJS Initialize from Serialized Data

So I am using the DraftJS package with React along with the mentions plugin. When a post is created, I store the raw JS in my PostreSQL JSONField:
convertToRaw(postEditorState.getCurrentContent())
When I edit the post, I set the editor state as follows:
let newEditorState = EditorState.createWithContent(convertFromRaw(post.richtext_content));
setEditorState(newEditorState);
The text gets set correctly, but none of the mentions are highlighted AND I can't add new mentions. Does anyone know how to fix this?
I am using the mention plugin: https://www.draft-js-plugins.com/plugin/mention
to save data
function saveContent() {
const content = editorState.getCurrentContent();
const rawObject = convertToRaw(content);
const draftRaw = JSON.stringify(rawObject); //<- save this to database
}
and retrieval:
setEditorState(()=> EditorState.push(
editorState,
convertFromRaw(JSON.parse(draftRaw)),
"remove-range"
););
it should preserve your data as saved.
the example provided (which works ok) is for inserting a new block with mention, saving the entityMap as well.
mentionData is jus a simple object {id:.., name:.., link:... , avatar:...}
One more thing:
initialize only once:
in other words do not recreate the state.
const [editorState, setEditorState] = useState(() => EditorState.createEmpty() );
und then populate something like:
useEffect(() => {
try {
if (theDraftRaw) {
let mtyState = EditorState.push(
editorState,
convertFromRaw(JSON.parse(theDraftRaw)),
"remove-range"
);
setEditorState(mtyState);
} else editorClear();
} catch (e) {
console.log(e);
// or some fallback to other field like text
}
}, [theDraftRaw]);
const editorClear = () => {
if (!editorState.getCurrentContent().hasText()) return;
let _editorState = EditorState.push(
editorState,
ContentState.createFromText("")
);
setEditorState(_editorState);
};

Finding/Returning Text

I need to open very simple websites and scan for a json object i.e.
myJSONObject:["el1","el2"]. There is only one HTML <pre> tag on the site that contains 100s of lines of text. Nothing else.
I was planning on scanning the page for myJSONObject: and then return ["el1", "el2"].
I used the following, which returns true, as it finds "myJSONObject:", but I have no way to return any text.
const found = await page.evaluate(() => window.find("myJSONObject:"));
Is there a way to use a regexp or something to find the needed text and return it? Is this at all possible?
I am new to puppeteer, so I am unsure of its capabilities. I appreciate any feedback.
You already find the right function (puppeteer.evaluate) to do the job. With it you can return strings, objects, numbers or booleans (in fact any serializable/stringifyable value) from browser/page context to the node context.
Don't know if you already grasp this: browser/page context and node context are different. The only way you can transfer data between them is by stringifying the data and then transfer them.
Said that, to solve your problem you have to come up with a regex and return the string matched. Full working example:
Suppose the <pre> text is something like this: <pre>[...] myJSONObject:["el1","el2"] [...]</pre>
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
// setup test page
await page.evaluate(() => {
const pre = document.createElement('pre');
pre.innerText = '<pre>[...] myJSONObject:["el1","el2"] [...]</pre>';
document.body.append(pre);
});
// important part (this is the answer to your question)
const myJson = await page.evaluate(() => {
var re = /myJSONObject:(\[.*?])/; // regex to match "json text"
const pre = document.querySelector('pre').innerText;
const matchedJsonText = pre.match(re)[1];
const json = JSON.parse(matchedJsonText);
return json;
});
// show results
console.log('myJSONObject:', myJson);
await browser.close();
})();
Please note that this regex only work with the json you provided as example. You'll have to come up with a better regex to match the jsons that you need.

Generating HTML report in gulp using lighthouse

I am using gulp for a project and I added lighthouse to the gulp tasks like this:
gulp.task("lighthouse", function(){
return launchChromeAndRunLighthouse('http://localhost:3800', flags, perfConfig).then(results => {
console.log(results);
});
});
And this is my launchChromeAndRunLighthouse() function
function launchChromeAndRunLighthouse(url, flags = {}, config = null) {
return chromeLauncher.launch().then(chrome => {
flags.port = chrome.port;
return lighthouse(url, flags, config).then(results =>
chrome.kill().then(() => results));
});
}
It gives me the json output in command line. I can post my json here and get the report.
Is there any way I can generate the HTML report using gulp ?
You are welcome to start a bounty if you think this question will be helpful for future readers.
The answer from #EMC is fine, but it requires multiple steps to generate the HTML from that point. However, you can use it like this (written in TypeScript, should be very similar in JavaScript):
const { write } = await import(root('./node_modules/lighthouse/lighthouse-cli/printer'));
Then call it:
await write(results, 'html', 'report.html');
UPDATE
There have been some changes to the lighthouse repo. I now enable programmatic HTML reports as follows:
const { write } = await import(root('./node_modules/lighthouse/lighthouse-cli/printer'));
const reportGenerator = await import(root('./node_modules/lighthouse/lighthouse-core/report/report-generator'));
// ...lighthouse setup
const raw = await lighthouse(url, flags, config);
await write(reportGenerator.generateReportHtml(raw.lhr), 'html', root('report.html'));
I know it's hacky, but it solves the problem :).
I've run into this issue too. I found somewhere in the github issues that you can't use the html option programmatically, but Lighthouse does expose the report generator, so you can write simple file write and open functions around it to get the same effect.
const ReportGenerator = require('../node_modules/lighthouse/lighthouse-core/report/v2/report-generator.js');
I do
let opts = {
chromeFlags: ['--show-paint-rects'],
output: 'html'
}; ...
const lighthouseResults = await lighthouse(urlToTest, opts, config = null);
and later
JSON.stringify(lighthouseResults.lhr)
to get the json
and
lighthouseResults.report.toString('UTF-8'),
to get the html
You can define the preconfig in the gulp as
const preconfig = {logLevel: 'info', output: 'html', onlyCategories: ['performance','accessibility','best-practices','seo'],port: (new URL(browser.wsEndpoint())).port};
The output option can be used as the html or json or csv. This preconfig is nothing but the configuration for the lighthouse based on how we want it to run and give us the solution.

How to use cheerio to get the URL of an image on a given page for ALL cases

right now I have a function that looks like this:
static getPageImg(url) {
return new Promise((resolve, reject) => {
//get our html
axios.get(url)
.then(resp => {
//html
const html = resp.data;
//load into a $
const $ = cheerio.load(html);
//find ourself a img
const src = url + "/" + $("body").find("img")[0].attribs.src;
//make sure there are no extra slashes
resolve(src.replace(/([^:]\/)\/+/g, "$1"));
})
.catch(err => {
reject(err);
});
});
}
this will handle the average case where the page uses a relative path to link to an image, and the host name is the same as the URL provided.
However,
most of the time the URL scheme will be more complex, like for example the URL might be stackoverflow.com/something/asdasd and what I need is to get stackoverflow.com/someimage link. Or the more interesting case where a CDN is used and the images come from a separate server. For example if I want to link to something from imgur ill give a link like : http://imgur.com/gallery/epqDj. But the actual location of the image is at http://i.imgur.com/pK0thAm.jpg a subdomain of the website. More interesting is the fact that if i was to get the src attribute I would have: "//i.imgur.com/pK0thAm.jpg".
Now I imagine there must be a simple way to get this image, as the browser can very quickly and easily do a "open window in new tab" so I am wondering if anyone knows an easy way to do this other than writing a big function that can handle all these cases.
Thank you!
This is my function that ended up working for all my test cases uysing nodes built in URL type. I had to just use the resolve function.
static getPageImg(url) {
return new Promise((resolve, reject) => {
//get our html
axios.get(url)
.then(resp => {
//html
const html = resp.data;
//load into a $
const $ = cheerio.load(html);
//find ourself a img
const retURL = nodeURL.resolve(url,$("body").find("img")[0].attribs.src);
resolve(retURL);
})
.catch(err => {
reject(err);
});
});
}