how to print response json from website using puppeteer? - puppeteer

I try to get google translation website to do some work for me, the website returns a blank web page with a json file. Using web brower, I can save the json file and open it in a text editor.
I am trying to use puppeteer to get this done automatically. Here is my code:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({headless:false, args: ["--no-sandbox"]});
const page = await browser.newPage();
// Approach 1:
const response = await page.goto('https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=zh&dt=t&q=Edit%20Report');
let text = await response.text();
console.log(text);
let json = await response.json();
console.log(json);
await browser.close();
})();
When I run this code, brower is launched, but the returned json file still get automatically saved to the disk instead of printing to the console. What puppeteer class I should use for this task?

Since it is an API call and the expected result is JSON, you can use a simple nodsJS or Jquery code to return the response as below.
$.get('https://translate.googleapis.com/translate_a/single?`client=gtx&sl=en&tl=zh&dt=t&q=Edit%20Report', (data) =>`
{
console.log(data);
});
but if you are particular about using puppeteer and want to return the response. you would do the following.
Add a Jquery dependency to your project, by running
npm install jquery
Import the JQuery to the project.
Invoke the below code, without launching the browser.
$.get('https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=zh&dt=t&q=Edit%20Report', (data) =>
{
console.log(data);
});
Here is the link to JSfiddle code https://jsfiddle.net/faizmagic/0h6cm1o4/latest/
I hope this helps.

Related

How to use a Javascript file to refresh/reload a div from an HTML file?

I am using Node JS and have a JS file, which opens a connection to an API, works with the receving API data and then saves the changed data into a JSON file. Next I have an HTML file, which takes the data from the JSON file and puts it into a table. At the end I open the HTML file in my browser to look at the visualized table and its data.
What I would like to happen is, that the table (or more specific a DIV with an ID inside the table) from the HTML file refreshes itself, when the JSON data gets updated from the JS file. Kinda like a "live table/website", that I can watch change over time without the need to presh F5.
Instead of just opening the HTML locally, I have tried it by using the JS file and creating a connection with the file like this:
const http = require('http');
const path = require('path');
const browser = http.createServer(function (request, response) {
var filePath = '.' + request.url;
if (filePath == './') {
filePath = './Table.html';
}
var extname = String(path.extname(filePath)).toLowerCase();
var mimeTypes = {
'.html': 'text/html',
'.css': 'text/css',
'.png': 'image/png',
'.js': 'text/javascript',
'.json': 'application/json'
};
var contentType = mimeTypes[extname] || 'application/octet-stream';
fs.readFile(filePath, function(error, content) {
response.writeHead(200, { 'Content-Type': contentType });
response.end(content, 'utf-8');
});
}).listen(3000);
This creates a working connection and I am able to see it in the browser, but sadly it doesn't update itself like I wish. I thought about some kind of function, which gets called right after the JSON file got saved and tells the div to reload itself.
I also read about something like window.onload, location.load() or getElementById(), but I am not able to figure out the right way.
What can I do?
Thank you.
Websockets!
Though they might sound scary, it's very easy to get started with websockets in NodeJS, especially if you use Socket.io.
You will need two dependencies in your node application:
"socket.io": "^4.1.3",
"socketio-wildcard": "^2.0.0"
your HTML File:
<script type="module" src="https://cdnjs.cloudflare.com/ajax/libs/socket.io/4.0.0/socket.io.js"></script>
Your CLIENT SIDE JavaScript file:
var socket = io();
socket.on("update", function (data) { //update can be any sort of string, treat it like an event name
console.log(data);
// the rest of the code to update the html
})
your NODE JS file:
import { Server } from "socket.io";
// other code...
let io = new Server(server);
let activeConnections = {};
io.sockets.on("connection", function (socket) {
// 'connection' is a "magic" key
// track the active connections
activeConnections[socket.id] = socket;
socket.on("disconnect", function () {
/* Not required, but you can add special handling here to prevent errors */
delete activeConnections[socket.id];
})
socket.on("update", (data) => {
// Update is any sort of key
console.log(data)
})
})
// Example with Express
app.get('/some/api/call', function (req, res) {
var data = // your API Processing here
Object.keys(activeConnections).forEach((conn) => {
conn.emit('update', data)
}
res.send(data);
})
Finally, shameful self promotion, here's one of my "dead" side projects using websockets, because I'm sure I forgot some small detail, and this might help. https://github.com/Nhawdge/robert-quest

YouTube's get_video_info endpoint no longer working

I was using the YouTube's get_video_info unofficial endpoint to get the resolution of a video. Sometime very recently, this has stopped working and gives me:
We're sorry...... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now.
Is there another public endpoint via which I can get the size/aspect ratio of the video? I need the endpoint to be accessible without having to use some login or developer key.
I spent a couple of hours debugging youtube-dl, and I found out that for all videos, you can simply get the content of the web page, i.e.
(async () => {
// From the back-end
const urlVideo = "https://www.youtube.com/watch?v=aiSla-5xq3w";
const html = await (await fetch(urlVideo)).text();
console.log(html);
})();
And then you match the content against the following RegExp:
/ytInitialPlayerResponse\s*=\s*({.+?})\s*;\s*(?:var\s+meta|<\/script|\n)/
Sample HTML:
(async () => {
const html = await (await fetch("https://gist.githubusercontent.com/avi12/0255ab161b560c3cb7e341bfb5933c2f/raw/3203f6e4c56fff693e929880f6b63f74929cfb04/YouTube-example.html")).text();
const matches = html.match(/ytInitialPlayerResponse\s*=\s*({.+?})\s*;\s*(?:var\s+meta|<\/script|\n)/);
const json = JSON.parse(matches[1]);
const [format] = json.streamingData.adaptiveFormats;
console.log(`${format.width}x${format.height}`);
})();

how to execute a script in every window that gets loaded in puppeteer?

I need to execute a script in every Window object created in Chrome – that is:
tabs opened through puppeteer
links opened by click()ing links in puppeteer
all the popups (e.g. window.open or "_blank")
all the iframes contained in the above
it must be executed without me evaluating it explicitly for that particular Window object...
I checked Chrome's documentation and what I should be using is Page.addScriptToEvaluateOnNewDocument.
However, it doesn't look to be possible to use through puppeteer.
Any idea? Thanks.
This searches for a target in all browser contexts.
An example of finding a target for a page opened
via window.open() or popups:
await page.evaluate(() => window.open('https://www.example.com/'))
const newWindowTarget = await browser.waitForTarget(async target => {
await page.evaluate(() => {
runTheScriptYouLike()
console.log('Hello StackOverflow!')
})
})
via browser.pages() or tabs
This script run evaluation of a script in the second tab:
const pageTab2 = (await browser.pages())[1]
const runScriptOnTab2 = await pageTab2.evaluate(() => {
runTheScriptYouLike()
console.log('Hello StackOverflow!')
})
via page.frames() or iframes
An example of getting eval from an iframe element:
const frame = page.frames().find(frame => frame.name() === 'myframe')
const result = await frame.evaluate(() => {
return Promise.resolve(8 * 7);
});
console.log(result); // prints "56"
Hope this may help you

Generating HTML report in gulp using lighthouse

I am using gulp for a project and I added lighthouse to the gulp tasks like this:
gulp.task("lighthouse", function(){
return launchChromeAndRunLighthouse('http://localhost:3800', flags, perfConfig).then(results => {
console.log(results);
});
});
And this is my launchChromeAndRunLighthouse() function
function launchChromeAndRunLighthouse(url, flags = {}, config = null) {
return chromeLauncher.launch().then(chrome => {
flags.port = chrome.port;
return lighthouse(url, flags, config).then(results =>
chrome.kill().then(() => results));
});
}
It gives me the json output in command line. I can post my json here and get the report.
Is there any way I can generate the HTML report using gulp ?
You are welcome to start a bounty if you think this question will be helpful for future readers.
The answer from #EMC is fine, but it requires multiple steps to generate the HTML from that point. However, you can use it like this (written in TypeScript, should be very similar in JavaScript):
const { write } = await import(root('./node_modules/lighthouse/lighthouse-cli/printer'));
Then call it:
await write(results, 'html', 'report.html');
UPDATE
There have been some changes to the lighthouse repo. I now enable programmatic HTML reports as follows:
const { write } = await import(root('./node_modules/lighthouse/lighthouse-cli/printer'));
const reportGenerator = await import(root('./node_modules/lighthouse/lighthouse-core/report/report-generator'));
// ...lighthouse setup
const raw = await lighthouse(url, flags, config);
await write(reportGenerator.generateReportHtml(raw.lhr), 'html', root('report.html'));
I know it's hacky, but it solves the problem :).
I've run into this issue too. I found somewhere in the github issues that you can't use the html option programmatically, but Lighthouse does expose the report generator, so you can write simple file write and open functions around it to get the same effect.
const ReportGenerator = require('../node_modules/lighthouse/lighthouse-core/report/v2/report-generator.js');
I do
let opts = {
chromeFlags: ['--show-paint-rects'],
output: 'html'
}; ...
const lighthouseResults = await lighthouse(urlToTest, opts, config = null);
and later
JSON.stringify(lighthouseResults.lhr)
to get the json
and
lighthouseResults.report.toString('UTF-8'),
to get the html
You can define the preconfig in the gulp as
const preconfig = {logLevel: 'info', output: 'html', onlyCategories: ['performance','accessibility','best-practices','seo'],port: (new URL(browser.wsEndpoint())).port};
The output option can be used as the html or json or csv. This preconfig is nothing but the configuration for the lighthouse based on how we want it to run and give us the solution.

Using Artoo.js with Google Puppeteer for Web Scraping

I can't seem to be able to use Artoo.js with Puppeteer.
I tried using it through npm install artoo-js, but it did not work.
I also tried injecting the build path distribution using the Puppeteer command page.injectFile(filePath), but I had no luck.
Was anyone able to implement these two libraries successfully?
If so, I would love a code snippet of how Artoo.js was injected.
I just tried Puppeteer for another answer, I figured I could try Artoo too, so here you go :)
(Step 0 : Install Yarn if you don't have it)
yarn init
yarn add puppeteer
# Download latest artoo script, not as a yarn dependency here because it won't be by the Node JS runtime
wget https://medialab.github.io/artoo/public/dist/artoo-latest.min.js
Save this in index.js:
const puppeteer = require('puppeteer');
(async() => {
const url = 'https://news.ycombinator.com/';
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Go to URL and wait for page to load
await page.goto(url, {waitUntil: 'networkidle'});
// Inject Artoo into page's JS context
await page.injectFile('artoo-latest.min.js');
// Sleeping 2s to let Artoo initialize (I don't have a more elegant solution right now)
await new Promise(res => setTimeout(res, 2000))
// Use Artoo from page's JS context
const result = await page.evaluate(() => {
return artoo.scrape('td.title:nth-child(3)', {
title: {sel: 'a'},
url: {sel: 'a', attr: 'href'}
});
});
console.log(`Result has ${result.length} items, first one is:`, result[0]);
browser.close();
})();
Result:
$ node index.js
Result has 30 items, first one is: { title: 'Headless mode in Firefoxdeveloper.mozilla.org',
url: 'https://developer.mozilla.org/en-US/Firefox/Headless_mode' }
This is too funny to miss: right now the top article of HackerNews is about Firefox Headless...