Puppeteer: how to download a m3u8 / HLS video?

Puppeteer: how to download a m3u8 / HLS video? - puppeteer

I am trying to download a HLS video once I am loggedon (cookies)
HLS videos are made of .m3u8 file and squeces of 00001.ts, 00002.ts ... files
I tried 2 methods
I download the files using the downloader
const client = await page.target().createCDPSession();
await client.send('Page.setDownloadBehavior', {
behavior: 'allow',
downloadPath: '/',
});
await page.evaluate((link) => {
location.href = link;
}, link);
This works for the .m3u8 file, however, it does not work for 00001.ts : because the browser opens these files and does not download them.
I have no idea how to force the downloading of .ts files and not opening them
I tried another way (from https://help.apify.com/en/articles/1929322-handling-file-download-with-puppeteer)
But this did not work either
await page.setRequestInterception(true);
await page.goto('https://eedols.com/m3u8/20221205130402-SiBOhsUsJbTdZLP-DfqGGLn5cF-6-l.m3u8');
const xRequest = await new Promise(resolve => {
page.on('request', interceptedRequest => {
interceptedRequest.abort(); //stop intercepting requests
resolve(interceptedRequest);
});
});
const options = {
encoding: null,
method: xRequest._method,
uri: xRequest._url,
body: xRequest._postData,
headers: xRequest._headers
}
/* add the cookies */
const cookies = await page.cookies();
options.headers.Cookie = cookies.map(ck => ck.name + '=' + ck.value).join(';');
/* resend the request */
const response = await request(options);
It is stuck.
Any idea on that ?

Related

What is the current folder for img references of static page

When a page is rendered using the page.setContent method of some static Html content, what is the current folder for attributes such as the src of img tags?
For example, for:
await page.setContent("<img src="./pic.jpg" />");
where is the folder ./?

Maybe it's undefined, here is my test result:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({args: ['--no-sandbox', '--disable-setuid-sandbox']});
const page = await browser.newPage();
page.on('request', request => console.log('send request: ' + request.url()));
page.on('console', message => console.log('console: ' + message.text()));
await page.setContent('<img src="./test.jpg" /><script>console.log("href="+window.location.href);</script>');
await browser.close();
})();
output:
console: href=about:blank
The page URL is about:blank and there's no requests sent.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({args: ['--no-sandbox', '--disable-setuid-sandbox']});
const page = await browser.newPage();
page.on('request', request => console.log('send request: ' + request.url()));
page.on('console', message => console.log('console: ' + message.text()));
await page.setContent('<base href="https://www.google.com"><img src="./test.jpg" /><script>console.log("href="+window.location.href);</script>');
await browser.close();
})();
output:
console: href=about:blank
send request: https://www.google.com/test.jpg
console: Failed to load resource: the server responded with a status of 404 ()
browser request test.jpg after appending a base element while the URL is still about:blank
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({args: ['--no-sandbox', '--disable-setuid-sandbox']});
const page = await browser.newPage();
page.on('request', request => console.log('send request: ' + request.url()));
page.on('console', message => console.log('console: ' + message.text()));
// set base href to local URL
await page.setContent('<base href="file:///abc/index.html"><img src="./test.jpg" /><script>console.log("href="+window.location.href);</script>');
await browser.close();
})();
output:
console: href=about:blank
console: Not allowed to load local resource: file:///abc/test.jpg
send request: file:///abc/test.jpg

The folder is located from the page you are visiting.
For example if the URL is
mydomain.com/directory1/page.html
The image can be found at mydomain.com/directory1/pic.jpg

how to load extension using puppeteer.connect() method

i use gologin service. gologin is a browser antidetect service where I can fake my browser identity / can manage browser fingerprint.
so I can freely do web-scraping without being detected.
in this case I want to be able to load my extension into that browser using the puppeteer.connect() method.
here's the code:
const puppeteer = require('puppeteer-core');
const GoLogin = require('gologin');
(async () => {
const GL = new GoLogin({
token: 'yU0token',
profile_id: 'yU0Pr0f1leiD',
});
const { status, wsUrl } = await GL.start();
const browser = await puppeteer.connect({
browserWSEndpoint: wsUrl.toString(),
ignoreHTTPSErrors: true,
});
const page = await browser.newPage();
await page.goto('https://myip.link/mini');
console.log(await page.content());
await browser.close();
await GL.stop();
})();
I don't know how. please help me, so i can load my extension using this puppeteer.connect()

Assume your wish is loading chrome-extension into your puppeteer browser.
Find chrome-extension Working Directory Where does Chrome store extensions?
Find your extension ID by go to chrome://extensions/
Sample code:
const puppeteer = require('puppeteer-core');
const MY_EXTENSION_PATH = '~/Library/Application Support/Google/Chrome/Default/Extensions/cdockenadnadldjbbgcallicgledbeoc/0.3.38_0'
async function loadExtension() {
return puppeteer.launch({
headless: 0,
args: [
`--disable-extensions-except=${MY_EXTENSION_PATH}`,
`--load-extension=${MY_EXTENSION_PATH}`,
],
});
}

How to use puppeteer with NordVPN?

Any existing sample on how to use puppeteer with nordVpn ?
I tried that:
page = await browser.newPage();
await useProxy(page, `socks5://login:password}#fr806.nordvpn.com:1080`);
I also tried:
'--proxy-server=socks5://login:password#fr806.nordvpn.com:1080'

This script works, you need to change the user/pass to yours... these are not your Nord user/pass... you need to get the service/api ones from in your account settings. Change the server to whatever one you need to use.
#!/usr/bin/env node
// Screengrab generator
// outputs a JSON object with a base64 encoded image of the screengrab
// eg;
const puppeteer = require('puppeteer');
let conf = new Object();
conf.url = "https://www.telegraph.co.uk";
// VPN
conf.vpnUser = conf.vpnUSer || 'USERNAME';
conf.vpnPass = conf.vpnPass || 'PASSWORD';
conf.vpnServer = conf.vpnServer || "https://uk1785.nordvpn.com:89";
(async() => {
const browser = await puppeteer.launch({
headless: true,
args: [
'--disable-dev-shm-usage',
'--proxy-server='+conf.vpnServer
]
});
try {
const page = await browser.newPage();
await page.authenticate({
username: conf.vpnUser,
password: conf.vpnPass,
});
await page.goto(conf.url, { waitUntil: 'networkidle2' });
} catch (error) {
console.error(error);
} finally {
await browser.close();
}
})();

Node js speed up puppeteer html to pdf

I have a node js application that creates dynamic content which I want users to download.
static async downloadPDF(res, html, filename) {
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch({
headless: true
});
const page = await browser.newPage()
await page.setContent(html, {
waitUntil: 'domcontentloaded'
})
const pdfBuffer = await page.pdf({
format: 'A4'
});
res.set("Content-Disposition", "attachment;filename=" + filename + ".pdf");
res.setHeader("Content-Type", "application/pdf");
res.send(pdfBuffer);
await browser.close()
}
Is there a way to speed up the whole process since it takes about 10 seconds to create a pdf file of size about 100kb?
I read somewhere that I can launch the headless browser once then I will only be creating a new page instead of launching a browser every time I request for the file.
I cannot find out a correct way of doing it.

You could move page creation to a util and hoist it to re-use it.
const puppeteer = require('puppeteer');
let page;
const getPage = async () => {
if (page) return page;
const browser = await puppeteer.launch({
headless: true,
});
page = await browser.newPage();
return page;
};
.
const getPage = require('./getPage');
static async downloadPDF(res, html, filename) {
const page = await getPage()
}

Yes, no reason to launch browser every time. You can set puppeter to call new url and get content. Without every time launching, it would be more faster.
How implement this ? Cut your function to three steps :
Create a browser instance. No matter headless or not. If you run app in X environment, you can launch a window, to see what your puppetter do
Create a function code, that will do main task in cycle.
After block is done, call await page.goto(url) ( where "page" is the instance of browser.newPage() ) and run your function again.
This is one of possible solution in function style code :
Create a instnces :
const browser = await puppeteer.launch( {'headless' : false });
const page = await browser.newPage();
page.setViewport({'width' : 1280, 'height' : 1024 });
I put it in realtime async function like (async ()=>{})();
Gets a data
Im my case, a set of urls was in mongo db, after getting it, I had ran a cycle :
for( const entrie of entries)
{
const url = entrie[1];
const id = entrie[0];
await get_aplicants_data(page,url,id,collection);
}
In get_aplicants_data() I had realized a logic according a loaded page :
await page.goto(url); // Going to url
.... code to prcess page data
Also you can load url in cycle and then put in your logic
Hope I have given you some help )

Can Google Chrome record an audio file for Google Cloud Speech-to-Text without changing the media type or encoding?

My app records an audiofile from the microphone in the browser, typically Chrome; sends it to Firebase Storage; then a Firebase Cloud Function sends the audiofile to Google Cloud Speech-to-Text. Everything works with IBM Cloud Speech-to-Text. With Google Cloud Speech-to-Text it works if I send the audio/flac sample file "several tornadoes touched down as a line of severe thunderstorms swept through Colorado on Sunday". But when I send an audiofile recorded in the browser I get back an error message:
Error: 3 INVALID_ARGUMENT: Request contains an invalid argument.
Here's the browser code. The audio settings are at the top: audio/webm;codecs=opus and 48000 bits per second. This is the only media file format and encoding that Chrome supports.
navigator.mediaDevices.getUserMedia({ audio: true, video: false })
.then(stream => {
var options = {
audioBitsPerSecond: 48000, // switch to 8000 on slow connections?
mimeType: 'audio/webm;codecs=opus' // only options on Chrome
};
const mediaRecorder = new MediaRecorder(stream, options);
mediaRecorder.start();
const audioChunks = [];
mediaRecorder.addEventListener("dataavailable", event => {
audioChunks.push(event.data);
});
mediaRecorder.addEventListener("stop", () => {
const audioBlob = new Blob(audioChunks);
firebase.storage().ref('Users/' + $scope.user.uid + '/Pronunciation_Test').put(audioBlob) // upload to Firebase Storage
.then(function(snapshot) {
firebase.storage().ref(snapshot.ref.location.path).getDownloadURL() // get downloadURL
.then(function(url) {
firebase.firestore().collection('Users').doc($scope.user.uid).collection("Pronunciation_Test").doc('downloadURL').set({downloadURL: url})
.then(function() {
console.log("Document successfully written!");
})
.catch(function(error) {
console.error("Error writing document: ", error);
});
})
.catch(error => console.error(error))
})
.catch(error => console.error(error));
// play back the audio blob
const audioUrl = URL.createObjectURL(audioBlob);
const audio = new Audio(audioUrl);
audio.play();
});
setTimeout(() => {
mediaRecorder.stop();
}, 3000);
})
.catch(function(error) {
console.log(error.name + ": " + error.message);
});
Firebase Storage converts the audiofile from webm/opus to application/octet-streaming.
Here's my Firebase Cloud Function that gets an audiofile from Firebase Storage and sends it to Google Cloud Speech-to-Text.
exports.Google_Speech_to_Text = functions.firestore.document('Users/{userID}/Pronunciation_Test/downloadURL').onUpdate((change, context) => {
// Imports the Google Cloud client library
const speech = require('#google-cloud/speech');
// Creates a client
const client = new speech.SpeechClient();
const downloadURL = change.after.data().downloadURL;
const gcsUri = downloadURL;
const encoding = 'application/octet-stream';
const sampleRateHertz = 48000;
const languageCode = 'en-US';
const config = {
encoding: encoding,
sampleRateHertz: sampleRateHertz,
languageCode: languageCode,
};
const audio = {
uri: gcsUri,
};
const request = {
config: config,
audio: audio,
};
// Detects speech in the audio file
return response = client.recognize(request)
.then(function(response) {
const [responseArray] = response;
const transcription = responseArray.results
.map(result => result.alternatives[0].transcript)
.join('\n');
console.log(`Transcription: `, transcription);
})
.catch((err) => { console.error(err); });
}); // close Google_Speech_to_Text
Here's the list of supported media formats and encodings for Google Cloud Speech-to-Text:
MP3
FLAC
LINEAR16
MULAW
AMR
AMR_WB
OGG_OPUS
SPEEX_WITH_HEADER_BYTE
webm/opus and application/octet-streaming aren't on the list.
Am I missing something or is it impossible to record an audiofile in Chrome, save it in Firebase Storage, and then send it to Google Cloud Speech-to-Text? It seems strange that Google products wouldn't work together. Do I have to recode the audiofile with ffmpeg before I send it to Google Cloud Speech-to-Text?

you can change the file type before uploading it
https://developer.mozilla.org/en-US/docs/Web/API/Blob
mediaRecorder.addEventListener("stop", () => {
//const audioBlob = new Blob(audioChunks);
const audioBlob = new Blob(audioChunks, {type : 'webm/opus'});
// try to change 'webm/opus' to something supported by google.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Puppeteer: how to download a m3u8 / HLS video? - puppeteer

Related

What is the current folder for img references of static page

how to load extension using puppeteer.connect() method

How to use puppeteer with NordVPN?

Node js speed up puppeteer html to pdf

Can Google Chrome record an audio file for Google Cloud Speech-to-Text without changing the media type or encoding?

Categories

Resources