How to bypass CSP(Content-Security-Policy) using puppeteer's API page.addScriptTag? - google-chrome

scenario:
I use puppeteer launched chrome in headless mode, and call page.addScriptTag with an cross-domain javascript file. Now if the opening site has csp set and restricts only same origin javascript tags, how can I bypass this using puppeteer API?

Use:
await page.setBypassCSP(true)
Documentation

This is my first stackoverflow contribution so have mercy on me. I found this work around to allow you to get past CSP, Here.
The basic idea is that you intercept page requests and use a library like node-fetch to make the request and disable the CSP header when passing it back to chrome.
Here's the snippet that initially came from the github issue tracker.
Replace "example.com" with the website that needs to have CSP disabled.
const fetch = require('node-fetch')
const requestInterceptor = async (request) => {
try {
const url = request.url()
const requestHeaders = request.headers()
const acceptHeader = requestHeaders.accept || ''
if (url.includes("example.com") && (acceptHeader.includes('text/html'))) {
const cookiesList = await page.cookies(url)
const cookies = cookiesList.map(cookie => `${cookie.name}=${cookie.value}`).join('; ')
delete requestHeaders['x-devtools-emulate-network-conditions-client-id']
if (requestHeaders.Cookie) {
requestHeaders.cookie = requestHeaders.Cookie
delete requestHeaders.Cookie
}
const theseHeaders = Object.assign({'cookie': cookies}, requestHeaders, {'accept-language': 'en-US,en'})
const init = {
body: request.postData(),
headers: theseHeaders,
method: request.method(),
follow: 20,
}
const result = await fetch(
url,
init,
)
const resultHeaders = {}
result.headers.forEach((value, name) => {
if (name.toLowerCase() !== 'content-security-policy') {
resultHeaders[name] = value
} else {
console.log('CSP', `omitting CSP`, {originalCSP: value})
}
})
const buffer = await result.buffer()
await request.respond({
body: buffer,
resultHeaders,
status: result.status,
})
} else {
request.continue();
}
} catch (e) {
console.log("Error while disabling CSP", e);
request.abort();
}
}
await page.setRequestInterception(true)
page.on('request', requestInterceptor)

Related

Is There A Way To Intercept Requests Before They Are Fired In Puppeteer?

As of right now, whenever I would like to intercept a request, I'm able to view the request through logging but the request goes through
if(req.method() == 'GET' && req.url().indexOf('url/Im/checkingout') > -1){
let reqHeaders = req.headers()
cookie = reqHeaders['cookie']
var bodyRequest = req.postData()
console.log(chalk.greenBright(`URL: ${req.url()}`))
console.log(chalk.greenBright(`Method: ${req.method()}`))
console.log(chalk.green(`Headers: ${JSON.stringify(reqHeaders)}`))
console.log(chalk.blue(`Body: ${bodyRequest}`))
console.log(chalk.magenta(`Cookie: ${cookie}`))
console.log('\n')
req.abort()
}
req.continue()
I was wondering if there was a way for Puppeteer could act like something like Burp Suite where you're able to see the request before it's fired and just pause it without it going through?
I've tried this:
if(certainCondition){
req.continue()
}
and this pauses the request but the logs above do not print anything.
Example:
Instructions: Once Page Navigates Click On 'Find A Shop'
I'd like for there to be logs and it stays exactly in this state but there currently aren't any logs
reproduce.mjs
import puppeteer from 'puppeteer'
const targetURL = async() => 'https://www.icecream.com/us/en/brands/haagen-dazs/products/banana-peanut-butter-chip-ice-cream';
(async() => {
const browser = await puppeteer.launch({headless: false, defaultViewport: null, args:["--disable-site-isolation-trials"]});
let page = await browser.newPage();
const target = await targetURL()
await page.goto(target);
let testing = false
await page.setRequestInterception(true)
page.on('request', (req) => {
(async() => {
if(req.method() == 'GET'){
let reqHeaders = req.headers()
let cookie = reqHeaders['cookie']
let bodyRequest = req.postData()
console.log(`URL: ${req.url()}`)
console.log(`Method: ${req.method()}`)
console.log(`Headers: ${JSON.stringify(reqHeaders)}`)
console.log(`Body: ${bodyRequest}`)
console.log(`Cookie: ${cookie}`)
console.log('\n')
req.abort()
//browser.close()
}
})()
if(testing){
req.continue();
}
})
})()

How do I write an async request to get a markdown file's content? Svelte

I'm having a great time building my blog with Svelte, but I'm switching the structure to to be accessed through a JSON API.
Right now it's easy to get the markdown metadata and path, but I'd love to also get the content.
How would I modify this posts.json.js file to also get the content?
const allPostFiles = import.meta.glob('../blog/posts/*.md')
const iterablePostFiles = Object.entries(allPostFiles)
const allPosts = await Promise.all(
iterablePostFiles.map(async ([path, resolver]) => {
const { metadata } = await resolver()
const postPath = path.slice(2, -3)
return {
meta: metadata,
path: postPath
}
})
)
const sortedPosts = allPosts.sort((a, b) => {
return new Date(b.meta.date) - new Date(a.meta.date)
})
return {
body: sortedPosts
}
Install and enable the vite-plugin-markdown
// svelte.config.js
import { plugin as markdown, Mode } from "vite-plugin-markdown";
/** #type {import('#sveltejs/kit').Config} */
export default {
kit: {
vite: {
plugins: [markdown({ mode: Mode.HTML })],
},
},
};
then the content will be available as html and frontmatter data as attributes
iterablePostFiles.map(async ([path, resolver]) => {
const { attributes, html } = await resolver();
return {
attributes,
html,
path: path.slice(2, -3),
};
})
(I suggest adding the metadata into the markdown files via frontmatter )
The answer above works perfectly, but it also works to tweak the API with this code:
const allPosts = await Promise.all(
iterablePostFiles.map(async ([path, resolver]) => {
const { metadata } = await resolver()
// because we know every path will start with '..' and end with '.md', we can slice from the beginning and the end
const postPath = path.slice(2, -3)
const post = await resolver()
const content = post.default.render()
return {
meta: metadata,
path: postPath,
text: content
}
})
)
The important addition is this:
const post = await resolver()
const content = post.default.render()
using these variable chains to avoid using the JS reserved word default.

DialogflowSDK middleware return after resolving a promise

I'm currently playing around with the actions-on-google node sdk and I'm struggling to work out how to wait for a promise to resolve in my middleware before it then executes my intent. I've tried using async/await and returning a promise from my middleware function but neither method appears to work. I know typically you wouldn't override the intent like i'm doing here but this is to test what's going on.
const {dialogflow} = require('actions-on-google');
const functions = require('firebase-functions');
const app = dialogflow({debug: true});
function promiseTest() {
return new Promise((resolve,reject) => {
setTimeout(() => {
resolve('Resolved');
}, 2000)
})
}
app.middleware(async (conv) => {
let r = await promiseTest();
conv.intent = r
})
app.fallback(conv => {
const intent = conv.intent;
conv.ask("hello, you're intent was " + intent );
});
It looks like I should at least be able to return a promise https://actions-on-google.github.io/actions-on-google-nodejs/interfaces/dialogflow.dialogflowmiddleware.html
but I'm not familiar with typescript so I'm not sure if I'm reading these docs correctly.
anyone able to advise how to do this correctly? For instance a real life sample might be I need to make a DB call and wait for that to return in my middleware before proceeding to the next step.
My function is using the NodeJS V8 beta in google cloud functions.
The output of this code is whatever the actual intent was e.g the default welcome intent, rather than "resolved" but there are no errors. So the middleware fires, but then moves onto the fallback intent before the promise resolves. e.g before setting conv.intent = r
Async stuff is really fiddly with the V2 API. And for me only properly worked with NodeJS 8. The reason is that from V2 onwards, unless you return the promise, the action returns empty as it has finished before the rest of the function is evaluated. There is a lot to work through to figure it out, here's some sample boilerplate I have that should get you going:
'use strict';
const functions = require('firebase-functions');
const {WebhookClient} = require('dialogflow-fulfillment');
const {BasicCard, MediaObject, Card, Suggestion, Image, Button} = require('actions-on-google');
var http_request = require('request-promise-native');
process.env.DEBUG = 'dialogflow:debug'; // enables lib debugging statements
exports.dialogflowFirebaseFulfillment = functions.https.onRequest((request, response) => {
const agent = new WebhookClient({ request, response });
console.log('Dialogflow Request headers: ' + JSON.stringify(request.headers));
console.log('Dialogflow Request body: ' + JSON.stringify(request.body));
function welcome(agent) {
agent.add(`Welcome to my agent!`);
}
function fallback(agent) {
agent.add(`I didn't understand`);
agent.add(`I'm sorry, can you try again?`);
}
function handleMyIntent(agent) {
let conv = agent.conv();
let key = request.body.queryResult.parameters['MyParam'];
var myAgent = agent;
return new Promise((resolve, reject) => {
http_request('http://someurl.com').then(async function(apiData) {
if (key === 'Hey') {
conv.close('Howdy');
} else {
conv.close('Bye');
}
myAgent.add(conv);
return resolve();
}).catch(function(err) {
conv.close(' \nUh, oh. There was an error, please try again later');
myAgent.add(conv);
return resolve();
})})
}
let intentMap = new Map();
intentMap.set('Default Welcome Intent', welcome);
intentMap.set('Default Fallback Intent', fallback);
intentMap.set('myCustomIntent', handleMyIntent);
agent.handleRequest(intentMap);
});
A brief overview of what you need:
you have to return the promise resolution.
you have to use the 'request-promise-native' package for HTTP requests
you have to upgrade your plan to allow for outbound HTTP requests (https://firebase.google.com/pricing/)
So it turns out my issue was to do with an outdated version of the actions-on-google sdk. The dialogflow firebase example was using v2.0.0, changing this to 2.2.0 in the package.json resolved the issue

Empty GET response on requesting JSON file's content | koa2

I am a new to koa2, and I trying to GET the contents of a JSON file with koa2
app.use( async ( ctx ) => {
let url = ctx.request.url;
if (url == "list") {
let res = ctx.request.get('http://domain/hello.json');
ctx.body = res.body;
}
})
The JSON file hello.json looks like the following:
{"da": "1212", "dad": "12addsf12"}
I want the route /list to return the contents of hello.json, however, the response is empty. What do I do?
Update:
Change the following lines of code:
let res = ctx.request.get('http://domain/hello.json');
ctx.body = res.body;
to:
let res = ctx.get('http://domain/hello.json');
ctx.body = res;
You should get the content now.
Koa by itself does not support routing, only middleware, you need to have a router middleware for that, try koa-router.
Your app would look something like
const route = require('koa-route')
app.use(route.get('/list', ctx => {
// Route handling logic
});
Also note that ctx.get is an alias for ctx.request.get which returns header information.
This may not be Koa's way of doing things, but this is what I tried and worked for me (complete code for noobs like me):
// jshint ignore: start
const koa2 = require("koa2");
const router = require('koa-simple-router');
const app = new koa2();
const request = require('request-promise-native');
// response
app.use(router(_ => {
_.get('/list', async (ctx) => {
const options = {
method: 'GET',
uri: 'http://www.mocky.io/v2/5af077a1310000540096c672'
}
await request(options, function (error, response, body) {
// I am leaving out error handling on purpose,
// for brevity's sake. You must in your code.
ctx.body = body;
})
});
}));
app.listen(3000);
And, like what J Pichardo's answer points out, Koa by itself does not support routing. You need to use some routing middleware.

Possible to run Headless Chrome/Chromium in a Google Cloud Function?

Is there any way to run Headless Chrome/Chromium in a Google Cloud Function? I understand I can include and run statically compiled binaries in GCF. Can I get a statically compiled version of Chrome that would work for this?
The Node.js 8 runtime for Google Cloud Functions now includes all the necessary OS packages to run Headless Chrome.
Here is a code sample of an HTTP function that returns screenshots:
Main index.js file:
const puppeteer = require('puppeteer');
exports.screenshot = async (req, res) => {
const url = req.query.url;
if (!url) {
return res.send('Please provide URL as GET parameter, for example: ?url=https://example.com');
}
const browser = await puppeteer.launch({
args: ['--no-sandbox']
});
const page = await browser.newPage();
await page.goto(url);
const imageBuffer = await page.screenshot();
await browser.close();
res.set('Content-Type', 'image/png');
res.send(imageBuffer);
}
and package.json
{
"name": "screenshot",
"version": "0.0.1",
"dependencies": {
"puppeteer": "^1.6.2"
}
}
I've just deployed a GCF function running headless Chrome. A couple takeways:
you have to statically compile Chromium and NSS on Debian 8
you have to patch environment variables to point to NSS before launching Chromium
performance is much worse than what you'd get on AWS Lambda (3+ seconds)
For 1, you should be able to find plenty of instructions online.
For 2, the code that I'm using is the following:
static executablePath() {
let bin = path.join(__dirname, '..', 'bin', 'chromium');
let nss = path.join(__dirname, '..', 'bin', 'nss', 'Linux3.16_x86_64_cc_glibc_PTH_64_OPT.OBJ');
if (process.env.PATH === undefined) {
process.env.PATH = path.join(nss, 'bin');
} else if (process.env.PATH.indexOf(nss) === -1) {
process.env.PATH = [path.join(nss, 'bin'), process.env.PATH].join(':');
}
if (process.env.LD_LIBRARY_PATH === undefined) {
process.env.LD_LIBRARY_PATH = path.join(nss, 'lib');
} else if (process.env.LD_LIBRARY_PATH.indexOf(nss) === -1) {
process.env.LD_LIBRARY_PATH = [path.join(nss, 'lib'), process.env.LD_LIBRARY_PATH].join(':');
}
if (fs.existsSync('/tmp/chromium') === true) {
return '/tmp/chromium';
}
return new Promise(
(resolve, reject) => {
try {
fs.chmod(bin, '0755', () => {
fs.symlinkSync(bin, '/tmp/chromium'); return resolve('/tmp/chromium');
});
} catch (error) {
return reject(error);
}
}
);
}
You also need to use a few required arguments when starting Chrome, namely:
--disable-dev-shm-usage
--disable-setuid-sandbox
--no-first-run
--no-sandbox
--no-zygote
--single-process
I hope this helps.
As mentioned in the comment, work is being done on a possible solution to running a headless browser in a cloud function. A directly applicable discussion:"headless chrome & aws lambda" can be read on Google Groups.
The question at. had was can you run headless chrome or chromium in Firebase Cloud Functions... the answer is NO! since the node.js project will not have access any chrome/chromium executables and therefore will fail! (TRUST ME - I've Tried!).
A better solutions is to use the Phantom npm package, which uses PhantomJS under the hood:
https://www.npmjs.com/package/phantom
Docs and info can be found here:
http://amirraminfar.com/phantomjs-node/#/
or
https://github.com/amir20/phantomjs-node
The site i was trying to crawl had implemented screen scraping software, the trick is to wait for the page to load by searching for expected string, or regex match, i.e. i do a regex for a , if you need a regex of any complexity made for you - get in touch at https://AppLogics.uk/ - starting at £5 (GPB).
here is a typescript snippet to make the http or https call:
const phantom = require('phantom');
const instance: any = await phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']);
const page: any = await instance.createPage();
const status = await page.open('https://somewebsite.co.uk/');
const content = await page.property('content');
same again in JavaScript:
const phantom = require('phantom');
const instance = yield phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']);
const page = yield instance.createPage();
const status = yield page.open('https://somewebsite.co.uk/');
const content = yield page.property('content');
Thats the easy bit! if its a static page your pretty much done and you can parse the HTML into something like the cheerio npm package: https://github.com/cheeriojs/cheerio - an implementation of core JQuery designed for servers!
However if it is a dynamically loading page, i.e. lazy loading, or even anti-scraping methods, you will need to wait for the page to update by looping and calling the page.property('content') method and running a text search or regex to see if your page has finished loading.
I have created a generic asynchronous function returning the page content (as a string) on success and throws an exception on failure or timeout. It takes as parameters the variables for the page, text (string to search for that indicates success), error (string to indicate failure or null to not check for error), and timeout (number - self explanatory):
TypeScript:
async function waitForPageToLoadStr(page: any, text: string, error: string, timeout: number): Promise<string> {
const maxTime = timeout ? (new Date()).getTime() + timeout : null;
let html: string = '';
html = await page.property('content');
async function loop(): Promise<string>{
async function checkSuccess(): Promise <boolean> {
html = await page.property('content');
if (!isNullOrUndefined(error) && html.includes(error)) {
throw new Error(`Error string found: ${ error }`);
}
if (maxTime && (new Date()).getTime() >= maxTime) {
throw new Error(`Timed out waiting for string: ${ text }`);
}
return html.includes(text)
}
if (await checkSuccess()){
return html;
} else {
return loop();
}
}
return await loop();
}
JavaScript:
function waitForPageToLoadStr(page, text, error, timeout) {
return __awaiter(this, void 0, void 0, function* () {
const maxTime = timeout ? (new Date()).getTime() + timeout : null;
let html = '';
html = yield page.property('content');
function loop() {
return __awaiter(this, void 0, void 0, function* () {
function checkSuccess() {
return __awaiter(this, void 0, void 0, function* () {
html = yield page.property('content');
if (!isNullOrUndefined(error) && html.includes(error)) {
throw new Error(`Error string found: ${error}`);
}
if (maxTime && (new Date()).getTime() >= maxTime) {
throw new Error(`Timed out waiting for string: ${text}`);
}
return html.includes(text);
});
}
if (yield checkSuccess()) {
return html;
}
else {
return loop();
}
});
}
return yield loop();
});
}
I have personally used this function like this:
TypeScript:
try {
const phantom = require('phantom');
const instance: any = await phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']);
const page: any = await instance.createPage();
const status = await page.open('https://somewebsite.co.uk/');
await waitForPageToLoadStr(page, '<div>Welcome to somewebsite</div>', '<h1>Website under maintenance, try again later</h1>', 1000);
} catch (error) {
console.error(error);
}
JavaScript:
try {
const phantom = require('phantom');
const instance = yield phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']);
const page = yield instance.createPage();
yield page.open('https://vehicleenquiry.service.gov.uk/');
yield waitForPageToLoadStr(page, '<div>Welcome to somewebsite</div>', '<h1>Website under maintenance, try again later</h1>', 1000);
} catch (error) {
console.error(error);
}
Happy crawling!