Is there any way to run Headless Chrome/Chromium in a Google Cloud Function? I understand I can include and run statically compiled binaries in GCF. Can I get a statically compiled version of Chrome that would work for this?
The Node.js 8 runtime for Google Cloud Functions now includes all the necessary OS packages to run Headless Chrome.
Here is a code sample of an HTTP function that returns screenshots:
Main index.js file:
const puppeteer = require('puppeteer');
exports.screenshot = async (req, res) => {
const url = req.query.url;
if (!url) {
return res.send('Please provide URL as GET parameter, for example: ?url=https://example.com');
}
const browser = await puppeteer.launch({
args: ['--no-sandbox']
});
const page = await browser.newPage();
await page.goto(url);
const imageBuffer = await page.screenshot();
await browser.close();
res.set('Content-Type', 'image/png');
res.send(imageBuffer);
}
and package.json
{
"name": "screenshot",
"version": "0.0.1",
"dependencies": {
"puppeteer": "^1.6.2"
}
}
I've just deployed a GCF function running headless Chrome. A couple takeways:
you have to statically compile Chromium and NSS on Debian 8
you have to patch environment variables to point to NSS before launching Chromium
performance is much worse than what you'd get on AWS Lambda (3+ seconds)
For 1, you should be able to find plenty of instructions online.
For 2, the code that I'm using is the following:
static executablePath() {
let bin = path.join(__dirname, '..', 'bin', 'chromium');
let nss = path.join(__dirname, '..', 'bin', 'nss', 'Linux3.16_x86_64_cc_glibc_PTH_64_OPT.OBJ');
if (process.env.PATH === undefined) {
process.env.PATH = path.join(nss, 'bin');
} else if (process.env.PATH.indexOf(nss) === -1) {
process.env.PATH = [path.join(nss, 'bin'), process.env.PATH].join(':');
}
if (process.env.LD_LIBRARY_PATH === undefined) {
process.env.LD_LIBRARY_PATH = path.join(nss, 'lib');
} else if (process.env.LD_LIBRARY_PATH.indexOf(nss) === -1) {
process.env.LD_LIBRARY_PATH = [path.join(nss, 'lib'), process.env.LD_LIBRARY_PATH].join(':');
}
if (fs.existsSync('/tmp/chromium') === true) {
return '/tmp/chromium';
}
return new Promise(
(resolve, reject) => {
try {
fs.chmod(bin, '0755', () => {
fs.symlinkSync(bin, '/tmp/chromium'); return resolve('/tmp/chromium');
});
} catch (error) {
return reject(error);
}
}
);
}
You also need to use a few required arguments when starting Chrome, namely:
--disable-dev-shm-usage
--disable-setuid-sandbox
--no-first-run
--no-sandbox
--no-zygote
--single-process
I hope this helps.
As mentioned in the comment, work is being done on a possible solution to running a headless browser in a cloud function. A directly applicable discussion:"headless chrome & aws lambda" can be read on Google Groups.
The question at. had was can you run headless chrome or chromium in Firebase Cloud Functions... the answer is NO! since the node.js project will not have access any chrome/chromium executables and therefore will fail! (TRUST ME - I've Tried!).
A better solutions is to use the Phantom npm package, which uses PhantomJS under the hood:
https://www.npmjs.com/package/phantom
Docs and info can be found here:
http://amirraminfar.com/phantomjs-node/#/
or
https://github.com/amir20/phantomjs-node
The site i was trying to crawl had implemented screen scraping software, the trick is to wait for the page to load by searching for expected string, or regex match, i.e. i do a regex for a , if you need a regex of any complexity made for you - get in touch at https://AppLogics.uk/ - starting at £5 (GPB).
here is a typescript snippet to make the http or https call:
const phantom = require('phantom');
const instance: any = await phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']);
const page: any = await instance.createPage();
const status = await page.open('https://somewebsite.co.uk/');
const content = await page.property('content');
same again in JavaScript:
const phantom = require('phantom');
const instance = yield phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']);
const page = yield instance.createPage();
const status = yield page.open('https://somewebsite.co.uk/');
const content = yield page.property('content');
Thats the easy bit! if its a static page your pretty much done and you can parse the HTML into something like the cheerio npm package: https://github.com/cheeriojs/cheerio - an implementation of core JQuery designed for servers!
However if it is a dynamically loading page, i.e. lazy loading, or even anti-scraping methods, you will need to wait for the page to update by looping and calling the page.property('content') method and running a text search or regex to see if your page has finished loading.
I have created a generic asynchronous function returning the page content (as a string) on success and throws an exception on failure or timeout. It takes as parameters the variables for the page, text (string to search for that indicates success), error (string to indicate failure or null to not check for error), and timeout (number - self explanatory):
TypeScript:
async function waitForPageToLoadStr(page: any, text: string, error: string, timeout: number): Promise<string> {
const maxTime = timeout ? (new Date()).getTime() + timeout : null;
let html: string = '';
html = await page.property('content');
async function loop(): Promise<string>{
async function checkSuccess(): Promise <boolean> {
html = await page.property('content');
if (!isNullOrUndefined(error) && html.includes(error)) {
throw new Error(`Error string found: ${ error }`);
}
if (maxTime && (new Date()).getTime() >= maxTime) {
throw new Error(`Timed out waiting for string: ${ text }`);
}
return html.includes(text)
}
if (await checkSuccess()){
return html;
} else {
return loop();
}
}
return await loop();
}
JavaScript:
function waitForPageToLoadStr(page, text, error, timeout) {
return __awaiter(this, void 0, void 0, function* () {
const maxTime = timeout ? (new Date()).getTime() + timeout : null;
let html = '';
html = yield page.property('content');
function loop() {
return __awaiter(this, void 0, void 0, function* () {
function checkSuccess() {
return __awaiter(this, void 0, void 0, function* () {
html = yield page.property('content');
if (!isNullOrUndefined(error) && html.includes(error)) {
throw new Error(`Error string found: ${error}`);
}
if (maxTime && (new Date()).getTime() >= maxTime) {
throw new Error(`Timed out waiting for string: ${text}`);
}
return html.includes(text);
});
}
if (yield checkSuccess()) {
return html;
}
else {
return loop();
}
});
}
return yield loop();
});
}
I have personally used this function like this:
TypeScript:
try {
const phantom = require('phantom');
const instance: any = await phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']);
const page: any = await instance.createPage();
const status = await page.open('https://somewebsite.co.uk/');
await waitForPageToLoadStr(page, '<div>Welcome to somewebsite</div>', '<h1>Website under maintenance, try again later</h1>', 1000);
} catch (error) {
console.error(error);
}
JavaScript:
try {
const phantom = require('phantom');
const instance = yield phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']);
const page = yield instance.createPage();
yield page.open('https://vehicleenquiry.service.gov.uk/');
yield waitForPageToLoadStr(page, '<div>Welcome to somewebsite</div>', '<h1>Website under maintenance, try again later</h1>', 1000);
} catch (error) {
console.error(error);
}
Happy crawling!
Related
I have created a url scraper function, working and tested on Google Cloud, but I am really drawing a blank on how to invoke it. I have tried two methods, one using the cloud_functions package, and the other using a standard HTTPS get. I've tried looking online, but none of the solutions/guides involve functions with an input from the Flutter app, and an output back to the app.
Here's the structure of the function (which is working alright). I've named this function Parse in Google Cloud Platform.
<PYTHON PACKAGE IMPORTS>
def Parser(url):
<URL PARSE FUNCTIONS>
return source, datetime, imageurl, keyword
def invoke_parse(request):
request_json = request.get_json(silent=True)
file = Parser(request_json['url'])
return jsonify({
"source": file[0],
"datetime": file[1],
"imageurl": file[2],
"keyword": file[3],
})
The first method I tried was using an HTTP CALL to get the function. But that isn't working, even though there are no errors - I suspect it's just returning nothing.
parser(String url) async{ // Here I honestly don't know where to use the url input within the function
var uri = Uri.parse(<Function URL String>);
HttpClient client;
try {
var request = await client.getUrl(uri);
var response = await request.close();
if (response.statusCode == HttpStatus.ok) {
var json = await response.transform(utf8.decoder).join();
Map data = jsonDecode(json) as Map;
source = data['source']; // These are the variables used in the main Flutter app
postedAt = data['datetime'];
_imageUrl = data['image'];
keyword = data['keyword'];
} else {
print('Error running parse:\nHttp status ${response.statusCode}');
}
} catch (exception) {
print('Failed invoking the parse function.');
}
}
That didn't work, so I thought I might alternatively use the cloud_functions package as follows (in lieu of the previous):
parser(String url) async {
var functionUrl = <FUNCTION URL>;
HttpsCallable callable = CloudFunctions.instance.getHttpsCallable(functionName: 'Parse')
..timeout = const Duration(seconds: 30);
try {
final HttpsCallableResult result = await callable.call(
<String, dynamic>{
'url': url,
}
);
setState(() {
source = result.data['source']; //THESE ARE VARIABLES USED IN THE FLUTTER APP
postedAt = result.data['datetime'];
_imageUrl = result.data['image'];
keyword = result.data['keyword'];
});
}
on CloudFunctionsException catch (e) {
print('caught firebase functions exception');
print(e.code);
print(e.message);
print(e.details);
} catch (e) {
print('caught generic exception');
print(e);
}
}
In the latter case, the code ran without errors but doesn't work. My flutter log states the following error:
I/flutter ( 2821): caught generic exception
I/flutter ( 2821): PlatformException(functionsError, Cloud function failed with exception., {code: NOT_FOUND, details: null, message: NOT_FOUND})
which I'm assuming is also an error at not being able to read the function.
Any help on how I should go about processing my function would be appreciated. Apologies if something is a really obvious solution, but I am not familiar as much with HTTP requests and cloud platforms.
Thanks and cheers.
Node Js Backend Function
const functions = require("firebase-functions");
const admin = require("firebase-admin");
admin.initializeApp();
exports.test = functions.https.onCall(async (data, context) => {
functions.logger.info("Hello logs: ", {structuredData: true});
functions.logger.info( data.token, {structuredData: true});
}
Flutter frontend
1- pubspec.yaml
cloud_functions: ^1.1.2
2 - Code
HttpsCallable callable = FirebaseFunctions.instance.httpsCallable('test');
final HttpsCallableResult results = await callable.call<Map>( {
'token': token,
});
I'm currently playing around with the actions-on-google node sdk and I'm struggling to work out how to wait for a promise to resolve in my middleware before it then executes my intent. I've tried using async/await and returning a promise from my middleware function but neither method appears to work. I know typically you wouldn't override the intent like i'm doing here but this is to test what's going on.
const {dialogflow} = require('actions-on-google');
const functions = require('firebase-functions');
const app = dialogflow({debug: true});
function promiseTest() {
return new Promise((resolve,reject) => {
setTimeout(() => {
resolve('Resolved');
}, 2000)
})
}
app.middleware(async (conv) => {
let r = await promiseTest();
conv.intent = r
})
app.fallback(conv => {
const intent = conv.intent;
conv.ask("hello, you're intent was " + intent );
});
It looks like I should at least be able to return a promise https://actions-on-google.github.io/actions-on-google-nodejs/interfaces/dialogflow.dialogflowmiddleware.html
but I'm not familiar with typescript so I'm not sure if I'm reading these docs correctly.
anyone able to advise how to do this correctly? For instance a real life sample might be I need to make a DB call and wait for that to return in my middleware before proceeding to the next step.
My function is using the NodeJS V8 beta in google cloud functions.
The output of this code is whatever the actual intent was e.g the default welcome intent, rather than "resolved" but there are no errors. So the middleware fires, but then moves onto the fallback intent before the promise resolves. e.g before setting conv.intent = r
Async stuff is really fiddly with the V2 API. And for me only properly worked with NodeJS 8. The reason is that from V2 onwards, unless you return the promise, the action returns empty as it has finished before the rest of the function is evaluated. There is a lot to work through to figure it out, here's some sample boilerplate I have that should get you going:
'use strict';
const functions = require('firebase-functions');
const {WebhookClient} = require('dialogflow-fulfillment');
const {BasicCard, MediaObject, Card, Suggestion, Image, Button} = require('actions-on-google');
var http_request = require('request-promise-native');
process.env.DEBUG = 'dialogflow:debug'; // enables lib debugging statements
exports.dialogflowFirebaseFulfillment = functions.https.onRequest((request, response) => {
const agent = new WebhookClient({ request, response });
console.log('Dialogflow Request headers: ' + JSON.stringify(request.headers));
console.log('Dialogflow Request body: ' + JSON.stringify(request.body));
function welcome(agent) {
agent.add(`Welcome to my agent!`);
}
function fallback(agent) {
agent.add(`I didn't understand`);
agent.add(`I'm sorry, can you try again?`);
}
function handleMyIntent(agent) {
let conv = agent.conv();
let key = request.body.queryResult.parameters['MyParam'];
var myAgent = agent;
return new Promise((resolve, reject) => {
http_request('http://someurl.com').then(async function(apiData) {
if (key === 'Hey') {
conv.close('Howdy');
} else {
conv.close('Bye');
}
myAgent.add(conv);
return resolve();
}).catch(function(err) {
conv.close(' \nUh, oh. There was an error, please try again later');
myAgent.add(conv);
return resolve();
})})
}
let intentMap = new Map();
intentMap.set('Default Welcome Intent', welcome);
intentMap.set('Default Fallback Intent', fallback);
intentMap.set('myCustomIntent', handleMyIntent);
agent.handleRequest(intentMap);
});
A brief overview of what you need:
you have to return the promise resolution.
you have to use the 'request-promise-native' package for HTTP requests
you have to upgrade your plan to allow for outbound HTTP requests (https://firebase.google.com/pricing/)
So it turns out my issue was to do with an outdated version of the actions-on-google sdk. The dialogflow firebase example was using v2.0.0, changing this to 2.2.0 in the package.json resolved the issue
Being new to elasticsearch, am exploring it by integrating with node and trying to execute the following online git example in windows.
https://github.com/sitepoint-editors/node-elasticsearch-tutorial
while trying to import the data of 1000 items from data.json, the execution 'node index.js' is failing with the following error.
By enabling the trace, I now see the following as the root cause from the bulk function.
"error": "Content-Type header [application/x-ldjson] is not supported",
** "status": 406**
I see a change log from https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/changelog.html which says following
13.0.0 (Apr 24 2017) bulk and other APIs that send line-delimited JSON bodies now use the Content-Type: application/x-ndjson header #507
Any idea how to resolve this content type issue in index.js?
index.js
(function () {
'use strict';
const fs = require('fs');
const elasticsearch = require('elasticsearch');
const esClient = new elasticsearch.Client({
host: 'localhost:9200',
log: 'error'
});
const bulkIndex = function bulkIndex(index, type, data) {
let bulkBody = [];
data.forEach(item => {
bulkBody.push({
index: {
_index: index,
_type: type,
_id: item.id
}
});
bulkBody.push(item);
});
esClient.bulk({body: bulkBody})
.then(response => {
console.log(`Inside bulk3...`);
let errorCount = 0;
response.items.forEach(item => {
if (item.index && item.index.error) {
console.log(++errorCount, item.index.error);
}
});
console.log(`Successfully indexed ${data.length - errorCount} out of ${data.length} items`);
})
.catch(console.err);
};
// only for testing purposes
// all calls should be initiated through the module
const test = function test() {
const articlesRaw = fs.readFileSync('data.json');
const articles = JSON.parse(articlesRaw);
console.log(`${articles.length} items parsed from data file`);
bulkIndex('library', 'article', articles);
};
test();
module.exports = {
bulkIndex
};
} ());
my local windows environment:
java version 1.8.0_121
elasticsearch version 6.1.1
node version v8.9.4
npm version 5.6.0
The bulk function doesn't return a promise. It accepts a callback function as a parameter.
esClient.bulk(
{body: bulkBody},
function(err, response) {
if (err) { console.err(err); return; }
console.log(`Inside bulk3...`);
let errorCount = 0;
response.items.forEach(item => {
if (item.index && item.index.error) {
console.log(++errorCount, item.index.error);
}
});
console.log(`Successfully indexed ${data.length - errorCount} out of ${data.length} items`);
}
)
or use promisify to convert a function accepting an (err, value) => ... style callback to a function that returns a promise.
const esClientBulk = util.promisify(esClient.bulk)
esClientBulk({body: bulkBody})
.then(...)
.catch(...)
EDIT: Just found out that elasticsearch-js supports both callbacks and promises. So this should not be an issue.
By looking at the package.json of the project that you've linked, it uses elasticsearch-js version ^11.0.1 which is an old version and that is sending requests with application/x-ldjson header for bulk upload, which is not supported by newer elasticsearch versions. So, upgrading elasticsearch-js to a newer version (current latest is 14.0.0) should fix it.
scenario:
I use puppeteer launched chrome in headless mode, and call page.addScriptTag with an cross-domain javascript file. Now if the opening site has csp set and restricts only same origin javascript tags, how can I bypass this using puppeteer API?
Use:
await page.setBypassCSP(true)
Documentation
This is my first stackoverflow contribution so have mercy on me. I found this work around to allow you to get past CSP, Here.
The basic idea is that you intercept page requests and use a library like node-fetch to make the request and disable the CSP header when passing it back to chrome.
Here's the snippet that initially came from the github issue tracker.
Replace "example.com" with the website that needs to have CSP disabled.
const fetch = require('node-fetch')
const requestInterceptor = async (request) => {
try {
const url = request.url()
const requestHeaders = request.headers()
const acceptHeader = requestHeaders.accept || ''
if (url.includes("example.com") && (acceptHeader.includes('text/html'))) {
const cookiesList = await page.cookies(url)
const cookies = cookiesList.map(cookie => `${cookie.name}=${cookie.value}`).join('; ')
delete requestHeaders['x-devtools-emulate-network-conditions-client-id']
if (requestHeaders.Cookie) {
requestHeaders.cookie = requestHeaders.Cookie
delete requestHeaders.Cookie
}
const theseHeaders = Object.assign({'cookie': cookies}, requestHeaders, {'accept-language': 'en-US,en'})
const init = {
body: request.postData(),
headers: theseHeaders,
method: request.method(),
follow: 20,
}
const result = await fetch(
url,
init,
)
const resultHeaders = {}
result.headers.forEach((value, name) => {
if (name.toLowerCase() !== 'content-security-policy') {
resultHeaders[name] = value
} else {
console.log('CSP', `omitting CSP`, {originalCSP: value})
}
})
const buffer = await result.buffer()
await request.respond({
body: buffer,
resultHeaders,
status: result.status,
})
} else {
request.continue();
}
} catch (e) {
console.log("Error while disabling CSP", e);
request.abort();
}
}
await page.setRequestInterception(true)
page.on('request', requestInterceptor)
I am running a simple node script which starts chromedriver pointed at my website, scrolls to the bottom of the page, and writes the trace to trace.json.
This file is around 30MB.
I can't seem to load this file in chrome://tracing/, which is what I assume I would do in order to view the profile data.
What are my options for making sense of my trace.json file?
Here is my node script, in case that helps clarify what I am up to:
'use strict';
var fs = require('fs');
var wd = require('wd');
var b = wd.promiseRemote('http://localhost:9515');
b.init({
browserName: 'chrome',
chromeOptions: {
perfLoggingPrefs: {
'traceCategories': 'toplevel,disabled-by-default-devtools.timeline.frame,blink.console,disabled-by-default-devtools.timeline,benchmark'
},
args: ['--enable-gpu-benchmarking', '--enable-thread-composting']
},
loggingPrefs: {
performance: 'ALL'
}
}).then(function () {
return b.get('http://www.example.com');
}).then(function () {
// We only want to measure interaction, so getting a log once here
// flushes any previous tracing logs we have.
return b.log('performance');
}).then(function () {
// Smooth scroll to bottom.
return b.execute(`
var height = Math.max(document.documentElement.scrollHeight, document.body.scrollHeight, document.documentElement.clientHeight);
chrome.gpuBenchmarking.smoothScrollBy(height, function (){});
`);
}).then(function () {
// Wait for the above action to complete.
return b.sleep(5000);
}).then(function () {
// Get all the trace logs since last time log('performance') was called.
return b.log('performance');
}).then(function (data) {
// Write the file to disk.
return fs.writeFileSync('trace.json', JSON.stringify(data.map(function (s) {
return JSON.parse(s.message); // This is needed since Selenium outputs logs as strings.
})));
}).fin(function () {
return b.quit();
}).done();
Your script doesn't generate the correct format. The required data for each entry are located in message.message.params.
To generate a trace that can be loaded in chrome://tracing :
var fs = require('fs');
var webdriver = require('selenium-webdriver');
var driver = new webdriver.Builder()
.withCapabilities({
browserName : 'chrome',
loggingPrefs : { performance: 'ALL' },
chromeOptions : {
args: ['--enable-gpu-benchmarking', '--enable-thread-composting'],
perfLoggingPrefs: {
'traceCategories': 'toplevel,disabled-by-default-devtools.timeline.frame,blink.console,disabled-by-default-devtools.timeline,benchmark'
}
}
}).build();
driver.get('https://www.google.com/ncr');
driver.sleep(1000);
// generate a trace file loadable in chrome://tracing
driver.manage().logs().get('performance').then(function (data) {
fs.writeFileSync('trace.json', JSON.stringify(data.map(function (d) {
return JSON.parse(d['message'])['message']['params'];
})));
});
driver.quit();
The same script with python:
import json, time
from selenium import webdriver
driver = webdriver.Chrome(desired_capabilities = {
'loggingPrefs': { 'performance': 'ALL' },
'chromeOptions': {
"args" : ['--enable-gpu-benchmarking', '--enable-thread-composting'],
"perfLoggingPrefs" : {
"traceCategories": "toplevel,disabled-by-default-devtools.timeline.frame,blink.console,disabled-by-default-devtools.timeline,benchmark"
}
}
})
driver.get('https://stackoverflow.com')
time.sleep(1)
# generate a trace file loadable in chrome://tracing
with open(r"trace.json", 'w') as f:
f.write(json.dumps([json.loads(d['message'])['message']['params'] for d in driver.get_log('performance')]))
driver.quit()
Not sure if you know, the recommendation lib for parsing those thing is https://github.com/ChromeDevTools/devtools-frontend
Also, recommended categories are __metadata,benchmark,devtools.timeline,rail,toplevel,disabled-by-default-v8.cpu_profiler,disabled-by-default-devtools.timeline,disabled-by-default-devtools.timeline.frame,blink.user_timing,v8.execute,disabled-by-default-devtools.screenshot
It's very old question, but hope this helps new other guys.