Node Elasticsearch - Bulk indexing not working - Content-Type header [application/x-ldjson] is not supported - json

Being new to elasticsearch, am exploring it by integrating with node and trying to execute the following online git example in windows.
https://github.com/sitepoint-editors/node-elasticsearch-tutorial
while trying to import the data of 1000 items from data.json, the execution 'node index.js' is failing with the following error.
By enabling the trace, I now see the following as the root cause from the bulk function.
"error": "Content-Type header [application/x-ldjson] is not supported",
** "status": 406**
I see a change log from https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/changelog.html which says following
13.0.0 (Apr 24 2017) bulk and other APIs that send line-delimited JSON bodies now use the Content-Type: application/x-ndjson header #507
Any idea how to resolve this content type issue in index.js?
index.js
(function () {
'use strict';
const fs = require('fs');
const elasticsearch = require('elasticsearch');
const esClient = new elasticsearch.Client({
host: 'localhost:9200',
log: 'error'
});
const bulkIndex = function bulkIndex(index, type, data) {
let bulkBody = [];
data.forEach(item => {
bulkBody.push({
index: {
_index: index,
_type: type,
_id: item.id
}
});
bulkBody.push(item);
});
esClient.bulk({body: bulkBody})
.then(response => {
console.log(`Inside bulk3...`);
let errorCount = 0;
response.items.forEach(item => {
if (item.index && item.index.error) {
console.log(++errorCount, item.index.error);
}
});
console.log(`Successfully indexed ${data.length - errorCount} out of ${data.length} items`);
})
.catch(console.err);
};
// only for testing purposes
// all calls should be initiated through the module
const test = function test() {
const articlesRaw = fs.readFileSync('data.json');
const articles = JSON.parse(articlesRaw);
console.log(`${articles.length} items parsed from data file`);
bulkIndex('library', 'article', articles);
};
test();
module.exports = {
bulkIndex
};
} ());
my local windows environment:
java version 1.8.0_121
elasticsearch version 6.1.1
node version v8.9.4
npm version 5.6.0

The bulk function doesn't return a promise. It accepts a callback function as a parameter.
esClient.bulk(
{body: bulkBody},
function(err, response) {
if (err) { console.err(err); return; }
console.log(`Inside bulk3...`);
let errorCount = 0;
response.items.forEach(item => {
if (item.index && item.index.error) {
console.log(++errorCount, item.index.error);
}
});
console.log(`Successfully indexed ${data.length - errorCount} out of ${data.length} items`);
}
)
or use promisify to convert a function accepting an (err, value) => ... style callback to a function that returns a promise.
const esClientBulk = util.promisify(esClient.bulk)
esClientBulk({body: bulkBody})
.then(...)
.catch(...)
EDIT: Just found out that elasticsearch-js supports both callbacks and promises. So this should not be an issue.
By looking at the package.json of the project that you've linked, it uses elasticsearch-js version ^11.0.1 which is an old version and that is sending requests with application/x-ldjson header for bulk upload, which is not supported by newer elasticsearch versions. So, upgrading elasticsearch-js to a newer version (current latest is 14.0.0) should fix it.

Related

Is it possible to perform an action with `context` on the init of the app?

I'm simply looking for something like this
app.on('init', async context => {
...
})
Basically I just need to make to calls to the github API, but I'm not sure there is a way to do it without using the API client inside the Context object.
I ended up using probot-scheduler
const createScheduler = require('probot-scheduler')
module.exports = app => {
createScheduler(app, {
delay: false
})
robot.on('schedule.repository', context => {
// this is called on startup and can access context
})
}
I tried probot-scheduler but it didn't exist - perhaps removed in an update?
In any case, I managed to do it after lots of digging by using the actual app object - it's .auth() method returns a promise containing the GitHubAPI interface:
https://probot.github.io/api/latest/classes/application.html#auth
module.exports = app => {
router.get('/hello-world', async (req, res) => {
const github = await app.auth();
const result = await github.repos.listForOrg({'org':'org name});
console.log(result);
})
}
.auth() takes the ID of the installation if you wish to access private data. If called empty, the client will can only retrieve public data.
You can get the installation ID by calling .auth() without paramaters, and then listInstallations():
const github = await app.auth();
const result = github.apps.listInstallations();
console.log(result);
You get an array including IDs that you can in .auth().

Table to JSON using node.js

I am trying to convert a table to JSON, to search for data easily, the URL is: http://www.tppcrpg.net/rarity.html
I found this package:
https://www.npmjs.com/package/tabletojson
I tried to use it like:
'use strict';
const tabletojson = require('tabletojson');
tabletojson.convertUrl(
'http://www.tppcrpg.net/rarity.html',
{ useFirstRowForHeadings: true },
function(tablesAsJson) {
console.log(tablesAsJson[1]);
}
);
However it returns undefined in the console, are there any alternative options or am I using the package wrong?
Hey you are actually getting data, change the console.log
Your output have total one array only but you are putting tablesAsJson[1] in console, but array index starts with [0].
'use strict';
const tabletojson = require('tabletojson');
tabletojson.convertUrl(
'http://www.tppcrpg.net/rarity.html',
function(tablesAsJson) {
console.log(tablesAsJson[0]);
}
);
For better looking code:
const url = 'http://www.tppcrpg.net/rarity.html';
tabletojson.convertUrl(url)
.then((data) => {
console.log(data[0]);
})
.catch((err) => {
console.log('err', err);
}); // to catch error

DialogflowSDK middleware return after resolving a promise

I'm currently playing around with the actions-on-google node sdk and I'm struggling to work out how to wait for a promise to resolve in my middleware before it then executes my intent. I've tried using async/await and returning a promise from my middleware function but neither method appears to work. I know typically you wouldn't override the intent like i'm doing here but this is to test what's going on.
const {dialogflow} = require('actions-on-google');
const functions = require('firebase-functions');
const app = dialogflow({debug: true});
function promiseTest() {
return new Promise((resolve,reject) => {
setTimeout(() => {
resolve('Resolved');
}, 2000)
})
}
app.middleware(async (conv) => {
let r = await promiseTest();
conv.intent = r
})
app.fallback(conv => {
const intent = conv.intent;
conv.ask("hello, you're intent was " + intent );
});
It looks like I should at least be able to return a promise https://actions-on-google.github.io/actions-on-google-nodejs/interfaces/dialogflow.dialogflowmiddleware.html
but I'm not familiar with typescript so I'm not sure if I'm reading these docs correctly.
anyone able to advise how to do this correctly? For instance a real life sample might be I need to make a DB call and wait for that to return in my middleware before proceeding to the next step.
My function is using the NodeJS V8 beta in google cloud functions.
The output of this code is whatever the actual intent was e.g the default welcome intent, rather than "resolved" but there are no errors. So the middleware fires, but then moves onto the fallback intent before the promise resolves. e.g before setting conv.intent = r
Async stuff is really fiddly with the V2 API. And for me only properly worked with NodeJS 8. The reason is that from V2 onwards, unless you return the promise, the action returns empty as it has finished before the rest of the function is evaluated. There is a lot to work through to figure it out, here's some sample boilerplate I have that should get you going:
'use strict';
const functions = require('firebase-functions');
const {WebhookClient} = require('dialogflow-fulfillment');
const {BasicCard, MediaObject, Card, Suggestion, Image, Button} = require('actions-on-google');
var http_request = require('request-promise-native');
process.env.DEBUG = 'dialogflow:debug'; // enables lib debugging statements
exports.dialogflowFirebaseFulfillment = functions.https.onRequest((request, response) => {
const agent = new WebhookClient({ request, response });
console.log('Dialogflow Request headers: ' + JSON.stringify(request.headers));
console.log('Dialogflow Request body: ' + JSON.stringify(request.body));
function welcome(agent) {
agent.add(`Welcome to my agent!`);
}
function fallback(agent) {
agent.add(`I didn't understand`);
agent.add(`I'm sorry, can you try again?`);
}
function handleMyIntent(agent) {
let conv = agent.conv();
let key = request.body.queryResult.parameters['MyParam'];
var myAgent = agent;
return new Promise((resolve, reject) => {
http_request('http://someurl.com').then(async function(apiData) {
if (key === 'Hey') {
conv.close('Howdy');
} else {
conv.close('Bye');
}
myAgent.add(conv);
return resolve();
}).catch(function(err) {
conv.close(' \nUh, oh. There was an error, please try again later');
myAgent.add(conv);
return resolve();
})})
}
let intentMap = new Map();
intentMap.set('Default Welcome Intent', welcome);
intentMap.set('Default Fallback Intent', fallback);
intentMap.set('myCustomIntent', handleMyIntent);
agent.handleRequest(intentMap);
});
A brief overview of what you need:
you have to return the promise resolution.
you have to use the 'request-promise-native' package for HTTP requests
you have to upgrade your plan to allow for outbound HTTP requests (https://firebase.google.com/pricing/)
So it turns out my issue was to do with an outdated version of the actions-on-google sdk. The dialogflow firebase example was using v2.0.0, changing this to 2.2.0 in the package.json resolved the issue

Elasticsearch js Bulk Index TypeError

Background
Recently, I have been working with the Elasticsearch Node.js API to bulk-index a large JSON file. I have successfully parsed the JSON file. Now, I should be able to pass the index-ready array into the Elasticsearch bulk command. However, using console log, it appears as though the array shouldn't be causing any problems.
Bugged Code
The following code is supposed to take an API URL (with a JSON response) and parse it using the Node.js HTTP library. Then using the Elasticsearch Node.js API, it should bulk-index every entry in the JSON array into my Elasticsearch index.
var APIUrl = /* The url to the JSON file on the API providers server. */
var bulk = [];
/*
Used to ready JSON file for indexing
*/
var makebulk = function(ParsedJSONFile, callback) {
var JSONArray = path.to.array; /* The array was nested... */
var action = { index: { _index: 'my_index', _type: 'my_type' } };
for(const item of items) {
var doc = { "id": `${item.id}`, "name": `${item.name}` };
bulk.push(action, doc);
}
callback(bulk);
}
/*
Used to index the output of the makebulk function
*/
var indexall = function(madebulk, callback) {
client.bulk({
maxRetries: 5,
index: "my_index",
type: "my_type",
body: makebulk
}, function(err, resp) {
if (err) {
console.log(err);
} else {
callback(resp.items);
}
});
}
/*
Gets the specified URL, parses the JSON object,
extracts the needed data and indexes into the
specified Elasticsearch index
*/
http.get(APIUrl, function(res) {
var body = '';
res.on('data', function(chunk) {
body += chunk;
});
res.on('end', function() {
var APIURLResponse = JSON.parse(body);
makebulk(APIURLResponse, function(resp) {
console.log("Bulk content prepared");
indexall(resp, function(res) {
console.log(res);
});
console.log("Response: ", resp);
});
});
}).on('error', function(err) {
console.log("Got an error: ", err);
});
When I run node bulk_index.js on my web server, I receive the following error: TypeError: Bulk body should either be an Array of commands/string, or a String. However, this doesn't make any sense because the console.log(res) command (From the indexall function under http.get client request) outputs the following:
Bulk content prepared
Response: [ { index: { _index: 'my_index', _type: 'my_type', _id: '1' } },
{ id: '5', name: 'The Name' }, ... },
... 120690 more items ]
The above console output appears to show the array in the correct format.
Question
What does TypeError: Bulk body should either be an Array of commands/string, or a String indicate is wrong with the array I am passing into the client.bulk function?
Notes
My server is currently running Elasticsearch 6.2.4 and Java Development Kit version 10.0.1. Everything works as far as the Elaticsearch server and even my Elaticsearch mappings (I didn't provide the client.indices.putMapping code, however I can if it is needed). I have spent multiple hours reading over every scrap of documentation I could find regarding this TypeError. I couldn't find much in regards to the error being thrown, so I am not sure where else to look for information regarding this error.
Seems a typo in your code.
var indexall = function(**madebulk**, callback) {
client.bulk({
maxRetries: 5,
index: "my_index",
type: "my_type",
body: **makebulk**
Check the madebulk & makebulk.

Possible to run Headless Chrome/Chromium in a Google Cloud Function?

Is there any way to run Headless Chrome/Chromium in a Google Cloud Function? I understand I can include and run statically compiled binaries in GCF. Can I get a statically compiled version of Chrome that would work for this?
The Node.js 8 runtime for Google Cloud Functions now includes all the necessary OS packages to run Headless Chrome.
Here is a code sample of an HTTP function that returns screenshots:
Main index.js file:
const puppeteer = require('puppeteer');
exports.screenshot = async (req, res) => {
const url = req.query.url;
if (!url) {
return res.send('Please provide URL as GET parameter, for example: ?url=https://example.com');
}
const browser = await puppeteer.launch({
args: ['--no-sandbox']
});
const page = await browser.newPage();
await page.goto(url);
const imageBuffer = await page.screenshot();
await browser.close();
res.set('Content-Type', 'image/png');
res.send(imageBuffer);
}
and package.json
{
"name": "screenshot",
"version": "0.0.1",
"dependencies": {
"puppeteer": "^1.6.2"
}
}
I've just deployed a GCF function running headless Chrome. A couple takeways:
you have to statically compile Chromium and NSS on Debian 8
you have to patch environment variables to point to NSS before launching Chromium
performance is much worse than what you'd get on AWS Lambda (3+ seconds)
For 1, you should be able to find plenty of instructions online.
For 2, the code that I'm using is the following:
static executablePath() {
let bin = path.join(__dirname, '..', 'bin', 'chromium');
let nss = path.join(__dirname, '..', 'bin', 'nss', 'Linux3.16_x86_64_cc_glibc_PTH_64_OPT.OBJ');
if (process.env.PATH === undefined) {
process.env.PATH = path.join(nss, 'bin');
} else if (process.env.PATH.indexOf(nss) === -1) {
process.env.PATH = [path.join(nss, 'bin'), process.env.PATH].join(':');
}
if (process.env.LD_LIBRARY_PATH === undefined) {
process.env.LD_LIBRARY_PATH = path.join(nss, 'lib');
} else if (process.env.LD_LIBRARY_PATH.indexOf(nss) === -1) {
process.env.LD_LIBRARY_PATH = [path.join(nss, 'lib'), process.env.LD_LIBRARY_PATH].join(':');
}
if (fs.existsSync('/tmp/chromium') === true) {
return '/tmp/chromium';
}
return new Promise(
(resolve, reject) => {
try {
fs.chmod(bin, '0755', () => {
fs.symlinkSync(bin, '/tmp/chromium'); return resolve('/tmp/chromium');
});
} catch (error) {
return reject(error);
}
}
);
}
You also need to use a few required arguments when starting Chrome, namely:
--disable-dev-shm-usage
--disable-setuid-sandbox
--no-first-run
--no-sandbox
--no-zygote
--single-process
I hope this helps.
As mentioned in the comment, work is being done on a possible solution to running a headless browser in a cloud function. A directly applicable discussion:"headless chrome & aws lambda" can be read on Google Groups.
The question at. had was can you run headless chrome or chromium in Firebase Cloud Functions... the answer is NO! since the node.js project will not have access any chrome/chromium executables and therefore will fail! (TRUST ME - I've Tried!).
A better solutions is to use the Phantom npm package, which uses PhantomJS under the hood:
https://www.npmjs.com/package/phantom
Docs and info can be found here:
http://amirraminfar.com/phantomjs-node/#/
or
https://github.com/amir20/phantomjs-node
The site i was trying to crawl had implemented screen scraping software, the trick is to wait for the page to load by searching for expected string, or regex match, i.e. i do a regex for a , if you need a regex of any complexity made for you - get in touch at https://AppLogics.uk/ - starting at £5 (GPB).
here is a typescript snippet to make the http or https call:
const phantom = require('phantom');
const instance: any = await phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']);
const page: any = await instance.createPage();
const status = await page.open('https://somewebsite.co.uk/');
const content = await page.property('content');
same again in JavaScript:
const phantom = require('phantom');
const instance = yield phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']);
const page = yield instance.createPage();
const status = yield page.open('https://somewebsite.co.uk/');
const content = yield page.property('content');
Thats the easy bit! if its a static page your pretty much done and you can parse the HTML into something like the cheerio npm package: https://github.com/cheeriojs/cheerio - an implementation of core JQuery designed for servers!
However if it is a dynamically loading page, i.e. lazy loading, or even anti-scraping methods, you will need to wait for the page to update by looping and calling the page.property('content') method and running a text search or regex to see if your page has finished loading.
I have created a generic asynchronous function returning the page content (as a string) on success and throws an exception on failure or timeout. It takes as parameters the variables for the page, text (string to search for that indicates success), error (string to indicate failure or null to not check for error), and timeout (number - self explanatory):
TypeScript:
async function waitForPageToLoadStr(page: any, text: string, error: string, timeout: number): Promise<string> {
const maxTime = timeout ? (new Date()).getTime() + timeout : null;
let html: string = '';
html = await page.property('content');
async function loop(): Promise<string>{
async function checkSuccess(): Promise <boolean> {
html = await page.property('content');
if (!isNullOrUndefined(error) && html.includes(error)) {
throw new Error(`Error string found: ${ error }`);
}
if (maxTime && (new Date()).getTime() >= maxTime) {
throw new Error(`Timed out waiting for string: ${ text }`);
}
return html.includes(text)
}
if (await checkSuccess()){
return html;
} else {
return loop();
}
}
return await loop();
}
JavaScript:
function waitForPageToLoadStr(page, text, error, timeout) {
return __awaiter(this, void 0, void 0, function* () {
const maxTime = timeout ? (new Date()).getTime() + timeout : null;
let html = '';
html = yield page.property('content');
function loop() {
return __awaiter(this, void 0, void 0, function* () {
function checkSuccess() {
return __awaiter(this, void 0, void 0, function* () {
html = yield page.property('content');
if (!isNullOrUndefined(error) && html.includes(error)) {
throw new Error(`Error string found: ${error}`);
}
if (maxTime && (new Date()).getTime() >= maxTime) {
throw new Error(`Timed out waiting for string: ${text}`);
}
return html.includes(text);
});
}
if (yield checkSuccess()) {
return html;
}
else {
return loop();
}
});
}
return yield loop();
});
}
I have personally used this function like this:
TypeScript:
try {
const phantom = require('phantom');
const instance: any = await phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']);
const page: any = await instance.createPage();
const status = await page.open('https://somewebsite.co.uk/');
await waitForPageToLoadStr(page, '<div>Welcome to somewebsite</div>', '<h1>Website under maintenance, try again later</h1>', 1000);
} catch (error) {
console.error(error);
}
JavaScript:
try {
const phantom = require('phantom');
const instance = yield phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']);
const page = yield instance.createPage();
yield page.open('https://vehicleenquiry.service.gov.uk/');
yield waitForPageToLoadStr(page, '<div>Welcome to somewebsite</div>', '<h1>Website under maintenance, try again later</h1>', 1000);
} catch (error) {
console.error(error);
}
Happy crawling!