How to scrape badly formed HTML - html

I'm trying to scrape a really really old page that looks like it was built with FrontPage or even just pasted from a Word document. It's full of font tags that can spontaneously stop and start in the middle of a word, or similar elements at randomly different tree depths.
I am not looking for any tools that can parse poor XML, I am already using Html Agility Pack. When I say badly formed HTML I mean it was not outputted from a database, does not have any consistent patterns, but on the screen it looks fine.
What techniques and tools can I use?

I would use cheerio in Nodejs. It replicates the same api as jQuery which makes it very easy to parse bad formatted html. Scraping with Javascript makes sense for many reasons.
This is an example taken from node.io,
var request = require('request')
, cheerio = require('cheerio')
, async = require('async')
, format = require('util').format;
var reddits = [ 'programming', 'javascript', 'node' ]
, concurrency = 2;
async.eachLimit(reddits, concurrency, function (reddit, next) {
var url = format('http://reddit.com/r/%s', reddit);
request(url, function (err, response, body) {
if (err) throw err;
var $ = cheerio.load(body);
$('a.title').each(function () {
console.log('%s (%s)', $(this).text(), $(this).attr('href'));
});
next();
});
});

Related

How do I parse a html page using nodejs to find a qr code?

I want to parse a web page, searching for QRcodes in the page. When I find them, I am going to read them using the QRcode npm module.
The hard part is, I don't know how to parse the html page in a way I can detect the only the image tags that contains a QRcode inside it.
I tried finding some kind of pattern in the images that contain a Qr code, but it usually starts with "?qr" but I think the ending is different everytimwe.
I'm using the module require-promise to get the raw html, and then I parse through it
const rp = require('request-promise');
const url = 'https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States';
rp(url)
.then(function(html){
//success!
console.log(html);
})
.catch(function(err){
//handle error
});
I want to be able to download the image of the QRcode.
You need to pass the html returned into something like https://www.npmjs.com/package/node-html-parser
const rp = require('request-promise');
const parser = require('node-html-parser');
const url = 'https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States';
rp(url)
.then(function(html){
const data = parser.parse(html);
console.log(JSON.stringify(data));
})
.catch(function(err){
//handle error
});
Then you can access things off the data object to find the QR code

How to parse json newline delimited in Angular 2

I am writing an Angular 2 app (built with angular cli), and trying to use AWS Polly text-to-speech API.
According to the API you can request audio output as well as "Speech Marks" which can describe word timing, visemes, etc. The audio is delivered as "mp3" format, and the speech marks as "application/x-json-stream", which I understand as a "new line" delimited JSON. It cannot be parsed with JSON.parse() due to the new lines. I have yet been unable to read/parse this data. I have looked at several libs that are for "json streaming" but they are all built for node.js and won't work with Angular 2. My code is as follows...
onClick() {
AWS.config.region = 'us-west-2';
AWS.config.accessKeyId = 'xxxxx';
AWS.config.secretAccessKey = 'yyyyy';
let polly = new AWS.Polly();
var params = {
OutputFormat: 'json',
Text: 'Hello world',
VoiceId: 'Joanna',
SpeechMarkTypes:['viseme']
};
polly.synthesizeSpeech(params, (err, data) => {
if (err) {
console.log(err, err.stack);
} else {
var uInt8Array = new Uint8Array(data.AudioStream);
var arrayBuffer = uInt8Array.buffer;
var blob = new Blob([arrayBuffer]);
var url = URL.createObjectURL(blob);
this.audio.src = url;
this.audio.play(); // works fine
// speech marks info displays "application/x-json-stream"
console.log(data.ContentType);
}
});
Strangely enough Chrome browser knows how to read this data and displays it in the response.
Any help would be greatly appreciated.
I had the same problem. I saved the file so I could then read it line by line, accessing the JSON objects when I need to highlight words being read. Mind you this is probably not the most effective way, but an easy way to move on and get working on the fun stuff.
I am trying out different ways to work with Polly, will update answer if I find a better way
You can do it with:
https://www.npmjs.com/package/ndjson-parse
That worked for me.
But I can't play audio, I tried your code it says
DOMException: Failed to load because no supported source was found.

render JSON response using Jade

i have gone back to basics to try and create a simple example of calling a REST API, receiving some JSON back and rendering the JSON data in HTML using Jade.
I have tried many approaches to this but cannot get any to work.
what code would i need to add to my main script file (below - lxrclient.js) to achieve this. I know i need to add express module, and render the view, but no matter who may approaches i have tried i cannot get it to work. I have also added the jade file i am using further down. really appreciate any help anyone can provide with this.
//this is my main script file lxrclient3.js
var http = require('http');
var express = require('express');
var app = express();
app.set('view engine', 'jade');
var options = {
host: '41.193.214.130',
port: 2510,
path: '/eiftidemo/clt_list',
method: 'GET'
};
http.request(options, function(res) {
var body = '';
//node of these statemnst excecute
res.on('data', function(chunk) {
body += chunk;
});
res.on('end', function() {
var clientsData = JSON.parse(body);
debugger;
});
}).end();
app.get("/clientlist", function(req, res){
res.render('listlxr', {clientd: clientsData});
});
var server = app.listen(3000, function() {
console.log('Our App is running at http://localhost:3000');
});
here is my Jade view
html
head
title List of Clients
link(rel="stylesheet", href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css")
body
div.row
div.col-md-3
div.col-md-6
div.panel.panel-primary
div.panel-heading.text-center
h1 Client List for Hyposure
div.panel-body
table.table
each clients in clientsData
tr
td.media
span.bg-info= clients.clientName
td.media
span.bg-info= clients.clientSurname
thx to anyone who can help
First of all, when you say Jade I hope you mean Pug, if not, you should update to the newest version of Pug instead.
Now, to send data to your template for the engine to render you send a JSON object, like so;
res.get("/", function(req, res)
{
res.render("page.html", {greet : "Hello"});
}
This is a standard way of rendering a page and sending some data along side.
Now in your Pug (Jade (Same thing)) template you can access the sent variable like so;
html
head
title List of Clients
body
h1 #{greet} <!-- This will print "Hello" -->
This should give you a basic idea on how to render data onto a site, you can also send nested objects and you just work it the same way as in this example, but you point to the correct key. So for example if you're sending the following object:
{
greet: {
message : "Hello",
who : "Adrian"
}
}
Then you can print values from that using:
#{greet.message} #{greet.who}

How do I deal with static navbars in Node?

A bit about my background: I haven't done much web development for a while, and am only recently starting to get back in the swing of things. I remember the days when I used to have static header/footer files in PHP, and code like so:
<html>
<head>
<title>Title</title>
</head>
<body>
<?php include("header.php") ?>
Body text here
<?php include("footer.php") ?>
</body>
</html>
I'm currently making the switch to Node for my backend (for various reasons not pertinent to this post). I'm still quite new at this, so I'm wondering if there is a simple way to go about having static headers/footers (which contains navbars and such) for my front end. I'm trying to use Bootstrap for my front end framework.
I think using Jade, or other template engines, might be one way to go about doing this, though I'm not necessarily sure if I want to use Jade just yet as dealing with js and HTML is troublesome enough without adding another pseudo-language/format/template into the mix. So I'm wondering if there is a solution that does not use template engines.
Here's what I currently have for my app.js/web.js file:
var express = require('express')
var app = express()
var path = require('path')
var fs = require('fs');
var bodyparser = require('body-parser')
app.use(bodyparser.urlencoded({extended: false}))
app.use(express.static(path.join(__dirname)));
app.get('/submit', function(req,res) {
[functions omitted for brevity]
})
[other processes omitted for brevity]
app.listen(8080)
Thank you!
I would highly recommend to use a view engine for this. Node is not like PHP. PHP scripts are processed sequentially and the rendered output is sent as the response. If you were to do the same approach in node you would be building a string in JavaScript and then sending it out to the client.
app.get('/submit', function(req,res) {
var output = '<html>';
output += '<head>';
...
res.send(output);
});
Now imagine that you have to query a database:
app.get('/submit', function(req,res) {
var output = '<html>';
output += '<head>';
...
db.query(query, function(err, result) {
output += 'the result is ' + result;
...
res.send(output);
});
});
So the lesson is, use a templating engine as they are designed for building the html output for you. Start with ejs as it will have a more familiar syntax to php.

How write and immediately read a file nodeJS

I have to obtain a json that is incrusted inside a script tag in certain page... so I can't use regular scraping techniques, like cheerio.
Easy way out, write the file (download the page) to the server and then read it using string manipulation to extract the json (there are several) work on them and save to my db hapily.
the thing is that I'm too new to nodeJS, and can't get the code to work, I think that I'm trying to read the file before it is fully written, and if read it time before obtain [Object Object]...
Here's what I have so far...
var http = require('http');
var fs = require('fs');
var request = require('request');
var localFile = 'tmp/scraped_site_.html';
var url = "siteToBeScraped.com/?searchTerm=foobar"
// writing
var file = fs.createWriteStream(localFile);
var request = http.get(url, function(response) {
response.pipe(file);
});
//reading
var readedInfo = fs.readFileSync(localFile, function (err, content) {
callback(url, localFile);
console.log("READING: " + localFile);
console.log(err);
});
So first of all I think you should understand what went wrong.
The http request operation is asynchronous. This means that the callback code in http.get() will run sometime in the future, but the fs.readFileSync, due to its synchronous nature will execute and complete even before the http request will actually be sent to the background thread that will execute it, since they are both invoked in what is commonly known as the (same) tick. Also fs.readFileSync returns a value and does not use a callback.
Even if you replace fs.readFileSync with fs.readFile instead the code still might not work properly since the readFile operation might execute before the http response is fully read from the socket and written to the disk.
I strongly suggest reading: stackoverflow question and/or Understanding the node.js event loop
The correct place to invoke the file read is when the response stream has finished writing to the file, which would look something like this:
var request = http.get(url, function(response) {
response.pipe(file);
file.once('finish', function () {
fs.readFile(localFile, /* fill encoding here */, function(err, data) {
// do something with the data if there is no error
});
});
});
Of course this is a very raw and not recommended way to write asynchronous code but that is another discussion altogether.
Having said that, if you download a file, write it to the disk and then read it all back again to the memory for manipulation, you might as well forgo the file part and just read the response into a string right away. Your code will then look something like so (this can be implemented in several ways):
var request = http.get(url, function(response) {
var data = '';
function read() {
var chunk;
while ( chunk = response.read() ) {
data += chunk;
}
}
response.on('readable', read);
response.on('end', function () {
console.log('[%s]', data);
});
});
What you really should do IMO is to create a transform stream that will strip away all the data you need from the response, while not consuming too much memory and yielding this more elegantly looking code:
var request = http.get(url, function(response) {
response.pipe(yourTransformStream).pipe(file)
});
Implementing this transform stream, however, might prove slightly more complex. So if you're a node beginner and you don't plan on downloading big files or lots of small files than maybe loading the whole thing into memory and doing string manipulations on it might be simpler.
For further information about transformation streams:
node.js stream api
this wonderful guide by substack
this post from strongloop
Lastly, see if you can use any of the million node.js crawlers already out there :-) take a look at these search results on npm
According to the http module help 'get' does not return the response body
This is modified from the request example on the same page
What you need to do is process the response with in the callback (function) passed into http.request so it can be called when it is ready (async)
var http = require('http')
var fs = require('fs')
var localFile = 'tmp/scraped_site_.html'
var file = fs.createWriteStream(localFile)
var req = http.request('http://www.google.com.au', function(res) {
res.pipe(file)
res.on('end', function(){
file.end()
fs.readFile(localFile, function(err, buf){
console.log(buf.toString())
})
})
})
req.on('error', function(e) {
console.log('problem with request: ' + e.message)
})
req.end();
EDIT
I updated the example to read the file after it is created. This works by having a callback on the end event of the response which closes the pipe and then it can reopen the file for reading. Alternatively you can use
req.on('data', function(chunk){...})
to process the data as it arrives without putting it into a temporary file
My impression is that you serializing a js object into JSON by reading it from a stream that's downloading a file containing HTML. This is do-able yet hard. Its difficult to know when you're search expression is found because if you parse as the chunks come in then you never know if you received only context and you could never find what you're looking for because it was split into 2 or many parts which were never analyzed as a whole.
You could try something like this:
http.request('u/r/l',function(res){
res.on('data',function(data){
//parse data as it comes in
}
});
This allows you to read data as it comes in. You can handle it to save to disc, db, or even parse it if you accumulated the contents within the script tags into a single string then parsed objects in that.