Crawling : Web scraping stops due to structural changes

Crawling : Web scraping stops due to structural changes - html

While crawling a webpage the structure of the webpage keeps changing , I mean its dynamic which leads to a scenario where my crawler stops working . Is there a mechanism to identify webpage structural changes before running the full crawler so as to identify whether the structure has changed or not.

If you can run your own javascript code in the webpage you can use MutationObserver providing the ability to watch for changes being made to the DOM tree.
Something like:
waitForDomStability(timeout: number) {
return new Promise(resolve => {
const waitResolve = observer => {
observer.disconnect();
resolve();
};
let timeoutId;
const observer = new MutationObserver((mutationList, observer) => {
for (let i = 0; i < mutationList.length; i += 1) {
// we only care if new nodes have been added
if (mutationList[i].type === 'childList') {
// restart the countdown timer
window.clearTimeout(timeoutId);
timeoutId = window.setTimeout(waitResolve, timeout, observer);
break;
}
}
});
timeoutId = setTimeout(waitResolve, timeout, observer);
// start observing document.body
observer.observe(document.body, { attributes: true, childList: true, subtree: true });
});
}
I'm using this approach in the open source scraping extension get-set-fetch. For full code look at /packages/background/src/ts/plugins/builtin/FetchPlugin.ts from the repo.

You can certainly use "snapshots" for comparing 2 versions of the same page. I've implemented something similar to java String hashCode to achieve this.
Code in javascript:
/*
returns a dom element snapshot as innerText hash code
starting point is java String hashCode: s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
keep everything fast: only work with a 32 bit hash, remove exponentiation
custom implementation: s[0]*31 + s[1]*31 + ... + s[n-1]*31
*/
function getSnapshot() {
const snapshotSelector = 'body';
const nodeToBeHashed = document.querySelector(snapshotSelector);
if (!nodeToBeHashed) return 0;
const { innerText } = nodeToBeHashed;
let hash = 0;
if (innerText.length === 0) {
return hash;
}
for (let i = 0; i < innerText.length; i += 1) {
// an integer between 0 and 65535 representing the UTF-16 code unit
const charCode = innerText.charCodeAt(i);
// multiply by 31 and add current charCode
hash = ((hash << 5) - hash) + charCode;
// convert to 32 bits as bitwise operators treat their operands as a sequence of 32 bits
hash |= 0;
}
return hash;
}
If you can't run javascript code in the page, you can use the entire html response as the content to be hashed in your favorite language.

Related

accessing table cells from puppeteer

I need to be able to validate the values in specific table cells, and later, to click on a particular cell that holds a link. I get that Node and the browser (or emulator) are two different process spaces, so I can't pass references. I was hoping that puppeteer would hide this fact in a read-only manner such as return someArray; in a function run in the browser context being "magically" replicated by puppeteer on the Node side, but alas.
test("get a certain row from a certain table", async function getRow() {
await page.waitForSelector("#actionItemsView-table");
const
cellText = await page.evaluate(function getCells() {
const
row = Array.from(document.querySelectorAll("#actionItemsView-table tbody tr"));
for (let i = row.length - 1; i >= 0; --i) {
// the row we want is the one that has text of interest in cells[2]
if (row[i].cells[2] === "some text that identifies the row of interest") {
return row[i]; // we can't pass this back to Node, so this is wrong
// but some version of this what we need to do
}
}
return null; // no such row
});
console.log(cellText); // cellText is an empty array
}, testTimeout);
Lacking that, I have run through various intermediate experiments, all the way to this, seemingly simplest case, just to get something that works and then work my way back up to what I need, but this doesn't work either:
test("get the text from a single cell", async function getInnerText() {
await page.waitForSelector("#actionItemsView-table");
const
cellText = await page.evaluate(function getText() {
let
ct,
row = Array.from(document.querySelectorAll("#actionItemsView-table tbody tr"));
for (let i = row.length - 1; i >= 0; --i) {
if (row[i].cells[2] === "some text that identifies the row of interest") {
ct = row[i].cells[3].innerText; // the text of the next cell to the right
break;
}
}
return ct; // ct is not a string!
});
console.log(cellText); // cellText is undefined!
}, testTimeout);
If I do things like
document.querySelect("#actionItemsView-table").rows[2].cells[3].innerText
they work, so my selectors and javascript syntax seems to be correct.
There has to be a way to do this and it has to be way easier than I have made it -- what am I missing? Why is the above not working but something like this does work:
await page.$eval("input[name=emailAddr]", function setId(el, id) { el.value = id; return id; }, id);

Here's an easier way to find that cell:
await page.evaluate(() => {
let td = [...document.querySelectorAll("td")].find(td => td.innerText === "something")
return td?.nextElementSibling?.innerText
})
This will return the text or undefined

Execution order in NodeJS [duplicate]

This question already has answers here:
JavaScript closure inside loops – simple practical example
(44 answers)
Closed 4 years ago.
The community reviewed whether to reopen this question 3 months ago and left it closed:
Duplicate This question has been answered, is not unique, and doesn’t differentiate itself from another question.
I am running an event loop of the following form:
var i;
var j = 10;
for (i = 0; i < j; i++) {
asynchronousProcess(callbackFunction() {
alert(i);
});
}
I am trying to display a series of alerts showing the numbers 0 through 10. The problem is that by the time the callback function is triggered, the loop has already gone through a few iterations and it displays a higher value of i. Any recommendations on how to fix this?

The for loop runs immediately to completion while all your asynchronous operations are started. When they complete some time in the future and call their callbacks, the value of your loop index variable i will be at its last value for all the callbacks.
This is because the for loop does not wait for an asynchronous operation to complete before continuing on to the next iteration of the loop and because the async callbacks are called some time in the future. Thus, the loop completes its iterations and THEN the callbacks get called when those async operations finish. As such, the loop index is "done" and sitting at its final value for all the callbacks.
To work around this, you have to uniquely save the loop index separately for each callback. In Javascript, the way to do that is to capture it in a function closure. That can either be done be creating an inline function closure specifically for this purpose (first example shown below) or you can create an external function that you pass the index to and let it maintain the index uniquely for you (second example shown below).
As of 2016, if you have a fully up-to-spec ES6 implementation of Javascript, you can also use let to define the for loop variable and it will be uniquely defined for each iteration of the for loop (third implementation below). But, note this is a late implementation feature in ES6 implementations so you have to make sure your execution environment supports that option.
Use .forEach() to iterate since it creates its own function closure
someArray.forEach(function(item, i) {
asynchronousProcess(function(item) {
console.log(i);
});
});
Create Your Own Function Closure Using an IIFE
var j = 10;
for (var i = 0; i < j; i++) {
(function(cntr) {
// here the value of i was passed into as the argument cntr
// and will be captured in this function closure so each
// iteration of the loop can have it's own value
asynchronousProcess(function() {
console.log(cntr);
});
})(i);
}
Create or Modify External Function and Pass it the Variable
If you can modify the asynchronousProcess() function, then you could just pass the value in there and have the asynchronousProcess() function the cntr back to the callback like this:
var j = 10;
for (var i = 0; i < j; i++) {
asynchronousProcess(i, function(cntr) {
console.log(cntr);
});
}
Use ES6 let
If you have a Javascript execution environment that fully supports ES6, you can use let in your for loop like this:
const j = 10;
for (let i = 0; i < j; i++) {
asynchronousProcess(function() {
console.log(i);
});
}
let declared in a for loop declaration like this will create a unique value of i for each invocation of the loop (which is what you want).
Serializing with promises and async/await
If your async function returns a promise, and you want to serialize your async operations to run one after another instead of in parallel and you're running in a modern environment that supports async and await, then you have more options.
async function someFunction() {
const j = 10;
for (let i = 0; i < j; i++) {
// wait for the promise to resolve before advancing the for loop
await asynchronousProcess();
console.log(i);
}
}
This will make sure that only one call to asynchronousProcess() is in flight at a time and the for loop won't even advance until each one is done. This is different than the previous schemes that all ran your asynchronous operations in parallel so it depends entirely upon which design you want. Note: await works with a promise so your function has to return a promise that is resolved/rejected when the asynchronous operation is complete. Also, note that in order to use await, the containing function must be declared async.
Run asynchronous operations in parallel and use Promise.all() to collect results in order
function someFunction() {
let promises = [];
for (let i = 0; i < 10; i++) {
promises.push(asynchonousProcessThatReturnsPromise());
}
return Promise.all(promises);
}
someFunction().then(results => {
// array of results in order here
console.log(results);
}).catch(err => {
console.log(err);
});

async await is here
(ES7), so you can do this kind of things very easily now.
var i;
var j = 10;
for (i = 0; i < j; i++) {
await asycronouseProcess();
alert(i);
}
Remember, this works only if asycronouseProcess is returning a Promise
If asycronouseProcess is not in your control then you can make it return a Promise by yourself like this
function asyncProcess() {
return new Promise((resolve, reject) => {
asycronouseProcess(()=>{
resolve();
})
})
}
Then replace this line await asycronouseProcess(); by await asyncProcess();
Understanding Promises before even looking into async await is must
(Also read about support for async await)

Any recommendation on how to fix this?
Several. You can use bind:
for (i = 0; i < j; i++) {
asycronouseProcess(function (i) {
alert(i);
}.bind(null, i));
}
Or, if your browser supports let (it will be in the next ECMAScript version, however Firefox already supports it since a while) you could have:
for (i = 0; i < j; i++) {
let k = i;
asycronouseProcess(function() {
alert(k);
});
}
Or, you could do the job of bind manually (in case the browser doesn't support it, but I would say you can implement a shim in that case, it should be in the link above):
for (i = 0; i < j; i++) {
asycronouseProcess(function(i) {
return function () {
alert(i)
}
}(i));
}
I usually prefer let when I can use it (e.g. for Firefox add-on); otherwise bind or a custom currying function (that doesn't need a context object).

var i = 0;
var length = 10;
function for1() {
console.log(i);
for2();
}
function for2() {
if (i == length) {
return false;
}
setTimeout(function() {
i++;
for1();
}, 500);
}
for1();
Here is a sample functional approach to what is expected here.

ES2017: You can wrap the async code inside a function(say XHRPost) returning a promise( Async code inside the promise).
Then call the function(XHRPost) inside the for loop but with the magical Await keyword. :)
let http = new XMLHttpRequest();
let url = 'http://sumersin/forum.social.json';
function XHRpost(i) {
return new Promise(function(resolve) {
let params = 'id=nobot&%3Aoperation=social%3AcreateForumPost&subject=Demo' + i + '&message=Here%20is%20the%20Demo&_charset_=UTF-8';
http.open('POST', url, true);
http.setRequestHeader('Content-type', 'application/x-www-form-urlencoded');
http.onreadystatechange = function() {
console.log("Done " + i + "<<<<>>>>>" + http.readyState);
if(http.readyState == 4){
console.log('SUCCESS :',i);
resolve();
}
}
http.send(params);
});
}
(async () => {
for (let i = 1; i < 5; i++) {
await XHRpost(i);
}
})();

JavaScript code runs on a single thread, so you cannot principally block to wait for the first loop iteration to complete before beginning the next without seriously impacting page usability.
The solution depends on what you really need. If the example is close to exactly what you need, #Simon's suggestion to pass i to your async process is a good one.

node.js: put asynchronous returns in one object or array

I am missing something fundamental in terms of callbacks/async in the code below: why do I get:
[,,'[ {JSON1} ]']
[,,'[ {JSON2} ]']
(=2 console returns) instead of only one console return with one proper table, which is want I want and would look like:
[,'[ {JSON1} ]','[ {JSON2} ]']
or ideally:
[{JSON1},{JSON2}]
See my code below, getPTdata is a function I created to retrieve some JSON via a REST API (https request). I cannot get everything at once since the API I'm talking to has a limit, hence the limit and offset parameters of my calls.
offsets = [0,1]
res = []
function goGetData(callback) {
for(var a = 0; a < offsets.length; a++){
getPTdata('stories',
'?limit=1&offset='+offsets[a]+'&date_format=millis',
function(result){
//called once getPTdata is done
res[a] = result
callback(res)
});
}
}
goGetData(function(notgoingtowork){
//called once goGetData is done
console.log(res)
})

Solved like this:
offsets = [0,1]
res = []
function goGetData(callback) {
var nb_returns = 0
for(var a = 0; a < offsets.length; a++){
getPTdata('stories','?limit=1&offset='+offsets[a]+'&date_format=millis', function(result){
//note because of "loop closure" I cannot use a here anymore
//called once getPTdata is done, therefore we know result and can store it
nb_returns++
res.push(JSON.parse(result))
if (nb_returns == offsets.length) {
callback(res)
}
});
}
}
goGetData(function(consolidated){
//called once goGetData is done
console.log(consolidated)
})

JSON streaming with Oboe.js, MongoDB and Express.js

I'm experimenting with JSON streaming through HTTP with Oboe.js, MongoDB and Express.js.
The point is to do a query in MongoDB (Node.js's mongodb native drive), pipe it (a JavaScript array) to Express.js and parse it in the browser with Oboe.js.
The benchmarks I did compared streaming vs. blocking in both the MongoDB query server-side and the JSON parsing in the client-side.
Here is the source code for the two benchmarks. The first number is the number of milli-seconds for 1000 queries of 100 items (pagination) in a 10 million documents collection and the second number between parenthesis, represents the number of milli-seconds before the very first item in the MongoDB result array is parsed.
The streaming benchmark server-side:
// Oboe.js - 20238 (16.887)
// Native - 16703 (16.69)
collection
.find()
.skip(+req.query.offset)
.limit(+req.query.limit)
.stream()
.pipe(JSONStream.stringify())
.pipe(res);
The blocking benchmark server-side:
// Oboe.js - 17418 (14.267)
// Native - 13706 (13.698)
collection
.find()
.skip(+req.query.offset)
.limit(+req.query.limit)
.toArray(function (e, docs) {
res.json(docs);
});
These results really surprise me because I would have thought that:
Streaming would be quicker than blocking every single time.
Oboe.js would be quicker to parse the entire JSON array compared to the native JSON.parse method.
Oboe.js would be quicker to parse the first element in the array compared to the native JSON.parse method.
Does anyone have an explanation ?
What am I doing wrong ?
Here is the source-code for the two client-side benchmarks too.
The streaming benchmark client-side:
var limit = 100;
var max = 1000;
var oboeFirstTimes = [];
var oboeStart = Date.now();
function paginate (i, offset, limit) {
if (i === max) {
console.log('> OBOE.js time:', (Date.now() - oboeStart));
console.log('> OBOE.js avg. first time:', (
oboeFirstTimes.reduce(function (total, time) {
return total + time;
}, 0) / max
));
return true;
}
var parseStart = Date.now();
var first = true;
oboe('/api/spdy-stream?offset=' + offset + '&limit=' + limit)
.node('![*]', function () {
if (first) {
first = false;
oboeFirstTimes.push(Date.now() - parseStart);
}
})
.done(function () {
paginate(i + 1, offset + limit, limit);
});
}
paginate(0, 0, limit);
The blocking benchmark client-side:
var limit = 100;
var max = 1000;
var nativeFirstTimes = [];
var nativeStart = Date.now();
function paginate (i, offset, limit) {
if (i === max) {
console.log('> NATIVE time:', (Date.now() - nativeStart));
console.log('> NATIVE avg. first time:', (
nativeFirstTimes.reduce(function (total, time) {
return total + time;
}, 0) / max
));
return true;
}
var parseStart = Date.now();
var first = true;
var req = new XMLHttpRequest();
req.open('GET', '/api/spdy-stream?offset=' + offset + '&limit=' + limit, true);
req.onload = function () {
var json = JSON.parse(req.responseText);
json.forEach(function () {
if (first) {
first = false;
nativeFirstTimes.push(Date.now() - parseStart);
}
});
paginate(i + 1, offset + limit, limit);
};
req.send();
}
paginate(0, 0, limit);
Thanks in advance !

I found those comments in Oboe doc at the end of the "Why Oboe?" section:
Because it is a pure Javascript parser, Oboe.js requires more CPU time than JSON.parse. Oboe.js works marginally more slowly for small messages that load very quickly but for most real-world cases using i/o effectively beats optimising CPU time.
SAX parsers require less memory than Oboe’s pattern-based parsing model because they do not build up a parse tree. See Oboe.js vs SAX vs DOM.
If in doubt, benchmark, but don’t forget to use the real internet, including mobile, and think about perceptual performance.

Limit the size of a file upload (html input element)

I would like to simply limit the size of a file that a user can upload.
I thought maxlength = 20000 = 20k but that doesn't seem to work at all.
I am running on Rails, not PHP, but was thinking it'd be much simpler to do it client side in the HTML/CSS, or as a last resort using jQuery. This is so basic though that there must be some HTML tag I am missing or not aware of.
Looking to support IE7+, Chrome, FF3.6+. I suppose I could get away with just supporting IE8+ if necessary.
Thanks.

var uploadField = document.getElementById("file");
uploadField.onchange = function() {
if(this.files[0].size > 2097152){
alert("File is too big!");
this.value = "";
};
};
This example should work fine. I set it up for roughly 2MB, 1MB in Bytes is 1,048,576 so you can multiply it by the limit you need.
Here is the jsfiddle example for more clearence:
https://jsfiddle.net/7bjfr/808/

This is completely possible. Use Javascript.
I use jQuery to select the input element. I have it set up with an onChange event.
$("#aFile_upload").on("change", function (e) {
var count=1;
var files = e.currentTarget.files; // puts all files into an array
// call them as such; files[0].size will get you the file size of the 0th file
for (var x in files) {
var filesize = ((files[x].size/1024)/1024).toFixed(4); // MB
if (files[x].name != "item" && typeof files[x].name != "undefined" && filesize <= 10) {
if (count > 1) {
approvedHTML += ", "+files[x].name;
}
else {
approvedHTML += files[x].name;
}
count++;
}
}
$("#approvedFiles").val(approvedHTML);
});
The code above saves all the file names that I deem worthy of persisting to the submission page before the submission actually happens. I add the "approved" files to an input element's val using jQuery so a form submit will send the names of the files I want to save. All the files will be submitted, however, now on the server-side, we do have to filter these out. I haven't written any code for that yet but use your imagination. I assume one can accomplish this by a for loop and matching the names sent over from the input field and matching them to the $_FILES (PHP Superglobal, sorry I don't know ruby file variable) variable.
My point is you can do checks for files before submission. I do this and then output it to the user before he/she submits the form, to let them know what they are uploading to my site. Anything that doesn't meet the criteria does not get displayed back to the user and therefore they should know, that the files that are too large won't be saved. This should work on all browsers because I'm not using the FormData object.

You can't do it client-side. You'll have to do it on the server.
Edit: This answer is outdated!
When I originally answered this question in 2011, HTML File API was nothing but a draft. It is now supported on all major browsers.
I'd provide an update with solution, but #mark.inman.winning has already answered better than I could.
Keep in mind that even if it's now possible to validate on the client, you should still validate it on the server, though. All client side validations can be bypassed.

const input = document.getElementById('input')
input.addEventListener('change', (event) => {
const target = event.target
if (target.files && target.files[0]) {
/*Maximum allowed size in bytes
5MB Example
Change first operand(multiplier) for your needs*/
const maxAllowedSize = 5 * 1024 * 1024;
if (target.files[0].size > maxAllowedSize) {
// Here you can ask your users to load correct file
target.value = ''
}
}
})
<input type="file" id="input" />
If you need to validate file type, write in comments below and I'll share my solution.
(Spoiler: accept attribute is not bulletproof solution)

Video file example (HTML + Javascript):
function upload_check()
{
var upl = document.getElementById("file_id");
var max = document.getElementById("max_id").value;
if(upl.files[0].size > max)
{
alert("File too big!");
upl.value = "";
}
};
<form action="some_script" method="post" enctype="multipart/form-data">
<input id="max_id" type="hidden" name="MAX_FILE_SIZE" value="250000000" />
<input onchange="upload_check()" id="file_id" type="file" name="file_name" accept="video/*" />
<input type="submit" value="Upload"/>
</form>

I made a solution using just JavaScript, and it supports multiple files:
const input = document.querySelector("input")
const result = document.querySelector("p")
const maximumSize = 10 * 1024 * 1024 // In MegaBytes
input.addEventListener("change", function(e){
const files = Array.from(this.files)
const approvedFiles = new Array
if(!files.length) return result.innerText = "No selected files"
for(const file of files) if(file.size <= maximumSize) approvedFiles.push(file)
if(approvedFiles.length) result.innerText = `Approved files: ${approvedFiles.map(file => file.name).join(", ")}`
else result.innerText = "No approved files"
})
<input type="file" multiple>
<p>Result</p>

This question was from a long time ago, but maybe this could help someone struggling.
If you are working with forms, the easiest way to do this is by creating a new FormData
with your form. For example:
form.addEventListener("submit", function(e){
e.preventDefault()
const fd = new FormData(this)
for(let key of fd.keys()){
if(fd.get(key).size >= 2000000){
return console.log(`This archive ${fd.get(key).name} is bigger than 2MB.`)
}
else if(fd.get(key).size < 2000000){
console.log(`This archive ${fd.get(key).name} is less than 2MB.`)
}
else{
console.log(key, fd.get(key))
}
}
this.reset()
})
As you can see, you can get the size from an archive submited with a form by typing this:
fd.get(key).size
And the file name is also reachable:
fd.get(key).name
Hope this was helpful!

<script type="text/javascript">
$(document).ready(function () {
var uploadField = document.getElementById("file");
uploadField.onchange = function () {
if (this.files[0].size > 300000) {
this.value = "";
swal({
title: 'File is larger than 300 KB !!',
text: 'Please Select a file smaller than 300 KB',
type: 'error',
timer: 4000,
onOpen: () => {
swal.showLoading()
timerInterval = setInterval(() => {
swal.getContent().querySelector('strong')
.textContent = swal.getTimerLeft()
}, 100)
},
onClose: () => {
clearInterval(timerInterval)
}
}).then((result) => {
if (
// Read more about handling dismissals
result.dismiss === swal.DismissReason.timer
) {
console.log('I was closed by the timer')
}
});
};
};
});
</script>

PHP solution to verify the size in the hosting.
<?php
if ($_FILES['name']['size'] > 16777216) {
?>
<script type="text/javascript">
alert("The file is too big!");
location.href = history.back();
</script>
<?php
die();
}
?>
16777216 Bytes = 16 Megabytes
Convert units: https://convertlive.com/u/convert/megabytes/to/bytes#16
Adapted from https://www.php.net/manual/en/features.file-upload.php

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Crawling : Web scraping stops due to structural changes - html

While crawling a webpage the structure of the webpage keeps changing , I mean its dynamic which leads to a scenario where my crawler stops working . Is there a mechanism to identify webpage structural changes before running the full crawler so as to identify whether the structure has changed or not.

Related

accessing table cells from puppeteer

Execution order in NodeJS [duplicate]

node.js: put asynchronous returns in one object or array

JSON streaming with Oboe.js, MongoDB and Express.js

Limit the size of a file upload (html input element)

Categories

Resources