How to Determine If Webpage Has Been Modified?

How to Determine If Webpage Has Been Modified? - html

I know I can check the response header's 'last-modified' value to determine when the web page was last modified, but in many instances that header is NOT provided. Also, in many instances the content itself hasn't changed, but the current time/date is displayed on the page, thus giving the appearance of a modification.
Any suggestions on how to overcome the above issues and determine if a web page has been (truly) modified?
Thanks.

Sure. Define for yourself what counts as a "modification" (for example, only things in the "content" div) and only look at that.
If you can't find a way to decide whether something's been changed, then you can't expect a computer to…

You are asking two question here:
When was it modified?
Was it modified?
To answer question #1, you'd have to check the page every so often to meet your granularity requirements e.g. every hour, every day, every week, etc. This could be quite resource intensive. This will depend on if you really need to know this.
To answer question #2, you need to compare something. You could do what #Paul Rosnia suggested, but if they as much as added a comma, it will be considered modified.
Then, you might also want to see what has been modifed. Then you you'd have to save the content and compare them to each other in order to highlight the changes.
You could use http://php.net/manual/en/function.file-get-contents.php and a CRON job to cache the page on your server and then perdiodically compare your cache. The comparing part will be the tricky part, since you have to write specific code to ignore the things that don't matter to you e.g. date/time stamp, header changes, menu changes, etc.

The sure-fire way to detect page changes is to download and checksum it. If the checksum changes, the page has been edited (with extremely high certainty).
Here's an example that works on the command line:
curl -s news.ycombinator.com | md5 #=> d86582bec138c051b0d8322f7823a23c
That was a few minutes ago. If you run it now you'll get a different answer!

Related

How does Chrome update URL bar completions?

I really enjoy using Chrome's URL bar because it remembers commonly-visited sites and often suggests a good completion based on what I've typed and/or visited before. So, for example, I can type t in the URL bar and Chrome will automatically fill it in with twitter.com, or I can type maps and Chrome will fill in the .google.com. This gives me the convenience of data-driven domain name shortcuts without having to maintain an explicit list.
What I'm wondering, though, is how Chrome determines that an old shortcut should be replaced with a new one. For example, if I visit twitter.com often, then that becomes the completion when I type t. But if I then start visiting twilio.com often enough, then, after some time, Chrome will start to fill that in as the default completion for t. What I can't figure out is how or when that transition takes place. It also seems that there are (at least) two cases involved : one for domain names, and another for path strings, because if I visit a certain full URL often, and then want to get to the root of the same domain, I end up having to type the entire domain name out to get Chrome to ignore the full-URL completion.
If I had to guess, I'd imagine that Chrome stores the things that I type in the URL bar in a trie whose values are the number of times that a particular string has been typed (and/or visited ?). Then I'd imagine it has some sort of exponential decay model for the "counts" in the trie. But this is just a guess. Does anyone know how this updating process happens ?

Well, I ended up finding some answers by having a look at the Chromium source code ; I'd imagine that Chrome itself uses this code without too much modification.
When you type something into the search/URL bar (which is apparently called the "Omnibox"), Chrome starts looking for suggestions and completions that match what you've typed. To do this, there are several "providers" registered with the browser, each of which knows how to make a particular type of suggestion. The URL history provider is one of these.
The querying process is pretty cool, actually. It all happens asynchronously, with particular attention paid to which activity happens in which thread (the main thread being especially important not to block). When the providers find suggestions, they call back to the omnibox, which appears to merge and sort things before updating the UI widget.
History provider
It turns out that URLs in Chrome are stored in at least one, and probably two, sqlite databases (one is on disk, and the second, which I know less about, seems to be in memory).
This comment at the top of HistoryURLProvider explains the lookup process, complete with multithreaded ASCII art !
Sqlite lookup
Basically, typing in the omnibox causes sqlite to run this SQL query for looking up URLs by prefix. The suggestions are ordered by the number of visits to the URL, as well as by the number of times that a URL has been typed.
Interestingly, this is not a trie ! The lookup is indeed based on prefix, but the scoring of those lookups does not appear to be aggregated by prefix, like I'd imagined.
I had a little less success in determining how the scores in the database are updated. This part of the code updates a URL after a visit, but I haven't yet run across where the counts are decremented (if at all ?).
Updating suggestions
What I think is happening regarding the updating of suggestions -- and this is still just a guess right now -- is that the in-memory sqlite database essentially has priority over the on-disk DB, and then whenever Chrome restarts or otherwise flushes the contents of the in-memory DB to disk, the visit and typed counts for each URL get updated at that time. Again, just a guess, but I'll keep looking as I get time.
The code is really nice to read through, actually. I definitely recommend it if you have similar questions about Chrome.

Detecting what changed in an HTML Textfield

For a major school project I am implementing a real-time collaborative editor. For a little background, basically what this means is that two(or more) users can type into a document at the same time, and their changes are automatically propagated to one another (similar to Etherpad).
Now my problem is as follows:
I want to be able to detect what changes a user carried out onto an HTML textfield. They could:
Insert a character
Delete a character
Paste a string of characters
Cut a string of characters
I want to be able to detect which of these changes happened and then notify other clients similar to "insert character 'c' at position 2" etc.
Anyway I was hoping to get some advice on how I would go about implementing the detection of these changes?
My first attempt was to consider the carot position before and after a change occurred, but this failed miserably.
For my second attempt I was thinking about doing a diff on the entire contents of the textfields old and new value. Am I missing anything obvious with this solution? Is there something simpler?

It is a really hard work make this working today, for several reasons, but
maybe you will need to restrict only to some browsers. read: https://developer.mozilla.org/en/XUL/Attribute/oninput the alternative to "oninput" is listen to all input events (keyboard, mouse, dragdrop) I suggest to use "oninput"
html is not perfect... even html5. input and textareas supports only single-range
selections. you can solve this using designmode/contenteditable instead of
textareas/textfield
detecting offsets of what changed is a hard work: read
-- https://developer.mozilla.org/en/Document_Object_Model_%28DOM%29/window.getSelection
-- http://www.quirksmode.org/dom/range_intro.html -- http://msdn.microsoft.com/en-us/library/ms535869%28v=VS.85%29.aspx -- http://msdn.microsoft.com/en-us/library/ms535872%28v=VS.85%29.aspx
you may need a "diff" algorithm written in javascript! http://ejohn.org/projects/javascript-diff-algorithm/
one personal note: detecting words, characters changes may be totally non-sense and not useful, detect instead paragraphs changes, or in case of an excel-like worksheet, the single cell
I hope this helps
feel free to correct my English!

My pseudocode/written out response would be (if I understand your question exactly) to use jQuery to detect keyup events and then save the input to the server via ajax, then also take the response and post it back to the input. This isn't very efficient, but basically the idea is that you're constantly posting and checking what else has been posted. If you want to see what someone else is doing in real time, you can ping the server every second or so and update with the response.
All of this of course can be optimized, but it still is kind of taxing for a server. You could also see if you can implement Google Topeka Wave for your project, or get in touch with Google Topeka to see how they do it :)

How can I improve the subjective speed of my application?

Today my co-worker noticed that when adding a decimal place to a progress indicator leads to the impression that the program is running faster than without. (i.e. instead of 1,2,3... it shows 1, 1.2, 1.4, 1.6, ...) I checked it and I was surprised that I got the same impression even though I knew it was faked.
That makes me wonder: What other things are there to create the impression of a fast application?
Of course the best way is to actually make the application faster, but from an algorithmic point of view often there's not much you can do. Additionally I think making a user less frustrated is a good thing, even though it is more or less a psychologic trick.

This effect can be very dramatic: doing relatively large amounts of work to give users a correct and often updating status of progress can of course slow down the actual running time of the application (screen updates, progress display needed calculations, etc) while still giving the user the feeling it takes less time.
Some of the things you could do in GUIs:
make sure your application remains responsive (resizing the forms remains possible, perhaps give a cancel button for the operation?) while background processing is occurring
be very consistent in showing status messages/hourglass cursors throughout the application
if you have something updating during an operation, make sure it updates often (like the almost ridiculous showing of filenames and registry keys during an install), or make sure there's an option to make it do this for users that like this behavior

Present some intermediate, interesting results first. "We've found 2,359 zetuyls matching your request, we're just calculating their future value".
I've seen transport reservation systems do that sort of thing quite nicely.

Showing details (such as the names of files being copied in an installation process) can often make things seem like they're going faster because there's constant, noticeable activity (as opposed to a slowly-creeping progress bar).
If your algorithm is such that it generates a list of results, and you have some way of displaying results as they're generated (as opposed to all at once at the end), do so - the sooner the user has something else to look at besides a spinner, the better.

Allow the user to do something else, while your application is processing data or waiting for a result. In application-scope you could allow to do some refinement of a search query or collect information for preparing next steps. Or just present some other "work" necessary to do or just some hints, documentation, statistics, entertainment..

Use one of those animated progress bars which look like they are doing something even when they aren't progressing. Also, as peSHIr said - print each filename that you copy and update it really fast - you could even fake it by cycling through a large string array N times a second.

I've read somewhere that if the process seems to be speeding up, it seems to be faster than when it's progressing at a steady pace. I can't find the reference right now, but it should be simple to implement.
(10 minutes later...)
A further look down Google lane unearthed the following references:
http://www.azarask.in/blog/post/hacking-memory/
http://blogs.msdn.com/time/

Here is an article about "Expressing time in your UI" and user perception of time. I do not know if it is exactly what you expect as an answer, but it is definitely worth the read.

Add a thread sleep at critical points. With each passing version, reduce the delay.

Change config values on a specific time

I just got a mail saying that I have to change a config value at 2009-09-01 (new taxes). Our normal approach for this would be to to awake at 2009-08-31 at 23:59 and then just change the value manually. Which not is a big problem since this don't happens to often. But it makes me wonder how other people handle issues like this.
So! How do you handle date specific config changes?
(We are working in asp.net but I don't think this has to be language specific)
Br
Carl Bergquist

I'd normally store this kind of data in a database table like this
Key, Value, EffectiveFrom, EffectiveTo
-----------------------------------------
VAT, 15.0, 20081201, 20091231
VAT, 17.5, 20100101, NULL
I'd then use the EffectiveFrom and EffectiveTo dates to chose the value that is effective at the given time. If the rate is open ended then the effecive to could either by NULL or 99991231.
This also allows you to go back without having to change the config. E.g. if someone asks you to recalculate the tax for the previous month before the rate change.

In linux, there is a command "at" for batch execution.
See "man at" for details.

To be honest, waking up near the time and changing it seems to be the simplest and cheapest approach. All of the technical solutions are fine, but it depends where you work.
In our environment it would be cheaper and simpler to get someone to wake up and make the change than to redevelop the functionality of a piece of software that already works. It certainly involves less testing, development overhead and costs which means we would tend to solve the problem as you do, manually.

That depends totally on the situation and the technology.
pjp's idea is good, if you get your config from a database, or as metadata to define the valid time for whole config sets/files.
Another might be: just prepare a new configfile with the new entries and swap them at midnight (probably with a restart of the service/program) whatever.
Swapping them would be possible with at (as given bei Neeraj) ...
If timing is a problem you should handle the change, or at least the timing of the change on the running server (to avoid time out of synch problems).

We got same kind of problem some time before and handled using the following approach.
this is suitable if you are well known to the source that orginates the configuration changes..
In our case, the source exposed a webservice (actualy a third party) which will return a modified config details. And there is a windows service running on our server which keeps on polling the webservice and will update the configuration file if there is any change.
this works perfectly in our case..
You can make use of this approach by changing the polling webservice part to your source of config change (say reading changes from some disk path). But am not sure how this is possible reading config changes from email.

Why not just make a shell script to swap out the files. run it in cron and switch the files out a minute before and send an alert text if NOT successful and an email if successful.
This is an example on a Linux box but I think you get the point and can do this on a Windows box.
Script:
cp /path/to/old/config /path/to/backup/dir/config.timestamp
cp /path/to/new/config
if(/path/to/new/config exsits) {
sendSuccessEmail();
} else {
sendPanicTextAlert();
}
cron:
59 23 31 8 * /path/to/script.sh
you could test this as well before hand just point to some dummy directories and file

I've seen the hybrid approach. Instead of actually changing the data model to include EffectiveDate/EndDate or manually changing the values yourself, schedule a script to change the values automatically. Also, be sure to have a solid test plan that will validate all changes.
However, this type of manual change can have a dramatic impact on reporting. If previous transactions join directly to the tables being changed, numbers in historical reports could change in a very bad way. There really is no "right" answer.

If I'm not able to do something like pjp's solution, I'd use either a scheduled task or a server job to update it automatically at the right time.
But...I'd probably still be awake checking it had worked.

Look the best solution would be to parameterise your config file and add things like when a certain entry should be used from. This would negate the need for any copying or swapping of files and your application would simply deal with it. (That goes for a config file approach or a database)
If you cannot change the current systems and you have to go with swapping the config files, then you also have two options:
Use a scheduled task to kick off a batch job or even a VBScript or PowerShell script (which ever you feel comfortable with) Make sure you set up the correct credentials to be able to do this at the middle of the night and you could also add some checking and mitigation into this approach.
Write a windows Service that does this for you. Here you have all the flexibility you need. Code it to do whatever it needs to do, do all the checks you need to (so that you can keep sleeping rather than making sure it actually worked) etc, etc. You service would then even take care of the scheduling aspect and all will be good. Here you could use xml DOM object and xPath and not replace the file, but simply update the specific entries as required.
Remember that any change to the config file would cause your site to restart, so make sure you take care of all the other housekeeping stuff that this could cause. (Although this would be exactly the same if you where sitting there in the middle of the night copying file around)

What can I do to prevent write-write conflicts on a wiki-style website?

On a wiki-style website, what can I do to prevent or mitigate write-write conflicts while still allowing the site to run quickly and keeping the site easy to use?
The problem I foresee is this:
User A begins editing a file
User B begins editing the file
User A finishes editing the file
User B finishes editing the file, accidentally overwriting all of User A's edits
Here were some approaches I came up with:
Have some sort of check-out / check-in / locking system (although I don't know how to prevent people from keeping a file checked out "too long", and I don't want users to be frustrated by not being allowed to make an edit)
Have some sort of diff system that shows an other changes made when a user commits their changes and allows some sort of merge (but I'm worried this will hard to create and would make the site "too hard" to use)
Notify users of concurrent edits while they are making their changes (some sort of AJAX?)
Any other ways to go at this? Any examples of sites that implement this well?

Remember the version number (or ID) of the last change. Then read the entry before writing it and compare if this version is still the same.
In case of a conflict inform the user who was trying to write the entry which was changed in the meantime. Support him with a diff.
Most wikis do it this way. MediaWiki, Usemod, etc.

Three-way merging: The first thing to point out is that most concurrent edits, particularly on longer documents, are to different sections of the text. As a result, by noting which revision Users A and B acquired, we can do a three-way merge, as detailed by Bill Ritcher of Guiffy Software. A three-way merge can identify where the edits have been made from the original, and unless they clash it can silently merge both edits into a new article. Ideally, at this point carry out the merge and show User B the new document so that she can choose to further revise it.
Collision resolution:
This leaves you with the scenario when both editors have edited the same section. In this case, merge everything else and offer the text of the three versions to User B - that is, include the original - with either User A's version in the textbox or User B's. That choice depends on whether you think the default should be to accept the latest (the user just clicks Save to retain their version) or force the editor to edit twice to get their changes in (they have to re-apply their changes to editor A's version of the section).
Using three-way merging like this avoids lock-outs, which are very difficult to handle well on the web (how long do you let them have the lock?), and the aggravating 'you might want to look again' scenario, which only works well for forum-style responses. It also retains the post-respond style of the web.
If you want to Ajax it up a bit, dynamically 3-way merge User A's version into User B's version while they are editing it, and notify them. Now that would be impressive.

In Mediawiki, the server accepts the first change, and then when the second edit is saved a conflicts page comes up, and then the second person merges the two changes together. See Wikipedia: Help:Edit Conflicts

Using a locking mechanism will probably be the easiest to implement. Each article could have a lock field associated with it and a lock time. If the lock time exceeded some set value you'd consider the lock to be invalid and remove it when checking out the article for edit. You could also keep track of open locks and remove them on session close. You'd also need to implement some concurrency control in the database (autogenerated timestamps, perhaps) so that you could make sure that you are checking in an update to the version that you checked out, just in case two people were able to edit the article at the same time. Only the one with the correct version would be able successfully check in an edit.
You might also be able to find a difference engine that you could just use to construct differences, though displaying them in a wiki editor may be problematic -- actually displaying the differences is probably harder than constructing the diff. You'd rely on the versioning system to detect when you needed to reject an edit and perform a diff.

In Gmail, if we are writing a reply to a mail and someone else sends a reply while we are still typing it, a popup appears indicating that there is a new update and the update itself appears as another post without a page reload. This approach would suit your needs and if you can use Ajax to show the exact post with a link to diff of what was just updated while User B is still busy typing his entry that would be great.

As Ravi (and others) have said, you could use an AJAX approach and inform the user when another change is in progress. When an edit is submitted, just indicate the textual differences and let the second user work out how to merge the two versions.
However, I'd like to add on with something new you could try in addition to that: Open a chat dialog between the editors while they're doing their edits. You could use something like embedded Gabbly for that, for instance.
The best conflict resolution is direct dialog, I say.

Your problem (lost update) is solved best using Optimistic Concurrency Control.
One implementation is to add a version column in each editable entity of the system. On user edit you load the row and display the html form on the user. A hidden field gives the version, let's say 3. The update query needs to look something like:
update articles set ..., version=4 where id=14 and version=3;
If rows returned is 0 then someone has already updated article 14. All you need to do then is how to deal with the situation. Some common solutions:
last commit wins
first commit wins
merge conflicting updates
let the user decide
Instead of an incrementing version int/long you can use a timestamp but it's not suggested because:
retrieving the current time from the JVM isn't necessarily safe in a clustered environment, where nodes may not be time synchronized.
(quote from Java Persistence with Hibernate)
Some more info at the hibernate documentation.

At my office, we have a policy that all data tables contain 4 fields:
CreatedBy
CreatedDate
LastUpdateBy
LastUpdateDate
That way there is a nice audit trail on who has done what to the records, at least most recently.
But most importantly, it becomes easy enough to compare the LastUpdateDate of the current or edited record on the screen (requires you to store it on the page, in a cookie, whatever, with the value in the database. If the values don't match, you can decide what to do from there.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008