Can a web app give desktop notification - html

Well, A small idea of making an application ran through my mind. But this is the first time I would try to make an application. Till now I have worked on PHP, C++, Ruby with Rails frame work, but all at a beginner level.
I am ready to take a bigger leap and learn them extensively if my idea demands so. The requirement is such that the web app gives you a desktop notification. I have seen that happen with a Gmail chat box. All I need to know is, what could be right tools to start with? Does PHP go that far? Or working on Rails would be a better option? Also, is it possible for an average beginner level programmer to do such a thing? What all does it take to make an app like that?

The desktop notification is not language dependant. There is no server-side language that can make a notification appear on a desktop. It doesn't matter what you put on the server, you can't make that code put notifications on the desktop unless the client supports such a notification API.
Client support means there has to be some sort of application running on the client that can put up the notification dialog on the desktop. This is the number one requirement.
If you don't have anything running on the client with such support, then no matter what magic you put on the server (PHP, Ruby, Perl, ...) nothing will ever happen on the client.
That's why, as you see from the answer by KCiebiera, the clients that can make notifications possible are Chrome and Firefox.
What you do is run code on the client (Javascript in this case), that looks for something (a message from the server for example), and then instructs the browser (Chrome or Firefox), to launch the notification.
It's not more complicated than that. Look at the tutorial KCiebiera posted, and that should get you started.
Hope this helps you understand the problem better.

AFAIK desktop notifications are available only in Chrome/Chromium and Firefox browser. There is a working draft http://www.w3.org/TR/notifications/ and also a great tutorial http://www.html5rocks.com/en/tutorials/notifications/quick/ on how to use them. Basically you have to learn a JavaScript API.

What goes on the server-side is irrelevant here, as the notifications happen client-side. Desktop notifications are currently part of a W3C working draft, and are implemented in the current version of Chrome and the next release versions of Firefox, Safari, & IE; see the compatibility table for more details. There are plenty of tutorials out there on how to implement this.
You would have to implement on the server some method that the client-side can use to get (or be pushed) the notifications so that it can display them using the above mentioned API. It doesn't matter if your use Rails, PHP, C++, whatever, they can all do it with enough effort.

Related

Architectures to access Smart Card from a generic browser? Or: How to bridge the gap from browser to PC/SC stack? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 13 days ago.
The community reviewed whether to reopen this question 13 days ago and left it closed:
Original close reason(s) were not resolved
Improve this question
What are the existing client-side architectures to access a local Smart Card thru a PC/SC Smart Card reader (ISO 7816-3, ISO 14443) from a generic browser (connected to a server through http(s)), preferably from Javascript, with the minimum installation hassle for the end user? The server needs to be able to at least issue APDUs of its choice to the card (or perhaps delegate some of that to client-side code that it generates). I am assuming availability on the client side of a working PC/SC stack, complete with Smart Card reader. That's a reasonable assumption at least on Windows since XP, modern OS X and Unixes.
I have so far identified the following options:
Some custom ActiveX. That's what my existing application uses (we developed it in-house), deployment is quite easy for clients with IE once they get the clearance to install the ActiveX, but it does not match the "generic browser" requirement.
Update: ActiveX is supported mostly by the deprecated IE, including IE11; but not by Edge.
Some PC/SC browser extension using the Netscape Plugin API, which seems like a smooth extension of the above. The only ready-made one I located is SConnect (webarchive). It's no longer promoted (Update: thought still actively maintained and used late 2020 in at least one application), it's API documentation (webarchive) is no longer officially available, and it has strong ties to a particular Smart Card and reader vendor. The principle may be nice, but making such a plugin for every platform would be a lot of work.
Update: NPAPI support is dropped by many browsers, including Chrome and Firefox.
A Java Applet, running on top of Oracle's JVM (1.)6 or better, which comes with javax.smartcardio. That's fine from a functional point of view, well documented, I can live with the few known bugs, but I'm afraid of an irresistible downwards spiral regarding acceptance of Java-as-a-browser-extension.
[update, Feb 2021]: This answer considered the WebUSB API as a promising solution solution in 2015, then reported in 2019 that can't work or is abandoned. I made a question about it there.
Any other idea?
Also: is there some way to prevent abuse of whatever PC/SC interface the browser has by a rogue server (e.g. presenting 3 wrong PINs to block a card, just for the nastiness of it; or making some even more evil things).
The fact is that browsers can't talk to (cryptographic) smart cards for other purposes than establishing SSL.
You shall need additional code, executed by the browser, to access smart cards.
There are tens of custom and proprietary plugins (using all three options you mentioned) for various purposes (signing being the most popular, I guess) built because there is no standard or universally accepted way, at least in Europe and I 'm sure elsewhere as well.
Creating, distributing and maintaining your own shall be a blast, because browsers release every month or so and every new release changes sanboxing ir UI tricks, so you may need to adjust your code quite often.
And you probably would want to have GUI capabilities, at least for asking the permission of the user to access a card or some functionality on it.
For creating a multiple-platform, multiple browser plugin, something like firebreath could be used.
Personally, I don't believe that exposing PC/SC to the web is any good. PC/SC is by nature qute a low level protocol that when exposing this, you could as well expose block level access to your disk and hope that "applications on the web are mine only and they behave well" (this should answer your "Also"). At the same time a thin shim like SConnect is the easiest to create, for providing a javscript plugin.sendAPDU()-style code (or just wrap all the PC/SC API and let the javascript caller take care of the same level of details as in native PC/SC API use case).
Creating a plugin for this purpose is usually driven by acute current deficiencies.
Addressing the future (mobile etc) is another story, where things like W3C webcrypto and OpenMobile API will probably finally somehow create something that exposes client-side key containers to web applications. If your target with smart cards is cryptography, my suggestion is to avoid PC/SC and use platform services (CryptoAPI on Windows, Keychain on OSX, PKCS#11 on Linux)
Any kind of design has requirements. This all applies if you're thinking of using keys rather than arbitrary APDU-s. If your requirement is to send arbitrary APDU-s, do create a plugin and just go with it.
Update (8/2016): A new API for the Web called WebUSB API is being discussed. You can already use it with Chrome v54+.
This standard will be implemented in all major browsers and will replace the need for third-party applications or extensions for Smard Cards :-)
So the new answer is YES!
And the OSI-like architecture stack is:
PC/SC
CCID v1.1
WebUSB API
USB driver, i.e. libusb.
2019 Update: As #vlp commented, it seems that it doesn't work any in Chrome because they decided to block WebUSB for smartcards for some specious reasons :-(
Note: Google annonced that they will abandon Chrome Apps in 2017.
Previous anwser:
Now (2015) you can create a Google Chrome App, using the chrome.usb API.
Then you access the smartcard reader via its CCID-compliant interface.
It's not cross-browser but JavaScript programmable & cross-platform.
Anyway Netscape Plugin API (NPAPI) is not supported any more by modern browsers. And Java applets are being dismissed by browser vendors.
I have just released a beta plugin addressing this problem.
This beta code is available here:
https://github.com/ubinity/webpcsc-firebreath
This plugin is based on the firebreath framework and has been beta-tested with Fireofx and Chrome under Linux/WinXP/Win7. Source code and extension pack are provided.
The basic idea is to provide a PCSLite API access and then develop a more friendly JS-api on top of this.
This plugin is under active development, so feel free to send any report and request.
For your first question I have little hope: either you are satisied with a very small subset of smart card functionality (like signing e-Mail or PDFs), then you may use some ready-made software (like PKCS), ideally maintained by the smart card company, or you want broader functionality and need to invest considerable effort on your own. Surely PCSC is the starting point to choose.
At least for your "also:" there is some hope.
1) Note, that some specifications (e.g. ICAO/German BSI TR-3110) request a method, where a PIN is not blocked, but uses a substantial amount of time as soon as the error counter hits 1 before replying. The final attempt must be enabled using a different command, otherwise no further comparison and error counter adjustment is done.
2) Simply protect the Verify command by requiring secure messaging. Sensitive applications use secure messaging for everything, so first step a session key is negtiated, which is second applied to all succeeding commands and responses. The effect would be, that the command is rejected due to incorrect MACs long before a comparison or modification of error counter is done.
There is another browser plugin similar to the one proposed by #cslashm available at http://github.com/cardid/WebCard. Is also open source and can be installed with "minimum installation hassle" as required in the original question. You can see an example of use visiting http://plugin.cardid.org
WebCard has been tested in IE 8 through 11, Chrome and Firefox in Windows and in Chrome and Safari in Mac OS X. Since is just a wrapper for PC/SC it requires in Mac OS X the installation of SmartCard Services from http://smartcardservices.macosforge.com
As chrome and firefox going to stop the support of NPAPI Plugin, there is no secure solution available to maintain the session for the smart card reading instead your certificate of the card have support for mutual ssl ,I answered for the similar question source,It might help
Its dirty, but if its acceptable / viable to install a bridge daemon/service on the client machine, then you can write a local bridge service (e.g. in python / pyscard) that exposes the smartcard via a REST interface, then have javascript in the browser that mediates between that local service (facade) and the remote server API.
Web Serial API (draft) can be used to communicate with a serial smart card reader from some browsers.
Buyer beware: This API is a draft and may be changed/abandoned at any time.
Speaking about Chrome, you can now use the Smart Card Connector app provided by Google which bundles the PC/SC-Lite port and the generic CCID driver.
The app itself works through the chrome.usb API, that was mentioned by the previous commenters.
So, instead of rewriting the whole stack (starting from the lowest level - raw USB), it's now possible for developers to code only the part that works on top of PC/SC API - which is exposed by the Connector app.
Clients,clients,clients...plugins,..JSApis..
Well..
For certain we know this : All browsers, when communicating to an Apache or IIS servers, are actually signing "something" when a https/SSL handshake process is needed.
For instance, a typical Apache configuration like this:
SSLVerifyClient require
SSLVerifyDepth 10
SSLOptions +FakeBasicAuth +StdEnvVars +ExportCertData +OptRenegotiate
Initiates a PIN pad pop up and the user must insert the smartcard pin to go on.
Well, my idea is : why not make the turn to the server, and tweak that behaviour, in order to upload a bytestream of stuff to sign something when a handshake is initiaded?
I have a setup where a smartcard reader is scanned to login a user. The PC/SC library work great on desktop. Somebody had mentioned to use
Emscripten (https://github.com/kripken/emscripten) compiler which compiles c++ into JavaScript code. But that didn't work well because some of the functions being used by PC/SC are only available server side.
After much research. I finally gave up on a client side solution, chrome web usb API also couldn't recognize the reader.
I then decided to give signalR a try and set up a hub on the PC connected to the smartcard reader and this approach worked out very well.

Can HTML5 communicate with peripherals like scanners and credit card readers?

My company writes software that installs on client machines to perform point-of-sale transactions. The software interfaces with a variety of external peripherals (receipt printers, bar code scanners, credit-card readers, etc). We do this with a WinForms app that we created in Visual Studio using the Microsoft OPOS library, which in turn communicates with our cloud server.
There are obvious inefficiencies in this model, primarily with updates. I'm researching other ways to communicate with these peripherals over the web, preferably via web browser. So far as I can tell, Java is one of the only technologies out there that can do what we're looking for (via applet), and I assume Adobe Flash can as well (via the Air platform). These are viable, but not preferable because we want to run our software on web-enabled mobile devices.
Does anybody have suggestions for other ways to communicate with external peripherals over the web?
UPDATE (Jan 16th, 2019): The Credential Management API has been announced. It's currently only supported on Chrome and Opera but it's looking promising. Google Developers wrote an article elaborating on the spec.
UPDATE (Dec 28th, 2016): Another couple years gone, and another update. This one will be more focused on two new developments than anything else. See the new "WebUSB & Web BlueTooth" section under "Full Device API". But the answer remains the same.
UPDATE (Nov 3rd, 2014): It's been just over two years since the original post was made, but the answer remains mostly the same for now. We are, however, closer to your original goal in several areas.
ORIGINAL ANSWER:
There would be a number of ways to go about this.
Background
The HTML5 specification has entered into the "Recommendation" state. This means that HTML5 is pretty much set for what it looks like. However, I will be using HTML5 in the same way that every marketing person in the world has decided is best. That is, I will not be talking about HTML. Well, I will, in so far as you will utilize it from an HTML page, but not really. What I'll actually be discussing is JavaScript (JS) and that's a horse of a different color. But for all intents and purposes, we're putting it all under the same heading as HTML5, which has been decided to mean "shiny and new" now.
Also, the items which I am discussing will vary in support. Some are very browser dependent projects (like Chromium specific implementations), and some are more standards driven projects that may not have browsers implementing or experimenting with them yet. I'll try to distinguish between the two as I go along.
Full Device API
Status: Incoming, but not ready
Being able to access devices from the browser is making slow but steady progress. Right now, many modern browsers have access to some of the more common devices like the camera or gamepads, but they are all high level APIs. Browser vendors, the standards groups, and lots of companies involved with the web are all trying to make webapps just as powerful as your local applications.
But the APIs you are looking for are still in progress and a ways off. For your particular case, and for the more general case of connecting your webapp to most devices, we're still a few years away from something we can use. If you want to see what awesome things are coming up in that field, here are just a select few items that may help you directly:
Web Near Field Communication (NFC) API
This one unfortunately may be dead in the water for now. But it looks like originally some folks at the W3C (mostly Intel it looks like) were looking at adding a NFC API to the web.
Media Capture Streams
The WebRTC group is working on programmatic access to media streams like the camera which would allow to integrate things like barcode scanning or other features. This has reached CR status and is available in browsers, but is less helpful on its own.
Web Bluetooth
If you had bluetooth capable tools, this API would help you connect with them from computers and devices that were able to listen and connect. The primary driver for this at the moment seems like it is the Chrome team, including an experimental implementation, but I wouldn't consider it anywhere ready to use yet (See "WebUSB & Web BlueTooth" section).
WebUSB
This would allow full access to low level USB information including listing devices and interacting with them. Same as Web BlueTooth, this seems to be current Chrome pet project, but I also wouldn't rely on it (See "WebUSB & Web BlueTooth" section).
Network Service Discovery
If you have other devices or items on the network which broadcast and use HTTP, this API would allow you to discover and interact with these services. No browser implementation, but it is in a working draft for the W3C.
Originally, Mozilla was pushing a number of these forward because of Boot2Gecko (or Firefox OS). However, with that project officially cancelled, we aren't seeing much progress from them in these areas now.
Members of the Chrome team, however, seem to have decided to dive in and start not only working towards these, but putting them live in browsers. Which leads us to...
WebUSB & Web BlueTooth
Like sausages, it's better to not know how Web Standards are made
-Abraham Lincoln (probably)
There's been a little bit of buzz in this area as it looks like the Chrome team snuck in these as experimental features and developed their own specification for it. Which is great! Just maybe not in the way that you were hoping for.
Each browser vendor and W3C contributor group has their own style and makes contributions towards the specs in their own way. The result is usually a fairly decent specification that the browsers have agreed upon. But getting from nothing to something is... messy. Real messy. And is quite a process a lot of times. It doesn't always result in a good spec (yeah, I'm talking about you Florian compromise...) but even when it does, it takes a while.
However, It seems like Google developed this version of the spec all on their own. And, in my experience, Google's approach to the specs is always a little... well... setting my personal opinions aside we'll say "gung-ho". They tend to just dive right into the deep end. And that seems to be what they've done here.
I highly doubt these specs or implementations will look anything like this when they become standards. And there's nothing wrong with that. That's part of the process. But I wouldn't go relying on this implementation or developing any code or products against it. This is an unprecedented feature on the web and all the browser vendors are gonna want a big say in this.
That said, this is actually good. One of the things Google often does (for better or worse) with situations like this is forces the conversation and it can push things along. And having a feature shipped in the browser, even an experimental feature, can turn up the heat on the demand for it. So we may see more progress in this area soon.
PhoneGap Apache Cordova. You know, for your phone
Status: Not fully featured and phone only
Apache Cordova, previously Adobe PhoneGap, is a way to write your program in HTML, CSS, and JS that allows you to access lower level functionality on things like phones, and compile across devices. This would be a way to implement your program, but it would be a phone application, not necessarily a desktop one. An option to consider, and something I figured I would mention.
Cordova implements a few of the above features already, but doesn't have some of the more powerful ones like NFC or BlueTooth.
The Native-App solution (for Windows 8)
Status: Possible, but OS specific and desktop app
Windows 8 offers the ability to build applications in HTML and JS. This would allow you to easily access lower level functionality on the OS via their API. From the looks of it, it is pretty extensive and you can do a lot. You mentioned cross OS support, however, and this obviously limits you to one OS.
It's so Flash-y!
Status: Dying/Dead, not possible as a web app
Flash won't have direct access to the system through the web. You could create an AIR application, but that will sort of defeat the purpose of having it web based. In addition, Flash support on mobile, and on the web it would seem, is on the decline.
NodeJS
Status: Can be a bit of a pain and only possible as a desktop app
NodeJS and JS applications have sort of been a hot topic the last couple years. I didn't discuss it in my original post because I felt it wasn't quite there yet. However, things have progressed and it is much closer to being ready for this sort of thing, and has the support and power of a growing user base. That said, for your particular case, I wouldn't recommend using it. It would have to be local on the users machine, and because of how NodeJS (and similar engines) are at the moment, it would require a lot of extra configuration and setup that would complicate things a bit.
So you could build an app using HTML, CSS and JS with NodeJS or similar engines and have low level access to what you need, but it has to be local, and it would take more work than I'm sure you want to do every time you'd like to install it for a customer.
... Now where was I?
So where does that leave us? Well, simple: if you want a single language/set of code as your code base, HTML/CSS/JS aren't a great option... yet. But they could be some day. For now, your options are limited to what you feel is best for your customer. Java is a stable option you listed, but obviously comes with its own drawbacks. As the web develops, I think we'll see a lot of really cool things coming out of the new functionality, but we've got a ways to go still.
More reading:
Brian.IO: Beyond HTML5
HTML5 Apps on Windows 8
Wikipedia list of projects built using JS
This is possible, but it would have to be done indirectly. In theory, you could write a socket-server in a low level language, which gets I/O, and sends the I/O through the socket (relaying, I guess). HTML5 uses WebSockets, or some equivalent to communicate with this socket-server.
Now it can be achieved with WebUSB API.
It is available in Chrome since version 54.
It is a W3C editor's draft so we can expect (hope) that it will be adopted by other browser vendors...
I've been thinking about this a lot lately... have a POS app mostly written in VB6, considering what to do next. HTML5 is an option and I was thinking I'd use VSPE to get the serial stuff into the JS.
http://www.eterlogic.com/Products.VSPE.html
Love this product! Works very well for getting serial traffic where you need it, so I think it would work well, at least as a proof-of-concept to get you going. You'll want to use a combination of "connector" types along with the "tcpclient" and "tcpserver".
Just for the record, a method that works well in 2016 (since chrome 26), but is to be withdrawn over the next 2 years is to deploy your html5 as a chrome app and use chrome.usb (or chrome.serial or chrome.bluetooth).
I am currently using chrome.usb and planning to migrate to a web app using WebUSB API (see Supersharp's answer), which I hope will be adopted by the time Google discontinue chrome apps 🤞.

Looking to develop chat server that works in HTML5? Technology available?

I was a big fan of AIM and live chat/buddy lists back in the day. With the rise of HTML5 and its use becoming more common in modern browsers, I'd like to develop an HTML5 messaging system.
What technologies do I have to look up? At the start, I won't care about the design (CSS), just functionality.
I'll most likely have a standard registration and store users in a MySQL Database.
Additionally, "friends" will also easily be stored in a database, populating a user's buddy list based on which user ID's he/she has marked as "friend".
The actual server and client connectivity is what most interests me. Is this technology available for HTML5 yet? Point me in the right direction and I'll be good to go!
For the chat, you would probably like to look in to Websockets (as you talk about HTML5).
There are also examples like this where NodeJS is used. To use node, you would have to run a node-server. For examples and more info: nodejs.org
I think the websockets API will be your first port of call for a messaging app in HTML5. You'll be wanting the server to notify the client rather than the client poll or rely on callbacks and this would be the start i think.
However, I don't think this is very well supported in even the most modern browser. In fact i believe firefox and opera have pulled support because of security concerns.
I haven't done any work in this myself but just though it looked interesting stuff. So I guess I just wish you luck with your dev. Exciting cutting edge stuff I think.

Where can I find clientside Websockets examples?

I'm not currently trying to set up a server with Websockets. Eventually I will, but I first just want to see a short working example of websockets, connecting to a third party server.
I found a short example here: http://www.websockets.org/about.html and here http://blog.chromium.org/2009/12/web-sockets-now-available-in-google.html but couldn't get either of them to work.
Do you know where I can find a couple/few confirmed working client side examples of websockets? I believe websockets.org provides an echo address ws://websockets.org:8787 (so please don't tell me "just set up a server yourself!). Thanks!
phpwebsocket works quite well for me, takes 2 minutes to set up and start testing
jWebSocket is a custom implementation of web sockets but they're compatible with native browser web sockets and they have a number of demos available, including a really basic hello world and a chat client. Terrible framed website though, so you'll have to go there and click the links on the left side to get to the demos.

What are the pros and cons of various ways of analyzing websites?

I'd like to write some code which looks at a website and its assets and creates some stats and a report. Assets would include images. I'd like to be able to trace links, or at least try to identify menus on the page. I'd also like to take a guess at what CMS created the site, based on class names and such.
I'm going to assume that the site is reasonably static, or is driven by a CMS, but is not something like an RIA.
Ideas about how I might progress.
1) Load site into an iFrame. This would be nice because I could parse it with jQuery. Or could I? Seems like I'd be hampered by cross-site scripting rules. I've seen suggestions to get around those problems, but I'm assuming browsers will continue to clamp down on such things. Would a bookmarklet help?
2) A Firefox add-on. This would let me get around the cross-site scripting problems, right? Seems doable, because debugging tools for Firefox (and GreaseMonkey, for that matter) let you do all kinds of things.
3) Grab the site on the server side. Use libraries on the server to parse.
4) YQL. Isn't this pretty much built for parsing sites?
My suggestion would be:
a) Chose a scripting language. I suggest Perl or Python: also curl+bash but it bad no exception handling.
b) Load the home page via a script, using a python or perl library.
Try Perl WWW::Mechanize module.
Python has plenty of built-in module, try a look also at www.feedparser.org
c) Inspect the server header (via the HTTP HEAD command) to find application server name. If you are lucky you will also find the CMS name (i.d. WordPress, etc).
d) Use Google XML API to ask something like "link:sitedomain.com" to find out links pointing to the site: again you will find code examples for Python on google home page. Also asking domain ranking to Google can be helpful.
e)You can collect the data in a SQLite db, then post process them in Excel.
You should simply fetch the source (XHTML/HTML) and parse it. You can do that in almost any modern programming language. From your own computer that is connected to Internet.
iframe is a widget for displaying HTML content, it's not a technology for data analysis. You can analyse data without displaying it anywhere. You don't even need a browser.
Tools in languages like Python, Java, PHP are certainly more powerful for your tasks than Javascript or whatever you have in those Firefox extensions.
It also does not matter what technology is behind the website. XHTML/HTML is just a string of characters no matter how a browser renders it. To find your "assets" you will simply look for specific HTML tags like "img", "object" etc.
I think an writing an extension to Firebug would proabably be one of the easiest way to do with. For instance YSlow has been developed on top of Firebug and it provides some of the features you're looking for (e.g. image, CSS and Javascript-summaries).
I suggest you try option #4 first (YQL):
The reason being that it looks like this might get you all the data you need and you could then build your tool as a website or such where you could get info about a site without actually having to go to the page in your browser. If YQL works for what you need, then it looks like you'd have the most flexibility with this option.
If YQL doesn't pan out, then I suggest you go with option #2 (a firefox addon).
I think you should probably try and stay away from Option #1 (the Iframe) because of the cross-site scripting issues you already are aware of.
Also, I have used Option #3 (Grab the site on the server side) and one problem I've ran into in the past is the site being grabbed loading content after the fact using AJAX calls. At the time I didn't find a good way to grab the full content of pages that use AJAX - SO BE WARY OF THAT OBSTACLE! Other people here have ran into that also, see this: Scrape a dynamic website
THE AJAX DYNAMIC CONTENT ISSUE:
There may be some solutions to the ajax issue, such as using AJAX itself to grab the content and using the evalScripts:true parameter. See the following articles for more info and an issue you might need to be aware of with how evaluated javascript from the content being grabbed works:
Prototype library: http://www.prototypejs.org/api/ajax/updater
Message Board: http://www.crackajax.net/forums/index.php?action=vthread&forum=3&topic=17
Or if you are willing to spend money, take a look at this:
http://aptana.com/jaxer/guide/develop_sandbox.html
Here is an ugly (but maybe useful) example of using a .NET component called WebRobot to scrap content from a dynamic AJAX enabled site such as Digg.com.
http://www.vbdotnetheaven.com/UploadFile/fsjr/ajaxwebscraping09072006000229AM/ajaxwebscraping.aspx
Also here is a general article on using PHP and the Curl library to scrap all the links from a web page. However, I'm not sure if this article and the Curl library covers the AJAX content issue:
http://www.merchantos.com/makebeta/php/scraping-links-with-php/
One thing I just thought of that might work is:
grab the content and evaluate it using AJAX.
send the content to your server.
evaluate the page, links, etc..
[OPTIONAL] save the content as a local page on your server .
return the statistics info back to the page.
[OPTIONAL] display cached local version with highlighting.
^Note: If saving a local version, you will want to use regular expressions to convert relative link paths (for images especially) to be correct.
Good luck!
Just please be aware of the AJAX issue. Many sites nowadays load content dynamically using AJAX. Digg.com does, MSN.com does for it's news feeds, etc...
That really depends on the scale of your project. If it’s just casual, not fully automated, I’d strongly suggest a Firefox Addon.
I’m right in the middle of similar project. It has to analyze the DOM of a page generated using Javascript. Writing a server-side browser was too difficult, so we turned to some other technologies: Adobe AIR, Firefox Addons, userscripts, etc.
Fx addon is great, if you don’t need the automation. A script can analyze the page, show you the results, ask you to correct the parts, that it is uncertain of and finally post the data to some backend. You have access to all of the DOM, so you don’t need to write a JS/CSS/HTML/whatever parser (that would be hell of a job!)
Another way is Adobe AIR. Here, you have more control over the application — you can launch it in the background, doing all the parsing and analyzing without your interaction. The downside is — you don’t have access to all DOM of the pages. The only way to go pass this is to set up a simple proxy, that fetches target URL, adds some Javascript (to create a trusted-untrusted sandbox bridge)… It’s a dirty hack, but it works.
Edit:
In Adobe AIR, there are two ways to access a foreign website’s DOM:
Load it via Ajax, create HTMLLoader object, and feed the response into it (loadString method IIRC)
Create an iframe, and load the site in untrusted sandbox.
I don’t remember why, but the first method failed for me, so I had to use the other one (i think there was some security reasons involved, that I couldn’t workaround). And I had to create a sandbox, to access site’s DOM. Here’s a bit about dealing with sandbox bridges. The idea is to create a proxy, that adds a simple JS, that creates childSandboxBridge and exposes some methods to the parent (in this case: the AIR application). The script contents is something like:
window.childSandboxBridge = {
// ... some methods returning data
}
(be careful — there are limitations of what can be passed via the sandbox bridge — no complex objects for sure! use only the primitive types)
So, the proxy basically tampered with all the requests that returned HTML or XHTML. All other was just passed through unchanged. I’ve done this using Apache + PHP, but could be done with a real proxy with some plugins/custom modules for sure. This way I had the access to DOM of any site.
end of edit.
The third way I know of, the hardest way — set up an environment similar to those on browsershots. Then you’re using firefox with automation. If you have a Mac OS X on a server, you could play with ActionScript, to do the automation for you.
So, to sum up:
PHP/server-side script — you have to implement your own browser, JS engine, CSS parser, etc, etc. Fully under control and automated instead.
Firefox Addon — has access to DOM and all stuff. Requires user to operate it (or at least an open firefox session with some kind of autoreload). Nice interface for a user to guide the whole process.
Adobe AIR — requires a working desktop computer, more difficult than creating a Fx addon, but more powerful.
Automated browser — more of a desktop programming issue that webdevelopment. Can be set up on a linux terminal without graphical environment. Requires master hacking skills. :)
Being primarily a .Net programmer these days, my advice would be to use C# or some other language with .Net bindings. Use the WebBrowser control to load the page, and then iterate through the elements in the document (via GetElementsByTagName()) to get links, images, etc. With a little extra work (parsing the BASE tag, if available), you can resolve src and href attributes into URL's and use the HttpWebRequest to send HEAD requests for the target images to determine their sizes. That should give you an idea of how graphically intensive the page is, if that's something you're interested in. Additional items you might be interested in including in your stats could include backlinks / pagerank (via Google API), whether the page validates as HTML or XHTML, what percentage of links link to URL's in the same domain versus off-site, and, if possible, Google rankings for the page for various search strings (dunno if that's programmatically available, though).
I would use a script (or a compiled app depending on language of choice) written in a language that has strong support for networking and text parsing/regular expressions.
Perl
Python
.NET language of choice
Java
whatever language you are most comfortable with. A basic stand alone script/app keeps you from needing to worry too much about browser integration and security issues.