windows tool to view website client content without browser - html

Per the title, I am looking for a tool or some sort of initiative that's already been undertaken by other developers to simply grab data off of websites so one can navigate them without looking at them in the browser. I am fully aware of how most pages work so what I would like to do is just look at the data that's being pulled from them per windows technology that's already (hopefully) been written. Does this make sense? Here is an example of what I would like to see in a tool:
a windows interface that gives me data about a webpage (menus, submenus,
button names/captions, etc...
be able to execute transactions on those pages by specifying what to do
through the tool's interface (click button, download image, etc..)
does anyone know of a tool out there to do such things?

The closest "program" that comes to mind is
WWW::Mechanize
Advertised as
Handy web browsing in a Perl object
This can in fact be used on Windows, however you
will need Perl.

Related

Can I make CNC Editor with HTML5?

I would like to make my own CNC Editor.
I am looking for some guidance. I don’t know if it is even possible with HTML5. But it would be great if I can. If possible, please list what I should research and learn.
I don’t need it to be online accessible, I will only have it on my computer. I will be accessing it via local network from multiple different computers. I don’t want it accessing the internet, because it’s not always available.
Desired Features:
⁃ Read and Write files with different extensions (all files used are easily opened in notepad)
⁃ Store and retrieve data from a simple database file.
⁃ Make calculations
⁃ Have a text Editor window
⁃ Have a Display area for simple vector graphics depending on data loaded and provided by user.
It is possible but requires a lot of work. I would say that these are technologies you would need to master in order to pull this off:
Node.js (use express.js) - for storing and retrieving files from database and for reading/writing local files with extensions you want (server-side)
Vue.js or Angular.js or React - for building frontend interface to manipulate your vector graphics. It can also do calculations and It's good with svgs and that kind of stuff.
Electron.js (not mandatory) - it wraps it in native-app like experience. This framework actually gives you ability to write desktop apps for any os and arch.
So as I said, It would be a lot of work but its possible in the end.
Funny coincidence is that my brother is planning to build CNC machine so I might be doing this as well in next couple of months. Feel free to contact me if you need any further help!
UPDATE: You cant do it with just HTML5. It would be like trying to make wooden space shuttle.

Chrome extensions communication with Arduino

My idea is to have an arduino board that will communicate with the browser.
I want the arduino board to react (eg. blink led) when user is connected to a certain website.
User inputs on the board ( eg. press button) will affect the browser (eg. close tab, switch tab).
I started learning and creating simple examples using chrome extensions tutorial.
However, since I am not myself a skilled programmer, I'd like to know if it is possible to achieve those things aforementioned.
How I imagine it right now it will be:
Chrome extension writes into a json. Arduino reads data from json -> blink led.
Arduino writes values in json. Chrome extension can automatically see changes in the file and react accordingly -> close tab( so, without having a user re-installing each time the extension).
Are this scenarios possible? Which would be the easiest way to achieve this?
I would recommend to take a look at firebase. That's a really easy real-time database with a lot of good tutorials (even some by Google).
Then you should use a library like this one on your Arduino.
(there are some examples to look at)
And in your chrome extension you can fetch the values from the database really easy again.

Simple program to automate mouse and keyboard movements

Background:
I check my grades weekly online for school. I then use print-screen and ms paint to save images of the website with my grades. (I know this is really time consuming and inefficient, but my computer has connectivity problems with my printer and it doesn't recognize any PDF makers)
Possible Solution:
Write some program or something that automatically:
opens a browser
goes to the website
logs in
opens each individual subject
creates an image (idc what process it uses to do so)
Thanks E'rybody
Don't use a browser. send Http requests in perl or something like that.
You can simply parse the HTML instead of grabbing screenshots.
Alternatively ask the school to provide your grades via email or on bloddy paper?
Personally I'd go with curl from the command line and then parse the html you get back:
curl http://website.com/grades
However, if you need to do it through a web browser (they might have some funky javascript) then there's a framework called selenium which might be usable: http://seleniumhq.org/
It's generally used for testing but can be used for your scenario. It's got clients for lots of different languages too.

What are the pros and cons of various ways of analyzing websites?

I'd like to write some code which looks at a website and its assets and creates some stats and a report. Assets would include images. I'd like to be able to trace links, or at least try to identify menus on the page. I'd also like to take a guess at what CMS created the site, based on class names and such.
I'm going to assume that the site is reasonably static, or is driven by a CMS, but is not something like an RIA.
Ideas about how I might progress.
1) Load site into an iFrame. This would be nice because I could parse it with jQuery. Or could I? Seems like I'd be hampered by cross-site scripting rules. I've seen suggestions to get around those problems, but I'm assuming browsers will continue to clamp down on such things. Would a bookmarklet help?
2) A Firefox add-on. This would let me get around the cross-site scripting problems, right? Seems doable, because debugging tools for Firefox (and GreaseMonkey, for that matter) let you do all kinds of things.
3) Grab the site on the server side. Use libraries on the server to parse.
4) YQL. Isn't this pretty much built for parsing sites?
My suggestion would be:
a) Chose a scripting language. I suggest Perl or Python: also curl+bash but it bad no exception handling.
b) Load the home page via a script, using a python or perl library.
Try Perl WWW::Mechanize module.
Python has plenty of built-in module, try a look also at www.feedparser.org
c) Inspect the server header (via the HTTP HEAD command) to find application server name. If you are lucky you will also find the CMS name (i.d. WordPress, etc).
d) Use Google XML API to ask something like "link:sitedomain.com" to find out links pointing to the site: again you will find code examples for Python on google home page. Also asking domain ranking to Google can be helpful.
e)You can collect the data in a SQLite db, then post process them in Excel.
You should simply fetch the source (XHTML/HTML) and parse it. You can do that in almost any modern programming language. From your own computer that is connected to Internet.
iframe is a widget for displaying HTML content, it's not a technology for data analysis. You can analyse data without displaying it anywhere. You don't even need a browser.
Tools in languages like Python, Java, PHP are certainly more powerful for your tasks than Javascript or whatever you have in those Firefox extensions.
It also does not matter what technology is behind the website. XHTML/HTML is just a string of characters no matter how a browser renders it. To find your "assets" you will simply look for specific HTML tags like "img", "object" etc.
I think an writing an extension to Firebug would proabably be one of the easiest way to do with. For instance YSlow has been developed on top of Firebug and it provides some of the features you're looking for (e.g. image, CSS and Javascript-summaries).
I suggest you try option #4 first (YQL):
The reason being that it looks like this might get you all the data you need and you could then build your tool as a website or such where you could get info about a site without actually having to go to the page in your browser. If YQL works for what you need, then it looks like you'd have the most flexibility with this option.
If YQL doesn't pan out, then I suggest you go with option #2 (a firefox addon).
I think you should probably try and stay away from Option #1 (the Iframe) because of the cross-site scripting issues you already are aware of.
Also, I have used Option #3 (Grab the site on the server side) and one problem I've ran into in the past is the site being grabbed loading content after the fact using AJAX calls. At the time I didn't find a good way to grab the full content of pages that use AJAX - SO BE WARY OF THAT OBSTACLE! Other people here have ran into that also, see this: Scrape a dynamic website
THE AJAX DYNAMIC CONTENT ISSUE:
There may be some solutions to the ajax issue, such as using AJAX itself to grab the content and using the evalScripts:true parameter. See the following articles for more info and an issue you might need to be aware of with how evaluated javascript from the content being grabbed works:
Prototype library: http://www.prototypejs.org/api/ajax/updater
Message Board: http://www.crackajax.net/forums/index.php?action=vthread&forum=3&topic=17
Or if you are willing to spend money, take a look at this:
http://aptana.com/jaxer/guide/develop_sandbox.html
Here is an ugly (but maybe useful) example of using a .NET component called WebRobot to scrap content from a dynamic AJAX enabled site such as Digg.com.
http://www.vbdotnetheaven.com/UploadFile/fsjr/ajaxwebscraping09072006000229AM/ajaxwebscraping.aspx
Also here is a general article on using PHP and the Curl library to scrap all the links from a web page. However, I'm not sure if this article and the Curl library covers the AJAX content issue:
http://www.merchantos.com/makebeta/php/scraping-links-with-php/
One thing I just thought of that might work is:
grab the content and evaluate it using AJAX.
send the content to your server.
evaluate the page, links, etc..
[OPTIONAL] save the content as a local page on your server .
return the statistics info back to the page.
[OPTIONAL] display cached local version with highlighting.
^Note: If saving a local version, you will want to use regular expressions to convert relative link paths (for images especially) to be correct.
Good luck!
Just please be aware of the AJAX issue. Many sites nowadays load content dynamically using AJAX. Digg.com does, MSN.com does for it's news feeds, etc...
That really depends on the scale of your project. If it’s just casual, not fully automated, I’d strongly suggest a Firefox Addon.
I’m right in the middle of similar project. It has to analyze the DOM of a page generated using Javascript. Writing a server-side browser was too difficult, so we turned to some other technologies: Adobe AIR, Firefox Addons, userscripts, etc.
Fx addon is great, if you don’t need the automation. A script can analyze the page, show you the results, ask you to correct the parts, that it is uncertain of and finally post the data to some backend. You have access to all of the DOM, so you don’t need to write a JS/CSS/HTML/whatever parser (that would be hell of a job!)
Another way is Adobe AIR. Here, you have more control over the application — you can launch it in the background, doing all the parsing and analyzing without your interaction. The downside is — you don’t have access to all DOM of the pages. The only way to go pass this is to set up a simple proxy, that fetches target URL, adds some Javascript (to create a trusted-untrusted sandbox bridge)… It’s a dirty hack, but it works.
Edit:
In Adobe AIR, there are two ways to access a foreign website’s DOM:
Load it via Ajax, create HTMLLoader object, and feed the response into it (loadString method IIRC)
Create an iframe, and load the site in untrusted sandbox.
I don’t remember why, but the first method failed for me, so I had to use the other one (i think there was some security reasons involved, that I couldn’t workaround). And I had to create a sandbox, to access site’s DOM. Here’s a bit about dealing with sandbox bridges. The idea is to create a proxy, that adds a simple JS, that creates childSandboxBridge and exposes some methods to the parent (in this case: the AIR application). The script contents is something like:
window.childSandboxBridge = {
// ... some methods returning data
}
(be careful — there are limitations of what can be passed via the sandbox bridge — no complex objects for sure! use only the primitive types)
So, the proxy basically tampered with all the requests that returned HTML or XHTML. All other was just passed through unchanged. I’ve done this using Apache + PHP, but could be done with a real proxy with some plugins/custom modules for sure. This way I had the access to DOM of any site.
end of edit.
The third way I know of, the hardest way — set up an environment similar to those on browsershots. Then you’re using firefox with automation. If you have a Mac OS X on a server, you could play with ActionScript, to do the automation for you.
So, to sum up:
PHP/server-side script — you have to implement your own browser, JS engine, CSS parser, etc, etc. Fully under control and automated instead.
Firefox Addon — has access to DOM and all stuff. Requires user to operate it (or at least an open firefox session with some kind of autoreload). Nice interface for a user to guide the whole process.
Adobe AIR — requires a working desktop computer, more difficult than creating a Fx addon, but more powerful.
Automated browser — more of a desktop programming issue that webdevelopment. Can be set up on a linux terminal without graphical environment. Requires master hacking skills. :)
Being primarily a .Net programmer these days, my advice would be to use C# or some other language with .Net bindings. Use the WebBrowser control to load the page, and then iterate through the elements in the document (via GetElementsByTagName()) to get links, images, etc. With a little extra work (parsing the BASE tag, if available), you can resolve src and href attributes into URL's and use the HttpWebRequest to send HEAD requests for the target images to determine their sizes. That should give you an idea of how graphically intensive the page is, if that's something you're interested in. Additional items you might be interested in including in your stats could include backlinks / pagerank (via Google API), whether the page validates as HTML or XHTML, what percentage of links link to URL's in the same domain versus off-site, and, if possible, Google rankings for the page for various search strings (dunno if that's programmatically available, though).
I would use a script (or a compiled app depending on language of choice) written in a language that has strong support for networking and text parsing/regular expressions.
Perl
Python
.NET language of choice
Java
whatever language you are most comfortable with. A basic stand alone script/app keeps you from needing to worry too much about browser integration and security issues.

How to take screenshot of rendered HTML page

Our web analytics package includes detailed information about user's activity within a page, and we show (click/scroll/interaction) visualizations in an overlay atop the web page. Currently this is an IFrame containing a live rendering of the page.
Since pages change over time, older data no longer corresponds to the current layout of the page. We would like to run a spider to occasionally take snapshots of the pages, allowing us to maintain a record of interactions with various versions of the page.
We have a working implementation of this (Linux), but the snapshot process is a hideous Python/JavaScript/HTML hack which opens a Firefox window, screenshotting and scrolling and merging and saving to a file. This requires us to install the X stack on our normally headless servers, and takes over a minute per page.
We would prefer a headless implementation with performance closer to that of the rendering time in a regular web browser, but haven't found anything.
There's some movement towards building something using Mozilla source as a starting point, but that seems like overkill to me, as well as a maintenance nightmare if we try to keep it up to date.
Suggestions?
An article on Digital Inspiration points towards CutyCapt which is cross-platform and uses the Webkit rendering engine as well as IECapt which uses the present IE rendering engine and requires Windows, natch. Nothing off the top of my head which uses Gecko, Firefox's rendering engine.
I doubt you're going to be able to get away from X, however. Since CutyCapt requires Qt, it requires either X or a Windows installation. And, similarly, IECapt will require Windows (or Wine if you want to try to run it under Linux, and then you're back to needing X). I doubt you'll be able to find a rendering engine which doesn't require Qt, Gtk, GDI, or Cocoa, and therefore requires a full install of display libraries.
Why not store the HTML that is sent out to the client? You could then use that to redisplay in a webbrowser as a page to show what it looked like.
Using your webanalytics data about use actions, you could they use that to default the combo boxes, fields etc to the values the client would have had, even change the CSS on buttons, etc, to mark them as being pushed.
As a benefit, you don't need the X stack, don't need to do any crawling or storing of images.
EDIT (Re Andrew Moore):
This is were you store the current CSS/images under a version number. Place an easily parsable version number in a comment in the HTML. If you change your CSS/images and use the existing names, increment the version number in the HTML output sent out.
The system that stores the HTML will know that it needs to grab a new copy and store under a new number. When redisplaying, it simply uses the version number to determine which CSS/image set to use.
We currently have a system here that uses a very similiar system so we can track users actions and provide better support when they call our help desk, as they can bring up the users session and follow what they did, even some-what live.
you can even code it to auto-censor sensitive fields when it is stored.
depending on the specifics of your needs perhaps you could get away with using one of the many free webpage thumbnail services? snapcasa, for example lets you generate thousands per month / no charge no advertizing .. (not ever used, just googled 'free thumbnail service') to find this.
just a thot