Browser, upload large file

Browser, upload large file - html

I'm looking for a way to allow a user to upload a large file (~1gb) to my unix server using a web page and browser.
There are a lot of examples that illustrate how to do this with a traditional post request, however this doesn't seem like a good idea when the file is this large.
I'm looking for recommendations on the best approach.
Bonus points if the method includes a way of providing progress information to the user.
For now security is not a major concern, as most users who will be using the service can be trusted. We can also assume that the connection between client and host will not be interrupted (or if it is they have to start over).
We can also assume the user is running a browser of supporting most modern features (JavaScript, Flash, etc)
edit
No language requirements. Just looking for the best solution.

There are several ways to handle this,
1. Flash Uploader
Theres plenty of flash uploaders to improve the users GUI so that they can examine the process and the process factors such as time left, KB Done etc.
This is very good if you understand how to improve Flash source code for later developments.
2. Ajax
Theres a few ways using Ajax and PHP (although PHP Does not support it) you can use Perl module to accomplish the same thing http://pecl.php.net/package/uploadprogress, This is only if you wish to show percentage information etc.
3 Basic Javascript.
This method would be just the regular form, but with some ajax styling so when the form is submitted you can show a basic loader saying please wait while you send us the file...
If your using asp, you can take a look at: http://neatupload.codeplex.com/
Hope theres some good information to get you on your way.
Regards

Not sure about your language requirements, but you can look e.g. into
http://pypi.python.org/pypi/gp.fileupload/
Supports progress information also, btw.

I have used the dojo FileUploader widget to reliably upload audio files greater than a gigabyte with a progress bar. Though you said security was not an issue, I'd like to say that I got HTTPS uploads w/cookie based authentication hooked up flawlessly.
See: http://www.sitepen.com/blog/2008/09/02/the-dojo-toolkit-multi-file-uploader/ and
http://api.dojotoolkit.org/jsdoc/1.3/dojox.form.FileUploader

Related

Getting same information firebug can get?

This all goes back to some of my original questions of trying to "index" a webpage. I was originally trying to do it specifically in java but now I'm opening it up to any language.
Before I tried using HTML unit and other methods in java to get the information I needed but wasn't successful.
The information I need to get from a webpage I can very easily find with firebug and I was wondering if there was anyway to duplicate what firebug was doing specifically for my needs. When I open up firebug I go to the NET tab, then to the XHR tab and it shows a constantly updating page with the information the server is updating. Then when I click on the request and look at the response it has the information I need, and this is all without ever refreshing the webpage which is what I am trying to do(not to mention the variables it is outputting do not show up in the html of the webpage)
So can anyone point me in the right direction of how they would go about this?
(I will be putting this information into a mysql database which is why i added it as a tag, still dont know what language would be best to use though)
Edit: These requests on the server are somewhat random and although it shows the url that they come from when I try to visit the url in firefox it comes up trying to open something called application/jos

Jon, I am fairly certain that you are confusing several technologies here, and the simple answer is that it doesn't work like that. Firebug works specifically because it runs as part of the browser, and (as far as I am aware) runs under a more permissive set of instructions than a JavaScript script embedded in a page.
JavaScript is, for the record, different from Java.
If you are trying to log AJAX calls, your best bet is for the serverside application to log the invoking IP, useragent, cookies, and complete URI to your database on receipt. It will be far better than any clientside solution.
On a note more related to your question, it is not good practice to assume that everyone has read other questions you have posted. Generally speaking, "we" have not. "We" is in quotes because, well, you know. :) It also wouldn't hurt for you to go back and accept a few answers to questions you've asked.

So, the problem is?:
With someone else's web-page, hosted on someone else's server, you want to extract select information?
Using cURL, Python, Java, etc. is too painful because the data is continually updating via AJAX (requires a JS interpreter)?
Plain jQuery or iFrame intercepts will not work because of XSS security.
Ditto, a bookmarklet -- which has the added disadvantage of needing to be manually triggered every time.
If that's all correct, then there are 3 other approaches:
Develop a browser plugin... More difficult, but has the power to do everything in one package.
Develop a userscript. This is much easier to do and technologies such as Greasemonkey deal with the XSS problem.
Use a browser macro technology such as Chickenfoot. These all have plusses and minuses -- which I won't get into.
Using Greasemonkey:
Depending on the site, this can be quite easy.   The big drawback, if you want to record data, is that you need your own web-server and web-application. But this server can be locally hosted on an XAMPP stack, or whatever web-application technology you're comfortable with.
Sample code that intercepts a page's AJAX data is at: Using Greasemonkey and jQuery to intercept JSON/AJAX data from a page, and process it.
Note that if the target page does NOT use jQuery, the library in use (if any) usually has similar intercept capabilities. Or, listening for DOMSubtreeModified always works, too.

If you're using a library such as jQuery, you may have an option such as the jQuery ajaxSend and ajaxComplete callbacks. These could post requests to your server to log these events (being careful not to end up in an infinite loop).

HTML5 copy protection

I am building an online game, similar to a flash online game, but using HTML only. (HTML5)
I would like to prevent people from copying it and putting the game on their site.
With flash, I used to do this by adding the URL check to make sure it was running on my site, but this seems useless if i can only put the check in Javascript.
Is there anything that can prevent someone from simply copying the game?
as a followup..
Can I protect the source at all? (aside from obfuscation with javascript.)
Thanks!

Unfortunately if the game engine is 100% JavaScript, the most practical way to do this would be with obfuscating your code. Here's a link to a site that does just that.
However, by using Ajax which allows your client side script to communicate with your server, you can store the majority of your game's logic and functionality on your server in any type of server-side scripting such as PHP.
Basically your PHP (or some other language) files on your site would get requests from the user's machine to make decisions about logic in the game whenever logic is needed for the game to function, and your server would respond, all through using Ajax. Then someone could only really copy half of your game - that part stored in JavaScript.
The one downside with this method is that it may slow down the user's game drastically due to having to communicate with the server constantly.
Hope this helped and made sense!

No way to do this, it's the nature of the web. Your best bet is obfuscating your code.

Real time web application

I really need your help with this. We are planning on developing a real-time web application. We look at different libraries and concepts and a little confused.
What we need is: clients connect to websites and send data(usually an integer + client machine name) whenever they want (usually 1-5 seconds). Also, the same clients must receive data(the data received from other clients) from the server in a real-time mode. (maximum 0.5 seconds). Also, this data must be stored in the database.
We were thinking about using different technologies, but cannot decide which one to use.
We need this web application to be supported on Iphones and Android Phones (maybe blackberry).
and, of course desktop browsers.
Pooling seems not a very good Idea in this situation, due to highloads.
Html 5 web sockets kinda new, and probably not supported by all browsers.
Have anyone used nodejs ?
or twisted matrix: http://twistedmatrix.com/trac/?
or orbited(cannot post more than one link)?
or tornado?
Or XMPP(Jabber. I did not find good examples.)?
or something else?
What technology is the best to use in this type of project? Also, we would probably prefer the technology that has some community support and free to use.
Thanks a lot!

There is a lot of things to consider here. I would say that HTML 5 is not an option, simply due to support across platform.
Running with NodeJS is most likely possible, but the communication methods are really complicated. Pushing data to a page isn't really something that HTML/web apps are designed to do....
To get a valid answer you are going to need to get someone to come in and sit with you to really iron out details and implementation.

When you say that clients "connect to a website", do you really need it to be a website? It sounds like all the client is sending is a number and for that you don't need a website. Just pick the language of your choice, open up a socket, and go from there.
Are you streaming data to be visualized? You might want to take a look at graphite (and/or "pyped" which is part of graphite).

What kind of data? What is the purpose?
For real-time you're not going to get a web site unless you use some type of RIA but even then, it isn't going to be enough. Services aren't going to be good enough either. You're going to end up doing some type of polling which will only ever be psuedo-real-time unless you do duplex mode which wont be supported on most of the platforms you want to support.
sockets are the way to go but that requires a client for each platform you want to handle. Maybe you should rethink your requirements.

What are the pros and cons of various ways of analyzing websites?

I'd like to write some code which looks at a website and its assets and creates some stats and a report. Assets would include images. I'd like to be able to trace links, or at least try to identify menus on the page. I'd also like to take a guess at what CMS created the site, based on class names and such.
I'm going to assume that the site is reasonably static, or is driven by a CMS, but is not something like an RIA.
Ideas about how I might progress.
1) Load site into an iFrame. This would be nice because I could parse it with jQuery. Or could I? Seems like I'd be hampered by cross-site scripting rules. I've seen suggestions to get around those problems, but I'm assuming browsers will continue to clamp down on such things. Would a bookmarklet help?
2) A Firefox add-on. This would let me get around the cross-site scripting problems, right? Seems doable, because debugging tools for Firefox (and GreaseMonkey, for that matter) let you do all kinds of things.
3) Grab the site on the server side. Use libraries on the server to parse.
4) YQL. Isn't this pretty much built for parsing sites?

My suggestion would be:
a) Chose a scripting language. I suggest Perl or Python: also curl+bash but it bad no exception handling.
b) Load the home page via a script, using a python or perl library.
Try Perl WWW::Mechanize module.
Python has plenty of built-in module, try a look also at www.feedparser.org
c) Inspect the server header (via the HTTP HEAD command) to find application server name. If you are lucky you will also find the CMS name (i.d. WordPress, etc).
d) Use Google XML API to ask something like "link:sitedomain.com" to find out links pointing to the site: again you will find code examples for Python on google home page. Also asking domain ranking to Google can be helpful.
e)You can collect the data in a SQLite db, then post process them in Excel.

You should simply fetch the source (XHTML/HTML) and parse it. You can do that in almost any modern programming language. From your own computer that is connected to Internet.
iframe is a widget for displaying HTML content, it's not a technology for data analysis. You can analyse data without displaying it anywhere. You don't even need a browser.
Tools in languages like Python, Java, PHP are certainly more powerful for your tasks than Javascript or whatever you have in those Firefox extensions.
It also does not matter what technology is behind the website. XHTML/HTML is just a string of characters no matter how a browser renders it. To find your "assets" you will simply look for specific HTML tags like "img", "object" etc.

I think an writing an extension to Firebug would proabably be one of the easiest way to do with. For instance YSlow has been developed on top of Firebug and it provides some of the features you're looking for (e.g. image, CSS and Javascript-summaries).

I suggest you try option #4 first (YQL):
The reason being that it looks like this might get you all the data you need and you could then build your tool as a website or such where you could get info about a site without actually having to go to the page in your browser. If YQL works for what you need, then it looks like you'd have the most flexibility with this option.
If YQL doesn't pan out, then I suggest you go with option #2 (a firefox addon).
I think you should probably try and stay away from Option #1 (the Iframe) because of the cross-site scripting issues you already are aware of.
Also, I have used Option #3 (Grab the site on the server side) and one problem I've ran into in the past is the site being grabbed loading content after the fact using AJAX calls. At the time I didn't find a good way to grab the full content of pages that use AJAX - SO BE WARY OF THAT OBSTACLE! Other people here have ran into that also, see this: Scrape a dynamic website
THE AJAX DYNAMIC CONTENT ISSUE:
There may be some solutions to the ajax issue, such as using AJAX itself to grab the content and using the evalScripts:true parameter. See the following articles for more info and an issue you might need to be aware of with how evaluated javascript from the content being grabbed works:
Prototype library: http://www.prototypejs.org/api/ajax/updater
Message Board: http://www.crackajax.net/forums/index.php?action=vthread&forum=3&topic=17
Or if you are willing to spend money, take a look at this:
http://aptana.com/jaxer/guide/develop_sandbox.html
Here is an ugly (but maybe useful) example of using a .NET component called WebRobot to scrap content from a dynamic AJAX enabled site such as Digg.com.
http://www.vbdotnetheaven.com/UploadFile/fsjr/ajaxwebscraping09072006000229AM/ajaxwebscraping.aspx
Also here is a general article on using PHP and the Curl library to scrap all the links from a web page. However, I'm not sure if this article and the Curl library covers the AJAX content issue:
http://www.merchantos.com/makebeta/php/scraping-links-with-php/
One thing I just thought of that might work is:
grab the content and evaluate it using AJAX.
send the content to your server.
evaluate the page, links, etc..
[OPTIONAL] save the content as a local page on your server .
return the statistics info back to the page.
[OPTIONAL] display cached local version with highlighting.
^Note: If saving a local version, you will want to use regular expressions to convert relative link paths (for images especially) to be correct.
Good luck!
Just please be aware of the AJAX issue. Many sites nowadays load content dynamically using AJAX. Digg.com does, MSN.com does for it's news feeds, etc...

That really depends on the scale of your project. If it’s just casual, not fully automated, I’d strongly suggest a Firefox Addon.
I’m right in the middle of similar project. It has to analyze the DOM of a page generated using Javascript. Writing a server-side browser was too difficult, so we turned to some other technologies: Adobe AIR, Firefox Addons, userscripts, etc.
Fx addon is great, if you don’t need the automation. A script can analyze the page, show you the results, ask you to correct the parts, that it is uncertain of and finally post the data to some backend. You have access to all of the DOM, so you don’t need to write a JS/CSS/HTML/whatever parser (that would be hell of a job!)
Another way is Adobe AIR. Here, you have more control over the application — you can launch it in the background, doing all the parsing and analyzing without your interaction. The downside is — you don’t have access to all DOM of the pages. The only way to go pass this is to set up a simple proxy, that fetches target URL, adds some Javascript (to create a trusted-untrusted sandbox bridge)… It’s a dirty hack, but it works.
Edit:
In Adobe AIR, there are two ways to access a foreign website’s DOM:
Load it via Ajax, create HTMLLoader object, and feed the response into it (loadString method IIRC)
Create an iframe, and load the site in untrusted sandbox.
I don’t remember why, but the first method failed for me, so I had to use the other one (i think there was some security reasons involved, that I couldn’t workaround). And I had to create a sandbox, to access site’s DOM. Here’s a bit about dealing with sandbox bridges. The idea is to create a proxy, that adds a simple JS, that creates childSandboxBridge and exposes some methods to the parent (in this case: the AIR application). The script contents is something like:
window.childSandboxBridge = {
// ... some methods returning data
}
(be careful — there are limitations of what can be passed via the sandbox bridge — no complex objects for sure! use only the primitive types)
So, the proxy basically tampered with all the requests that returned HTML or XHTML. All other was just passed through unchanged. I’ve done this using Apache + PHP, but could be done with a real proxy with some plugins/custom modules for sure. This way I had the access to DOM of any site.
end of edit.
The third way I know of, the hardest way — set up an environment similar to those on browsershots. Then you’re using firefox with automation. If you have a Mac OS X on a server, you could play with ActionScript, to do the automation for you.
So, to sum up:
PHP/server-side script — you have to implement your own browser, JS engine, CSS parser, etc, etc. Fully under control and automated instead.
Firefox Addon — has access to DOM and all stuff. Requires user to operate it (or at least an open firefox session with some kind of autoreload). Nice interface for a user to guide the whole process.
Adobe AIR — requires a working desktop computer, more difficult than creating a Fx addon, but more powerful.
Automated browser — more of a desktop programming issue that webdevelopment. Can be set up on a linux terminal without graphical environment. Requires master hacking skills. :)

Being primarily a .Net programmer these days, my advice would be to use C# or some other language with .Net bindings. Use the WebBrowser control to load the page, and then iterate through the elements in the document (via GetElementsByTagName()) to get links, images, etc. With a little extra work (parsing the BASE tag, if available), you can resolve src and href attributes into URL's and use the HttpWebRequest to send HEAD requests for the target images to determine their sizes. That should give you an idea of how graphically intensive the page is, if that's something you're interested in. Additional items you might be interested in including in your stats could include backlinks / pagerank (via Google API), whether the page validates as HTML or XHTML, what percentage of links link to URL's in the same domain versus off-site, and, if possible, Google rankings for the page for various search strings (dunno if that's programmatically available, though).

I would use a script (or a compiled app depending on language of choice) written in a language that has strong support for networking and text parsing/regular expressions.
Perl
Python
.NET language of choice
Java
whatever language you are most comfortable with. A basic stand alone script/app keeps you from needing to worry too much about browser integration and security issues.

Best way to handle file uploads through HTTP

File uploads through web pages using the standard HTML input always seems clunky to me. If the user tries to upload a large file, it can go on forever and they get no queue that the file is actually being uploaded.
I have tried to do things like provide a gif graphic that is an animated graphic bar, but it doesn't give the user any indication of how much is uploaded. I have even tried to do a progress bar with AJAX, but those were always ugly and never seemed to work right.
This has been an issue with many of my clients, and often I'm asked if there is a better way. Sometimes I'll just provide them an FTP site so they can upload it there, but that's not a practical solution either.
What do you think the best way to handle HTTP file uploads from HTML is? What are some good ideas / examples you have seen around the internet?

There are several techniques for asynchronous file transfer with a progress bar over HTTP, most of which involve either Flash or XMLHttpRequest.

There are a number of client side controls that one can use.
You can
Build your own ActiveX control. Windows/IE only
Use Flash to queue up files and upload them one at a time to the server using the stanard file upload protocol.
Use a signed java applet to upload.
Write a browser plugin.
Some random links from google:
http://www.element-it.com/MultiPowUpload.aspx
http://www.codeproject.com/KB/aspnet/FlashUpload.aspx
http://www.dmxzone.com/forum/go/?36564

I'll add swfupload to this. It's an open source flash uploader that can degrade gracefully if the user doesn't have flash.

There's really only the one mechanism for uploading via a browser. You can, however, dress it up and make it more user friendly by providing a progress bar to show that the upload is progressing and at what speed.
This is typically done by targeting the upload form at a hidden iframe and using AJAX calls to find out how much of the file has reached the server.
Here's one example of this:
Megaupload

If you running a mod_perl2 apache there is the Apache2::UploadProgress module. This adds an id to the http upload request, you then query the server for the progress of that upload. Has built in support for creating an AJAX progress bar in a popup window or within the page doing the upload. If you want to build your own progress display you can get the info back as XML or JSON data.

The YUI Uploader utility uses a Flash-based uploader, is well documented, and has several examples for you to try. I've used it on several projects, and would recommend it.

I use this one for a fairly simple and complete tool. The base sourcecode is good and you can easily customize it if necessary.
AJAX File Upload

Interesting, no one has mentioned NeatUpload upload component by Dean Brettle, it has lots of interesting features and runs on MONO, too

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008