How were HTML forms interpreted in the early 90s? - html

In the modern web an HTML <form> element is submitted and then interpreted by scripting. Either it is interpreted by a server side programming language (usually PHP) or it is interpreted by a client side script (almost always JavaScript).
Forms existed even in the early 90s. How were they interpreted back then?
According to this Wikipedia article there was an email based HTML form submission back then, but it was unreliable. Was this all there was? Why did HTML even have forms if they were so useless without scripting? Or was it a chicken and egg sort of situation?

Before server side scripting (PHP, Ruby, node.js) there was server side programming.
One of the original interfaces between web servers and back-end processes was the Common Gateway Interface (CGI). It was introduced in the early 90s by the NCSA back-end team at the same time forms was introduced into HTML by Tim Berners-Lee (who was also at NCSA at the time). So forms was introduced at roughly the same time CGI was invented.
Initially a lot of people wrote CGI programs in C. I was one of those who had to do so as a homework assignment. Instead of a giant all-encompassing framework we wrote small C programs that read from stdin and print to stdout (we printed HTTP response, not just the HTML as per CGI spec). A website had lots of these small programs each doing one small thing and updated some database (sometimes that database was just a flat file).
Almost as soon as it was introduced people also started writing CGI scripts in Perl. So there was really no transition period between C programs and scripting languages. People simply stopped writing CGI scripts in C because it was faster to do so in scripting languages.

Server side was actually always in the picture.
The Apache HTTP Server was available since 1995, and in 1996 it also had Perl support (which was used as a server-side programming language).
JavaScript was created in 1996 and Netscape was the first browser supported the client-side language (other browsers vendors implementations were based on the work that was done in Netscape).
In 1993 the Mosaic browser is released with support for images, nested lists and fill-out forms.
Basically - every HTTP server that could handle request and pass it to some application (no matter in what language that application is written in) is a server-side application. It can be written in scripting language (Perl/Python/PHP/Ruby), high-level language (Java/C#) and if you really want - even assembly. All you need to do is make sure you "follow the protocol".

JavaScript wasn't so advance (hell Ajax wasn't even out yet). So it was pure server-side. Mostly CGI (being Perl) and PHP.
There was also Coldfusion but wasn't a popular favorite.
Eventually, at the end of 1999 and early 2000s ASP.NET (aspx) and JavaServer Pages (jsp) came out, although a lot of commercial sites used aspx and jsp for obvious reasons.
Note, Java applets also existed (mostly for rendering though) but had to be separately downloaded and supported by the browser.

Additionally, I stumbled on an interesting piece of history on Wikipedia. HTML forms could also be sent by e-mail, using a mailto: address in the target attribute. Didn't seem to be popular, but still cool!
Quoting the Wikipedia article:
User-agent support for email based HTML form submission, using a
'mailto' URL as the form action, was proposed in RFC 1867 section 5.6,
during the HTML 3.2 era. Various web browsers implemented it by
invoking a separate email program or using their own rudimentary SMTP
capabilities. Although sometimes unreliable, it was briefly popular as
a simple way to transmit form data without involving a web server or
CGI scripts.
And RFC 1867 (November 1995):
5.6 Allow form ACTION to be "mailto:"
Independent of this proposal, it would be very useful for HTML
interpreting user agents to allow a ACTION in a form to be a
"mailto:" URL. This seems like a good idea, with or without this
proposal. Similarly, the ACTION for a HTML form which is received via
mail should probably default to the "reply-to:" of the message.
These two proposals would allow HTML forms to be served via HTTP
servers but sent back via mail, or, alternatively, allow HTML forms
to be sent by mail, filled out by HTML-aware mail recipients, and the
results mailed back.

Related

Will HTML5 completely remove the use of backend languages?

I keep hearing about HTML5 applications. When they say that, do they mean applications that are built using HTML5, CSS3, and Javascript only and without the help of MySQL, PHP and other backend tools?
Most of the web applications are 3 tier that means heart of application is Business Logic. And I don't think any Business will get ready to write business logic in HTML for security reason and there are many more such reason like code reliability,maintainability come into picture. So no one can remove use of back end languages.
"HTML 5 Application" is whishy-washy marketing speak. It might mean something that stands alone (probably using non-HTML 5 APIs such as web storage), it might mean something that just uses a little bit of HTML 5.
Certainly you still need a server if any data needs to travel beyond the browser (e.g. to share data with other people / devices, back data up on the cloud, get news feeds, etc).

Importing dynamic content to a HTML page without JS or server-side scripting

A client requires my company to create a web-based learning resource to distribute to a large number of users. As such, they have some strict standards to ensure that everyone is able to access it (so it must conform to WCAG 2.0 and their own internal requirements). Since there is a great deal of content, I'd like to setup some kind of system that will store the data externally and load it into the page dynamically. That way, if I have to change something like a menu item name, I won't have to change it a thousand times.
I can't use server-side languages since this resource will be distributed on CD as well as the internet and I can't use JavaScript since a requirement is that the "resource must be operable with JavaScript disabled".
Does this leave me with any options or am I essentially stuck hard-coding every page in static HTML? All assistance is appreciated.
Well, I'd push back on the 'no javascript' requirement. Traditionally, requiring JS was considered an accessibility problem. However, we've progressed a long way and we're even building accessibility standards for JS (look up the ARIA work).
That said...
If this has to be put on CD (which, in and of itself seems to indicate this client is woefully out of date), then I think your best bet is to put all of the automation on the 'compile' side.
One way to do this would be to build a standard site with any server-side technology you prefer, launch it, then use a web site archiver/downloader/spider to grab the rendered HTML from the site for distribution offline.
There are also many CMS products that do that...the CMS spits out static HTML that is then published to the server.

How can you access the info on a website via a program?

Suppose I want to write a program to read movie info from IMDb, or music info from last.fm, or weather info from weather.com etc., just reading the webpage and parsing it is quiet tedious. Often websites have an xml feed (such as last.fm) set up exactly for this.
Is there a particular link/standard that websites follow for this feed? Such as robot.txt, is there a similar standard for information feeds, or does each website have its own standard?
This is the kind of problem RSS or Atom feeds were designed for, so look for a link for an RSS feed if there is one. They're both designed to be simple to parse too. That's normally on sites that have regularly updated content though, like news or blogs. If you're lucky, they'll provide many different RSS feeds for different aspects of the site (the way Stackoverflow does for questions, for instance)
Otherwise, the site may have an API you can use to get the data (like Facebook, Twitter, Google services etc). Failing that, you'll have to resort to screen-scraping and the possible copyright and legal implications that are involved with that.
Websites provide different ways to access this data. Like web services , Feeds, Endpoints to query their data.
And there are programs used to collect data from pages without using standard techniques. These programs are called Bots. These programs use different techniques to get data from websites (NOTE: Be careful Data may be copyright protected)
The most common such standards are RSS and the related Atom. Both are formats for XML syndication of web content. Most software libraries include components for parsing these formats, as they are widespread.
yes rss standard. And xml standard.
Sounds to me like you're referring to RSS or Atom feeds. These are specified for a given page in the source; for instance, open the source html for this very page and go to line 22.
Both Atom and RSS are standards. They are both XML based, and there are many parsers for each.
You mentioned screen scraping as the "tedious" option; it is also normally against the terms of service for the website. Doing this may get you blocked. Feed reading is by definition allowed.
There are a number of standards websites use for this, depending on what they are doing, and what they want to do.
RSS is a protocol for sending out formatted chunks of data in machine-parsable form. It stands for "Real Simple Syndication" and is usually used for news feeds, blogs, and other things where there is new content on a periodic or sporadic basis. There are dozens of RSS readers which allow one to subscribe to multiple RSS sources and periodically check them for new data. It is intended to be lightweight.
AJAX is a protocol for sending commands from websites to the web server and getting results back in a machine-parsable form. It is designed to work with JavaScript on the web client. The AJAX standard specifies how to format and send a request and how to format and send a reply, as well as how to parse the requests and replies. It tends to be up to the developers to know what commands are available via AJAX.
SOAP is another protocol like AJAX, but it's uses tend to be more program-to-program, rather than from web client to server. SOAP allows for auto-discovery of what commands are available by use of a machine-readable file in WSTL format, which essentially specifies in XML the method signatures and types used by a particular SOAP interface.
Not all sites use RSS, AJAX, or SOAP. Last.fm, one of the examples you listed, does not seem to support RSS and uses it's own web-based API for getting information from the site. In those cases, you have to find out what their API is (Last.fm appears to be well documented, however).
Choosing the method of obtaining data depends on the application. If its a public/commercial application screen scraping won't be an option. (E.g. if you want to use IMDB information commercially then you will need to make contract paying them 15000$ or more according to their website's usage policy)
I think your problem isn't not knowing the standard procedure for obtaining website information but rather not knowing that your inability to obtain data is due to websites not wanting to provide that data.
If a website wants you to be able to use their information, then there will almost certainly be a well documented api interface with various standard protocols for queries.
A list of APIs can be found here.
Dataformats listed at this particular sites are: CSV, GeoRSS, HTML, JSON, KML, OPML, OpenSearch, PHP, RDF, RSS, Text, XML, XSPF, YAML, CSV, GEORSS.

Basic client/server programming

I am new to web programming...I have been asked to create a simple Internet search application which would allow transmit to the browser some data stored remotely in the server.
Considering the client/server architecture (which I am new to) I would like to know if the "client" is represented only by the Internet browser and therefore the entire code of the web application should be stored in the server. As it's a very generic question a generic answer is also well accepted.
As you note, this is a very generic and broad question. You'd be well-served by more complete requirements. Regardless:
Client/server architecture generally means that some work is done by the client, and some by the server. The client may be a custom application (such as iTunes or Outlook), or it might be a web browser. Even if it's a web browser, you typically still have some code executing client-side, Javascript usually, to do things like field validation (are all fields filled out?).
Much of the code, as you note, will be running on the server, and some of this may duplicate your client-side code. Validation, for instance, should be performed on the client-side, to improve performance (no need to communicate with the server to determine if the password meets length requirements), but should be performed on the server as well, since client-side code is easily bypassed.
Either you can put all the code on the server, and have it generate HTML to send back to the browser. Or you can include JavaScript in the HTML pages, so some of the logic runs inside the browser. Many web applications mix the two techniques.
You can do this with all the code stored on the server.
1)The user will navigate to a page on your webserver using an url you provide.
2)When the webserver gets the request for that page, instead of just returning a standard html file, it will run your code, perhaps some PHP, which inserts the server information, perhaps from a database, into a html template.
3) The resulting fully complete html file is sent to the client. To the client's browser, it looks like any other html page.
For an example of PHP the dynamically inserts information into HTML see: (this wont be exactly what you will do but it will give you an idea of how PHP can work)
code:
http://www.php-scripts.com/php_diary/example1.phps
see the result (refresh a few times to see it in action):
http://www.php-scripts.com/php_diary/example1.php3
You can see from this the "code file" looks just like a normal html file, except what is between angled brackets is actually PHP code, in this case it puts the time into the position it is at in the html file, in your case you would write code to pull the data you want into the file.

What are the pros and cons of various ways of analyzing websites?

I'd like to write some code which looks at a website and its assets and creates some stats and a report. Assets would include images. I'd like to be able to trace links, or at least try to identify menus on the page. I'd also like to take a guess at what CMS created the site, based on class names and such.
I'm going to assume that the site is reasonably static, or is driven by a CMS, but is not something like an RIA.
Ideas about how I might progress.
1) Load site into an iFrame. This would be nice because I could parse it with jQuery. Or could I? Seems like I'd be hampered by cross-site scripting rules. I've seen suggestions to get around those problems, but I'm assuming browsers will continue to clamp down on such things. Would a bookmarklet help?
2) A Firefox add-on. This would let me get around the cross-site scripting problems, right? Seems doable, because debugging tools for Firefox (and GreaseMonkey, for that matter) let you do all kinds of things.
3) Grab the site on the server side. Use libraries on the server to parse.
4) YQL. Isn't this pretty much built for parsing sites?
My suggestion would be:
a) Chose a scripting language. I suggest Perl or Python: also curl+bash but it bad no exception handling.
b) Load the home page via a script, using a python or perl library.
Try Perl WWW::Mechanize module.
Python has plenty of built-in module, try a look also at www.feedparser.org
c) Inspect the server header (via the HTTP HEAD command) to find application server name. If you are lucky you will also find the CMS name (i.d. WordPress, etc).
d) Use Google XML API to ask something like "link:sitedomain.com" to find out links pointing to the site: again you will find code examples for Python on google home page. Also asking domain ranking to Google can be helpful.
e)You can collect the data in a SQLite db, then post process them in Excel.
You should simply fetch the source (XHTML/HTML) and parse it. You can do that in almost any modern programming language. From your own computer that is connected to Internet.
iframe is a widget for displaying HTML content, it's not a technology for data analysis. You can analyse data without displaying it anywhere. You don't even need a browser.
Tools in languages like Python, Java, PHP are certainly more powerful for your tasks than Javascript or whatever you have in those Firefox extensions.
It also does not matter what technology is behind the website. XHTML/HTML is just a string of characters no matter how a browser renders it. To find your "assets" you will simply look for specific HTML tags like "img", "object" etc.
I think an writing an extension to Firebug would proabably be one of the easiest way to do with. For instance YSlow has been developed on top of Firebug and it provides some of the features you're looking for (e.g. image, CSS and Javascript-summaries).
I suggest you try option #4 first (YQL):
The reason being that it looks like this might get you all the data you need and you could then build your tool as a website or such where you could get info about a site without actually having to go to the page in your browser. If YQL works for what you need, then it looks like you'd have the most flexibility with this option.
If YQL doesn't pan out, then I suggest you go with option #2 (a firefox addon).
I think you should probably try and stay away from Option #1 (the Iframe) because of the cross-site scripting issues you already are aware of.
Also, I have used Option #3 (Grab the site on the server side) and one problem I've ran into in the past is the site being grabbed loading content after the fact using AJAX calls. At the time I didn't find a good way to grab the full content of pages that use AJAX - SO BE WARY OF THAT OBSTACLE! Other people here have ran into that also, see this: Scrape a dynamic website
THE AJAX DYNAMIC CONTENT ISSUE:
There may be some solutions to the ajax issue, such as using AJAX itself to grab the content and using the evalScripts:true parameter. See the following articles for more info and an issue you might need to be aware of with how evaluated javascript from the content being grabbed works:
Prototype library: http://www.prototypejs.org/api/ajax/updater
Message Board: http://www.crackajax.net/forums/index.php?action=vthread&forum=3&topic=17
Or if you are willing to spend money, take a look at this:
http://aptana.com/jaxer/guide/develop_sandbox.html
Here is an ugly (but maybe useful) example of using a .NET component called WebRobot to scrap content from a dynamic AJAX enabled site such as Digg.com.
http://www.vbdotnetheaven.com/UploadFile/fsjr/ajaxwebscraping09072006000229AM/ajaxwebscraping.aspx
Also here is a general article on using PHP and the Curl library to scrap all the links from a web page. However, I'm not sure if this article and the Curl library covers the AJAX content issue:
http://www.merchantos.com/makebeta/php/scraping-links-with-php/
One thing I just thought of that might work is:
grab the content and evaluate it using AJAX.
send the content to your server.
evaluate the page, links, etc..
[OPTIONAL] save the content as a local page on your server .
return the statistics info back to the page.
[OPTIONAL] display cached local version with highlighting.
^Note: If saving a local version, you will want to use regular expressions to convert relative link paths (for images especially) to be correct.
Good luck!
Just please be aware of the AJAX issue. Many sites nowadays load content dynamically using AJAX. Digg.com does, MSN.com does for it's news feeds, etc...
That really depends on the scale of your project. If it’s just casual, not fully automated, I’d strongly suggest a Firefox Addon.
I’m right in the middle of similar project. It has to analyze the DOM of a page generated using Javascript. Writing a server-side browser was too difficult, so we turned to some other technologies: Adobe AIR, Firefox Addons, userscripts, etc.
Fx addon is great, if you don’t need the automation. A script can analyze the page, show you the results, ask you to correct the parts, that it is uncertain of and finally post the data to some backend. You have access to all of the DOM, so you don’t need to write a JS/CSS/HTML/whatever parser (that would be hell of a job!)
Another way is Adobe AIR. Here, you have more control over the application — you can launch it in the background, doing all the parsing and analyzing without your interaction. The downside is — you don’t have access to all DOM of the pages. The only way to go pass this is to set up a simple proxy, that fetches target URL, adds some Javascript (to create a trusted-untrusted sandbox bridge)… It’s a dirty hack, but it works.
Edit:
In Adobe AIR, there are two ways to access a foreign website’s DOM:
Load it via Ajax, create HTMLLoader object, and feed the response into it (loadString method IIRC)
Create an iframe, and load the site in untrusted sandbox.
I don’t remember why, but the first method failed for me, so I had to use the other one (i think there was some security reasons involved, that I couldn’t workaround). And I had to create a sandbox, to access site’s DOM. Here’s a bit about dealing with sandbox bridges. The idea is to create a proxy, that adds a simple JS, that creates childSandboxBridge and exposes some methods to the parent (in this case: the AIR application). The script contents is something like:
window.childSandboxBridge = {
// ... some methods returning data
}
(be careful — there are limitations of what can be passed via the sandbox bridge — no complex objects for sure! use only the primitive types)
So, the proxy basically tampered with all the requests that returned HTML or XHTML. All other was just passed through unchanged. I’ve done this using Apache + PHP, but could be done with a real proxy with some plugins/custom modules for sure. This way I had the access to DOM of any site.
end of edit.
The third way I know of, the hardest way — set up an environment similar to those on browsershots. Then you’re using firefox with automation. If you have a Mac OS X on a server, you could play with ActionScript, to do the automation for you.
So, to sum up:
PHP/server-side script — you have to implement your own browser, JS engine, CSS parser, etc, etc. Fully under control and automated instead.
Firefox Addon — has access to DOM and all stuff. Requires user to operate it (or at least an open firefox session with some kind of autoreload). Nice interface for a user to guide the whole process.
Adobe AIR — requires a working desktop computer, more difficult than creating a Fx addon, but more powerful.
Automated browser — more of a desktop programming issue that webdevelopment. Can be set up on a linux terminal without graphical environment. Requires master hacking skills. :)
Being primarily a .Net programmer these days, my advice would be to use C# or some other language with .Net bindings. Use the WebBrowser control to load the page, and then iterate through the elements in the document (via GetElementsByTagName()) to get links, images, etc. With a little extra work (parsing the BASE tag, if available), you can resolve src and href attributes into URL's and use the HttpWebRequest to send HEAD requests for the target images to determine their sizes. That should give you an idea of how graphically intensive the page is, if that's something you're interested in. Additional items you might be interested in including in your stats could include backlinks / pagerank (via Google API), whether the page validates as HTML or XHTML, what percentage of links link to URL's in the same domain versus off-site, and, if possible, Google rankings for the page for various search strings (dunno if that's programmatically available, though).
I would use a script (or a compiled app depending on language of choice) written in a language that has strong support for networking and text parsing/regular expressions.
Perl
Python
.NET language of choice
Java
whatever language you are most comfortable with. A basic stand alone script/app keeps you from needing to worry too much about browser integration and security issues.