Pull information from a website into my website - html

I am trying to add information into my website that is drawn from another website. This is information like the current local temperature from weather.com or the current stock price from yahoo finance or the current exchange rate from some other site. I need the numbers, not just an iframe or screenshot.
Is this possible?

Yes it is.
But you can't achieve it with pure HTML. You need use either a server-side scripting languages such as PHP or a client-side scripting language such as JavaScript. But JavaScript can cause problems because most browsers requests a Access-Control-Allow-Origin header. Yahoo's YQL (Yahoo Query Language) allows all requests from anywhere so you can use it. Yahoo has weather and currency (I don't know for sure)

So you basically have two options here. Find free or paid APIs and get the information from them such as this API for weather:
http://openweathermap.org/api
The other option is called web-scraping which is basically just scanning the source code of a site with some kind of server side or client side programming language to find the data that you want. The data is usually wrapped in an HTML tag with a certain class or id that you can scan for.
I would suggest trying to go the API route first though because web-scraping is a grey area and it will easily break unless the website you are scraping never makes design/layout changes.

Related

SQL drop down selections in HTML

I'm new to HTML and coding period. I've created a basic HTML page. In that page i want to create dropdown selections that produce outputs from my SQL database. MSSQL not MySQL.
EX: If I select a table or a column from dropdown one and then input a keyword for selection box 2. I want it to produce a table that shows the information in that table/column with that keyword.
If I select a medical name from dropdown and I want it to show only medical names that are equal to Diabetes. and then show me those rows from my database to a table. How would I show that in HTMl from connecting to the database, to creating the dropdown selection linked to the database, and then being able to select the criteria for what I want to be displayed. and then showing that in a table or list format.
Thank you in advance
OK, Facu Carbonel's answer is a bit... chaotic, so since this question (suprisingly) isn't closed yet, I'll write one myself and try to do better.
First of all - this is a VERY BROAD topic which I cannot answer directly. I could give a pile of code, but walking through it all would take pages of text and in the end you'd just have a solution for this one particular problem and could start from scratch with the next one.
So instead I'll take the same path that Factu Carbonel took and try to show some directions. I'll put keywords in bold that you can look up and research. They're all pieces of the puzzle. You don't need to understand each of them completely and thoroughly from the beginning, but be aware what they are and what they do, so that you can google finer details when you need them.
First of all, you need to understand the roles of the "server side" and "client side".
The client side is the browser (Chrome, Firefox, Internet Explorer, what have you). When you type an address in the address bar (or click a link or whatever), what the browser does is it parses the whole thing and extracts the domain name. For example, the link to this question is https://stackoverflow.com/questions/59903087/sql-drop-down-selections-in-html?noredirect=1#comment105933697_59903087 and the domain part of that is stackoverflow.com. The rest of this long jibberish (it's called an "URL" by the way) is also relevant, but later.
With the domain in hand the browser then uses the DNS system to convert that pretty name into an IP address. Then it connects via network to the computer (aka "server") designated by that IP address and issues a HTTP request (HTTP, not HTML - don't mix these up, they're not the same thing).
HTTP, by the way, is the protocol that is used on the web to communicate between the server and the browser. It's like a language that they both understand, so that the browser can tell the server hey, give me the page /questions/59903087/sql-drop-down-selections-in-html. And the server then returns the HTML for that page.
This, by the way, is another important point to understand about HTTP. First the browser makes its request, and the server listens. Then the server returns its response, and the browser listens. And then the connection is closed. There's no chit-chat back and forth. The browser can do another request immediately after that, but it will be a new request.
Now, the browser is actually pretty limited in what it can do. Through these HTTP requests it gets from the server the HTML code, the CSS code and the Javascript code. It also can get pictures, videos and sound files. And then it can display them according to the HTML and CSS. And Javascript code runs inside the browser and can manipulate the HTML and CSS as needed, to respond to the user's actions. But that's all.
It might seem that the Javascript code that runs inside the browser is all powerful, but that is only an illusion as well. It's actually quite limited, and on purpose. In order to prevent bad webpages from doing bad things, the Javascript in each page is essentially limited to that page only.
Note a few things that it CANNOT do:
It cannot connect to something that doesn't use HTTP. Like an SQL server.
It can make HTTP requests, but only to the same domain as the page (you can get around this via CORS, but that's advanced stuff you don't need to worry about)
It cannot access your hard drive (well, it can if the user explicitly selects a file, but that's it)
It cannot affect other open browser tabs
It cannot access anything in your computer outside the browser
This, by the way, is called "sandboxing" - like, the Javascript code in the browser is only allowed to play in its sandbox, which is the page in which it was loaded.
OK, so here we can see, that accessing your SQL server directly from HTML/CSS/Javascript is impossible.
Fortunately, we still need to talk about the other side of the equation - the web server which responded to the browser's requests and gave it the HTML to display.
It used to be, far back in the early days of the internet, that web servers only returned static files. Those days are long gone. Now we can make the webserver return -- whatever we want. We can write a program that inspects the incoming request from the browser, and then generates the HTML on the fly. Or Javascript. Or CSS. Or images. Or whatever. The good thing about the server side is - we have FULL CONTROL over it. There are no sandboxes, no limits, your program can do anything.
Of course, it can't affect anything directly in the browser - it can only respond to the browsers requests. So to make a useful application, you actually need to coordinate both sides. There's one program running in the browser and one program running on the web server. They talk through HTTP requests and together they accomplish what they need to do. The browser program makes sure to give the user a nice UI, and the server program talks to all the databases and whatnot.
Now, while in browser you're basically limited to just Javascript and the features the browser offers you, on the server side you can choose what web server software and what programming language you use. You can use the same Javascript, or you can go for something like PHP, Java (not the same as Javasctipt!), C#, Ruby, Python, and thousands of others. Each language is different and does things its own way, but at the end of the day what it will do is that it will receive the incoming requests from the browser and generate some sort of output that the browser expects.
So, I hope that this at least gives you some starting point and outlines where to go from here.
First of all there is something that you need to know to do this, and that is the difference between a front-end and a back-end.
Html is a front-end technology, they are called like that because that's what is shown to the user and the back-end it's all mechanisms that run behind the hood.
The thing is, in your front-end you can't do things of back-end, like do querys from a database, manage sessions and that kind of thing.
For that you need a back-end running behind, like php, ruby, node.js or some technology like that.
From the html you can only call functions on the server using things like <form action="/log" method="POST"> this wold call the action /log that you should have already program on your back-end. Don't get confuse with this, there is plenty of ways to sending request to your back-end and this is just one way to do it.
For your specific case I recommend you to look up for ajax, to do the query on your database with no need of the browser to refresh after the query is made.
Some topics you need to know to understand this is:
-what's front-end and back-end and their differences.
-what is client-server architecture
-ajax
-http requests
-how to work with a back-end, doing querys to the database, making routes, etc.
-and for last, wile your server it's not open to the world with your own domain name, what is localhost and how to use it.
I hope that this clarify a bit this, that is no easy thing, but with a bit of research and practice you will accomplish!

Getting same information firebug can get?

This all goes back to some of my original questions of trying to "index" a webpage. I was originally trying to do it specifically in java but now I'm opening it up to any language.
Before I tried using HTML unit and other methods in java to get the information I needed but wasn't successful.
The information I need to get from a webpage I can very easily find with firebug and I was wondering if there was anyway to duplicate what firebug was doing specifically for my needs. When I open up firebug I go to the NET tab, then to the XHR tab and it shows a constantly updating page with the information the server is updating. Then when I click on the request and look at the response it has the information I need, and this is all without ever refreshing the webpage which is what I am trying to do(not to mention the variables it is outputting do not show up in the html of the webpage)
So can anyone point me in the right direction of how they would go about this?
(I will be putting this information into a mysql database which is why i added it as a tag, still dont know what language would be best to use though)
Edit: These requests on the server are somewhat random and although it shows the url that they come from when I try to visit the url in firefox it comes up trying to open something called application/jos
Jon, I am fairly certain that you are confusing several technologies here, and the simple answer is that it doesn't work like that. Firebug works specifically because it runs as part of the browser, and (as far as I am aware) runs under a more permissive set of instructions than a JavaScript script embedded in a page.
JavaScript is, for the record, different from Java.
If you are trying to log AJAX calls, your best bet is for the serverside application to log the invoking IP, useragent, cookies, and complete URI to your database on receipt. It will be far better than any clientside solution.
On a note more related to your question, it is not good practice to assume that everyone has read other questions you have posted. Generally speaking, "we" have not. "We" is in quotes because, well, you know. :) It also wouldn't hurt for you to go back and accept a few answers to questions you've asked.
So, the problem is?:
With someone else's web-page, hosted on someone else's server, you want to extract select information?
Using cURL, Python, Java, etc. is too painful because the data is continually updating via AJAX (requires a JS interpreter)?
Plain jQuery or iFrame intercepts will not work because of XSS security.
Ditto, a bookmarklet -- which has the added disadvantage of needing to be manually triggered every time.
If that's all correct, then there are 3 other approaches:
Develop a browser plugin... More difficult, but has the power to do everything in one package.
Develop a userscript. This is much easier to do and technologies such as Greasemonkey deal with the XSS problem.
Use a browser macro technology such as Chickenfoot. These all have plusses and minuses -- which I won't get into.
Using Greasemonkey:
Depending on the site, this can be quite easy.   The big drawback, if you want to record data, is that you need your own web-server and web-application. But this server can be locally hosted on an XAMPP stack, or whatever web-application technology you're comfortable with.
Sample code that intercepts a page's AJAX data is at: Using Greasemonkey and jQuery to intercept JSON/AJAX data from a page, and process it.
Note that if the target page does NOT use jQuery, the library in use (if any) usually has similar intercept capabilities. Or, listening for DOMSubtreeModified always works, too.
If you're using a library such as jQuery, you may have an option such as the jQuery ajaxSend and ajaxComplete callbacks. These could post requests to your server to log these events (being careful not to end up in an infinite loop).

How can you access the info on a website via a program?

Suppose I want to write a program to read movie info from IMDb, or music info from last.fm, or weather info from weather.com etc., just reading the webpage and parsing it is quiet tedious. Often websites have an xml feed (such as last.fm) set up exactly for this.
Is there a particular link/standard that websites follow for this feed? Such as robot.txt, is there a similar standard for information feeds, or does each website have its own standard?
This is the kind of problem RSS or Atom feeds were designed for, so look for a link for an RSS feed if there is one. They're both designed to be simple to parse too. That's normally on sites that have regularly updated content though, like news or blogs. If you're lucky, they'll provide many different RSS feeds for different aspects of the site (the way Stackoverflow does for questions, for instance)
Otherwise, the site may have an API you can use to get the data (like Facebook, Twitter, Google services etc). Failing that, you'll have to resort to screen-scraping and the possible copyright and legal implications that are involved with that.
Websites provide different ways to access this data. Like web services , Feeds, Endpoints to query their data.
And there are programs used to collect data from pages without using standard techniques. These programs are called Bots. These programs use different techniques to get data from websites (NOTE: Be careful Data may be copyright protected)
The most common such standards are RSS and the related Atom. Both are formats for XML syndication of web content. Most software libraries include components for parsing these formats, as they are widespread.
yes rss standard. And xml standard.
Sounds to me like you're referring to RSS or Atom feeds. These are specified for a given page in the source; for instance, open the source html for this very page and go to line 22.
Both Atom and RSS are standards. They are both XML based, and there are many parsers for each.
You mentioned screen scraping as the "tedious" option; it is also normally against the terms of service for the website. Doing this may get you blocked. Feed reading is by definition allowed.
There are a number of standards websites use for this, depending on what they are doing, and what they want to do.
RSS is a protocol for sending out formatted chunks of data in machine-parsable form. It stands for "Real Simple Syndication" and is usually used for news feeds, blogs, and other things where there is new content on a periodic or sporadic basis. There are dozens of RSS readers which allow one to subscribe to multiple RSS sources and periodically check them for new data. It is intended to be lightweight.
AJAX is a protocol for sending commands from websites to the web server and getting results back in a machine-parsable form. It is designed to work with JavaScript on the web client. The AJAX standard specifies how to format and send a request and how to format and send a reply, as well as how to parse the requests and replies. It tends to be up to the developers to know what commands are available via AJAX.
SOAP is another protocol like AJAX, but it's uses tend to be more program-to-program, rather than from web client to server. SOAP allows for auto-discovery of what commands are available by use of a machine-readable file in WSTL format, which essentially specifies in XML the method signatures and types used by a particular SOAP interface.
Not all sites use RSS, AJAX, or SOAP. Last.fm, one of the examples you listed, does not seem to support RSS and uses it's own web-based API for getting information from the site. In those cases, you have to find out what their API is (Last.fm appears to be well documented, however).
Choosing the method of obtaining data depends on the application. If its a public/commercial application screen scraping won't be an option. (E.g. if you want to use IMDB information commercially then you will need to make contract paying them 15000$ or more according to their website's usage policy)
I think your problem isn't not knowing the standard procedure for obtaining website information but rather not knowing that your inability to obtain data is due to websites not wanting to provide that data.
If a website wants you to be able to use their information, then there will almost certainly be a well documented api interface with various standard protocols for queries.
A list of APIs can be found here.
Dataformats listed at this particular sites are: CSV, GeoRSS, HTML, JSON, KML, OPML, OpenSearch, PHP, RDF, RSS, Text, XML, XSPF, YAML, CSV, GEORSS.

What are the pros and cons of various ways of analyzing websites?

I'd like to write some code which looks at a website and its assets and creates some stats and a report. Assets would include images. I'd like to be able to trace links, or at least try to identify menus on the page. I'd also like to take a guess at what CMS created the site, based on class names and such.
I'm going to assume that the site is reasonably static, or is driven by a CMS, but is not something like an RIA.
Ideas about how I might progress.
1) Load site into an iFrame. This would be nice because I could parse it with jQuery. Or could I? Seems like I'd be hampered by cross-site scripting rules. I've seen suggestions to get around those problems, but I'm assuming browsers will continue to clamp down on such things. Would a bookmarklet help?
2) A Firefox add-on. This would let me get around the cross-site scripting problems, right? Seems doable, because debugging tools for Firefox (and GreaseMonkey, for that matter) let you do all kinds of things.
3) Grab the site on the server side. Use libraries on the server to parse.
4) YQL. Isn't this pretty much built for parsing sites?
My suggestion would be:
a) Chose a scripting language. I suggest Perl or Python: also curl+bash but it bad no exception handling.
b) Load the home page via a script, using a python or perl library.
Try Perl WWW::Mechanize module.
Python has plenty of built-in module, try a look also at www.feedparser.org
c) Inspect the server header (via the HTTP HEAD command) to find application server name. If you are lucky you will also find the CMS name (i.d. WordPress, etc).
d) Use Google XML API to ask something like "link:sitedomain.com" to find out links pointing to the site: again you will find code examples for Python on google home page. Also asking domain ranking to Google can be helpful.
e)You can collect the data in a SQLite db, then post process them in Excel.
You should simply fetch the source (XHTML/HTML) and parse it. You can do that in almost any modern programming language. From your own computer that is connected to Internet.
iframe is a widget for displaying HTML content, it's not a technology for data analysis. You can analyse data without displaying it anywhere. You don't even need a browser.
Tools in languages like Python, Java, PHP are certainly more powerful for your tasks than Javascript or whatever you have in those Firefox extensions.
It also does not matter what technology is behind the website. XHTML/HTML is just a string of characters no matter how a browser renders it. To find your "assets" you will simply look for specific HTML tags like "img", "object" etc.
I think an writing an extension to Firebug would proabably be one of the easiest way to do with. For instance YSlow has been developed on top of Firebug and it provides some of the features you're looking for (e.g. image, CSS and Javascript-summaries).
I suggest you try option #4 first (YQL):
The reason being that it looks like this might get you all the data you need and you could then build your tool as a website or such where you could get info about a site without actually having to go to the page in your browser. If YQL works for what you need, then it looks like you'd have the most flexibility with this option.
If YQL doesn't pan out, then I suggest you go with option #2 (a firefox addon).
I think you should probably try and stay away from Option #1 (the Iframe) because of the cross-site scripting issues you already are aware of.
Also, I have used Option #3 (Grab the site on the server side) and one problem I've ran into in the past is the site being grabbed loading content after the fact using AJAX calls. At the time I didn't find a good way to grab the full content of pages that use AJAX - SO BE WARY OF THAT OBSTACLE! Other people here have ran into that also, see this: Scrape a dynamic website
THE AJAX DYNAMIC CONTENT ISSUE:
There may be some solutions to the ajax issue, such as using AJAX itself to grab the content and using the evalScripts:true parameter. See the following articles for more info and an issue you might need to be aware of with how evaluated javascript from the content being grabbed works:
Prototype library: http://www.prototypejs.org/api/ajax/updater
Message Board: http://www.crackajax.net/forums/index.php?action=vthread&forum=3&topic=17
Or if you are willing to spend money, take a look at this:
http://aptana.com/jaxer/guide/develop_sandbox.html
Here is an ugly (but maybe useful) example of using a .NET component called WebRobot to scrap content from a dynamic AJAX enabled site such as Digg.com.
http://www.vbdotnetheaven.com/UploadFile/fsjr/ajaxwebscraping09072006000229AM/ajaxwebscraping.aspx
Also here is a general article on using PHP and the Curl library to scrap all the links from a web page. However, I'm not sure if this article and the Curl library covers the AJAX content issue:
http://www.merchantos.com/makebeta/php/scraping-links-with-php/
One thing I just thought of that might work is:
grab the content and evaluate it using AJAX.
send the content to your server.
evaluate the page, links, etc..
[OPTIONAL] save the content as a local page on your server .
return the statistics info back to the page.
[OPTIONAL] display cached local version with highlighting.
^Note: If saving a local version, you will want to use regular expressions to convert relative link paths (for images especially) to be correct.
Good luck!
Just please be aware of the AJAX issue. Many sites nowadays load content dynamically using AJAX. Digg.com does, MSN.com does for it's news feeds, etc...
That really depends on the scale of your project. If it’s just casual, not fully automated, I’d strongly suggest a Firefox Addon.
I’m right in the middle of similar project. It has to analyze the DOM of a page generated using Javascript. Writing a server-side browser was too difficult, so we turned to some other technologies: Adobe AIR, Firefox Addons, userscripts, etc.
Fx addon is great, if you don’t need the automation. A script can analyze the page, show you the results, ask you to correct the parts, that it is uncertain of and finally post the data to some backend. You have access to all of the DOM, so you don’t need to write a JS/CSS/HTML/whatever parser (that would be hell of a job!)
Another way is Adobe AIR. Here, you have more control over the application — you can launch it in the background, doing all the parsing and analyzing without your interaction. The downside is — you don’t have access to all DOM of the pages. The only way to go pass this is to set up a simple proxy, that fetches target URL, adds some Javascript (to create a trusted-untrusted sandbox bridge)… It’s a dirty hack, but it works.
Edit:
In Adobe AIR, there are two ways to access a foreign website’s DOM:
Load it via Ajax, create HTMLLoader object, and feed the response into it (loadString method IIRC)
Create an iframe, and load the site in untrusted sandbox.
I don’t remember why, but the first method failed for me, so I had to use the other one (i think there was some security reasons involved, that I couldn’t workaround). And I had to create a sandbox, to access site’s DOM. Here’s a bit about dealing with sandbox bridges. The idea is to create a proxy, that adds a simple JS, that creates childSandboxBridge and exposes some methods to the parent (in this case: the AIR application). The script contents is something like:
window.childSandboxBridge = {
// ... some methods returning data
}
(be careful — there are limitations of what can be passed via the sandbox bridge — no complex objects for sure! use only the primitive types)
So, the proxy basically tampered with all the requests that returned HTML or XHTML. All other was just passed through unchanged. I’ve done this using Apache + PHP, but could be done with a real proxy with some plugins/custom modules for sure. This way I had the access to DOM of any site.
end of edit.
The third way I know of, the hardest way — set up an environment similar to those on browsershots. Then you’re using firefox with automation. If you have a Mac OS X on a server, you could play with ActionScript, to do the automation for you.
So, to sum up:
PHP/server-side script — you have to implement your own browser, JS engine, CSS parser, etc, etc. Fully under control and automated instead.
Firefox Addon — has access to DOM and all stuff. Requires user to operate it (or at least an open firefox session with some kind of autoreload). Nice interface for a user to guide the whole process.
Adobe AIR — requires a working desktop computer, more difficult than creating a Fx addon, but more powerful.
Automated browser — more of a desktop programming issue that webdevelopment. Can be set up on a linux terminal without graphical environment. Requires master hacking skills. :)
Being primarily a .Net programmer these days, my advice would be to use C# or some other language with .Net bindings. Use the WebBrowser control to load the page, and then iterate through the elements in the document (via GetElementsByTagName()) to get links, images, etc. With a little extra work (parsing the BASE tag, if available), you can resolve src and href attributes into URL's and use the HttpWebRequest to send HEAD requests for the target images to determine their sizes. That should give you an idea of how graphically intensive the page is, if that's something you're interested in. Additional items you might be interested in including in your stats could include backlinks / pagerank (via Google API), whether the page validates as HTML or XHTML, what percentage of links link to URL's in the same domain versus off-site, and, if possible, Google rankings for the page for various search strings (dunno if that's programmatically available, though).
I would use a script (or a compiled app depending on language of choice) written in a language that has strong support for networking and text parsing/regular expressions.
Perl
Python
.NET language of choice
Java
whatever language you are most comfortable with. A basic stand alone script/app keeps you from needing to worry too much about browser integration and security issues.

how to browse web site with script to get informations

I need to write a script that go to a web site, logs in, navigates to a page and downloads (and after that parse) the html of that page.
What I want is a standalone script, not a script that controls Firefox. I don't need any javascript support in that just simple html navigation.
If nothing easy to do this exists.. well then something that acts though a web browser (firefox or safari, I'm on mac).
thanks
I've no knowledge of pre-built general purpose scrapers, but you may be able to find one via Google.
Writing a web scraper is definitely doable. In my very limited experience (I've written only a couple), I did not need to deal with login/security issues, but in Googling around I saw some examples that dealt with them - afraid I don't remember URL's for those pages. I did need to know some specifics about the pages I was scraping; having that made it easier to write the scraper, but, of course, the scrapers were limited to use on those pages. However, if you're just grabbing the entire page, you may only need the URL(s) of the page(s) in question.
Without knowing what language(s) would be acceptable to you, it is difficult to help much more. FWIW, I've done scrapers in PHP and Python. As Ben G. said, PHP has cURL to help with this; maybe there are more, but I don't know PHP very well. Python has several modules you might choose from, including lxml, BeautifulSoup, and HTMLParser.
Edit: If you're on Unix/Linux (or, I presume, CygWin) You may be able to achieve what you want with wget.
If you wanted to use PHP, you could use the cURL functions to build your own simple web page scraper.
For an idea of how to get started, see: http://us2.php.net/manual/en/curl.examples-basic.php
This is PROBABLY a dumb question, since I have no knowledge of mac but what language are we talking about here, and also is this a website that you have control over, or something like a spider bot that google might use when checking page content? I know that in C# you can load in objects on other sites using an HttpWebRequest and a stream reader... In java script (this would only really work if you know what is SUPPOSED to be there) you could open the web page as the source of an iframe, and using java script traverse the contents of all the elements on the page... or better yet, use jquery.
I need to write a script that go to a web site, logs in, navigates to a page and downloads (and after that parse) the html of that page.
To me this just sounds like a POST or GET request to the URL of the login page could do the job.With the proper parameters username and password (depending on the form input names used on the page) set in the request, the result will be the html of the page that you can then parse as you please.
This can be done with virtually any language. What language do you want to use?
I recently did exactly what you’re asking for in a C# project. If login is required your first request is likely to be a post and include credentials. The response will usually include cookies which persist the identity across subsequent requests. Use Fiddler to look at what form data (field names and values) is being posted to the server when you logon normally with your browser. Once you have this you can construct an HttpWebRequest with the form data and store the cookies from the response in a CookieContainer.
The next step is to make the request for the content you actually want. This will be another HttpWebRequest with the CookieContainer attached. The response can be read by a StreamReader which you can than read and convert to a string.
Each time I’ve done this it has usually been a pretty laborious process to identify all the relevant form data and recreate the requests manually. Use Fiddler extensively and compare the requests your browser is making when using the site normally with the requests coming from your script. You may also need to manipulate the request headers; again, use Fiddler to construct these by hand, get them submitting correctly and the response as you expect then code it. Good luck!