Related
I'm trying to make a view with Swift on Xcode that shows the schedule of an event day.
The problem is that, while I'm trying to make the schedule dynamic, so that it changes depending on the data from a website, the way to do so is elusive.
For example, if there already is a schedule posted on website, what could I do so that the schedule showed on the app changes depending on the schedule on the website?
Or, as an alternative, could I maybe bring data from a google doc that lists a schedule to be shown on the schedule view of my app?
Merely using a webview and linking the website page isn't good enough, as it would be no better than posting a link on the app.
I know these are very broad questions, but I was wondering if the community has better ideas to offer, or some efficient ways to carry out those two aforementioned procedures.
Yes. Using WebView is a lazy method and is not efficient.
The best solution to do this is by scrapping the data from that website. There are lots of scrapping libraries available for many languages. Usually called HTML DOM parsers. I prefer this one: http://simplehtmldom.sourceforge.net/ It's built on PHP. Though its name says "simple", it's very powerful. Read their docs. These scrappers read data from HTML code and parse based on certain rules like id, class etc. You can then convert this data to JSON/XML and then use it in your app
Before doing this, check whether that website has any sort of API
I am trying to add information into my website that is drawn from another website. This is information like the current local temperature from weather.com or the current stock price from yahoo finance or the current exchange rate from some other site. I need the numbers, not just an iframe or screenshot.
Is this possible?
Yes it is.
But you can't achieve it with pure HTML. You need use either a server-side scripting languages such as PHP or a client-side scripting language such as JavaScript. But JavaScript can cause problems because most browsers requests a Access-Control-Allow-Origin header. Yahoo's YQL (Yahoo Query Language) allows all requests from anywhere so you can use it. Yahoo has weather and currency (I don't know for sure)
So you basically have two options here. Find free or paid APIs and get the information from them such as this API for weather:
http://openweathermap.org/api
The other option is called web-scraping which is basically just scanning the source code of a site with some kind of server side or client side programming language to find the data that you want. The data is usually wrapped in an HTML tag with a certain class or id that you can scan for.
I would suggest trying to go the API route first though because web-scraping is a grey area and it will easily break unless the website you are scraping never makes design/layout changes.
I've had a thought of using Wordpress as a CMS backend, because well a lot of people know it and it is easy to use and then using Node.JS as the front-end. You're probably thinking now why would I want to do that in the first place, what is the advantage?
I want to use websockets and the wonderful Socket.io library for Node.JS provides beautiful cross-browser websockets support. Essentially I want a user to come to a site, a websocket is created and then content is fed to the frontend asynchronously as JSON and then decoded on the frontend all without page refreshing.
Effectively I am making Wordpress become a real-time CMS. You visit a site, but every link you click fetches the page as JSON and returns it via a websocket to save multiple requests and of course, page size.
How do I go about getting Node.JS talking to a MySQL database, pulling out info and then showing it? Any tutorials, resources and other useful tips would be gratefully appreciated. A few of my colleagues have wondered the same thing, so I think the answers will be a big help to everyone.
To be exact, you can't use Node.js for a front-end solution, since it runs on the server, not the browser (think of it like any other server-side language such as PHP, JSP etc).
You can, however, create the described solution with jQuery or any other Javascript library, you just have to implement data transfer with Socket.IO. On the server-side you'd need something to handle websockets, so the most native way would be to use Node.js, but since you want to use Wordpress, it gets really complicated, as Wordpress is not meant to be used in the way you described, so I'm afraid you'd have to write your CMS from ground up in Node.
Also, the way you described has a huge flaw. Search engine crawlers are still unable to parse and run Javascript, so if all of your content is loaded dynamically, it would seem empty to Google and others, so it would be impossible to ever make it in the search results rendering your site pretty much useless.
For MySQL and other modules for Node, you should check NPM registry and the Node modules page.
EDIT
After Dwayne explained his solution in comments, this is how I'd do it:
I'd use jQuery for front-end. Binding the document with .on(), and setting the selector to 'a', so that every anchor on the webpage would fire the handler.
The handler parses the a.href attribute and figures out whether it's an external link, which shouldn't be handled by Javascript, or if it's a link to the next page, to an article etc. You can prevent the default action by calling e.preventDefault() in the handler, which prevents the browser from redirecting to the location.
Then the handler would get the content in JSON by calling .getJSON() to the URL based on the article. The easiest way would be to have a certain pattern (such as all urls like www.domain.com/api) redirect to the Node service via .htaccess, to prevent cross-domain problems.
Node would then see the request, extract the parameters and figure out what the user wants. Then connect to the MySQL database with this module (it's as simple as it can get) and return the corresponding content formatted as JSON. Don't forget to set the Content-Type headers to 'application/json'.
jQuery gets the response, figure out the type of the request and updates the content accordingly. Profit.
As you can see, I wouldn't use WebSockets in this case, since you wouldn't really benefit much from it. They are mostly meant for small real-time updates (no huge HTTP headers to reduce the bandwidth) that are both-ways. This means that the server could also push data into the browser, without the browser asking for it. In a blog context, this is not required, and you won't have too many request, so the difference in bandwidth wouldn't be noticeable anyway. If, however, you would like to use it for educational purposes, just basically replace the getJSON part with SocketIO, I'm not sure whether Apache supports proxying WebSockets, though. Extra information about SocketIO basics are here.
Edit: I overlooked the part with 'using Node.js on the front-end'. As Vahur Roosimaa said, Node.js is on the server-side (think of it as Nginx / Apache + PHP combination). Node isn't a frontend library like jQuery.
If you want you can use it just for the websockets functionality (I suggest using Socket.IO).
Nice tutorials about Node.js and MySQL:
http://www.giantflyingsaucer.com/blog/?p=2596
http://mclear.co.uk/2011/01/26/very-simple-nodejs-mysql-select-query-example/
http://www.hacksparrow.com/using-mysql-with-node-js.html
This SO question might also help: MySQL with Node.js
Also check the examples from the github repo of node-mysql.
If you want something more advanced like an ORM, I recommend Sequelize.
Another good question from SO: Which ORM should I use for Node.js and MySQL?
You should check out Wordscript which I recently added a Node JS example which can act as a simple front end for doing basic post retrieval from a Wordpress database.
It uses a common mysql library for node, and generates MySQL queries from get parameters and renders data as it is retrieved from the database; including tags.
Wordscript aims to free backend/frontend developers from being forced to work with the Wordpress PHP codebase, but still allows for Wordpress'es administrative interface to be used when needed (and prudent to do so). API's have been written in Ruby and PHP that both return JSON feeds and function generally the same way the node version does; so thats an additional option where a scripting language is available.
One option you have, if you want to have wordpress as the CMS and keep its admin UI, is to write your wordpress templates to output JSON instead of HTML.
In contrast to Wordscript, this is more solution specific, since you will need to write your JSON output for every template/data you want. The upside is that you can create the JSON specifically for your needs.
On the node side, you write a small server that will consume the JSON, letting you use whatever javascript template language you want. Nodejs will also help out with performance, since you can save the rendered content and/or the JSON output in memory, saving you roundtrips to the wordpress templates.
I wrote a blog about this, which describes more of the benefits of using nodejs and wordpress together.
http://www.1001.io/improve-wordpress-with-nodejs/
Suppose I want to write a program to read movie info from IMDb, or music info from last.fm, or weather info from weather.com etc., just reading the webpage and parsing it is quiet tedious. Often websites have an xml feed (such as last.fm) set up exactly for this.
Is there a particular link/standard that websites follow for this feed? Such as robot.txt, is there a similar standard for information feeds, or does each website have its own standard?
This is the kind of problem RSS or Atom feeds were designed for, so look for a link for an RSS feed if there is one. They're both designed to be simple to parse too. That's normally on sites that have regularly updated content though, like news or blogs. If you're lucky, they'll provide many different RSS feeds for different aspects of the site (the way Stackoverflow does for questions, for instance)
Otherwise, the site may have an API you can use to get the data (like Facebook, Twitter, Google services etc). Failing that, you'll have to resort to screen-scraping and the possible copyright and legal implications that are involved with that.
Websites provide different ways to access this data. Like web services , Feeds, Endpoints to query their data.
And there are programs used to collect data from pages without using standard techniques. These programs are called Bots. These programs use different techniques to get data from websites (NOTE: Be careful Data may be copyright protected)
The most common such standards are RSS and the related Atom. Both are formats for XML syndication of web content. Most software libraries include components for parsing these formats, as they are widespread.
yes rss standard. And xml standard.
Sounds to me like you're referring to RSS or Atom feeds. These are specified for a given page in the source; for instance, open the source html for this very page and go to line 22.
Both Atom and RSS are standards. They are both XML based, and there are many parsers for each.
You mentioned screen scraping as the "tedious" option; it is also normally against the terms of service for the website. Doing this may get you blocked. Feed reading is by definition allowed.
There are a number of standards websites use for this, depending on what they are doing, and what they want to do.
RSS is a protocol for sending out formatted chunks of data in machine-parsable form. It stands for "Real Simple Syndication" and is usually used for news feeds, blogs, and other things where there is new content on a periodic or sporadic basis. There are dozens of RSS readers which allow one to subscribe to multiple RSS sources and periodically check them for new data. It is intended to be lightweight.
AJAX is a protocol for sending commands from websites to the web server and getting results back in a machine-parsable form. It is designed to work with JavaScript on the web client. The AJAX standard specifies how to format and send a request and how to format and send a reply, as well as how to parse the requests and replies. It tends to be up to the developers to know what commands are available via AJAX.
SOAP is another protocol like AJAX, but it's uses tend to be more program-to-program, rather than from web client to server. SOAP allows for auto-discovery of what commands are available by use of a machine-readable file in WSTL format, which essentially specifies in XML the method signatures and types used by a particular SOAP interface.
Not all sites use RSS, AJAX, or SOAP. Last.fm, one of the examples you listed, does not seem to support RSS and uses it's own web-based API for getting information from the site. In those cases, you have to find out what their API is (Last.fm appears to be well documented, however).
Choosing the method of obtaining data depends on the application. If its a public/commercial application screen scraping won't be an option. (E.g. if you want to use IMDB information commercially then you will need to make contract paying them 15000$ or more according to their website's usage policy)
I think your problem isn't not knowing the standard procedure for obtaining website information but rather not knowing that your inability to obtain data is due to websites not wanting to provide that data.
If a website wants you to be able to use their information, then there will almost certainly be a well documented api interface with various standard protocols for queries.
A list of APIs can be found here.
Dataformats listed at this particular sites are: CSV, GeoRSS, HTML, JSON, KML, OPML, OpenSearch, PHP, RDF, RSS, Text, XML, XSPF, YAML, CSV, GEORSS.
I'd like to write some code which looks at a website and its assets and creates some stats and a report. Assets would include images. I'd like to be able to trace links, or at least try to identify menus on the page. I'd also like to take a guess at what CMS created the site, based on class names and such.
I'm going to assume that the site is reasonably static, or is driven by a CMS, but is not something like an RIA.
Ideas about how I might progress.
1) Load site into an iFrame. This would be nice because I could parse it with jQuery. Or could I? Seems like I'd be hampered by cross-site scripting rules. I've seen suggestions to get around those problems, but I'm assuming browsers will continue to clamp down on such things. Would a bookmarklet help?
2) A Firefox add-on. This would let me get around the cross-site scripting problems, right? Seems doable, because debugging tools for Firefox (and GreaseMonkey, for that matter) let you do all kinds of things.
3) Grab the site on the server side. Use libraries on the server to parse.
4) YQL. Isn't this pretty much built for parsing sites?
My suggestion would be:
a) Chose a scripting language. I suggest Perl or Python: also curl+bash but it bad no exception handling.
b) Load the home page via a script, using a python or perl library.
Try Perl WWW::Mechanize module.
Python has plenty of built-in module, try a look also at www.feedparser.org
c) Inspect the server header (via the HTTP HEAD command) to find application server name. If you are lucky you will also find the CMS name (i.d. WordPress, etc).
d) Use Google XML API to ask something like "link:sitedomain.com" to find out links pointing to the site: again you will find code examples for Python on google home page. Also asking domain ranking to Google can be helpful.
e)You can collect the data in a SQLite db, then post process them in Excel.
You should simply fetch the source (XHTML/HTML) and parse it. You can do that in almost any modern programming language. From your own computer that is connected to Internet.
iframe is a widget for displaying HTML content, it's not a technology for data analysis. You can analyse data without displaying it anywhere. You don't even need a browser.
Tools in languages like Python, Java, PHP are certainly more powerful for your tasks than Javascript or whatever you have in those Firefox extensions.
It also does not matter what technology is behind the website. XHTML/HTML is just a string of characters no matter how a browser renders it. To find your "assets" you will simply look for specific HTML tags like "img", "object" etc.
I think an writing an extension to Firebug would proabably be one of the easiest way to do with. For instance YSlow has been developed on top of Firebug and it provides some of the features you're looking for (e.g. image, CSS and Javascript-summaries).
I suggest you try option #4 first (YQL):
The reason being that it looks like this might get you all the data you need and you could then build your tool as a website or such where you could get info about a site without actually having to go to the page in your browser. If YQL works for what you need, then it looks like you'd have the most flexibility with this option.
If YQL doesn't pan out, then I suggest you go with option #2 (a firefox addon).
I think you should probably try and stay away from Option #1 (the Iframe) because of the cross-site scripting issues you already are aware of.
Also, I have used Option #3 (Grab the site on the server side) and one problem I've ran into in the past is the site being grabbed loading content after the fact using AJAX calls. At the time I didn't find a good way to grab the full content of pages that use AJAX - SO BE WARY OF THAT OBSTACLE! Other people here have ran into that also, see this: Scrape a dynamic website
THE AJAX DYNAMIC CONTENT ISSUE:
There may be some solutions to the ajax issue, such as using AJAX itself to grab the content and using the evalScripts:true parameter. See the following articles for more info and an issue you might need to be aware of with how evaluated javascript from the content being grabbed works:
Prototype library: http://www.prototypejs.org/api/ajax/updater
Message Board: http://www.crackajax.net/forums/index.php?action=vthread&forum=3&topic=17
Or if you are willing to spend money, take a look at this:
http://aptana.com/jaxer/guide/develop_sandbox.html
Here is an ugly (but maybe useful) example of using a .NET component called WebRobot to scrap content from a dynamic AJAX enabled site such as Digg.com.
http://www.vbdotnetheaven.com/UploadFile/fsjr/ajaxwebscraping09072006000229AM/ajaxwebscraping.aspx
Also here is a general article on using PHP and the Curl library to scrap all the links from a web page. However, I'm not sure if this article and the Curl library covers the AJAX content issue:
http://www.merchantos.com/makebeta/php/scraping-links-with-php/
One thing I just thought of that might work is:
grab the content and evaluate it using AJAX.
send the content to your server.
evaluate the page, links, etc..
[OPTIONAL] save the content as a local page on your server .
return the statistics info back to the page.
[OPTIONAL] display cached local version with highlighting.
^Note: If saving a local version, you will want to use regular expressions to convert relative link paths (for images especially) to be correct.
Good luck!
Just please be aware of the AJAX issue. Many sites nowadays load content dynamically using AJAX. Digg.com does, MSN.com does for it's news feeds, etc...
That really depends on the scale of your project. If it’s just casual, not fully automated, I’d strongly suggest a Firefox Addon.
I’m right in the middle of similar project. It has to analyze the DOM of a page generated using Javascript. Writing a server-side browser was too difficult, so we turned to some other technologies: Adobe AIR, Firefox Addons, userscripts, etc.
Fx addon is great, if you don’t need the automation. A script can analyze the page, show you the results, ask you to correct the parts, that it is uncertain of and finally post the data to some backend. You have access to all of the DOM, so you don’t need to write a JS/CSS/HTML/whatever parser (that would be hell of a job!)
Another way is Adobe AIR. Here, you have more control over the application — you can launch it in the background, doing all the parsing and analyzing without your interaction. The downside is — you don’t have access to all DOM of the pages. The only way to go pass this is to set up a simple proxy, that fetches target URL, adds some Javascript (to create a trusted-untrusted sandbox bridge)… It’s a dirty hack, but it works.
Edit:
In Adobe AIR, there are two ways to access a foreign website’s DOM:
Load it via Ajax, create HTMLLoader object, and feed the response into it (loadString method IIRC)
Create an iframe, and load the site in untrusted sandbox.
I don’t remember why, but the first method failed for me, so I had to use the other one (i think there was some security reasons involved, that I couldn’t workaround). And I had to create a sandbox, to access site’s DOM. Here’s a bit about dealing with sandbox bridges. The idea is to create a proxy, that adds a simple JS, that creates childSandboxBridge and exposes some methods to the parent (in this case: the AIR application). The script contents is something like:
window.childSandboxBridge = {
// ... some methods returning data
}
(be careful — there are limitations of what can be passed via the sandbox bridge — no complex objects for sure! use only the primitive types)
So, the proxy basically tampered with all the requests that returned HTML or XHTML. All other was just passed through unchanged. I’ve done this using Apache + PHP, but could be done with a real proxy with some plugins/custom modules for sure. This way I had the access to DOM of any site.
end of edit.
The third way I know of, the hardest way — set up an environment similar to those on browsershots. Then you’re using firefox with automation. If you have a Mac OS X on a server, you could play with ActionScript, to do the automation for you.
So, to sum up:
PHP/server-side script — you have to implement your own browser, JS engine, CSS parser, etc, etc. Fully under control and automated instead.
Firefox Addon — has access to DOM and all stuff. Requires user to operate it (or at least an open firefox session with some kind of autoreload). Nice interface for a user to guide the whole process.
Adobe AIR — requires a working desktop computer, more difficult than creating a Fx addon, but more powerful.
Automated browser — more of a desktop programming issue that webdevelopment. Can be set up on a linux terminal without graphical environment. Requires master hacking skills. :)
Being primarily a .Net programmer these days, my advice would be to use C# or some other language with .Net bindings. Use the WebBrowser control to load the page, and then iterate through the elements in the document (via GetElementsByTagName()) to get links, images, etc. With a little extra work (parsing the BASE tag, if available), you can resolve src and href attributes into URL's and use the HttpWebRequest to send HEAD requests for the target images to determine their sizes. That should give you an idea of how graphically intensive the page is, if that's something you're interested in. Additional items you might be interested in including in your stats could include backlinks / pagerank (via Google API), whether the page validates as HTML or XHTML, what percentage of links link to URL's in the same domain versus off-site, and, if possible, Google rankings for the page for various search strings (dunno if that's programmatically available, though).
I would use a script (or a compiled app depending on language of choice) written in a language that has strong support for networking and text parsing/regular expressions.
Perl
Python
.NET language of choice
Java
whatever language you are most comfortable with. A basic stand alone script/app keeps you from needing to worry too much about browser integration and security issues.