Scraping hidden HTML (when visible = false) using Hpricot (Ruby on Rails) - html

I've come across an issue which unfortunately I can't seem to surpass, I'm also just a newborn to Ruby on rails unfortunately hence the number of questions
I am attempting to scrape a webpage such as the following:
http://www.yellowpages.com.mt/Malta/Grocers-Mini-Markets-Retail-In-Malta-Gozo.aspx
I would like to scrape The Addresses, Phones and URL of the next Page which in this case is
http://www.yellowpages.com.mt/Malta/Grocers-Mini-Markets-Retail-In-Malta-Gozo+Ismol.aspx
I've been trying just about anything i could think of but nothing seems to work due to them being set to invisible or so.
The Address is within an h3 tag but it does not appear to be scrap-able. I've been also looking into ScRUBYt from the following url http://www.rubyrailways.com/ajax-scraping-with-scrubyt-linkedin-google-analytics-yahoo-suggestions/, but i really cant seem to find heads or tails of how to apply them in this case.
I would really appreciate any pointers as this is an obstacle which i really need to surpass in order to move forward on my assignment. Thanks in advance for any help.

In the particular example you have given, the elements are not hidden, but loaded via ajax after the page load. So basically what you need is a http client which can run javascript (web browser?) to see those address and other contents.
If you want to really automate the process and scrape the data which is got through ajax or javascript, you can try selenium. Even though it is not developed for that purpose, it serves your needs.

I don't have an answer to your specific question, but I thought I'd point to Ryan Bates' Railscast episode on screen scraping with ruby: http://railscasts.com/episodes/173-screen-scraping-with-scrapi
He uses a library called scrAPI instead of ScRUBYt, since he couldn't get ScRUBYt working. scrAPI seems to be a bit easier maybe?
I hope this helps somewhat, good luck with your assignment! :)
-John

There is a good script posted at the google group. It seems to extract address, etc. You may want to look at the code for the script page.txt.

Related

How can I pass form data from one html file to another without JS/PHP?

I'm learning basic web dev and started with HTML, CSS, Bootstrap. Haven't touched PHP or anything server side yet.
What I've done so far is I've created a pretty basic registration form with 5 fields and what I'm trying to do is display the input of those fields in a table that I've created on another page. The submit button has the "method" and action. Now, I've Googled a ton to find some solutions and have gone through most of the questions of this site but I still can't find out to achieve what I'm trying to do without the use of PHP/JS.
So, is it even possible to read form data from another page like this without the use of JS/PHP? If so, how do I proceed and what needs to be done? I can post the source code but I don't think it's going to help since there isn't much there, everything else is working fine except for finding a solution to this.
Thank you.
You need a programming language.
If you want to do it entirely client-side, then that has to be JS.
If you want to do it server-side (which allows you to access the data and, optionally, make it available to other people, instead of limiting it to the user of the browser) then you can use any programming language at all (although JS and PHP are among the most common choices).
Since you are trying to create a registration page, you'll need to use server-side programming.
You necessarily need to use JavaScript / PHP.
Since you are just starting, I would highly recommend you to check out the W3Schools tutorials on HTML, CSS, JavaScript, PHP, Bootstrap and jQuery.
:)
So this is long gone but I was actually able to resolve my problem without using anything other than basic HTML , so here's how I did it for anyone else who's trying to find the answer to problem (probably not, you don't usually do this professionally and basically this was a challenge from a friend).
So, two things.
SessionStorage
LocalStorage
This is built-in to your browser and you can use it to achieve simple tasks by simply assigning values to it. They'll remain there and you can use however you want.
But, as the name implies, sessionstorage will only retain those values during the session (the time you have your browser open for) while localstorage can retain it indefinitely. Not sure if I can link other sites over here so just Google these terms to learn more and how to use them.

How to block people from viewing source code?

So I take this class, and I'm way ahead of everyone else and a lot of people steal code from my website, I have already disabled right clicking but it's rather easy to get around this, is their any way to stop people from being able to view my source code?
tl;dr: Nope.
You could look into obfuscation, as well as CSS & JS minification.
"If you steal from one author, it’s plagiarism; Steal from many, it’s art."
No, if someone wants it, they will get it, you can make it harder but, you will just alienate your users from normal functionality, focus on your backend code.
If they steal your code, your lector will hopefully notice, either way they only hurt themselves.
Afaik the only way to hide your source code is if you put it on the server-side.
It is not possible from hiding client-side source code from users - sorry.
One suggestion would be stopping the user from right-clicking but that might cause you more problems...
You could render the html pages server side and convert them into images which get sent to the client. You could then have some image maps that handle clicking on the various locations.
There isn't a perfect solution (100% bullet proof) to protect your JavaScript code on the client side, however there are some tools on the market that can help you to protect your code:
Code Compression/Minification (Usually don't protect the code)
Google Closure (Free)
Uglify JS (Free)
Code Obfuscation/Compression/Minification
JScrambler (Paid, but is on my opinion the best one on the market)
Jasob (Paid)
Stunnix (Paid, it seems to be outdated)
Hope this answers your question!

Always avoid using <iframe>?

Some days ago, some friends of mine told me to avoid using <iframe> for virtually anything, which of course includes Google Maps. That made me do some research and, among other things, find this thread in Quora (http://www.quora.com/Google-Maps/What-are-best-practices-and-recommendations-to-implement-Google-maps-within-an-iframe-on-a-webpage), which I think isn't conclusive, at least in my case. I've made a simple site which includes displaying a Google Map. I used an <iframe> because it is very simple and, as pointed out before, it is the option that Google offers within every map, so I guessed it was the optimal one.
My question is: using an <iframe> is always a bad solution, or in a simple case like mine (only displaying a location map), is it recommended?
Thank you all, please let me hear your thoughts on this,
João
Using an iframe is like having another page loaded in your browser. Which takes resources. I think this is what the suggestion to avoid it based on. But naturally, the solution is to avoid those who suggest that you should avoid something always. Just use it when it makes sense and know where to stop.

Getting html content from one page and adding it to my website

I have affiliated with expedia and I am using their API system. One of their requirements for launching the site is adding the terms and agreements to my page and they give us this page: http://travel.ian.com/index.jsp?pageName=userAgreement&locale=en_US&cid=xxx. I do not want to go to a different site, and I can not copy and paste the information because of updates. I also prefer not to use an iframe. Does anyone have any ideas on how to do this? Here is a webpage using this on their site with their domain: http://www.helloweekends.com/terms.htm. Does anyone know how they did this? Any help would be greatly appreciated!
Since it originates from another domain, it wouldn't be possible to use JavaScript, due to the same origin policy. Also, relying on JavaScript for the update would be trouble for users who has JavaScript disabled, as they wouldn't see the terms. Since you don't want to use an iframe, or copy the content, I guess your best shot would be to scrape their page with a server-side language of your choice, and then display it on your page.
Scraping can be a bit tricky though, if you rely on their markup. If they change their markup, there is a chance that your script will break, thus stop updating the terms.
There are various tutorials available on how to scrape sites. Here are a few PHP examples:
Web scrape with PHP
PHP Screen Scraping Tutorial
Note Make sure that they allow you to scrape the page prior to implementing it, so that you don't violate their rules.
Do you know if their API serves something with JSON? A JSONP call can get the values to you, but it will make your page rely on javascript for the users to see the updated page.
Another option is to use PHP of any other server side language to get the contents of the url, process it and return the block you require.
I would suggest the load() function offered by jQuery. It makes a simple AJAX call to retrieve a file, and you could even use a selector to only grab part of the page. For example, load the contents of a HTML page into a div:
$('#div_id').load('my_file.html');
Or just load a part of the page:
$('#div_id').load('my_file.html #main_text_id');

template removal/detection/difference utility for HTML and other text

I remember reading a while back on some random website about a program that would look at multiple pages on an HTML site and detect the differences/similarities between the pages to automatically detect which parts were template "boilerplate" and which parts were new content, and then based on this, automatically spit out just the parts that are content.
Unfortunately, I didn't remember enough details about this utility to actually find it on google, so I wonder if any of you guys have run across anything like this, and CAN remember the name of it.
Thanks.
Murphy's Law (or is it some other law) has stricken, and I've found it just moments after I'd given up and posted this question. The project I am thinking of is this:
http://code.google.com/p/boilerpipe/
Thanks.