How to display content from other sites on my own page? - language-agnostic

How does a website like http://www.dogpile.com display search results from Google and other search engines on it's own page. The only way I can think about doing something like this is by using iframes but of course then the content won't really be on my page.

They are using the public api's for the different search engines and building their pages from the results.
See:
Google's Search API
Bing Search API
Yahoo! Search API

Have a look at curl. There are plenty of examples of its usage on that page.

On the server side download the content of their page, make all the relative references absolute, add their head entries to yours, add their body to yours, hope you don't get caught stealing someone else's content.

You could use AJAX, and just consider the remote page as a webservice that returns HTML. That is, just innerHTML it directly into your DOM.

when someone requests a search:
perform that search for the various search engines on your server
extract the content with xpaths, regex, etc
and then display the results on your own webpage

Related

Indexe site in angular by search robots if $locationProvider.html5Mode(true) (without !#)

friends.
I have site written in angular.
How google or other search robots indexing this sites?
What techniques can be used to allow robots to see pages content?
Internet says how optimise site on angular for indexing if $locationProvider.html5Mode(false);
Exist method for indexing site if url of pages like: http://site.com/!#audio/1234
But in my site $locationProvider.html5Mode(true);
That means that url of pages without !#
http://site.com/audio/1234.
But all content of page loades with javascript.
I need help. Thank you!
Your server will receive a request at http://site.com/audio/1234. Somehow you will need to be able to return some content.
For now the only strategy I know is to render that page on the server-side also... Which involved some (or a lot of) work.
It's also good for the browsers that does not support pushStates yet. That way these users can still access the data.
You can also take a look at this http://docs.meteor.com/#spiderable They render the page on the server-side by executing the client-side javascript on the server-side.
Here's a similar question: pushState and SEO

Getting html content from one page and adding it to my website

I have affiliated with expedia and I am using their API system. One of their requirements for launching the site is adding the terms and agreements to my page and they give us this page: http://travel.ian.com/index.jsp?pageName=userAgreement&locale=en_US&cid=xxx. I do not want to go to a different site, and I can not copy and paste the information because of updates. I also prefer not to use an iframe. Does anyone have any ideas on how to do this? Here is a webpage using this on their site with their domain: http://www.helloweekends.com/terms.htm. Does anyone know how they did this? Any help would be greatly appreciated!
Since it originates from another domain, it wouldn't be possible to use JavaScript, due to the same origin policy. Also, relying on JavaScript for the update would be trouble for users who has JavaScript disabled, as they wouldn't see the terms. Since you don't want to use an iframe, or copy the content, I guess your best shot would be to scrape their page with a server-side language of your choice, and then display it on your page.
Scraping can be a bit tricky though, if you rely on their markup. If they change their markup, there is a chance that your script will break, thus stop updating the terms.
There are various tutorials available on how to scrape sites. Here are a few PHP examples:
Web scrape with PHP
PHP Screen Scraping Tutorial
Note Make sure that they allow you to scrape the page prior to implementing it, so that you don't violate their rules.
Do you know if their API serves something with JSON? A JSONP call can get the values to you, but it will make your page rely on javascript for the users to see the updated page.
Another option is to use PHP of any other server side language to get the contents of the url, process it and return the block you require.
I would suggest the load() function offered by jQuery. It makes a simple AJAX call to retrieve a file, and you could even use a selector to only grab part of the page. For example, load the contents of a HTML page into a div:
$('#div_id').load('my_file.html');
Or just load a part of the page:
$('#div_id').load('my_file.html #main_text_id');

Prevent site configuration info from showing up on Google

I have a site that's running WordPress.
The main page has an embedded Flash player and an imbedded iframe, and for some reason, all the configuration info from the Flash player is showing up on Google for my site, and nothing else.
How can I have the main site information show up on Google, without having that Flash player config info show up?
And can I customize what shows up at all?
If there's some way to tag the info I don't want to show up, or tag the info I want to show up, I can probably do most ofthe edits myself, I just don't know where to start...
EDIT: I tried most of the suggestions below, and I didn't get anywhere...
Any other ideas?
Thanks a lot!
If you don't want Google, or other crawler to access certain parts of your website you should use a robots.txt file. Inside you specify which parts are accessible and which aren't, when the crawlers get to your website will always look for this file for instructions.
You can check some documentation on how to do it here and here
In order to influence what text is used on the google search result try putting this within your head tags
<meta name="description" content="WHATEVER YOU WANT DISPLAYED ON GOOGLE">
Source: http://static.googleusercontent.com/external_content/untrusted_dlcp/www.google.com/en/us/webmasters/docs/search-engine-optimization-starter-guide.pdf
Some more information from google on controling parts of a page. Apparently there are google off/google on tags.
http://perishablepress.com/press/2009/08/23/tell-google-to-not-index-certain-parts-of-your-page/
Hope this helps.
If you want Google to index only part of your pages, you can't follow normal SEO routines. You should provide a mechanism to understand whether the current client (requester) is a robot or not. If yes, then don't render that part. This is the only way. Otherwise, a robot either gets the whole rendered content, or doesn't have access based on robots.txt file (Robot Exclusion Protocol).
Another way (which is not really smart, and can't be guaranteed to work) is to dynamically inject your content into the page via JavaScript. Because AMAIK, robots don't run JavaScript.
As search spiders won't render javascript generated markup (JS is not run as it is client-side in the browser), a quick fix would be to don't output any of flash / markup initially in the HTML document and then use JS to add the flash stuff on load.
Note: as far as I'm aware, Google is currently testing a JS reading spider so this may not work long term.
Google is returning this data because it simply can't find any content where it normally would. Search engines require content - they're not advanced enough to process your multimedia to determine what it's all about.
Google will IGNORE your meta description if it doesn't feel that it reflects your page content (of which there is only iframes and JS)
Use SWFObject to provide alternate content for users without flash (including search engines) - ensure it's not some dinky text like "download flash here" - but a lengthy descriptive content piece about your site or media that they would normally experience if they could experience.
Use robots.txt or <meta name="robots" content="noindex,follow"> for the iframe content to prevent it from being indexed.
For the love of all things holy, please look at reducing the number of JS files and inline JS on your site (i'd recommend WP-minify since it's so obvious that you love plugins)

Display a "No Javascript" div, but not to google / facebook share service

I would like to show a div near the top of a site to suggest to visitors that do not have javascript enabled that they should enable their javascript. I thought I had found a good method by using the noscript tag.
Unfortunately I found that this solution was less than ideal because of services like Google's indexer and Facebook's link sharing functionality. These services scrape the page and read the text in the noscript div as the summary of the page. This happens because these services are not utilising javascript (obviously).
So, my question to the masses is: What techniques do you prefer for avoiding having your "please enable your javascript" messages appearing in Google's results etc.
Ideally, I'm hoping to discover the best practice for solving this issue, but am interested in hearing any techniques you have user successfully, or unsuccessfully in the past.
Thanks!
In a pure HTML scenario (as tagged), consider placing your message at the bottom of the page, and using CSS to position it visibly at the top. This should push your warning far enough down the page as to avoid it showing up in typical search results.
If your HTML is generated by server script, then you may be able to conditionally present the element based on the client UserAgent. A good search engine user agent list would be convenient in this case.

How to implement a search in a HTML site?

I am really wondering if i can use search for a HTML website. The pages are static. I just want the users to able to search for contents of my site. and the results shown with in my site itself. Is there anyway i can achieve this. I can use PHP on my server.
Google search can be implemented but it takes you to google's page to show the results
You're better off not creating your own search engine - there's loads of good ones that can be integrated into your site, which will be better than you can write yourself.
Google is the most popular search engine, so you might as well use that. As an alternative to customising the html results page, you could use the Google AJAX Search API - this does your search, and inserts the results to a specific element on your page. (DON'T forget people with javascript turned off, however...)
I like easy and fast, so consider Google Custom Search
Possibilities are:
Database Extraction (http://www.ibm.com/developerworks/library/os-php-sphinxsearch/)
Search Indexer (http://framework.zend.com/manual/en/zend.search.lucene.html#zend.search.lucene.overview)
Custom Search Crawler
As for customise Google Custom Search: http://googlecustomsearch.blogspot.com/2009/10/structured-custom-search.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+blogspot%2FSyga+%28Google+Custom+Search%29
I think you can customize Google's result page.
Well, if you have php available, I would definitely suggest using that. If I were you, I would go through some PHP tutorials, and learn the basics.
W3Schools has some great tutorials.
Then, I would do some searches on building a text based database on your site, or use a clever solution like this one. You can build a small database with metadata and store it in a text file, and it should get you going. Good luck.