Prevent site configuration info from showing up on Google - html

I have a site that's running WordPress.
The main page has an embedded Flash player and an imbedded iframe, and for some reason, all the configuration info from the Flash player is showing up on Google for my site, and nothing else.
How can I have the main site information show up on Google, without having that Flash player config info show up?
And can I customize what shows up at all?
If there's some way to tag the info I don't want to show up, or tag the info I want to show up, I can probably do most ofthe edits myself, I just don't know where to start...
EDIT: I tried most of the suggestions below, and I didn't get anywhere...
Any other ideas?
Thanks a lot!

If you don't want Google, or other crawler to access certain parts of your website you should use a robots.txt file. Inside you specify which parts are accessible and which aren't, when the crawlers get to your website will always look for this file for instructions.
You can check some documentation on how to do it here and here

In order to influence what text is used on the google search result try putting this within your head tags
<meta name="description" content="WHATEVER YOU WANT DISPLAYED ON GOOGLE">
Source: http://static.googleusercontent.com/external_content/untrusted_dlcp/www.google.com/en/us/webmasters/docs/search-engine-optimization-starter-guide.pdf
Some more information from google on controling parts of a page. Apparently there are google off/google on tags.
http://perishablepress.com/press/2009/08/23/tell-google-to-not-index-certain-parts-of-your-page/
Hope this helps.

If you want Google to index only part of your pages, you can't follow normal SEO routines. You should provide a mechanism to understand whether the current client (requester) is a robot or not. If yes, then don't render that part. This is the only way. Otherwise, a robot either gets the whole rendered content, or doesn't have access based on robots.txt file (Robot Exclusion Protocol).
Another way (which is not really smart, and can't be guaranteed to work) is to dynamically inject your content into the page via JavaScript. Because AMAIK, robots don't run JavaScript.

As search spiders won't render javascript generated markup (JS is not run as it is client-side in the browser), a quick fix would be to don't output any of flash / markup initially in the HTML document and then use JS to add the flash stuff on load.
Note: as far as I'm aware, Google is currently testing a JS reading spider so this may not work long term.

Google is returning this data because it simply can't find any content where it normally would. Search engines require content - they're not advanced enough to process your multimedia to determine what it's all about.
Google will IGNORE your meta description if it doesn't feel that it reflects your page content (of which there is only iframes and JS)
Use SWFObject to provide alternate content for users without flash (including search engines) - ensure it's not some dinky text like "download flash here" - but a lengthy descriptive content piece about your site or media that they would normally experience if they could experience.
Use robots.txt or <meta name="robots" content="noindex,follow"> for the iframe content to prevent it from being indexed.
For the love of all things holy, please look at reducing the number of JS files and inline JS on your site (i'd recommend WP-minify since it's so obvious that you love plugins)

Related

How to enable other websites to embed my website's content using only content link?

I have been spending my past week on the Internet to find at least one hint about it. There are no tutorials or even SO questions available. What I am trying to find is that when some website uses some library like oEmbed to embed content of other websites on their website, they fetch embed code from its link. For example, when you post a YouTube link on Facebook or other social networks, they automatically fetch their embed code. I know how to fetch embed code but what I don't know is how to provide embed code that can be fetched by other websites by using a link of my website's content?
I want that my article should be embedded in some special way. Not like the default layout of that website. So is there any META tag or something in HTML where I can put embed code for other websites?
I don't think what you want is possible. You can use special meta-tags that specific sites (e.g.: Facebook, Twitter, Linkedin) will interpret, and that will help you customize the share a little (still using the "host site" style). But as far as I know, there's nothing you can do to provide style/code of your own.
And it makes sense from a security point of view: embedding external code from an unknown source is potentially dangerous and no site would/should allow you to do it. Even if they do allow it, they should pre-process the code and sanitize it (adapting your style/code to their style/code) to prevent possible threats.
As suggested by Alvaro Montoro, I searched on the Internet about how to become an oembed provider. Following are the links I found:
https://timnash.co.uk/becoming-oembed-provider/
http://freear.org.uk/content/5-steps-being-oembed-provider
You may want to use the CSS !important directive.
http://css-tricks.com/when-using-important-is-the-right-choice/

How to force Chrome to prerender more pages?

I'm learning about Chrome and Native Client.
Basically i want to increase number of pages that are prerendered
by chrome (now its just one page).
I was thinking about creating a extension that would
allow to prerender more pages.
Is this a way to go or am i left with hard coding it into Chrome and build from scratch?
EDIT
I started a bounty for this question. I would really appreciate some input.
No, there is no way to go, you would need to hardcode it in Chrome and rebuild as you noted.
As you probably already know, Chrome explicitly states that they currently limit the number of pages that can be prerendered:
Chrome will only prerender at max one URL per instance of Chrome (not one per tab). This limit may change in the future in some cases.
There is nothing in their supported API's or their experimental API's that will give you access to modify this. There is not a toggle in chrome://settings/ or chrome://flags/ or anywhere else in Chrome currently that allows you to change this. You can however use the Page Visibility API to determine if a page is being prerendered.
In addition to the one resource on the page that you specify using the <link rel="prerender" href="http://example.org/index.html"> you could also consider prefetching resources.
The issue with prefetching is it will only load the top level resource of at the specified URL. So, if you tried to prefetch the other pages, like so:
<link rel="prefetch" href="http://example.org/index.html">
...then only the index.html resource would be prefetched, not all of the CSS and JavaScript links containeed in the document. One approach could be to write out link tags for all of the contained resources in the page but that could get messy and difficult to maintain.
Another approach would be to wait for the current page to finish loading, and then use JavaScript to create an iframe, that is hidden off the page, targeted at the URLs you want to prefetch all the assets for. The browser would then load all of the content of the URL and it would be in the user's cache when they go to visit the page.
There is also a Chrome extension that combines these two approaches by searching for link tags that define a prefetch and then creates the hidden iframe causing the assets to be downloaded and cached.
If the goal is to optimize around client performance for navigating your site there might be other alternatives like creating a web application that uses a Single Page Application (SPA) style of development to reduce the number of times JS and CSS are loaded and executed. If you are a fan of Google then you might check out their framework for building SPAs called AngularJs.
If it was a good idea to pre-render more pages, Chrome would probably already do it. Pre-rendering too many pages will drain website bandwidth and possibly end up slowing down the whole web, which is the opposite of what you're trying to achieve. So it's most likely intentional that you can only pre-render a single page and you shouldn't try to break that.
Not possible. Chrome manages pre-rendering based on many factors. If this was possible, it could also be easily abused by many sites. You could, depending on your page, keep all content on one page.

Is there an effective way to serve specific thumbnail images to fetcher bots?

My site uses some aggressive caching techniques to keep requests to a minimum, among them being:
.htaccess redirects to cached HTML files;
Automatic merging of content images into CSS sprites.
This works great for human traffic, but when an article is posted on Facebook, Pinterest, Google+, Reddit, etc the bot fails to find a suitable thumbnail since the page images are all big sprite JPEGs.
One solution would be .htaccess rules that bypass the cache when a bot is making the request. Preferably without having to specifically name every possible bot user-agent. I am unsure how to accomplish that.
Another solution would be to embed one good thumbnail image on every page that a bot would download but a real web browser would not. Any ideas how to accomplish that?
Other suggestions are welcome. If all else fails I'll rework my script to exclude the first image of every post from the autosprites, but that will effectively double the number of image requests my poor overworked server must accomodate.
Showing different things to bots than to humans is a very bad approach regardless of the problem you're trying to solve. Google will sometimes even punish sites that do this with a low search ranking. A better way to do this would be to go to each bot's website and see if there is a way to tell that bot to display an image that is relevant to that page.
For example, Facebook accepts the following meta tag in the head of your html to tell it an image that is relevant to your page:
<meta property="og:image" content="[url to the image]">

Getting html content from one page and adding it to my website

I have affiliated with expedia and I am using their API system. One of their requirements for launching the site is adding the terms and agreements to my page and they give us this page: http://travel.ian.com/index.jsp?pageName=userAgreement&locale=en_US&cid=xxx. I do not want to go to a different site, and I can not copy and paste the information because of updates. I also prefer not to use an iframe. Does anyone have any ideas on how to do this? Here is a webpage using this on their site with their domain: http://www.helloweekends.com/terms.htm. Does anyone know how they did this? Any help would be greatly appreciated!
Since it originates from another domain, it wouldn't be possible to use JavaScript, due to the same origin policy. Also, relying on JavaScript for the update would be trouble for users who has JavaScript disabled, as they wouldn't see the terms. Since you don't want to use an iframe, or copy the content, I guess your best shot would be to scrape their page with a server-side language of your choice, and then display it on your page.
Scraping can be a bit tricky though, if you rely on their markup. If they change their markup, there is a chance that your script will break, thus stop updating the terms.
There are various tutorials available on how to scrape sites. Here are a few PHP examples:
Web scrape with PHP
PHP Screen Scraping Tutorial
Note Make sure that they allow you to scrape the page prior to implementing it, so that you don't violate their rules.
Do you know if their API serves something with JSON? A JSONP call can get the values to you, but it will make your page rely on javascript for the users to see the updated page.
Another option is to use PHP of any other server side language to get the contents of the url, process it and return the block you require.
I would suggest the load() function offered by jQuery. It makes a simple AJAX call to retrieve a file, and you could even use a selector to only grab part of the page. For example, load the contents of a HTML page into a div:
$('#div_id').load('my_file.html');
Or just load a part of the page:
$('#div_id').load('my_file.html #main_text_id');

Cleaning up HTML from textarea

I have a page with two textareas, where registered users can fill them with HTML codes. First one has TinyMCE (so HTML is cleaned up), but the other one does not, since I expect the code to be inserted as embed codes from other sites (mostly sites that provide maps, e.g. Google Maps, MapMyRace.com, etc). But problem is that those other sites may provide different tags, not just <embed> or <iframe>. So I can't strip tags because then I might strip tags that I didn't know other sites provided. I will save the HTML in these two textareas into my database, to be retrieved and displayed as parts of some other pages.
Do you have any suggestions to make this setup more secure? Or should I disallow free input of HTML in the 2nd textarea altogether? (Or.. I let the users tick a check box saying "I accept full responsibility for the behavior of the code I am inserting".. LOL)
Your opinion is highly appreciated :)
Thanks
The short answer is : free HTML is insecure and must be avoided. Nothing blocks your user from creating an iframe that redirects the user to some harmful page or put ads on your page or deface your site.
My favorite approach to this problem is to allow the user to paste a link (no the "embed on page" iframe code) in a text box. Then I use regex to identify the pasted link (is it youtube, Bing maps, ...) and I create the HTML from the pasted link, which isn't too complex for most iframe providers. It's much more work for you, and it restricts the APIs you can put on your page, but it's secure.
Letting your users use arbitrary HTML is dangerous. You may want to have a black and white lists of tags that you disallow and allow (respectively).