Will this method of placing a transparency over an image prevent Google copying it
Or will it just find the image from the uncrypted code above (image.png)
Am I wasting my time using this method of masking the image to fool Google
For quick removal
Use the Remove URLs tool. You should see results fairly quickly.
For non-emergency image removal
To prevent images from your site appearing in Google's search results, add a robots.txt file to the root of the server that blocks the image. It takes longer to remove an image from search results than the Remove URLs tool, but is an Internet standard that applies to all search engines, and you have more flexible control through the use of wildcards or subpath blocking.
For example, if you want Google to exclude the dogs.jpg image that appears on your site at www.yoursite.com/images/dogs.jpg, add the following to your robots.txt file:
User-agent: Googlebot-Image
Disallow: /images/dogs.jpg
The next time Google crawls your site, we'll see this directive and drop your image from our search results.
To remove all the images on your site from our index, place the following robots.txt file in your server root:
User-agent: Googlebot-Image
Disallow: /
Additionally, Google has introduced increased flexibility to the robots.txt file standard through the use asterisks. Disallow patterns may include "*" to match any sequence of characters, and patterns may end in "$" to indicate the end of a name. To remove all files of a specific file type (for example, to include .jpg but not .gif images), you'd use the following robots.txt entry:
User-agent: Googlebot-Image
Disallow: /*.gif$
By specifying Googlebot-Image as the User-agent, the images will be excluded from Google Image Search. It will also prevent cropping of the image for display within Mobile Image Search, as the image will be completely removed from Google's Image index. If you would like to exclude the images from all Google searches (including Google web search and Google Images), specify User-agent Googlebot.
Related
I am working on ROR web apps. My webpage url looks like below-
http://dev.ibiza.jp:3000/facebook/report?advertiser_id=2102#/dashboard
Here I understood that advertiser_id is 2102 but I couldn't understand what #/dashboard is pointing to?
The portion of the URL which follows the # symbol is not normally sent to the server in the request for the page. If you open your web inspector and watch the request for the page, you will see that the #/dashboard portion is not included in the request at all.
On a normal (basic HTML) web page, the # symbol can be used to link to a section within the page, so that the browser jumps down to that section after the page loads.
In fancy javascript-heavy web applications, the # symbol is commonly used followed by more URL paths, for example www.example.com/some-path#/other-path/etc the other-path/etc portion of the URL is not seen by the server, but is available for Javascript to read in the browser and presumably display something different based on that URL path.
So in your case, the first part of the URL is a request to the server:
http://dev.ibiza.jp:3000/facebook/report?advertiser_id=2102
and the second part of the URL could be for Javascript to display a specific view of the page once it has loaded:
#/dashboard
The # symbol is also used to create a Fragment Identifier and is also typically used to link to a specific piece of content within a web page (such as to cause the browser to jump down to a particular section on the page).
As others have mentioned, this has SEO implications. In order to index pages such as this, you may have to employ different techniques to allow the content that is "behind the # symbol" to be accessible to search engines.
# symbol is called anchor, it redirects to a specific position on the html page.
It's a crawling technique , you could read more Here
Providing another example
Here's a request to github for the sourcecode of a java class
https://github.com/spring-cloud/spring-cloud-consul/blob/master/spring-cloud-consul-discovery/src/main/java/org/springframework/cloud/consul/serviceregistry/ConsulServiceRegistry.java
By appending this with "#L90" the web browser will make the same request, and then scroll to line 90 and highlight the code.
https://github.com/spring-cloud/spring-cloud-consul/blob/master/spring-cloud-consul-discovery/src/main/java/org/springframework/cloud/consul/serviceregistry/ConsulServiceRegistry.java#L90
Your web browser made the same request to the github server, but in the anchored case, performed the additional action of highlighting the selected line after the response was received.
after # is the hash of the location; the ! the follows is used by search engines to help index AJAX content. After that can be anything, but is usually rendered to look as a path (hence the /)
I have a website with some large images. They are resized by default, but when you click on them, they open in a lightbox and become larger. I'd like to let the search engines know the original (bigger) images, instead of the smaller resize images included in the source code. Is there a way to let them index the bigger images?
You can't decide what google chose to index (but you can make it easier for google using thepiyush13 answer), but you can tell it what NOT to index.
Put this in your robot.txt files :
User-agent: Googlebot-Image
Disallow: /images/myImage.jpg // Put your images or directly the folder
Source : https://support.google.com/webmasters/answer/35308?hl=en
(to be adapted for other search engines)
Image sitemap is the way to go for Google
you can use Google image extensions for sitemaps to give Google more information about the images available on your pages. Image sitemap information helps Google discover images that we might not otherwise find (such as images your site reaches with JavaScript code), and allows you to indicate images on your site that you want Google to crawl and index.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
<url>
<loc>http://example.com/sample.html</loc>
<image:image>
<image:loc>http://example.com/image.jpg</image:loc>
</image:image>
<image:image>
<image:loc>http://example.com/photo.jpg</image:loc>
</image:image>
</url>
</urlset>
source : https://support.google.com/webmasters/answer/178636?hl=en
I've build a admin control panel for my website. I don't want the control panel app to end up in a search engine, since there's really no need for it. I did some research and i've found that by using the following tag, i can probably achieve my goal
<meta name="robots" content="noindex,nofollow">
Is this true? Is there other methods more reliable? I'm asking because i'm scare i could mess things up if i'm using the wrong method, and i do want search engines to search my site, just not the control panel...
Thanks
This is true, but on top of doing that, for even more security, in your .htaccess file, you should set this:
Header set X-Robots-Tag "noindex, nofollow"
And in you should create a new file in the root of your domain, named robots.txt with this content:
User-agent: *
Disallow: /
And you can be sure that they won't index your content ;)
Google will honor the meta tag by completely dropping the page from their index (source) however other crawlers might just simply decide to ignore it.
In that particular sense meta tags are more reliable with Google because by simply using robots.txt any other external source that is explicitly linking to your admin page (for whatever reason) will make your page appear in Google index (though without any content which will probably result in some SERP leeching).
Can anyone help me add a disallow rule to my robots.txt file that will stop crawlers indexing any link containing %2C which is the HTML URL encoding for a comma (,).
I think what I'm looking for is the wild card character if one exists in the robots.txt file.
So far I have this:
Disallow: %2C
But cannot seem to see it working.
Any suggestions?
Cheers
The best thing when testing robots.txt against the search engines is to utilize the tools they provide to you. Google Webmaster Tools has a robots.txt tester under "Health > Blocked URLs". If you use
User-agent: *
Disallow: *,*
this will block any requests for http://example.com/url%2Cpath/. I tried Disallow: *%2C* but apparently that doesn't block Googlebot from crawling the HTML escaped path. My guess is that Googlebot encodes it in the queuing process.
As for bing, they apparently removed their robots.txt validation tool. So really the only sure way of testing it, is to deploy a robots.txt on a test site, and the use Bing Webmaster Tools to fetch a page with the ','. It'll tell you at that point if it's blocked by robots.txt or not.
Remember when using robots.txt, that doesn't prevent the search engines from displaying the URL in the search results. It just prevents them from crawling the URL. If you simply don't want those type of URLs in the search results, but don't mind them crawling the page (meaning you can't block those URLs with robots.txt), you can add a meta tag or x-robots-tag in the http headers with a value of NOINDEX to prevent it from being added to the search results.
Regarding one of the other comments about using the "nofollow" standard. Nofollow doesn't actually prevent the search engines from crawling those URLs. It's more recognized as a way to disavowing any endorsement of that link to the destination. Google and Bing have suggested using nofollow to indicate sponsored links or untrusted UGC links.
I have a webpage that it cannot be accessed through my website.
Say, my website is www.google.com and the webpage that I cannot access using the website is like www.google.com/iamaskingthis/asdasd. This webpage appears on the google results when I type its content, however there is nothing which sends me to that page on my website.
I've already tried analyzing the page source to find its parent location but I can't seem to find it. I want to delete that page, but since I cannot find it, I can't destroy it either.
Thank you
You can use a robots.txt file to prevent search engine bots from visiting a page, and thus not showing search results for it.
For example, you can create a robots.txt file in the root of your website and add the following content to it:
User-agent: *
Disallow: /mysecretpage.html
More details at: http://www.robotstxt.org/robotstxt.html
There is no such concept as a 'parent page'. If you mean, by which link Google found the page, plese keep in mind, that it need not be under your control: If I put a link to www.google.com/iamaskingthis/asdasd on a page on my website and thegooglebat crawls it, it will know about it.
To make it short: There is no reliable way of hiding a page on a website. Use authentication, if you want to restrict access.
Google will crawl the page even if the button is gone, as it already has the page stored in it's records. The only way to disallow google crawling to it is either robots.txt or simply deleting it off the server (via FTP or your hostings control panel).