Disallow subdomain url using robots.txt - subdomain

i would like to ask you a question...
i have a domain kiosban.com and store.kiosban.com..
and i want to disallow
store.kiosban.com/template/*
And i have this on my store.kiosban.com/robots.txt
but when i look at google webmaster tools... on health menu >> Blocked Url, i got
robots.txt file Blocked URLs Downloaded Status
http://www.store.kiosban.com/robots.txt - Never
Did i do something wrong??

www.store.kiosban.com and store.kiosban.com are different hosts. You should provide a robots.txt on both hosts, or even better, 301-redirect one form to the other one.
But that doesn’t seem to be the issue in your case. It looks like Google just needs some time to crawl your site resp. its robots.txt.

Related

Email thumbnail URL changed to googleusercontent.com in gmail

I have a system whenever user upload an image, it will send an email to the registered user's gmail. But in the email, i see something like this, the thumbnail is not viewable.
I inspect on the element, and found the src linked to this url:
https://ci5.googleusercontent.com/proxy/VI2cPXWhfKZEIarh-iyKNz1j9q7Ymh8ty4Yz19lXh82RjSlACBzS0aRajfIj913uXAsX2ylcLEDs5FBsj4cR9TcU75Pw5djdHx4htxdCAQxs_ue1Q1wi5TV43uLLBpigpjH1xN747mUHSRdTBJmXQWFyykInJCRXicM1KhNk=s0-d-e1-ft#https://www.somedomain.com/files/1658/thumbnail_71JtDozxS1L._SY450_.jpg
Obviously it is being cached by google proxy
But i can view the image without google user content, by accessing https://www.somedomain.com/files/1658/thumbnail_71JtDozxS1L._SY450_.jpg (i masked the domain so the image might not available to you).
I tried to clear browser cache but the problem still persist. How can i bypass the googleusercontent thingy or at least make the thumbnail able to display.
I checkout on this link Images not displayed for Gmail but im not using localhost and the image itself is accessible outside of my local network.
How does Google Image Proxy work
The Google Image Proxy is a caching proxy server. Every time an image link is included in email the request will go to the Google Image Proxy first to see if it has been cached, if so it should serve it up from the proxy or it will go fetch it and cache it there after.
The solution for most issues
The Google Image Proxy server will fetch your images if this images:
have extensions like .png, .jpg/.jpeg or .gif only. May be .webp too. But not .svg.
do not use any kind of query string part in the image URL like ?id=123
have an URL which is mapped onto the image directly.
have not a long name.
Requirements for image server:
The response from image server/proxy server must include the correct header like Content-Type: image/jpeg.
File extension and content-type header must be in the same type.
Status code in server response must be 200 instead of 403, 500 and etc.
What could help too?
Google support answer:
Set up an image URL proxy whitelist
When your users open email messages, Gmail uses Google’s secure proxy
servers to serve images that might be included in these messages. This
protects your users and domain against image-based security
vulnerabilities.
Because of the image proxy, links to images that are dependent on
internal IPs and sometimes cookies are broken. The Image URL proxy
whitelist setting lets you avoid broken links to images by creating
and maintaining a whitelist of internal URLs that'll bypass proxy
protection.
When you configure the Image URL proxy whitelist, you can specify a
set of domains and a path prefix that can be used to specify large
groups of URLs. See the guidelines below for examples.
Configure the Image URL proxy whitelist setting:
Sign in to your Google Admin console. Sign in using your administrator account (does not end in #gmail.com).
From the Admin console Home page, go to Apps > G Suite > Gmail > Advanced settings. Tip: To see Advanced settings,
scroll to the bottom of the Gmail page.
On the left, select your top-level organization.
Scroll to the Image URL proxy whitelist section.
Enter image URL proxy whitelist patterns. Matching URLs will bypass image proxy protection. See the guidelines below for more details and
instructions.
At the bottom, click Save.
It can take up to an hour for changes to propagate to user accounts.
You can track prior changes under Admin console audit log.
Guidelines for applying the Image URL proxy whitelist setting
Security considerations
Consult with your security team before configuring the Image URL proxy
whitelist setting. The decision to bypass image proxy whitelist
protection can expose your users and domain to security risks if not
used with care.
In general, if you have a domain that needs authentication via cookie,
and if that domain is controlled by an administrator within your
organization and is completely trusted, then whitelisting that URL
should not expose your domain to image-based attacks.
Important: Disabling the image proxy is not recommended. This option is available to provide flexibility for administrators, but
disabling the image proxy can leave your users vulnerable to malicious
attacks.
Entering Image URL patterns
To maintain a whitelist of internal URLs that'll bypass proxy
protection, enter the image URL patterns in the Image URL proxy
whitelist setting. Matching URLs will bypass the image proxy.
A pattern can contain the scheme, the domain, and a path. The pattern
must always have a forward slash (/) present between the domain and
path. If the URL pattern specifies a scheme, then the scheme and the
domain must fully match. Otherwise, the domain can partially match the
URL suffix. For example, the pattern google.com matches
www.google.com, but not gle.com. The URL pattern can specify a
path that's matched against the path prefix.
Important: Enter your actual domain name as you enter the image URL pattern. Always include a trailing forward slash (/) after the
domain name.
Examples of Image URL patterns
The following patterns are examples only. The following patterns:
http://rule_fixed_scheme_domain.com/
rule_flex_scheme_domain.com/
rule_fixed_subpath.com/cgi-bin/
... will match the following URLs:
http://rule_fixed_scheme_domain.com/
http://rule_fixed_scheme_domain.com/test.jpg?foo=bar#frag
http://rule_fixed_scheme_domain.com
rule_flex_scheme_domain.com/
t.rule_flex_scheme_domain.com/test.jpg
http://t.rule_flex_scheme_domain.com/test.jpg
https://t.rule_flex_scheme_domain.com/test.jpg
http://rule_fixed_subpath.com/cgi-bin/
http://rule_fixed_subpath.com/cgi-bin/people
Note: The URL scheme (http://) is optional. If the scheme is omitted, the pattern can match any scheme, and allows partial matches
on the domain suffix.
Previewing the image URL patterns
Click Preview to see if the URLs match the image URL patterns
you've set. If the image URL matches a pattern, you'll see a
confirmation message. If the image URL does not match, an error
message appears.
Bharata has a great and detailed answer on this, but just wanted to add one addition that I identified with a similar issue.
We had a x-webkit-csp content security header that turned out to be the culprit.
Removing it and all worked through the image proxy.
Google's response was that x-webkit-csp is deprecated and to use the Content-Security-Policy header instead.
However this seems like a bug that an unsupported header throws a fatal error rather than simply ignoring it.
TL;DR: Make sure your server isn't blocking external connections (through AWS or .htaccess or some other firewall)!
I was having this problem too. I ran through every solution I could think of and every one I found online. Nothing fixed it.
Finally, I inspected the image in Gmail so that I could get the full CDN address Google creates for it. I tried to open that in a new tab and it failed. So I realized that Google wasn't able to grab the image.
In the end, I'd forgotten that I have the server locked down from all traffic except for my own (just a basic .htaccess IP deny). It's just a simple security layer I use while I'm in my development. Keep in mind that you might have it locked down within AWS or something more rugged like that.
I opened up all IPs for a minute, tested it, and sure enough it worked as expected. The old emails that were previously failing also worked, so it seems that Google tries to work their magic anytime the email is opened and they don't have the image saved. Once I closed the IP address again, the image continued to work whatever Google. I'm guessing once they write it to their CDN is remains there indefinitely.
So if you're certain that you've done everything else correctly, I would suggest making sure that the server is indeed open to the outside world so Google can process the image.
I had the same problem and I solved it by specifying the "https://" protocol in the "src" url of the img, otherwise by default "http" is prepended

Can't get rid of "Site Contains Malware" warnings

We've been having this problem for the last month, and it is getting really old. :( I'd love any help or advice.
We have a Facebook app that provides users a simple way to make tabs for pages. Some malicious actors were using our app to host redirects - which we have now blocked. As far as we can tell, there is not any more redirect abuse. Okay, here's where it gets weird.
We've got 12 "apps", each of which has identical functionality but different paths on our domain. For example:
http://raw2.statichtmlapp.com/tab/1/...
http://raw2.statichtmlapp.com/tab/2/...
http://raw2.statichtmlapp.com/tab/3/...
All urls beginning with the path /tab/2 are getting the warning, and all the other urls are fine. Gah.
We have read the documentation thoroughly about how to rectify this sort of thing, to no avail. https://developers.google.com/webmasters/hacked/docs/request_review suggests that we should use the webmaster console to request review for Malware or Spam, but our console says there is no problem with the domain.
We have submitted requests for review of phishing multiple times at http://www.google.com/safebrowsing/report_error/, but nothing happens.
I suspect that part of the issue may be an anti-abuse measure we have in place. The content in our app is only available when embedded inside Facebook, with a signed request that comes along with the iframe url from Facebook. So if a Google system attempts to directly craw urls that have been flagged, it will get either empty pages or errors. But we don't want to make the content available on the open internet for fear of phishing abuse (which is why we lock it down now), and we don't want to try to detect Google and just serve them the content, because that feels like something they would likely detect as suspicious and cause further flagging.
Any advice on what to do? It is incredibly frustrating to have a bunch of walls come slamming down like this, with very little we can do. Thank you so much for any help!
Aha! After a helpful reply on the Google Webmaster Central help forum, we got it cleared up. We needed to add the blocked url as a new site in the search console, not just the root domain. That made a security issue show up, and that enabled us to request a review, and finally get this cleared up.

should rel-canonical also include protocol (http/https)?

I'm migrating my website from http to https (although it will still support access via http)
Currently all of my pages have accurate rel-canonical meta tags set in the HTML, but obviously they all point to the canonical http:// url.
Should I now be updating those to https:// too, or is it ok to leave them as http?
I'm wondering whether Google will penalise me, or start detecting duplicate content, if I start mixing them
Yes Google sees http and https as different sites so you should update them.
A redirect on the server might be sufficient in the short term but personally I would be looking to update the pages as soon as you can.

Finding the list of all subdirectories in a link

Is there any way that I can find all subdirectories for one link? Should I get the permission? For example, in the lecture instructor opened the solutions by entering some keywords after www.site.com/keyword. Now I cannot remember the word, whatever I try, I cannot find, but I know there is a file. That's why I want to see the files, other pages for the link.
The only way to find out what resources are available on an HTTP server is to request a resource that tells you. There isn't anything particularly standard about web servers that will provide that, so you'll need to do something specific to the webserver you want the details from.
Note that not all servers will provide something like this.
The closest thing to a standard is that most servers, for a URL that maps on to a directory on their file system, if there isn't an index file in that directory, will generate an HTML document containing a list of links to the resources in that directory.

How do I stop search engines indexing a maintenance page

I need to setup a maintenance page for a website I'm running, e.g. for display when I'm performing site maintenance (scheduled downtime) or if something really breaks and I need to put up a holding page.
Is there anything special I need to do to ensure that search engine crawlers don't index it and think that it's my site. Or should I do a 404, add a temporary robots.txt file or something? I basically don't want them to index it as my site, but I also don't want them to think my site is dead and not come back.
Edit: Here's what I did in Apache: ErrorDocument 503 /.server-maintenance.html RewriteEngine On RewriteRule !^.server-maintenance.html /server-maintenance Redirect 503 /server-maintenancestrong text
You should send a 503 Service Unavailable HTTP status code, and not a 404. Use this in conjunction with a Retry-After header to tell the robots when to come back.
You may use a robots.txt
http://www.robotstxt.org/
Also, google has a validator in their webmasters tools.
https://www.google.com/webmasters/tools/
Returning 503 Service Unavailable tells Google bots to come back later. There's a Google support page describing the HTTP error codes and how they are interpreted by them.
You can also use Retry-After response header to suggest the minimum time after which your site is re-checked for availability.
Another approach would be to not link the maintenance page from any other page on your website (or any other website).