How to prevent traffic from China? - google-compute-engine

I have set up a test machine (g1 small) in us-west-1c. It has just a node.js web site. There are no known users other than myself doing testing. I can understand that anyone now can hit the web site.
My monthly bill shows non-insignificant traffic Compute Engine Network Internet Egress from Americas to China. As I am still conducting testing, I need not open the web server to China. Is there a way to cut off traffic requests from China, however China is defined? Am I right to assume egress to China is a result of requests coming from China?

It sounds like you might be getting crawled by bots...search engine or otherwise. This related question may have some ideas about locking things down on the application layer. And it would be good for the rest of the accidental traffic, too.

You might be getting crawled by bots of search engines. Add followings to your /robots.txt.
User-agent: Baiduspider
Disallow: /
User-agent: Sogou web spider
Disallow: /
User-agent: 360Spider
Disallow: /
User-agent: ChinasoSpider
Disallow: /
User-agent: Sosospider
Disallow: /
There are so many search engines in China. You can enter https://www.baidu.com/robots.txt to see more UAs of Chinese search engines' spiders.

Related

lighthouse report; insecure requests found

when I to test my Website (it is outside from my computer) with the tool lighthouse on Chrome, comes this report;
All sites should be protected with HTTPS, even ones that don't handle sensitive data. HTTPS prevents intruders from tampering with or passively listening in on the communications between your app and your users and is a prerequisite for HTTP/2 and many new web platform APIs. Learn more.
What I don't Understand why to come this, I have my Website with https — the report to say that my images and URL do not use HTTPS.
Screenshot from this warning;
I have to test my Website for https mistake with https://www.whynopadlock.com/f73e9366-da69-4ebf-a73f-6ceff2161cd6
Screenshot from it,
How to see, I have all gut, but the Tool lighthouse every time give me a similar result...
Can Please anyone help me with this problem, Thanks!
the transfer protocol your site current uses is HTTP/1.1
which has been the de-facto standard since 1997. In an effort to ensure a secure encrypted connection between browsers and websites, giants like Google and Let's Encrypt have been pushing sites to use HTTPS.
Google came up with a new way networking protocol SPDY which is termed a precursor to HTTP/2 which rolled out in 2015. Google suggests websites to use HTTP/2 as the new de-facto standard protocol which brings in multiplexing, header compression, binary transfer of data and much more.
The message that you have recieved would go away once you enable HTTP/2 on your server
As you haven't mentioned your webserver, here's how you enable it for Nginx and Apache

Should I use multiple files or combine pages into one file?

This question betrays how much of a novice I am. I'm making a website, and I'm wondering - is it okay to have separate html files for the distinct pages of my website, or should I try to combine them into one html file? I'm curious about the general way of doing things.
You have to take account about 2 things here here:
#1 HTTP Request Header (A single file is better)
For each request the client will do to display you website, some informations are sent in additional of the content (example Headers)
Look like that (from here):
GET /tutorials/other/top-20-mysql-best-practices/ HTTP/1.1
Host: net.tutsplus.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: PHPSESSID=r2t5uvjq435r4q7ib3vtdjq120
Pragma: no-cache
Cache-Control: no-cache
So, every new file (http, css, images, js, ...) will add more headers, or meta-data and will slow down request..
#2 Browser cache (Multiples files are better)
Files who never change on your website (like Logo, or main CSS file) have no need to be reloaded on every page, and can be placed on the browser cache.
So, create multiple file for "global" code are a good way to "avoid" loading code on every pages.
Conclusion
Both are good, every case have a specific solution.
As a proxy administrator, I am sometimes astounded by how many separate files are requested for a single web page. In the old days, that could be trouble. Good browsers today use an HTTP CONNECT to establish a single TCP tunnel in which to pass those requests. Also, it's quite common for services to distribute content across multiple servers, each of which would require its own CONNECT. And, for more heavily accessed services only line, there will often be the use of content delivery network to better manage the load and abuse from malicious sources. These days, it's best to organize content across files according to the structure of the document, along with version control and other management considerations.
The answer to this is more an art than a science. You can break your content up in whatever seems to you a logical fashion, but as much as possible you want to try to view the site through the eyes of your users. That can mean actual research (surveys, focus groups, etc.) if you can afford it, or trying to draw inferences from Google Analytics or some other tracking system.
There's also an SEO angle. More pages can equate to greater presence in search engines, but if you overdo it you wind up with what Google terms "thin content" — pages with very little meat to them which don't convey much information.
This isn't really a question with a right or wrong answer. It's very much dependent on what you wish to accomplish both aesthetically and in terms of usability, performance, etc. The best answer comes with experience, but you can always copy sites you like until you get a feel for it.

Why did Google.com switch to SPDY (HTTP/2+QUIC/35) instead of HTTP/2

Several days ago I saw Google.com was using HTTP/2, but yesterday I became aware that Google.com had switched to SPDY (HTTP/2+QUIC/35).
Two questions:
As you know, HTTP/2 extends SPDY, why did Google.com rollback to SPDY?
What's the difference between SPDY and SPDY (HTTP/2+QUIC/35)?
http/2+quic/35 is not Speedy, it is a new communication protocol, based on UDP instead of TCP, named QUIC.
Let's quote https://www.chromium.org/quic :
Key advantages of QUIC over TCP+TLS+HTTP2 include:
Connection establishment latency
Improved congestion control
Multiplexing without head-of-line blocking
Forward error correction
Connection migration
A good presentation is available in this blog article.
In fact, the whole QUIC project was used to by-pass the TCP standards, in a more reactive way. Google did experiment on QUIC since years, transparently in Chrome browsers of billions of users, and switched now to it by default, if it works (with a fallback to "classical" HTTP/2 over TCP).
From the developer point of view, QUIC has a HTTP/2 interface, with all its features.
To my knownledge, only the LiteSpeed supports QUIC outside of Google - not the OpenLiteSpeed version yet (sadly) - and the go-based Caddy server.
Are you sure they did? Or is the tool you are using to display this info (this extension perhaps?) choosing to display it as such? Show the Network tab in developer tools in Chrome to see what Chrome really thinks it's talking.
HTTP/2 is the standard version of SPDY so saying something is "SPDY-enabled (HTTP/2)" doesn't make sense. Unless it means it can talk SPDY ("SPDY-enabled") but has chosen in this case to talk HTTP/2 as that's better?
Finally QUIC is a new protocol Google is experimenting with, which replaces the TCP network layer that SPDY and HTTP/2 are built on top of. So both can use QUIC instead of TCP and it's usually faster than TCP (hence the name which sounds like "quick" and is an acronym of "Quick UDP Internet Connections")

Why do browser implementations of HTTP/2 require TLS?

Why does most modern browsers require TLS for HTTP2?
Is there a technical reason behind this? Or simply just to make the web more secure?
http://caniuse.com/#feat=http2
It is partly about making more things use HTTPS and encourage users and servers to go HTTPS. Both Firefox and Chrome developers have stated this to be generally good. For the sake of users and users' security and privacy.
It is also about broken "middle boxes" deployed on the Internet that assume TCP traffic over port 80 (that might look like HTTP/1.1) means HTTP/1.1 and then they will interfere in order to "improve" or filter the traffic in some way. Doing HTTP/2 clear text over such networks end up with a much worse success rate. Insisting on encryption makes those middle boxes never get the chance to mess up the traffic.
Further, there are a certain percentage of deployed HTTP/1.1 servers that will return an error response to an Upgrade: with an unknown protocol (such as "h2c", which is HTTP/2 in clear text) which also would complicate an implementation in a widely used browser. Doing the negotiation over HTTPS is much less error prone as "not supporting it" simply means switching down to the safe old HTTP/1.1 approach.

When should one use a 'www' subdomain?

When browsing through the internet for the last few years, I'm seeing more and more pages getting rid of the 'www' subdomain.
Are there any good reasons to use or not to use the 'www' subdomain?
There are a ton of good reasons to include it, the best of which is here:
Yahoo Performance Best Practices
Due to the dot rule with cookies, if you don't have the 'www.' then you can't set two-dot cookies or cross-subdomain cookies a la *.example.com. There are two pertinent impacts.
First it means that any user you're giving cookies to will send those cookies back with requests that match the domain. So even if you have a subdomain, images.example.com, the example.com cookie will always be sent with requests to that domain. This creates overhead that wouldn't exist if you had made www.example.com the authoritative name. Of course you can use a CDN, but that depends on your resources.
Also, you then don't have the ability to set a cross-subdomain cookie. This seems evident, but this means allowing authenticated users to move between your subdomains is more of a technical challenge.
So ask yourself some questions. Do I set cookies? Do I care about potentially needless bandwidth expenditure? Will authenticated users be crossing subdomains? If you're really concerned with inconveniencing the user, you can always configure your server to take care of the www/no www thing automatically.
See dropwww and yes-www (saved).
Just after asking this question I came over the no-www page which says:
...Succinctly, use of the www subdomain
is redundant and time consuming to
communicate. The internet, media, and
society are all better off without it.
Take it from a domainer, Use both the www.domainname.com and the normal domainname.com
otherwise you are just throwing your traffic away to the browers search engine (DNS Error)
Actually it is amazing how many domains out there, especially amongst the top 100, correctly resolve for www.domainname.com but not domainname.com
There are MANY reasons to use the www sub-domain!
When writing a URL, it's easier to handwrite and type "www.stackoverflow.com", rather than "http://stackoverflow.com". Most text editors, email clients, word processors and WYSIWYG controls will automatically recognise both of the above and create hyperlinks. Typing just "stackoverflow.com" will not result in a hyperlink, after all it's just a domain name.. Who says there's a web service there? Who says the reference to that domain is a reference to its web service?
What would you rather write/type/say.. "www." (4 chars) or "http://" (7 chars) ??
"www." is an established shorthand way of unambiguously communicating the fact that the subject is a web address, not a URL for another network service.
When verbally communicating a web address, it should be clear from the context that it's a web address so saying "www" is redundant. Servers should be configured to return HTTP 301 (Moved Permanently) responses forwarding all requests for #.stackoverflow.com (the root of the domain) to the www subdomain.
In my experience, people who think WWW should be omitted tend to be people who don't understand the difference between the web and the internet and use the terms interchangeably, like they're synonymous. The web is just one of many network services.
If you want to get rid of www, why not change the your HTTP server to use a different port as well, TCP port 80 is sooo yesterday.. Let's change that to port 1234, YAY now people have to say and type "http://stackoverflow.com:1234" (eightch tee tee pee colon slash slash stack overflow dot com colon one two three four) but at least we don't have to say "www" eh?
There are several reasons, here are some:
1) The person wanted it this way on purpose
People use DNS for many things, not only the web. They may need the main dns name for some other service that is more important to them.
2) Misconfigured dns servers
If someone does a lookup of www to your dns server, your DNS server would need to resolve it.
3) Misconfigured web servers
A web server can host many different web sites. It distinguishes which site you want via the Host header. You need to specify which host names you want to be used for your website.
4) Website optimization
It is better to not handle both, but to forward one with a moved permanently http status code. That way the 2 addresses won't compete for inbound link ranks.
5) Cookies
To avoid problems with cookies not being sent back by the browser. This can also be solved with the moved permanently http status code.
6) Client side browser caching
Web browsers may not cache an image if you make a request to www and another without. This can also be solved with the moved permanently http status code.
There is no huge advantage to including-it or not-including-it and no one objectively-best strategy. “no-www.org” is a silly load of old dogma trying to present itself as definitive fact.
If the “big organisation that has many different services and doesn't want to have to dedicate the bare domain name to being a web server” scenario doesn't apply to you (and in reality it rarely does), which address you choose is a largely cultural matter. Are people where you are used to seeing a bare “example.org” domain written on advertising materials, would they immediately recognise it as a web address without the extra ‘www’ or ‘http://’? In Japan, for example, you would get funny looks for choosing the non-www version.
Whichever you choose, though, be consistent. Make both www and non-www versions accessible, but make one of them definitive, always link to that version, and make the other redirect to it (permanently, status code 301). Having both hostnames respond directly is bad for SEO, and serving any old hostname that resolves to your server leaves you open to DNS rebinding attacks.
Apart from the load optimization regarding cookies, there is also a DNS related reason for using the www subdomain. You can't use CNAME to the naked domain. On yes-www.org (saved) it says:
When using a provider such as Heroku or Akamai to host your web site, the provider wants to be able to update DNS records in case it needs to redirect traffic from a failing server to a healthy server. This is set up using DNS CNAME records, and the naked domain cannot have a CNAME record. This is only an issue if your site gets large enough to require highly redundant hosting with such a service.
As jdangel points out the www is good practice in some cookie situations but I believe there is another reason to use www.
Isn't it our responsibility to care for and protect our users. As most people expect www, you will give them a less than perfect experience by not programming for it.
To me it seems a little arrogant, to not set up a DNS entry just because in theory it's not required. There is no overhead in carrying the DNS entry and through redirects etc they can be redirected to a non www dns address.
Seriously don't loose valuable traffic by leaving your potential visitor with an unnecessary "site not found" error.
Additionally in a windows only network you might be able to set up a windows DNS server to avoid the following problem, but I don't think you can in a mixed environment of mac and windows. If a mac does a DNS query against a windows DNS mydomain.com will return all the available name servers not the webserver. So if in your browser you type mydomain.com you will have your browser query a name server not a webserver, in this case you need a subdomain (eg www.mydomain.com ) to point to the specific webserver.
Some sites require it because the service is configured on that particular set up to deliver web content via the www sub-domain only.
This is correct as www is the conventional sub-domain for "World Wide Web" traffic.
Just as port 80 is the standard port. Obviously there are other standard services and ports as well (http tcp/ip on port 80 is nothing special!)
Imagine mycompany...
mx1.mycompany.com 25 smtp, etc
ftp.mycompany.com 21 ftp
www.mycompany.com 80 http
Sites that don't require it basically have forwarding in dns or redirection of some-kind.
e.g.
*.mycompany.com 80 http
The onlty reason to do it as far as I can see is if you prefer it and you want to.