This question betrays how much of a novice I am. I'm making a website, and I'm wondering - is it okay to have separate html files for the distinct pages of my website, or should I try to combine them into one html file? I'm curious about the general way of doing things.
You have to take account about 2 things here here:
#1 HTTP Request Header (A single file is better)
For each request the client will do to display you website, some informations are sent in additional of the content (example Headers)
Look like that (from here):
GET /tutorials/other/top-20-mysql-best-practices/ HTTP/1.1
Host: net.tutsplus.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: PHPSESSID=r2t5uvjq435r4q7ib3vtdjq120
Pragma: no-cache
Cache-Control: no-cache
So, every new file (http, css, images, js, ...) will add more headers, or meta-data and will slow down request..
#2 Browser cache (Multiples files are better)
Files who never change on your website (like Logo, or main CSS file) have no need to be reloaded on every page, and can be placed on the browser cache.
So, create multiple file for "global" code are a good way to "avoid" loading code on every pages.
Conclusion
Both are good, every case have a specific solution.
As a proxy administrator, I am sometimes astounded by how many separate files are requested for a single web page. In the old days, that could be trouble. Good browsers today use an HTTP CONNECT to establish a single TCP tunnel in which to pass those requests. Also, it's quite common for services to distribute content across multiple servers, each of which would require its own CONNECT. And, for more heavily accessed services only line, there will often be the use of content delivery network to better manage the load and abuse from malicious sources. These days, it's best to organize content across files according to the structure of the document, along with version control and other management considerations.
The answer to this is more an art than a science. You can break your content up in whatever seems to you a logical fashion, but as much as possible you want to try to view the site through the eyes of your users. That can mean actual research (surveys, focus groups, etc.) if you can afford it, or trying to draw inferences from Google Analytics or some other tracking system.
There's also an SEO angle. More pages can equate to greater presence in search engines, but if you overdo it you wind up with what Google terms "thin content" — pages with very little meat to them which don't convey much information.
This isn't really a question with a right or wrong answer. It's very much dependent on what you wish to accomplish both aesthetically and in terms of usability, performance, etc. The best answer comes with experience, but you can always copy sites you like until you get a feel for it.
Related
I've video in the html with video tag like <video src='xxxx.mp4'>. But sometimes the video loading is very slow.
I check the media request and find it try to load a 1MB data at the first video request. The request header is as below, but with no range settings.
And some video's first request size is very small, then it can show the first frame quickly. How the video media request works, can I influence the request size each time?
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2
Connection: keep-alive
Cookie: UM_distinctid=17e6bd4dd5c30-0f869b25cdf7158-4c3e207f-151800-17e6bd4dd5d5ce
Host: concert-cdn.jzurl.cn
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
TE: trailers
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0
If you are using just the video tag then the particular browser implements the logic to decide what size ranges to request. Some browsers actually request the entire file first, then abort that request and follow with individual range requests. You will also see requests from a browser with no end range as it 'looks' through the file - the logic again seems to be that a request can simply be cancelled if the rest of the data is not needed.
If you are using a Javascript player, like video.js etc, the player can in theory control this type of thing, but in practice for mp4 files, I think, many players just leverage the browsers HTML Video tag functionality anyway.
Focusing on what you are trying to achieve, there are a couple of things you can do to speed initial playback.
Firstly checking your server accepts range requests which it sounds like you have already done.
Next, assuming you are streaming an mp4 file, make sure the metadata is at the start - the 'MooV' atom as it known. There are several tools that will allow you make this change including ffmpeg:
https://ffmpeg.org/ffmpeg-formats.html#toc-mov_002c-mp4_002c-ismv
The mov/mp4/ismv muxer supports fragmentation. Normally, a MOV/MP4 file has all the metadata about all packets stored in one location (written at the end of the file, it can be moved to the start for better playback by adding faststart to the movflags, or using the qt-faststart tool). A fragmented file consists of a number of fragments, where packets and metadata about these packets are stored together. Writing a fragmented file has the advantage that the file is decodable even if the writing is interrupted (while a normal MOV/MP4 is undecodable if it is not properly finished), and it requires less memory when writing very long files (since writing normal MOV/MP4 files stores info about every single packet in memory until the file is closed). The downside is that it is less compatible with other applications.
If the above does not address your needs then you may want to look at using an Adaptive Bit Rate streaming protocol. Nearly all major streaming services use this approach, but it does require more work on the server side and generally a special streaming packager server, although there are open source ones available (e.g. https://github.com/shaka-project/shaka-packager).
ABR creates multiple different bandwidth version of the video and breaks each into equally time sized chunks, e.g. 2 second chunks. The client device or player downloads the video one chunk at a time and selects the next chunk from the bit rate most appropriate to the current network conditions. It can choose a low bandwidth chunk to allow a quick start and you will often see this on commercial streaming services where the video quality is less at start up and then improves over time as high bandwidth chunks are requested subsequently.
More info here: https://stackoverflow.com/a/42365034/334402
I'm building a REST based web service which will serve a couple hundreds clients that will upload/request little bursts of information throughout the day and make one larger cache update (about 100-200kb) once a day.
While testing the large update on the production machine (a linux virtual machine in the cloud running Apache/PHP) I discovered to my utter dismay that data gets to client corrupted (i.e. with one or more wrong character) literally MOST of the times.
Example of corrupted JSON, parser says SyntaxError: JSON.parse: expected ':' after property name in object at line 1 column 81998 of the JSON data:
"nascita":"1940-12-17","attiva":true,","cognome":"MILANI"
should be
"nascita":"1940-12-17","attiva":"true","cognome":"MILANI"
This is the HTTP Header of the answer
Connection Keep-Alive
Content-Type application/json
Date Fri, 02 Jun 2017 16:59:39 GMT
Keep-Alive timeout=5, max=100
Server Apache/2.4.18 (Ubuntu)
Transfer-Encoding chunked
I am certainly not an expert when it comes to networking but I used to think that such occurrences, failures of both IP and TCP error detection, were extremely rare. (I found this post interesting:
Can a TCP checksum produce a false positive? If yes, how is this dealt with?)
So... what's going here? Am I missing something?
I started to think of possible solutions.
The quickest I could think of was using HTTP compression: if the client is unable to decompress the content (which is very likely in case of data corruption) then I can ask for the content again.
I enabled that on Apache and, to my surprise, all responses completed with valid data.
Could it be that web browsers (I'm using good old Firefox for testing the web service) have some built-in mechanism for re-requesting corrupt compressed data? Or MAYBE the smaller, less regular nature of compressed data makes TCP/IP mistakes less likely??
The other quick solution that came to my mind was to calculate a checksum of the content, something I could do for smaller requests that don't really benefit from compression.
I am trying to figure out if and how the Content-MD5 field in HTTP could help me... Web browser seems to ignore it, so I guess i will have to compute and compare it explicitely on my client...
Using TLS may be another good idea, possibly the best.
Or again.... am I missing something HUGE?
Like, I don't know, for some reason my Apache is using UDP??
All these errors didn't make any sense.
So I got Wireshark to capture all TCP segments incoming from the web server and see what could be wrong with them. Again, Firefox showed a mistake at a random column but.... it turned out that there was no such error in the corresponding TCP segment!
I then tried Chrome (which doesn't come with a built in parser), installed JSONView extension and everything there was fine! Did the the same with Firefox, installed JSONView, and.. no errors!
Turns out there's some kind of bug with the latest Firefox built-in JSON viewer. I'm running 53.0.3 right now.
Why does most modern browsers require TLS for HTTP2?
Is there a technical reason behind this? Or simply just to make the web more secure?
http://caniuse.com/#feat=http2
It is partly about making more things use HTTPS and encourage users and servers to go HTTPS. Both Firefox and Chrome developers have stated this to be generally good. For the sake of users and users' security and privacy.
It is also about broken "middle boxes" deployed on the Internet that assume TCP traffic over port 80 (that might look like HTTP/1.1) means HTTP/1.1 and then they will interfere in order to "improve" or filter the traffic in some way. Doing HTTP/2 clear text over such networks end up with a much worse success rate. Insisting on encryption makes those middle boxes never get the chance to mess up the traffic.
Further, there are a certain percentage of deployed HTTP/1.1 servers that will return an error response to an Upgrade: with an unknown protocol (such as "h2c", which is HTTP/2 in clear text) which also would complicate an implementation in a widely used browser. Doing the negotiation over HTTPS is much less error prone as "not supporting it" simply means switching down to the safe old HTTP/1.1 approach.
This Yahoo Developer Network article says that browsers handle non-cacheable resources that are referenced more than once in a single HTML differently. I didn't find any rule about this in the HTTP/1.1 cache RFC.
I made some experiments in Chrome, but I couldn't figure out the exact rules. It was loading a duplicate non-cacheable scripts tag only once. Then I referenced the same script in 3 iframes. The first one triggered a network request, but the others were served from the cache. I tried to reference the same url as the src of an image, and that triggered a network reques again.
Is there any documentation about this behavior? How does this differ between browsers?
When a client decides to retrieve a resource, it's RFC2616 that governs that rules of whether that resource can be returned from a cache, or needs to be revalidated/reloaded from the origin server (mostly section 14.9 but you really need to read the whole thing) .
However, when you have multiple copies of the same resource on the same page, after the first copy has been retrieved following the rules of RFC2616, the decision as to whether to retrieve additional copies is now covered by the HTML5 spec (mostly specified in the processing model for fetching resources).
In particular, note step 10:
If the resource [...] is already being downloaded for other reasons (e.g. another invocation of this algorithm), and this request would be identical to the previous one (e.g. same Accept and Origin headers), and the user agent is configured such that it is to reuse the data from the existing download instead of initiating a new one, then use the results of the existing download instead of starting a new one.
This clearly describes a number of factors that could come into play in deciding whether a resource may be reused or not. Some key points:
Same Accept and Origin headers: While most browsers use the same Accept headers everywhere, in Internet Explorer they're different for an image vs a script vs HTML. And every browser sends a different Referer when frames are involved, and while Referer isn't directly mentioned, Accept and Origin were only given as examples.
Already being downloaded: Note that that is something quite different from already downloaded. So if the resource occurs multiple times on the page, but the first occurrence is finished downloading before the second occurrence is encountered, then the option to reuse may not be applicable.
The user agent is configured to reuse the data: That implies to me that the decision to reuse or re-retrieve the data is somewhat at the discretion of the user-agent, or at least a user option.
The end result, is that every single browser handles caching slightly differently. And even in a particular browser, the results may differ based on timing.
I created a test case with three nested frames (i.e. a page containing an iframe, which itself contained an iframe) and 6 copies of the same script, 2 on each page (using Cache-Control:no-cache to make them non-cacheable, but also tested with other variations, including max-age=0).
Chrome loaded only 1 copy.
Internet Explorer tended to vary, assumedly based on load, but was between 1 and 3.
Safari loaded 3 copies, one for each frame (i.e. with different Referer headers).
Opera and Firefox loaded all 6 copies.
When I reused the same resource in a couple of images (one on the root page, one in the first iframe) and a couple of other images for reference, the behaviour changed.
Chrome now loaded 5 copies, 1 of each type on each page. While Accept headers in Chrome are the same for images and scripts, the header order is different, which suggests they may be treated differently, and potentially cached differently.
Internet Explorer loaded 2 copies, 1 of each type which was to be expected for them. Assumedly that could have varied though, given their behaviour when it was just scripts.
Safari was still only 3 copies, one per frame.
Opera was inexplicably still 6. Couldn't tell what portion of those were scripts and which were images. But possibly this is also something that could vary based on load or timing.
Firefox loaded 8 copies, which was to be expected for them. The 6 scripts, plus the 2 new images.
Now this was what happened when viewing the page normally - i.e. just entering the page url into the address bar. Forcing a reload with F5 (or whatever the equivalent on Safari) produced a whole different set of results. And in general, the whole concept of reloading, F5 vs Ctrl-F5, what headers get sent by the client, etc. also differs wildly from one browser to the next. But that's a subject for another day.
The bottom line is caching is very unpredictable from one browser to the next, and the specs somewhat leave it up to the implementors to decide what works best for them.
I hope this has answered your question.
Additional Note: I should mention that I didn't go out of my way to test the latest copy of every browser (Safari in particular was an ancient v4, Internet Explorer was v9, but the others were probably fairly up to date). I doubt it makes much difference though. The chances that all browsers have suddenly converged on consistent behaviour in this regard is highly unlikely.
Molnarg, if you read the article properly it will become clear why this happens.
Unnecessary HTTP requests happen in Internet Explorer, but not in
Firefox. In Internet Explorer, if an external script is included twice
and is not cacheable, it generates two HTTP requests during page
loading. Even if the script is cacheable, extra HTTP requests occur
when the user reloads the page.
This behavior is unique to Internet Explorer only. If you ask me why this happens, I would say that IE developers chose to ignore the HTTP/1.1 cache RFC or at least could not implement it. Maybe it is a work in progress. But then again there are lot of aspects wherein IE is different from most of the browsers (JavaScript, HTML5, CSS ). This can't be helped unless devs update it.
The Yahoo Dev Article you gave lists best practices for high performance. This must accommodate for all the IE users, which are impaired by this. Which is why including same script multiple times, though OK for other browsers hurts IE users and should be avoided.
Update
Non-cacheable resources will generate network request, once or multiple times alike.
From 2. Overview of Cache Operation from the HTTP/1.1 cache RFC
Although caching is an entirely OPTIONAL feature of HTTP, we assume
that reusing the cached response is desirable and that such reuse
is the default behavior when no requirement or locally-desired
configuration prevents it.
So using cache means attempting to reuse and non-cacheable means opposite. Think of it like this non-cacheable request is like HTTP request with cache turned off (fallback from HTTP with cache on).
Cache-Control: max-age=n does not prevent cache storing it merely states that cache item is stale after n seconds. To prevent using cache use these headers for image:
Cache-Control: no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: 0
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
Note: There are existing question that look like duplicates (linked below) but most of them are from a few years ago. I'd like to get a clear and definitive answer that proves things either way.
Is making an entire website run in HTTPS not an issue today from a best practice and performance / SEO perspective?
UPDATE: Am looking for more information with sources, esp. around impact to SEO. Bounty added
Context:
The conversation came up when we wanted to introduce some buttons that spawn lightboxes with forms in them that collect personal information (some of them even allow users to login). This is on pages that make up a big portion of the site. Since the forms would need to collect and submit information securely and the forms are not on pages of their own, the easiest way we could see to make this possible was to make the pages themselves be HTTPS.
What I would like is for an answer that covers issues with switching a long running popular site to HTTPS such as the ones listed below:
Would a handshake be negotiated on every request?
Will all assets need to be encrypted?
Would browsers not cache HTTPS content, including assets?
Is downstream transparent proxies not caching HTTPS content, including assets (css, js etc.) still an issue?
Would all external assets (tracking pixels, videos, etc) need to have HTTPS version?
HTTPS and gzip might not be happy together?
Backlinks and organic links will always be HTTP so you will be 301'ing all the time, does this impact SEO / performance? Any other SEO impact of changing this sitewide?
There's a move with some of the big players to always run HTTPS, see Always on SSL, is this setting a precedent / best practice?
Duplicate / related questions:
Good practice or bad practice to force entire site to HTTPS?
Using SSL Across Entire Site
SSL on entire site or just part of it?
Not sure I can answer all points in one go with references, but here goes. Please edit as appropriate:
Would a handshake must be negotiated on every request?
No, SSL connections are typically reused for a number of consecutive requests. The overhead once associated with SSL is mostly gone these days. Computers have also gotten a lot faster.
Will all assets need to be encrypted?
Yes, otherwise the browser will not consider the entire site secure.
Would browsers not cache HTTPS content, including assets?
I do not think so, caching should work just fine.
Is downstream transparent proxies not caching HTTPS content, including assets (css, js etc.) still an issue?
For the proxy to cache SSL encrypted connections/assets, the proxy would need to decrypt the connection. That largely negates the advantage of SSL. So yes, proxies would not cache content.
It is possible for a proxy to be an SSL endpoint to both client and server, so it has separate SSL sessions with each and can see the plaintext being transmitted. One SSL connection would be between the proxy and the server, the proxy and the client would have a separate SSL connection signed with the certificate of the proxy. That requires that the client trusts the certificate of the proxy and that the proxy trusts the server certificate. This may be set up this way in corporate environments.
Would all external assets (tracking pixels, videos, etc) need to have HTTPS version?
Yes.
HTTPS and gzip might not be happy together?
Being on different levels of protocols, it should be fine. gzip is negotiated after the SSL layer is put over the TCP stream. For reasonably well behaved servers and clients there should be no problems.
Backlinks and organic links will always be HTTP so you will be 301'ing all the time, does this impact SEO?
Why will backlinks always be HTTP? That's not necessarily a given. How it impacts SEO very much depends on the SE in question. An intelligent SE can recognize that you're simply switching protocols and not punish you for it.
1- Would a handshake be negotiated on every request?
There are two issues here:
Most browsers don't need to re-establish a new connection between requests to the same site, even with plain HTTP. HTTP connections can be kept alive, so, no, you don't need to close the connection after each HTTP request/response: you can re-use a single connection for multiple requests.
You can also avoid to perform multiple handshake when parallel or subsequent SSL/TLS connections are required. There are multiple techniques explained in ImperialViolet - Overclocking SSL (definitely relevant for this question), written by Google engineers, in particular session resumption and false start. As far as I know, most modern browsers support at least session resumption.
These techniques don't get rid of new handshakes completely, but reduce their cost. Apart from session-reuse, OCSP-stapling (to check the certificate revocation status) and elliptic curves cipher suites can be used to reduce the key exchange overhead during the handshake, when perfect forward-secrecy is required. These techniques also depend on browser support.
There will still be an overhead, and if you need massive web-farms, this could still be a problem, but such a deployment is possible nowadays (and some large companies do it), whereas it would have been considered inconceivable a few years ago.
2- Will all assets need to be encrypted?
Yes, as always. If you serve a page over HTTPS, all the resources it uses (iframe, scripts, stylesheets, images, any AJAX request) need to be using HTTPS. This is mainly because there is no way to show the user which part of the page can be trusted and which can't.
3- Would browsers not cache HTTPS content, including assets?
Yes, they will, you can either use Cache-Control: public explicitly to serve your assets, or assume that the browser will do so. (In fact, you should prevent caching for sensitive resources.)
4- Is downstream transparent proxies not caching HTTPS content, including assets (css, js etc.) still an issue?
HTTP proxy servers merely relay the SSL/TLS connection without looking into them. However, some CDNs also provide HTTPS access (all the links on Google Libraries API are available via https://), which, combined with in-browser caching, allows for better performance.
5- Would all external assets (tracking pixels, videos, etc) need to have HTTPS version?
Yes, this goes with point #3. The fact that YouTube supports HTTPS access helps.
6- HTTPS and gzip might not be happy together?
They're independent. HTTPS is HTTP over TLS, the gzip compression happens at the HTTP level. Note that you can compress the SSL/TLS connection directly, but this is rarely used: you might as well use gzip compression at the HTTP level if you need (there's little point compressing twice).
7- Backlinks and organic links will always be HTTP so you will be 301'ing all the time, does this impact SEO?
I'm not sure why these links should use http://. URL shortening services are a problem generally speaking for SEO if that's what you're referring to.
I think we'll see more and more usage of HTTP Strict Transport Security, so more https:// URLs by default.