What is the complete process from entering a url to the browser's address bar to get the rendered page in browser? - html

I'm thinking about this question for a long time. It is a big question, since it almost covers all corners related to web developing.
In my understanding, the process should be like:
enter the url to the address bar
a request will be sent to the DNS server based on your network configuration
DNS will route you to the real IP of the domain name
a request(with complete Http header) will be sent to the server(with 3's IP to identify)'s 80 port(suppose we don't specify another port)
server will search the listening ports and forward the request to the app which is listening to 80 port(let's say nginx here) or to another server(then 3's server will be like a load balancer)
nginx will try to match the url to its configuration and serve as an static page directly, or invoke the corresponding script intepreter(e.g PHP/Python) or other app to get the dynamic content(with DB query, or other logics)
a html will be sent back to browser with a complete Http response header
browser will parse the DOM of html using its parser
external resources(JS/CSS/images/flash/videos..) will be requested in sequence(or not?)
for JS, it will be executed by JS engine
for CSS, it will be rendered by CSS engine and HTML's display will be adjusted based on the CSS(also in sequence or not?)
if there's an iframe in the DOM, then a separate same process will be executed from step 1-12
The above is my understanding, but I don't know whether it's correct or not? How much precise? Did I miss something?
If it's correct(or almost correct), I hope:
Make the step's description more precise in your words, or write your steps if there is a big change
Make a deep explanation for each step which you are most familiar with.
One answer per step. Others can make supplement in each answer's comment.
And I hope this thread can help all web developers to have a better understanding about what we do everyday.
And I will update this question based on the answers.
Thanks.

As you say this is a broad question where it's possible to go into great detail on a number of topics. There's nothing wrong with the sequence you described, but you're leaving out a lot of detail. To mention a few:
The DNS layer can help direct clients to different servers based on geographical location to help with load balancing and latency minimization, and one server can respond to requests from many different DNS names.
A browser can make different types of requests (GET, POST, HEAD, etc), and usually includes several different headers including cookies, browser capabilities, language preferences, etc.
Most browsers usually maintain a cache in order to avoid downloading stuff many times, and use various techniques to determine whether the cached version of a file is valid.
In modern webpages there's often complex interaction between many different kinds of files (HTML, CSS, images, JavaScript, video, Flash, ...), and web developers often need detailed knowledge of differences among browsers in order to keep their pages working for everyone
Each of these topics, and many more, could be discussed at length. Perhaps it's more practical to ask more specific questions about the topics you're interested in?

You type maps.google.com(Uniform Resource Locator) into the address bar of your browser and press enter.
Every URL has a unique IP address associated with it. The mapping is stored in Name Servers and this procedure is called Domain Name System.
The browser checks its cache to find the IP Address for the URL.
If it doesn't find it, it checks its OS to find the IP address (gethostname);
It then Checks the router's cache.
It then checks the ISP's cache. If it is not available there the ISP makes a recursive request to different name servers.
It Checks the com name server (we have many name servers such as 'in', 'mil', 'us' etc) and it will redirect to google.com
google.com name server will find the matching IP address for maps.google.com in its’ DNS records and return it to your DNS recursor which will send it back to your browser.
Browser initiates a TCP connection with the server.It uses a three way handshake
Client machine sends a SYN packet to the server over the internet asking if it is open for new connections.
If the server has open ports that can accept and initiate new connections, it’ll respond with an ACKnowledgment of the SYN packet using a SYN/ACK packet.
The client will receive the SYN/ACK packet from the server and will acknowledge it by sending an ACK packet.
Then a TCP connection is established for data transmission!
The browser will send a GET request asking for maps.google.com web page. If you’re entering credentials or submitting a form this could be a POST request.
The server sends the response.
Once the server supplies the resources (HTML, CSS, JS, images, etc.) to the browser it undergoes the below process:
Parsing - HTML, CSS, JS
Rendering - Construct DOM Tree → Render Tree → Layout of Render Tree → Painting the render tree
The rendering engine starts getting the contents of the requested document from the networking layer. This will usually be done in 8kB chunks.
A DOM tree is built out of the broken response.
New requests are made to the server for each new resource that is found in the HTML source (typically images, style sheets, and JavaScript files).
At this stage the browser marks the document as interactive and starts parsing scripts that are in "deferred" mode: those that should be executed after the document is parsed. The document state is set to "complete" and a "load" event is fired.
Each CSS file is parsed into a StyleSheet object, where each object contains CSS rules with selectors and objects corresponding CSS grammar. The tree built is called CSSCOM.
On top of DOM and CSSOM, a rendering tree is created, which is a set of objects to be rendered. Each of the rendering objects contains its corresponding DOM object (or a text block) plus the calculated styles. In other words, the render tree describes the visual representation of a DOM.
After the construction of the render tree it goes through a "layout" process. This means giving each node the exact coordinates where it should appear on the screen.
The next stage is painting–the render tree will be traversed and each node will be painted using the UI backend layer.
Repaint: When changing element styles which don't affect the element's position on a page (such as background-color, border-color, visibility), the browser just repaints the element again with the new styles applied (that means a "repaint" or "restyle" is happening).
Reflow: When the changes affect document contents or structure, or element position, a reflow (or relayout) happens.

i was also searching for the same thing and found this awesome detailed answer being built collaboratively at github

I can describe one point here -
Determining which file/resource to execute, which language interpreter to load.
Pardon me if I am wrong in using interpreter here. There may be other mistakes in my answer, I will try to correct them later and include proper technical terms for things.
When the web server (e.g. apache) has received the URI it checks if there is any existing rewrite rule matching it. In that case the rewritten URI is taken. In either case, if there is no file name to end the URI, the default file is loaded, which is generally index.html or index.php etc. According to the extension of the file name, the appropriate apache module for server-side programming language support is loaded, e.g. mod_php for PHP, mod_python in case of python. The appropriate server side language interpreter (considering interpreted languages like PHP) then prepares the final HTML or output in some other form for the web server which finally sends it as the HTTP response.

I hope above image help you to understand whole process.
Full article is here

Related

Will browser pull from cache if the same resource is being requested by a different origin?

Let's say you have a resource, could be an image, could be jQuery from a cdn. This resource is hosted at some 3rd party url, like https://example-cdn.com/resource.ext. Let's also assume it is cacheable (whatever that means--let me know if that is a non-trivial detail).
When https://website-a.com requests the resource (let's assume it was included in the html directly), it takes some time to load, but then the browser caches it for faster load next time.
Now, https://website-b.com is also including that resource in its html, using the exact same url (https://example-cdn.com/resource.ext).
My question is this: will the browser reach for the cached resource (because it was already fetched while loading https://website-a.com), or is there some reason that it would not be able to find it in the cache and have to load it over the network all over again?
Edit: This stackexchange answer seems to contain some related information. Can anyone verify that this answer is correct in all its assertions about caching? https://webmasters.stackexchange.com/a/84685
Yes, the resource will be cached.
This follows from the semantics of HTTP and URLs. A URL is a Universal Resource Location: it provides the location of the resource, in a form that can be used anywhere, and which always indicates the same resource: in <a> elements of different web sites, on business cards, on advertising posters. An HTTP client (a web browser) knows that a URL one web site has refers to the same resource if used on a different web site, and so it is safe to reuse a cached copy.
The exception to this is when a URL is a relative URL (your example uses absolute URLs). To make use of a relative URL the client must resolve the URL, using some context, to produce an absolute URL. Different web sites have different contexts and thus resolve to different absolute URLs. It is the absolute URL that the client must use to fetch resources and which is used as the key in its cache.

Where is the Data stored on Website

I am at this website -
http://www.zoominfo.com/s/#!search/company/1.64.eyJjb21wYW55TmFtZSI6xIB2YWx1xIw6ImEiLCJpc1VzZWTEjXRyxJN9fQ%3D%3D
If you see the company name - Agilent Technologies Inc.
Its neither there in page source, nor in any json format.
But it does show in the Dom of Chrome Developer tool.
I have looked and analysed almost every requests that it sent, but still couldn't find where this data is saved.
By where the data is saved - I am looking to find where I can scrape that data from?
If by using python-requests and BeautifulSoup
I do see an XMLHTTPREQUEST made, not sure what that means, or if that is the clue to my answer.
I am still learning python, and it would be a very useful information if someone helps me with this.
Thanks in advance.
After the HTML is loaded, js requests for the data through an XMLHTTPREQUEST which is loaded right after the request is received on your client. That's why you see the DOM element right there using element inspector.
You didn't mention what goal you want to achieve or what tool you are using. Please be specific on your question. If you do not have any idea about this kind of pattern, google out angularjs, see some example.
do see an XMLHTTPREQUEST made, not sure what that means, or if that is the clue to my answer.
It means that javascript embedded in the page is sending an extra HHTP request to the web server. It is likely that the "Agilent Technologies Inc." text is being returned in the server's response to that request, and the javascript in the page is then injecting the text into the DOM in the appropriate place.
Where is the Data stored on Website
That is a completely different question ...
(You have already noted that the data (e.g. the company name) gets injected into the page displayed by your browser.)
On the server side, the data could be stored in the web server (or its back-end systems) in a variety of ways. Or it might not be stored at all. There is no way of knowing ... without looking at the server-side code and configurations.

Could we pass GET data to css?

I just came across a website pagesource and saw this in the header:
<link href="../css/style.css?V1" rel="stylesheet" type="text/css" />
Could we actually pass GET data to css? I tried searching but found no results apart from using PHP. Could anyone help make meaning of the ?V1 after the .css
I know this forum is for asking programming problems, however I decided to ask this since I have found no results in my searches
First of all, no you can't pass GET parameters to CSS. Sorry. That would have been great though.
As for the example url. It can either be a CSS page generated by any web server (doesn't have to be PHP). In this case the server can serve different pages or versions of the same page which might explain the meaning of V1, Version 1. The server can also dynamically generate the page with a server-side template. This is an example from the Jade documentaion:
http://cssdeck.com/labs/learning-the-jade-templating-engine-syntax
It can also just be used as cache buster, for versioning purposes. Whenever you enter a url the browser will try to fetch it only if it doesn't already have a cached copy which is specific to that URL. If you have made a change in your content (in this instance the css file) and you want the browser to use it and not the cached version you can change the url and trick the browser to think it's a new resource that is not cached, so it'll fetch the new content from the server. V1 can then have a symantic meaning to the developer serving as a note (ie I've changed this file once...twice..etc) but not actually do anything but break the cache. This question addresses cache busting.
There are different concepts.
At first, it only is a link - it has a name, it might have an extension, but this is just a convention for humans, and nothing more than a resource identifier for the server. Once the browser requests it, it becomes a server request for a resource. The server then decides how to handle this request. It might be a simple file it just has to return, it might be a server side script, which has to be executed by a server side scripting interpreter, or basically anything else you can imagine.
Again, do not trick yourself in thinking "this is a CSS file", just because it has a css extension, or is called style.
Whatever runs at the server, and actually answers the request, will return something. And this something then is given a meaning. It might be CSS, it might be HTML, it might be JavaScript, or an image or just a binary download. To help the browser to understand what it is, the server returns a Content-Type header.
If no content type is given, the browser has to guess what it is. Or the nice web author gave a hint on what to expect as response - in this case he gave the hint of text/css. Again, this is how the returned content should be interpreted by the client/browser, not how that content is supposed to created on the server side.
And about the ?V1? This could mean different things. Maybe the user can configure a style (theme) for the website and this method is used to dispatch different styles. Or it can be used for something called "cache busting" (look it up).
You can pass whatever you want; the server decides what to do with the data.
After all, PHP isn't your only option for creating a server. If i wrote a server in Node.js, set up a route for /css/style.css and made it return different things depending on what query was given, neither the server nor browser will bat an eyelid.

Handling HTML PDFs with Auth Required Images

I'm currently creating pdf documents server side with wkhtmlpdf and nodejs. The client side sends the html to be rendered (which may include img tags with a source). When the user is previewing the html in the browser the images they uploaded to their account show fine because the user is authenticated via the browser and the node route can simply look up the image based on the user id (saved to session) and image id (passed in each image request).
The issue is when the images are attempting to be rendered in wkhtmltopdf webkit the renderer is not authenticated when it makes the request for images via node's exec of wkhtmltopdf in a separate process. A request to something like GET /user/images/<imageId> will fail due to the session not being set when the request is made inside the headless wkhtmltopdf renderer.
Is there a way to pass authentication via some wkhtmltopdf option or possibly a different way of authentication for images? The only restriction is not making images public.
I asked a similar question a while back that might help you:
Generate PDF Behind Authentication Wall
WKHTMLTOPDF has --cookie-jar which should get you what you need. Note that it didn't for me, and I wound up answering my own question with an alternate solution. In a nutshell, I wound up accessing the page via CURL - much more flexible - then writing a temporary that I converted to PDF, then deleted the temporary file.
A little round-a-bout, but it got the job done.
To implement authentication I allowed a cookie id flag ( with connect the key defaults to connect.sid ) as a query option in my image routes. The only "gotcha" is since images are requested from the server's perspective, you must ensure all your image paths are absolute domain paths rather than relative to your application ( unless those two are the same of course).
Steps for Expressjs:
Setup the id flag middleware which checks for say sid in the query via req.query (eg ?id=abc123 where abc123 is the req.cookies['connect.sid'], or req.signedCookies['connect.sid'] if your using a secret as you probably should )You may need to ensure the query middleware is setup first.
Ensure the req.headers contains this session id key and value prior to the cookie parser so the session is properly setup (eg if a cookie exists append a new one or if one does add it as the first req.headers.cookie = 'connect.sid=abc123;')
Ensure all image paths contain the full url (eg https://www.yourdomain.com/images/imageId?id=abc123)
Some extra tid bits: The image source replacement should probably happen at the server level to ensure the user does not copy/paste the image url with the session id and say email it to a friend which obviously leaves the door open for account hijacking.

Apache and HTML, post requests and actions - does an absolute URL leading to the same server get parsed as a local URL?

Not 100% sure if this is the right SE site to ask this, so feel free to move/warn me.
If I have a site www.mysite.com with a form on it and define its action as "http://www.mysite.com/handlepost" instead of "/handlepost", does it still get parsed as a local address by apache? That is, will apache figure out that I'm trying to send my form data to the same server the form resides on and do an automatic local post, or will the data be forced to make a round trip, going online, looking up the domain and actually being sent as an outside request?
Apache does not look at this information. It's your browser which does this job.
On the Apache side the job is only outputing content (html in this case), apache does not care about the way you write your url in this content.
On the browser side the page is analysed and GET requests (images,etc) are sent automatically to all collected url. The browser SHOULD know that relative url /foo are in fact http://currentsite/foo - or it's a really dump browser -. It is his job. And then it's his job to push the request to the right server (and to known if he should make a new DNS request, build a new HTTP connection, reuse an existing opened connection, build several connections -- usually max 3 conn per DNS--, etc). Apache does nothing in this part of the job.
So why absolute url are bad? Not because of the job the browser should have to do handling it (which is in fact nothing, his job is transforming relative url to absolute ones); It's because if your web application use only relative url the admin of the web server will have far more possibilities on proxying your application. For example:
he will be able to server your web application on several different DNS domains
(and then make the browser think he's talking to several servers, parallelizing static files downloads)
he could as use use this multi-domain to set up the application for different costumers
he could build an HTTPS access for external network access and an HTTP (without the S) access on a local name for the local network
And if your application is building the absolute url these tasks will become really harder.
dont use absolute URL's . As i feel it will do a round trip in your case as you have used round trip for the action part. so better use releative URL's