Retrieving part of the page body

Retrieving part of the page body - html

Is there a way to retrieve only part of a page? I need to access a certain number of characters after the sexdecillionth character, and I know that it is possible. I can't request the page, because it returns response body too large. What should I do?
The platform and language don't matter to me, because I can access them all.

Retrieve it from what?
Generally speaking, you'll need to open an HTTP connection and use the Range: header to request the start location of the page download. I don't know that every web server supports it, though, so your mileage may vary.

Related

W3 Total Cache - Json expire header

I need to set up a specific expire header to json files, much much lower than the rest of files. Can I do this in W3TV? I couldn't find a way.
The default of 31536000 seconds is ok for all other file types. But I use the JSON REST API to deliver data to an AngularJS+Cordova App, and was having a problem with content not being updated. We figured out that It was the json expire header when we manually configured 300 seconds, problem is W3 TC constantly overrides this change.
Is there a way to tell W3 TC to use a lower expire header for json files? Or a way manually enter a value that's not overridden by W3TC?
The only idea I've come up with is to rewrite the json expire header rule at the bottom of .htaccess, but I don't know if this is going to prevent W3TC to edit or erase it. Also, having a repeated rule seams just plain wrong from the begging.
Or is there any way to tell Angular to re download the json file even if the cache header tells it to keep it for a year?
What do you think?
Thanks!
FG

Add a random property to the end of the URL to fetch the JSON file. This is what jQuery does to ensure that the cache is not used for JSON requests.
Assume your file URL is http://example.com/myfile.json, then you would fetch it with http://example.com/myfile.json?__random=1 the first time and http://example.com/myfile.json?__random=2 the second time etc. of course you should use totally random numbers instead of 1, 2 etc.

W3TC offers 3 groups of header policies to manage for user agent (browser) caching. The third group on the browser cache settings page has an "Other" section where the headers available there would be applied to the json response from WP unless there's a bug or implementation that prevents this behavior from occurring. W3TC sets the directives for the nginx or apache web server to apply the headers specified on this page and there are ways that this policy could fail to be applied, however the intent is as indicated.

Where is the Data stored on Website

I am at this website -
http://www.zoominfo.com/s/#!search/company/1.64.eyJjb21wYW55TmFtZSI6xIB2YWx1xIw6ImEiLCJpc1VzZWTEjXRyxJN9fQ%3D%3D
If you see the company name - Agilent Technologies Inc.
Its neither there in page source, nor in any json format.
But it does show in the Dom of Chrome Developer tool.
I have looked and analysed almost every requests that it sent, but still couldn't find where this data is saved.
By where the data is saved - I am looking to find where I can scrape that data from?
If by using python-requests and BeautifulSoup
I do see an XMLHTTPREQUEST made, not sure what that means, or if that is the clue to my answer.
I am still learning python, and it would be a very useful information if someone helps me with this.
Thanks in advance.

After the HTML is loaded, js requests for the data through an XMLHTTPREQUEST which is loaded right after the request is received on your client. That's why you see the DOM element right there using element inspector.
You didn't mention what goal you want to achieve or what tool you are using. Please be specific on your question. If you do not have any idea about this kind of pattern, google out angularjs, see some example.

do see an XMLHTTPREQUEST made, not sure what that means, or if that is the clue to my answer.
It means that javascript embedded in the page is sending an extra HHTP request to the web server. It is likely that the "Agilent Technologies Inc." text is being returned in the server's response to that request, and the javascript in the page is then injecting the text into the DOM in the appropriate place.
Where is the Data stored on Website
That is a completely different question ...
(You have already noted that the data (e.g. the company name) gets injected into the page displayed by your browser.)
On the server side, the data could be stored in the web server (or its back-end systems) in a variety of ways. Or it might not be stored at all. There is no way of knowing ... without looking at the server-side code and configurations.

Could we pass GET data to css?

I just came across a website pagesource and saw this in the header:
<link href="../css/style.css?V1" rel="stylesheet" type="text/css" />
Could we actually pass GET data to css? I tried searching but found no results apart from using PHP. Could anyone help make meaning of the ?V1 after the .css
I know this forum is for asking programming problems, however I decided to ask this since I have found no results in my searches

First of all, no you can't pass GET parameters to CSS. Sorry. That would have been great though.
As for the example url. It can either be a CSS page generated by any web server (doesn't have to be PHP). In this case the server can serve different pages or versions of the same page which might explain the meaning of V1, Version 1. The server can also dynamically generate the page with a server-side template. This is an example from the Jade documentaion:
http://cssdeck.com/labs/learning-the-jade-templating-engine-syntax
It can also just be used as cache buster, for versioning purposes. Whenever you enter a url the browser will try to fetch it only if it doesn't already have a cached copy which is specific to that URL. If you have made a change in your content (in this instance the css file) and you want the browser to use it and not the cached version you can change the url and trick the browser to think it's a new resource that is not cached, so it'll fetch the new content from the server. V1 can then have a symantic meaning to the developer serving as a note (ie I've changed this file once...twice..etc) but not actually do anything but break the cache. This question addresses cache busting.

There are different concepts.
At first, it only is a link - it has a name, it might have an extension, but this is just a convention for humans, and nothing more than a resource identifier for the server. Once the browser requests it, it becomes a server request for a resource. The server then decides how to handle this request. It might be a simple file it just has to return, it might be a server side script, which has to be executed by a server side scripting interpreter, or basically anything else you can imagine.
Again, do not trick yourself in thinking "this is a CSS file", just because it has a css extension, or is called style.
Whatever runs at the server, and actually answers the request, will return something. And this something then is given a meaning. It might be CSS, it might be HTML, it might be JavaScript, or an image or just a binary download. To help the browser to understand what it is, the server returns a Content-Type header.
If no content type is given, the browser has to guess what it is. Or the nice web author gave a hint on what to expect as response - in this case he gave the hint of text/css. Again, this is how the returned content should be interpreted by the client/browser, not how that content is supposed to created on the server side.
And about the ?V1? This could mean different things. Maybe the user can configure a style (theme) for the website and this method is used to dispatch different styles. Or it can be used for something called "cache busting" (look it up).

You can pass whatever you want; the server decides what to do with the data.
After all, PHP isn't your only option for creating a server. If i wrote a server in Node.js, set up a route for /css/style.css and made it return different things depending on what query was given, neither the server nor browser will bat an eyelid.

Issues with images, security and cache

I've got some issues regarding images and secure access to the cloud (S3, in this case).
I do want to have direct access to avoid stressing my server so I do have two options:
1.) Generate a signed url on my server and send it back to client to load from
2.) Generate an Authorization-Header sent within the request from the client
Now, 1. & 2. work just fine, however, I am in trouble regarding images in html.
For 1.), images will be re-requested each time as the url changes every time and thus renders caching useless.
For 2.), caching would work as the url'd stay constant though I've seeked the net to figure how to add a custom http request header to image request including my security token without luck.
So, what options are left for me? How do others resolve that issue?
thanks
Alex

What is the complete process from entering a url to the browser's address bar to get the rendered page in browser?

I'm thinking about this question for a long time. It is a big question, since it almost covers all corners related to web developing.
In my understanding, the process should be like:
enter the url to the address bar
a request will be sent to the DNS server based on your network configuration
DNS will route you to the real IP of the domain name
a request(with complete Http header) will be sent to the server(with 3's IP to identify)'s 80 port(suppose we don't specify another port)
server will search the listening ports and forward the request to the app which is listening to 80 port(let's say nginx here) or to another server(then 3's server will be like a load balancer)
nginx will try to match the url to its configuration and serve as an static page directly, or invoke the corresponding script intepreter(e.g PHP/Python) or other app to get the dynamic content(with DB query, or other logics)
a html will be sent back to browser with a complete Http response header
browser will parse the DOM of html using its parser
external resources(JS/CSS/images/flash/videos..) will be requested in sequence(or not?)
for JS, it will be executed by JS engine
for CSS, it will be rendered by CSS engine and HTML's display will be adjusted based on the CSS(also in sequence or not?)
if there's an iframe in the DOM, then a separate same process will be executed from step 1-12
The above is my understanding, but I don't know whether it's correct or not? How much precise? Did I miss something?
If it's correct(or almost correct), I hope:
Make the step's description more precise in your words, or write your steps if there is a big change
Make a deep explanation for each step which you are most familiar with.
One answer per step. Others can make supplement in each answer's comment.
And I hope this thread can help all web developers to have a better understanding about what we do everyday.
And I will update this question based on the answers.
Thanks.

As you say this is a broad question where it's possible to go into great detail on a number of topics. There's nothing wrong with the sequence you described, but you're leaving out a lot of detail. To mention a few:
The DNS layer can help direct clients to different servers based on geographical location to help with load balancing and latency minimization, and one server can respond to requests from many different DNS names.
A browser can make different types of requests (GET, POST, HEAD, etc), and usually includes several different headers including cookies, browser capabilities, language preferences, etc.
Most browsers usually maintain a cache in order to avoid downloading stuff many times, and use various techniques to determine whether the cached version of a file is valid.
In modern webpages there's often complex interaction between many different kinds of files (HTML, CSS, images, JavaScript, video, Flash, ...), and web developers often need detailed knowledge of differences among browsers in order to keep their pages working for everyone
Each of these topics, and many more, could be discussed at length. Perhaps it's more practical to ask more specific questions about the topics you're interested in?

You type maps.google.com(Uniform Resource Locator) into the address bar of your browser and press enter.
Every URL has a unique IP address associated with it. The mapping is stored in Name Servers and this procedure is called Domain Name System.
The browser checks its cache to find the IP Address for the URL.
If it doesn't find it, it checks its OS to find the IP address (gethostname);
It then Checks the router's cache.
It then checks the ISP's cache. If it is not available there the ISP makes a recursive request to different name servers.
It Checks the com name server (we have many name servers such as 'in', 'mil', 'us' etc) and it will redirect to google.com
google.com name server will find the matching IP address for maps.google.com in its’ DNS records and return it to your DNS recursor which will send it back to your browser.
Browser initiates a TCP connection with the server.It uses a three way handshake
Client machine sends a SYN packet to the server over the internet asking if it is open for new connections.
If the server has open ports that can accept and initiate new connections, it’ll respond with an ACKnowledgment of the SYN packet using a SYN/ACK packet.
The client will receive the SYN/ACK packet from the server and will acknowledge it by sending an ACK packet.
Then a TCP connection is established for data transmission!
The browser will send a GET request asking for maps.google.com web page. If you’re entering credentials or submitting a form this could be a POST request.
The server sends the response.
Once the server supplies the resources (HTML, CSS, JS, images, etc.) to the browser it undergoes the below process:
Parsing - HTML, CSS, JS
Rendering - Construct DOM Tree → Render Tree → Layout of Render Tree → Painting the render tree
The rendering engine starts getting the contents of the requested document from the networking layer. This will usually be done in 8kB chunks.
A DOM tree is built out of the broken response.
New requests are made to the server for each new resource that is found in the HTML source (typically images, style sheets, and JavaScript files).
At this stage the browser marks the document as interactive and starts parsing scripts that are in "deferred" mode: those that should be executed after the document is parsed. The document state is set to "complete" and a "load" event is fired.
Each CSS file is parsed into a StyleSheet object, where each object contains CSS rules with selectors and objects corresponding CSS grammar. The tree built is called CSSCOM.
On top of DOM and CSSOM, a rendering tree is created, which is a set of objects to be rendered. Each of the rendering objects contains its corresponding DOM object (or a text block) plus the calculated styles. In other words, the render tree describes the visual representation of a DOM.
After the construction of the render tree it goes through a "layout" process. This means giving each node the exact coordinates where it should appear on the screen.
The next stage is painting–the render tree will be traversed and each node will be painted using the UI backend layer.
Repaint: When changing element styles which don't affect the element's position on a page (such as background-color, border-color, visibility), the browser just repaints the element again with the new styles applied (that means a "repaint" or "restyle" is happening).
Reflow: When the changes affect document contents or structure, or element position, a reflow (or relayout) happens.

i was also searching for the same thing and found this awesome detailed answer being built collaboratively at github

I can describe one point here -
Determining which file/resource to execute, which language interpreter to load.
Pardon me if I am wrong in using interpreter here. There may be other mistakes in my answer, I will try to correct them later and include proper technical terms for things.
When the web server (e.g. apache) has received the URI it checks if there is any existing rewrite rule matching it. In that case the rewritten URI is taken. In either case, if there is no file name to end the URI, the default file is loaded, which is generally index.html or index.php etc. According to the extension of the file name, the appropriate apache module for server-side programming language support is loaded, e.g. mod_php for PHP, mod_python in case of python. The appropriate server side language interpreter (considering interpreted languages like PHP) then prepares the final HTML or output in some other form for the web server which finally sends it as the HTTP response.

I hope above image help you to understand whole process.
Full article is here

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008