How to deal with crawler and outdated assets?

How to deal with crawler and outdated assets? - yii2

I've got the following error with my web application:
2017-12-02 22:32:39
[10.133.0.13][-][-][error][yii\web\HttpException:404]
yii\base\InvalidRouteException: Unable to resolve the request
"assets/7adcf7ba/site.css". in
/var/www/html/my-website/vendor/yiisoft/yii2/base/Module.php:537
It was caused by the web crawler of Google as I see it in HTTP_USER_AGENT. The folder 7adcf7ba does not exist (anymore), so I think the crawler uses cached data somehow.
How can I prevent that the crawler tries to access this outdated resource file and use the current one?
I don't want a solution with the Google's Search Console, since it isn't the only web crawler, of course, and I don't want to maintain several crawlers.
Can I use the robots.txt? Meta tags? Special attributes? How should I do it?

You must set to TRUE the $forceCopy property of yii\web\AssetManager:
'components' => [
...
'assetManager' => [
'class' => 'yii\web\AssetManager',
'forceCopy' => true,
],
...
],
"You may want to set this to be true during the development stage to make sure the published directory is always up-to-date. Do not set this to true on production servers as it will significantly degrade the performance." from: $forceCopy
[Edit]
More explanations are required
"A crawler (also called a spider or robot), is a software that analyzes the data of a network in a methodical and automatic way, usually on behalf of a search engine. Crawlers typically acquire a textual copy of all the documents visited and insert them into an index.
An extremely common use of crawlers is on the Web. On the Web, the crawler relies on a list of URLs to be visited provided by the search engine (which initially is based on the addresses suggested by users or on a list pre-filled by the programmers themselves ). When analyzing a URL, it identifies all the hyperlinks in the document and adds them to the list of URLs to visit. The process can be concluded manually or after a certain number of links has been followed. In addition, Internet crawlers have the option to be directed by the "robots.txt" file located in the root of the site. Within this file, you can indicate which pages should not be analyzed. The crawler has the right to follow the advice, but not the obligation." from: Web crawler
So you can configure robots.txt for allow or not a crawler to indexing specific pages in his search engine but not for avoid assets errors. The publication of the assets is another thing.

Related

Questions using an HTML5 video-tag 'src' URL in Google cloud storage

I'm designing/developing a simple HTML5-based webpage.
But, rather than having the videos (e.g. MP4 and/or WEBM files)
based locally on the web-server, I want to store them all
in Google 'cloud-storage', by referencing them with a full
URL in the 'src' attribute of a tag.
So, my first question is simply whether it's possible to derive
such a reference URL, to a video file that I've uploaded into my
Google acct's basic 15-GB of free storage? (Or do I need
to first buy an 'official' starter unit of Google Cloud Storage?)
Secondly, could someone please point me to a tutorial or 'recipe'
for how to compute such a URL, so that I can build a simple initial
prototype to validate such a design approach.
TIA...
Dave

It's actually almost trivial (once I bit-the-bullet and registered
for a 60-day free trial of "Google Cloud Platform".)
It seems those older-style URLs (full of long strings of hex-chars)
are a thing of the past. That actually makes sense, since the
'bucket name' that you create to store your files in, must be
"globally-unique" and becomes part of the URL.
https://storage.googleapis.com/your-bucket-name/Steve_Jobs-2mins.mp4
So, it becomes as simple as just using their 'console' tool to create
a bucket, upload your file(s) into that bucket, declare each 'public/shared',
and then reference the resulting URL in the 'src' attribute of your
video or source HTML tag.
You can view my working example here:
http://weasel.firmfriends.us/HTMLVideoFromCloud/
[ For details, you can 'view page source' on the HTML. ]
Cheers...
Dave

View cached Appcache data

My application is currently using HTML5 appcache.
I want to get the hash of files that I get from update() events. However, I can't seem to find out how to access the resources I downloaded.
I want to do something like
$.get( "/sunflowers.png", function( data ) {
hash(data)
});
I know that I can view the cached resources via chrome-internals however I hope to automate this process
PS: Bump for chromium devs! please advice.

AppCache is effectively deprecated at this point, so it's not likely that the answer for this original use case would still be relevant.
But in general, it's worth pointing out that there's a more "official" way of confirming that the contents of a downloaded subresource match the expected local hashes: using subresource integrity.

Caching all resources of the html page with HTML5 app cache

Is there a way to specify (in the cache manifest file) that all the resources included in the html page are to be cached?
I'm building a dynamic web app and want to give the user the ability to view the app while offline. Therefore I need all the images (for which the source is set from file names stored in the database according to the query string provided in the request) in the page cached. Basically what I need is something like * which can be used in the NETWORK and FALLBACK sections.
If there is no such way to specify this in the manifest file, what is the best approach to solve this? For example, making the manifest itself dynamic and including the resources based on a query string passed to that might work, but it involves getting the list of resources from the db again.
Any help is greatly appreciated!

You can't use a wildcard in the CACHE section.
The approach you described seems practicable. But why retrieving the resources from DB again? once you've fetched them all, give them to a listener which does the generation, or store them in a session attribute where you can fetch them to generate the manifest.

Why doesn't Wikipedia have extensions?

Look at a random wikipedia article like http://en.wikipedia.org/wiki/Impostor_syndrome, I see that there's no .html attached to the end of the address. In fact, if I do try to put a .html after it, Wikipedia tells me "Wikipedia does not have an article with this exact name." How come it doesn't need any file extensions?

More a superuser question?
There is no law saying that an html file has to end in .html or .htm and since wiki generates pages from a database there is really no file page there anyway (except in a cache).
Not having .htm or .php is moresensible - why do you care what technology they use when you ask for a url? It would be like having to put the operating system of the recipient at the end of their email address.

if you make a call to a website it probably looks like
www.example.com/siteA/index.html
this request just tells the webserver you want to see a resource that is called index.html in siteA.
the website that runs on this server has to determine what you want to see and how the data is loaded.
index.html could be a file in the siteA directory
or
it can be row with the key "index.html" in the siteA-table in your database.
so the part siteA/index.html is just a resource identifier. the grammar of this resource identifier is completely free and is determined per website.
url rewriting is also common to make url easier to read and remember.
for example there could be a rewrite rule to accomplish the following:
if the user enters something like
www.example.com/download/demo.zip
rewrite it so your website sees it like:
www.example.com/download.php?file=demo.zip

Wikipedia's servers map the url to the page you want. .html is just a naming convention that, today is mostly historical from the period of static pages when urls actually were names of files on the server. In fact, there may be no file at all, where the server queries the database and a web framework sends out the html on the fly.

Wikipedia is most likely using the Apache module mod_rewrite in order to not have to link paths directly to a file system path.
See: http://en.wikipedia.org/wiki/Rewrite_engine#Web_frameworks
However programming languages can also take control of the incoming URLs and return data depending on the structure of the link according to some set of rules, for example the Django web framework employees a URL dispatcher.

That's because Wikipedia uses MediaWiki's feature of URL shortening.
Actually when you search for a file it really loads a php file. Try searching for a word that doesn't exist, for example "Pazaz". The URL is http://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=pazaz . Notice index.php in the URL.
To tell the truth it's not a MediaWiki feature, it's Apache. For further info http://www.mediawiki.org/wiki/Manual:Short_URL .

URL routing is your answer for example in ASP read below source from
The ASP.NET MVC framework includes a flexible URL routing system that enables you to define URL mapping rules within your applications. The routing system has two main purposes:
Map incoming URLs to the application and route them so that the right Controller and Action method executes to process them
Construct outgoing URLs that can be used to call back to Controllers/Actions (for example: form posts, links, and AJAX calls)

I would suggest that sites like this use some sort of Model View Controller framework similar to Ruby on Rails where the url 'directories' form a part of a request/url route...
In frameworks that are MVC based, the url 'directories' can dictate what View/Controller to utilise as well as what action should be taken with the data.
eg: shop.com/product/carrots
Where product is a view/controller and carrots is the data. The framework then analyses which action/route to take. Default could be viewing the product information and price of the carrot.

What is the complete process from entering a url to the browser's address bar to get the rendered page in browser?

I'm thinking about this question for a long time. It is a big question, since it almost covers all corners related to web developing.
In my understanding, the process should be like:
enter the url to the address bar
a request will be sent to the DNS server based on your network configuration
DNS will route you to the real IP of the domain name
a request(with complete Http header) will be sent to the server(with 3's IP to identify)'s 80 port(suppose we don't specify another port)
server will search the listening ports and forward the request to the app which is listening to 80 port(let's say nginx here) or to another server(then 3's server will be like a load balancer)
nginx will try to match the url to its configuration and serve as an static page directly, or invoke the corresponding script intepreter(e.g PHP/Python) or other app to get the dynamic content(with DB query, or other logics)
a html will be sent back to browser with a complete Http response header
browser will parse the DOM of html using its parser
external resources(JS/CSS/images/flash/videos..) will be requested in sequence(or not?)
for JS, it will be executed by JS engine
for CSS, it will be rendered by CSS engine and HTML's display will be adjusted based on the CSS(also in sequence or not?)
if there's an iframe in the DOM, then a separate same process will be executed from step 1-12
The above is my understanding, but I don't know whether it's correct or not? How much precise? Did I miss something?
If it's correct(or almost correct), I hope:
Make the step's description more precise in your words, or write your steps if there is a big change
Make a deep explanation for each step which you are most familiar with.
One answer per step. Others can make supplement in each answer's comment.
And I hope this thread can help all web developers to have a better understanding about what we do everyday.
And I will update this question based on the answers.
Thanks.

As you say this is a broad question where it's possible to go into great detail on a number of topics. There's nothing wrong with the sequence you described, but you're leaving out a lot of detail. To mention a few:
The DNS layer can help direct clients to different servers based on geographical location to help with load balancing and latency minimization, and one server can respond to requests from many different DNS names.
A browser can make different types of requests (GET, POST, HEAD, etc), and usually includes several different headers including cookies, browser capabilities, language preferences, etc.
Most browsers usually maintain a cache in order to avoid downloading stuff many times, and use various techniques to determine whether the cached version of a file is valid.
In modern webpages there's often complex interaction between many different kinds of files (HTML, CSS, images, JavaScript, video, Flash, ...), and web developers often need detailed knowledge of differences among browsers in order to keep their pages working for everyone
Each of these topics, and many more, could be discussed at length. Perhaps it's more practical to ask more specific questions about the topics you're interested in?

You type maps.google.com(Uniform Resource Locator) into the address bar of your browser and press enter.
Every URL has a unique IP address associated with it. The mapping is stored in Name Servers and this procedure is called Domain Name System.
The browser checks its cache to find the IP Address for the URL.
If it doesn't find it, it checks its OS to find the IP address (gethostname);
It then Checks the router's cache.
It then checks the ISP's cache. If it is not available there the ISP makes a recursive request to different name servers.
It Checks the com name server (we have many name servers such as 'in', 'mil', 'us' etc) and it will redirect to google.com
google.com name server will find the matching IP address for maps.google.com in its’ DNS records and return it to your DNS recursor which will send it back to your browser.
Browser initiates a TCP connection with the server.It uses a three way handshake
Client machine sends a SYN packet to the server over the internet asking if it is open for new connections.
If the server has open ports that can accept and initiate new connections, it’ll respond with an ACKnowledgment of the SYN packet using a SYN/ACK packet.
The client will receive the SYN/ACK packet from the server and will acknowledge it by sending an ACK packet.
Then a TCP connection is established for data transmission!
The browser will send a GET request asking for maps.google.com web page. If you’re entering credentials or submitting a form this could be a POST request.
The server sends the response.
Once the server supplies the resources (HTML, CSS, JS, images, etc.) to the browser it undergoes the below process:
Parsing - HTML, CSS, JS
Rendering - Construct DOM Tree → Render Tree → Layout of Render Tree → Painting the render tree
The rendering engine starts getting the contents of the requested document from the networking layer. This will usually be done in 8kB chunks.
A DOM tree is built out of the broken response.
New requests are made to the server for each new resource that is found in the HTML source (typically images, style sheets, and JavaScript files).
At this stage the browser marks the document as interactive and starts parsing scripts that are in "deferred" mode: those that should be executed after the document is parsed. The document state is set to "complete" and a "load" event is fired.
Each CSS file is parsed into a StyleSheet object, where each object contains CSS rules with selectors and objects corresponding CSS grammar. The tree built is called CSSCOM.
On top of DOM and CSSOM, a rendering tree is created, which is a set of objects to be rendered. Each of the rendering objects contains its corresponding DOM object (or a text block) plus the calculated styles. In other words, the render tree describes the visual representation of a DOM.
After the construction of the render tree it goes through a "layout" process. This means giving each node the exact coordinates where it should appear on the screen.
The next stage is painting–the render tree will be traversed and each node will be painted using the UI backend layer.
Repaint: When changing element styles which don't affect the element's position on a page (such as background-color, border-color, visibility), the browser just repaints the element again with the new styles applied (that means a "repaint" or "restyle" is happening).
Reflow: When the changes affect document contents or structure, or element position, a reflow (or relayout) happens.

i was also searching for the same thing and found this awesome detailed answer being built collaboratively at github

I can describe one point here -
Determining which file/resource to execute, which language interpreter to load.
Pardon me if I am wrong in using interpreter here. There may be other mistakes in my answer, I will try to correct them later and include proper technical terms for things.
When the web server (e.g. apache) has received the URI it checks if there is any existing rewrite rule matching it. In that case the rewritten URI is taken. In either case, if there is no file name to end the URI, the default file is loaded, which is generally index.html or index.php etc. According to the extension of the file name, the appropriate apache module for server-side programming language support is loaded, e.g. mod_php for PHP, mod_python in case of python. The appropriate server side language interpreter (considering interpreted languages like PHP) then prepares the final HTML or output in some other form for the web server which finally sends it as the HTTP response.

I hope above image help you to understand whole process.
Full article is here

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008