I am trying to read the contents of a website using read_htmlin R. However, for some websites like http://benchmarkrealestate.com/, I get this error. Error in open.connection(x,"rb") : HTTP error 406
What does this error mean? This only happens in some websites. I tried to look it up online, but wasn't able to find the exact reason why I get this error.
How do I fix this?
406 Not Acceptable
The requested resource is capable of generating only content not
acceptable according to the Accept headers sent in the request.
The sentence above is lifted right off of Wikipedia.
Basically, whenever a Web crawler makes a request to a website, it often identifies itself, its application type and other information by submitting a characteristic identification string to its operating peer, i.e. the web server. In this case, this identification is transmitted in a header field called User-Agent.
One way to have the content of the web page returned to your console is to set your user-agent information to something identifiable with the help of the curl package:
library(xml2)
library(rvest)
library(curl)
web_content <- read_html(curl('http://benchmarkrealestate.com/', handle = new_handle("useragent" = "Mozilla/5.0")))
You may also want to read up on header fields.
Related
In a JSON-REST service architecture (following these patterns for methods and response codes) we often need to generate a deliberate 404 response - for example, if GET /users/123 is routed to a controller, which is then unable to find a User entity with ID 123, we return a 404 response, which in many cases will include a JSON payload with an error message/code/etc.
Now, when we provide a client for a specific API, we want the client to behave differently under different conditions. For example, if we point the client to the wrong host, we might get a 404 not found from that host - as opposed to the 404 we might get for an invalid User ID if we do reach the service.
In this case, a "404 User ID not found" is not an error, as far as the client is concerned - as opposed to any other "404 Not Found", which should cause the client to throw an exception.
My question is, how do you distinguish between these 404 errors?
Solely based on the response?
By adding a header to indicate a valid response?
Or some other way?
It is OK to return 404 in both cases. As 4xx codes are client relevant codes, it is also OK to return content even if there was an error.
Now, deciding what kind of 404 it was can be decided based on the body of the response. Remember, that the response should carry a mime-type that is compatible with the Accept header the client supplied. So if the client "knows" your specific error-describing format, your server can answer with a more detailed description.
This way both the server can decide whether the client would understand a detailed response with the 404, and the client also understands when it just got a regular 404, or one with a message it can process.
This would be both semantically correct, and compatible with HTTP.
In a classic form-based webapp, if a user submits a HTML form that contains validation errors, assuming no JavaScript, what's the correct thing to do?
Respond with the HTTP 200 + the page content (including error info for the user)
Respond with the HTTP 400 + the page content (including error info for the user)
Does it matter?
Your app is talking to human beings, not other machines. Therefore you should do the right thing and handle exceptions in a user-friendly manner.
Your user doesn't care about HTTP return codes, and so it should not even be a consideration for you either. You are confusing business-logic problems with HTTP protocol problems.
Infact, by throwing a 400 error at a web-browser, you are only likely to encounter the web browser throwing up an ugly message to the user.
If you were coding a REST api, then the answer would be different. But you're not.
1) would be the correct approach because you want to display a page of content to the user that highlights the invalid input values.
The trouble with 2) is that some browsers may display their own 'friendly' error page that is designed to help users understand 4xx errors. Here's some information about when IE displays 'friendly' error pages:
http://support.microsoft.com/kb/294807
On the one hand, if it is a web app for human consumption, a 200 with a some useful error message will work. Making web sites for humans is easier in that sense because they can read and understand the content and do not have to depend on the status code for interact with the applications.
On the other hand, If you thinking of a REST API more appropriate would be to throw a 4xx error because it is a client side error. In that case, you have several options.
According RFC2616, a 400 means
The request could not be understood by the server due to malformed
syntax. The client SHOULD NOT repeat the request without
modifications.
This doesn't seem to be appropriate as it's not due to malformed syntax.
However, RFC2616 is now obsoleted by RFC7230-7235. The new RFC7231 defines the meaning of 400 in a more broader way.
Client Error 4xx The 4xx (Client Error) class of status code indicates
that the client seems to have erred. Except when responding to a HEAD
request, the server SHOULD send a representation containing an
explanation of the error situation, and whether it is a temporary or
permanent condition.
400 Bad Request
The 400 (Bad Request) status code indicates that the server cannot or
will not process the request due to something that is perceived to be
a client error (e.g., malformed request syntax, invalid request
message framing, or deceptive request routing)
So this seems acceptable even though still generic. Another option would be to use 422 status code defined by RFC4918 (WebDAV).
422 Unprocessable Entity The 422 (Unprocessable Entity) status code
means the server understands the content type of the request entity
(hence a 415(Unsupported Media Type) status code is inappropriate),
and the syntax of the request entity is correct (thus a 400 (Bad
Request) status code is inappropriate) but was unable to process the
contained instructions. For example, this error condition may occur
if an XML request body contains well-formed (i.e., syntactically
correct), but semantically erroneous, XML instructions.
I want to confirm that I can return an image or a CSS file when I generate a 403 error.
From the documentation, it sounds like I can, as per: any included representation from section 6.5.
6.5. Client Error 4xx
The 4xx (Client Error) class of status code indicates that the client
seems to have erred. Except when responding to a HEAD request, the
server SHOULD send a representation containing an explanation of the
error situation, and whether it is a temporary or permanent
condition. These status codes are applicable to any request method.
User agents SHOULD display any included representation to the user.
Source: https://www.rfc-editor.org/rfc/rfc7231#section-6.5
Would you agree that we do not have to return HTML on a 403 error?
Yes, it's perfectly fine to return something else than HTML for an error.
Image hosts sometimes return errors as images so they would show up when embedded with <img>. Web APIs will often return an error description as JSON/XML. So it's not only perfectly fine, but also common.
I'm trying to find the correct HTTP status code for a page where the content is temporary unavailable however there is no redirect, instead a message is displayed on the page informing the user the content is temporarily unavailable.
307 Temporary Redirect isn't applicable as there is no redirect.
404 Not Found might possibly be applicable, however I'm not sure if this is the correct response to give as the content is found, just not available.
410 Gone isn't applicable as the content will be available again some time in the future.
None of the other codes seemed even remotely applicable. Does anyone know the correct code to use and can explain why?
It sounds like the 4XX series of responses are appropriate here. From the RFC:
The 4xx class of status code is intended for cases in which the
client seems to have erred. Except when responding to a HEAD request,
the server SHOULD include an entity containing an explanation of the
error situation, and whether it is a temporary or permanent
condition.
With this in mind, I think 403 forbidden is the most appropriate:
10.4.4 403 Forbidden
The server understood the request, but is refusing to fulfill it.
Authorization will not help and the request SHOULD NOT be repeated.
If the request method was not HEAD and the server wishes to make
public why the request has not been fulfilled, it SHOULD describe the
reason for the refusal in the entity. If the server does not wish to
make this information available to the client, the status code 404
(Not Found) can be used instead.
I suggest this for three reasons:
It's not an exotic code, so it will work fine in the browser. This, to me, is the most important reason - you will be able to serve a page that explains why the content isn't available, and you can be fairly certain it will be displayed correctly.
It's appropriate for the server to say "I understand your request, but I won't serve you that content at this time", and that's exactly what the first two lines of the description say.
It doesn't explicity say "forget you ever knew about this content" to any robots (or for that matter, people).
For completeness, here's why I ruled out the other response code categories:
2XX Success: This class of status code indicates that the client's request was
successfully received, understood, and accepted.
But, we're not accepting the request in this case. I don't think 2XX is right.
3XX Redirection: This class of status code indicates that further action needs to be
taken by the user agent in order to fulfill the request.
I suppose that you could argue "further action" to mean "please wait until the content is available before trying again", but reading the other 3XX codes, "further action" usually means "immediate redirect", which as you've already pointed out, isn't appropriate.
5XX Server error: Response status codes beginning with the digit "5" indicate cases in
which the server is aware that it has erred or is incapable of
performing the request.
Nothing has gone wrong on the server, you just don't want to serve the content right now.
HTTP STATUS CODE 204
i.e. NO CONTENT
Read More about it here:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
section 10.2.5
i.e.
The server successfully processed the request, but is not returning any content.
I'd like to know if the POST method on HTTP sends data as a QueryString, or if it use a special structure to pass the data to the server.
In fact, when I analyze the communication with POST method from client to server (with Fiddler for example), I don't see any QueryString, but a Form Body context with the name/value pairs.
The best way to visualize this is to use a packet analyzer like Wireshark and follow the TCP stream. HTTP simply uses TCP to send a stream of data starting with a few lines of HTTP headers. Often this data is easy to read because it consists of HTML, CSS, or XML, but it can be any type of data that gets transfered over the internet (Executables, Images, Video, etc).
For a GET request, your computer requests a specific URL and the web server usually responds with a 200 status code and the the content of the webpage is sent directly after the HTTP response headers. This content is the same content you would see if you viewed the source of the webpage in your browser. The query string you mentioned is just part of the URL and gets included in the HTTP GET request header that your computer sends to the web server. Below is an example of an HTTP GET request to http://accel91.citrix.com:8000/OA_HTML/OALogout.jsp?menu=Y, followed by a 302 redirect response from the server. Some of the HTTP Headers are wrapped due to the size of the viewing window (these really only take one line each), and the 302 redirect includes a simple HTML webpage with a link to the redirected webpage (Most browsers will automatically redirect any 302 response to the URL listed in the Location header instead of displaying the HTML response):
For a POST request, you may still have a query string, but this is uncommon and does not have anything to do with the data that you are POSTing. Instead, the data is included directly after the HTTP headers that your browser sends to the server, similar to the 200 response that the web server uses to respond to a GET request. In the case of POSTing a simple web form this data is encoded using the same URL encoding that a query string uses, but if you are using a SOAP web service it could also be encoded using a multi-part MIME format and XML data.
For example here is what an HTTP POST to an XML based SOAP web service located at http://192.168.24.23:8090/msh looks like in Wireshark Follow TCP Stream:
Post uses the message body to send the information back to the server, as opposed to Get, which uses the query string (everything after the question mark). It is possible to send both a Get query string and a Post message body in the same request, but that can get a bit confusing so is best avoided.
Generally, best practice dictates that you use Get when you want to retrieve data, and Post when you want to alter it. (These rules aren't set in stone, the specs don't forbid altering data with Get, but it's generally avoided on the grounds that you don't want people making changes just by clicking a link or typing a URL)
Conversely, you can use Post to retrieve data without changing it, but using Get means you can bookmark the page, or share the URL with other people, things you couldn't do if you'd used Post.
http://en.wikipedia.org/wiki/GET_%28HTTP%29
http://en.wikipedia.org/wiki/POST_%28HTTP%29
As for the actual format of the data sent in the message body, that's entirely up to the sender and is specified with the Content-Type header. If not specified, the default content-type for HTML forms is application/x-www-form-urlencoded, which means the server will expect the post body to be a string encoded in a similar manner to a GET query string. However this can't be depended on in all cases. RFC2616 says the following on the Content-Type header:
Any HTTP/1.1 message containing an entity-body SHOULD include a
Content-Type header field defining the media type of that body. If
and only if the media type is not given by a Content-Type field, the
recipient MAY attempt to guess the media type via inspection of its
content and/or the name extension(s) of the URI used to identify the
resource. If the media type remains unknown, the recipient SHOULD
treat it as type "application/octet-stream".
A POST request can include a query string, however normally it doesn't - a standard HTML form with a POST action will not normally include a query string for example.
GET will send the data as a querystring, but POST will not. Rather it will send it in the body of the request.
If your post try to reach the following URL
mypage.php?id=1
you will have the POST data but also GET data.