I'm reading email from a maildir and some emails have weird sets of characters in them:
=3D
=09
I think =3D is = and =09 is a space. There are some others, but I'm not sure:
=E2
=80
=93
Does anyone know what these are and what encoding issues I'm dealing with here?
BTW, I tried fetching these email via POP3 and it's the same thing. The reason I'm posting this on SO is not because I'm using a regular mail client to read the data. I'm reading via PHP out of maildir files. Perhaps a regular email client would detect what encoding this is and deal with it.
Thanks!
That looks like quoted-printable encoding.
This is a form of encoding for sending 8-bit character encodings over a medium which may not preserve the high bit - ie, they are not 8-bit clean. In the olden days, some mail servers did not preserve all 8 bits of a byte.
If you're seeing these in the message source but not in your email client, then this is normal.
If you're seeing these in your email client then something is messed up in whatever software the sender is using - most likely, the Content-Transfer-Encoding header has not been properly specified (which tells the email client how to decode it).
If you're writing an email client and want to be able to deal with this, you'll need to read the Content-Transfer-Encoding header. Of course, if you're doing that, you're also going to come up against multipart messages/attachments, base64, and much more.
Related
I recently had to debug an (old) web app that had to check a session cookie and was failing under some circumstance. Turns out that the library code that had to parse the HTTP Cookie header did not like the fact that one of the cookies in the header had a value that was a JSON object:
Cookie: lt-session-data={"id":"0.198042fc1767138e549","lastUpdatedDate":"2020-12-17T10:22:25Z"}; sessionid=a7f2f57d0b9a3247a350d9157fcbf9c2
(The lt-session-data cookie comes from LucidChart and ends up tagged with our domain because of the user visiting one of our Confluence pages with an embedded LucidChart diagram.)
It seems clear that the library I am using is applying the rules of RFC6265 in a strict manner:
cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
cookie-octet = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
; US-ASCII characters excluding CTLs,
; whitespace DQUOTE, comma, semicolon,
; and backslash
According to this, the JSON cookie value breaks at least two rules (no DQUOTE, no comma), and possibly more, depending on the data items in the JSON object (e.g. strings that may contain arbitrary characters).
My question is, is this common practice? Both Firefox and Chrome seem to accept this cookie without any issue, even though it goes against the RFC standard. I tried Googling for standards, but only RFC6265 turns up. It seems people just started putting JSON values in cookies.
If this is now common practice, and not a misguided effort by people who didn't bother to read the relevant standard docs, is there an updated standard?
By change I discovered that the django admin interfaces uses enctype="multipart/form-data" always.
I would like to adopt this pattern, but I am unsure if I see all consequences this has.
Why not use enctype="multipart/form-data" always?
Update
Since more than one year we use enctype="multipart/form-data" always in some forms. Works fine.
From the RFC that defines multipart/form-data:
Many web applications use the "application/x-www-form-urlencoded"
method for returning data from forms. This format is quite compact,
for example:
name=Xavier+Xantico&verdict=Yes&colour=Blue&happy=sad&Utf%F6r=Send
However, there is no opportunity to label the enclosed data with a
content type, apply a charset, or use other encoding mechanisms.
Many form-interpreting programs (primarily web browsers) now
implement and generate multipart/form-data, but a receiving
application might also need to support the
"application/x-www-form-urlencoded" format.
Aside from letting you upload files, multipart/form-data also allows you to use other charsets and encoding mechanisms. So the only reasons not to use it are:
If you want to save a bit of bandwidth (bearing in mind that this becomes much less of an issue if the request body is compressed).
If you need to support really old clients that can't handle file uploads and only know application/x-www-form-urlencoded, or that have issues handling anything other than ASCII.
There's a bit of overhead with using multipart/form-data for simple text forms. Compare a simple form with name and email.
Default (x-www-form-urlencoded)
Content-Type: application/x-www-form-urlencoded; charset=utf-8
name=Nomen+Nescio&email=foo%40bar.com
multipart/form-data
Content-Type: multipart/form-data; boundary=96a188ad5f9d4026822dacbdde47f43f
--96a188ad5f9d4026822dacbdde47f43f
Content-Disposition: form-data; name="name"
Nomen Nescio
--96a188ad5f9d4026822dacbdde47f43f
Content-Disposition: form-data; name="email"
foo#bar.com
--96a188ad5f9d4026822dacbdde47f43f
As you can see, you need to transmit a bunch of additional bytes in the body when using multipart encoding (37 bytes vs 252 bytes in this example)
But when you add the http headers and apply compression, the relative difference in payload would in most real life cases be much smaller.
The reason to prefer urlencoded over multipart is a small saving in http request size.
TL; DR
There's almost certainly no problem if you're targeting any modern browser and using SSL for any confidential data.
Background
The form-data type was originally developed as an experimental extension for file uploads in browsers, as explained in rfc 1867. There were compatibility issues at the time, but if your target browsers supports HTML 4.x and hence the enc-type, you're fine. As you can see here that's not an issue for all mainstream browsers.
As already noted in other answers, it is a more verbose format, but that is also not an issue when you can compress the request or even just rely on the improved speed of communications in the last 20 years.
Finally, you should also consider the potential for abuse of this format. Since it was designed to upload files, there was the potential for this to be used to extract information from the user's machine without their knowledge, or sending confidential information unencrypted, as noted in the HTML spec. Once again, though, modern browsers are so field hardened, I would be stunned if such low hanging fruit was left for hackers to abuse and you can use HTTPS for confidential data.
The enctype attribute specifies how the form-data should be encoded when submitting it to the server and enctype="multipart/form-data" is used when a user want to upload a file (images, text files etc.) to the server.
Is it necessary to percent encode a URI before using it in the browser i.e when we write a URI in a browser should it already be percent encoded or it is the responsibility of the browser to encode the URI and send the request to the server?
You'll find that most modern browsers will accept a non-encoded URL and they will generally be able to encode reserved characters themselves.
However, it is bad practice to rely on this because you can end up with unpredictable results. For instance, if you were sending form data to a server using a GET request and someone had typed in a # symbol, the browser will interpret that differently if it was encoded or non-encoded.
In short, it's always best to encode data manually to get predictable results if you're expecting reserved characters in a request. Fortunately most programming languages used on the web have built in functions for this.
Just to add, you don't need to encode the whole URL - it's usually the data you're sending in a GET request which gets encoded. For example:
http://www.foo.com?data=This%20is%20my%20encoded%20string%20%23
I have a problem with IE(7&8) browser's handling of security certificate errors.
Our application needs to send out a secure link to the user's email, consisting of a randomly generated token which may have special characters. So before sending out, we encode the token. The sample URL would be like this:
localhost:8080/myapp?t=7f%26DX%243q9a
When the user opens this in IE, it gives the certificate error page. ("There is a problem with this website's security certificate.") The continue link ON that page re-encodes our token into something else:
localhost:8080/myapp?t=7f%2526DX%25243q9a
(Thus the user would be sent to a slightly different URL than what we're expecting, as you can see.)
Here, you can see that the "%" s I'd sent get turned into "%25" s. how can I decode the token correctly after this?
Nasty!
If this is a reproducible bug and not funny behaviour caused by some character set issues or something - it doesn't look like it! - then I think your only way to work around it is to use an encoding method for the parameter that uses only letters and numbers, like base64.
I was wondering if somebody could shed some light on this browser behaviour:
I have a form with a textarea that is submitted to to the server either via XHR (using jQuery, I've also tried with plain XMLHttpRequest just to rule jQuery out and the result is the same) or the "old fashioned" way via form submit. In both cases method="POST" is used.
Both ways submit to the same script on the server.
Now the funny part: if you submit via XHR new line characters are transferred as "%0A" (or \n if I am not mistaken), and if you submit the regular way they are transferred as "%0D%0A" (or \r\n).
This, off course, causes some problems on the server side, but that is not the question here.
I'd just like to know why this difference? Shouldn't new lines be transferred the same no matter what method of submitting you use? What other differences are there (if any)?
XMLHttpRequest will when sending XML strip the CR characters from the stream. This is in accord with the XML specification which indicates that CRLF be normalised to simple LF.
Hence if you package your content as XML and send it via XHR you will lose the CRs.
In part 3.7.1 of RFC2616(HTTP1.1), it allows either \r\n,\r,\n to represent newline.
HTTP relaxes this requirement and allows the
transport of text media with plain CR or LF alone representing a line
break when it is done consistently for an entire entity-body. HTTP
applications MUST accept CRLF, bare CR, and bare LF as being
representative of a line break in text media received via HTTP.
But this does not apply to control structures:
This flexibility regarding
line breaks applies only to text media in the entity-body; a bare CR
or LF MUST NOT be substituted for CRLF within any of the HTTP control
structures (such as header fields and multipart boundaries).