An application I'm maintaining loads user agents extracted from web logs into a MySQL table column using the 'latin1' charset. Occasionally, it fails to load a user agent that looks like this:
Mozilla/5.0 (Iâ?; CPU iPhone OS 5_0_1 like Mac OS X) AppleWebKit/534.46 (KHTML^C like Gecko) Version
I suspect it's choking on Iâ?. I'm working to figure out if this should be supported, or if it's corruption introduced by the upstream logging system. Is this a legal user agent in a HTTP header?
RFC 2616 (HTTP 1.1) says that message header contents must be "consisting of either *TEXT or combinations of token, separators, and quoted-string". If you look at the definitions for TEXT etc you will find that legal characters are those with byte values not in the [0, 31] range and not equal to 127; therefore characters such as â are as far as I can tell legal as per the spec.
Technically, octets > 127 are allowed in comments. RFC 2616 makes them default to ISO-8859-1, but HTTPbis (the upcoming revision of RFC 2616) has removed that rule so that sometimes in the distant future, we may be able to move to a sane encoding.
Recommendation: strip all octets > 127.
HTTP 1.1 RFC2616 refers to ISO-8859-1, which is a latin based single byte character set.
With the consideration that HTTP traffic is supposed to be single byte, I also am using the latin1 character set for my similar logs. The decision was simply to make my indexes smaller.
If you use UTF8 with VARCHAR, only the characters that are multi-byte require additional bytes, so in table space, it's not much extra. However, indexes are stored fixed-width, so, they're padded with spaces just in case you need them (UTF8 indexes are three times as large as latin1 indexes).
It doesn't affect me if the occasional odd header is unreadable. However, if you're not indexing the column, you may as well use UTF8.
Related
EIP-155 states that the "The string format of the substituted hexadecimal ID MUST be leading zero padded to 64 hex characters length if necessary."
In what situation is a 0-padded hex ID necessary? It is odd they chose to use the keyword MUST here as it seems like the choice of whether to use 64 hex character padding is completely arbitrary.
I understand that there cannot exist more than 2^256 ids (64 hex digits), but wouldn't the choice of metadata URI for an ERC-1155 token be implementation-dependent?
For example, if I wanted to create an ERC-1155 token composed only of 64 NFTs, I'd much prefer defining metadata URLs as follows:
https://{DOMAIN}/1.json
https://{DOMAIN}/2.json
...
https://{DOMAIN}/40.json (64 in hex)
I suspect that ERC-1155 was built with uint256 in mind as the standard for numeric types and that requiring ID to be padded to 64 hex characters means that all 256 bits of information are specified explicitly. Maybe this alleviates potential issues with dirty leading bits?
Padding doesn't appear to be strictly necessary to function - I have seen smart contracts which use unpadded metadata URLs, such as Mining.game
(https://mumbai.polygonscan.com/address/0x1a3d0451f48ebef398dd4c134ae60846274b7ce0#code),
(https://api.mining.game/1.json).
This is on the Polygon testnet, not a mainnet, so keep in mind that code quality may not be stellar. But regardless, it appears to work.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
What is the difference between UTF-8 and ISO-8859-1?
UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.
Wikipedia explains both reasonably well: UTF-8 vs Latin-1 (ISO-8859-1). Former is a variable-length encoding, latter single-byte fixed length encoding.
Latin-1 encodes just the first 256 code points of the Unicode character set, whereas UTF-8 can be used to encode all code points. At physical encoding level, only codepoints 0 - 127 get encoded identically; code points 128 - 255 differ by becoming 2-byte sequence with UTF-8 whereas they are single bytes with Latin-1.
UTF
UTF is a family of multi-byte encoding schemes that can represent Unicode code points which can be representative of up to 2^31 [roughly 2 billion] characters. UTF-8 is a flexible encoding system that uses between 1 and 4 bytes to represent the first 2^21 [roughly 2 million] code points.
Long story short: any character with a code point/ordinal representation below 127, aka 7-bit-safe ASCII is represented by the same 1-byte sequence as most other single-byte encodings. Any character with a code point above 127 is represented by a sequence of two or more bytes, with the particulars of the encoding best explained here.
ISO-8859
ISO-8859 is a family of single-byte encoding schemes used to represent alphabets that can be represented within the range of 127 to 255. These various alphabets are defined as "parts" in the format ISO-8859-n, the most familiar of these likely being ISO-8859-1 aka 'Latin-1'. As with UTF-8, 7-bit-safe ASCII remains unaffected regardless of the encoding family used.
The drawback to this encoding scheme is its inability to accommodate languages comprised of more than 128 symbols, or to safely display more than one family of symbols at one time. As well, ISO-8859 encodings have fallen out of favor with the rise of UTF. The ISO "Working Group" in charge of it having disbanded in 2004, leaving maintenance up to its parent subcommittee.
Windows Code Pages
It's worth mentioning that Microsoft also maintains a set of character encodings with limited compatibility with ISO-8859, usually denoted as "cp####". MS seems to have a push to move their recent product releases to using Unicode in one form or another, but for legacy and/or interoperability reasons you're still likely to run into them.
For example, cp1252 is a superset of the ISO-8859-1, containing additional printable characters in the 0x80-0x9F range, notably the Euro symbol € and the much maligned "smart quotes" “”. This frequently leads to a mismatch where 8859-1 can be displayed as 1252 perfectly fine, and 1252 may seem to display fine as 8859-1, but will misbehave when one of those extra symbols shows up.
Aside from cp1252, the Turkish cp1254 is a similar superset of ISO-8859-9, but all other Windows Code Pages have at least some fundamental conflicts, if not differing entirely from their 8859 equivalent.
ASCII: 7 bits. 128 code points.
ISO-8859-1: 8 bits. 256 code points.
UTF-8: 8-32 bits (1-4 bytes). 1,112,064 code points.
Both ISO-8859-1 and UTF-8 are backwards compatible with ASCII, but UTF-8 is not backwards compatible with ISO-8859-1:
#!/usr/bin/env python3
c = chr(0xa9)
print(c)
print(c.encode('utf-8'))
print(c.encode('iso-8859-1'))
Output:
©
b'\xc2\xa9'
b'\xa9'
ISO-8859-1 is a legacy standards from back in 1980s. It can only represent 256 characters so only suitable for some languages in western world. Even for many supported languages, some characters are missing. If you create a text file in this encoding and try copy/paste some Chinese characters, you will see weird results. So in other words, don't use it. Unicode has taken over the world and UTF-8 is pretty much the standards these days unless you have some legacy reasons (like HTTP headers which needs to compatible with everything).
One more important thing to realise: if you see iso-8859-1, it probably refers to Windows-1252 rather than ISO/IEC 8859-1. They differ in the range 0x80–0x9F, where ISO 8859-1 has the C1 control codes, and Windows-1252 has useful visible characters instead.
For example, ISO 8859-1 has 0x85 as a control character (in Unicode, U+0085, ``), while Windows-1252 has a horizontal ellipsis (in Unicode, U+2026 HORIZONTAL ELLIPSIS, …).
The WHATWG Encoding spec (as used by HTML) expressly declares iso-8859-1 to be a label for windows-1252, and web browsers do not support ISO 8859-1 in any way: the HTML spec says that all encodings in the Encoding spec must be supported, and no more.
Also of interest, HTML numeric character references essentially use Windows-1252 for 8-bit values rather than Unicode code points; per https://html.spec.whatwg.org/#numeric-character-reference-end-state,
will produce U+2026 rather than U+0085.
From another perspective, files that both unicode and ascii encodings fail to read because they have a byte 0xc0 in them, seem to get read by iso-8859-1 properly. The caveat is that the file shouldn't have unicode characters in it of course.
My reason for researching this question was from the perspective, is in what way are they compatible. Latin1 charset (iso-8859) is 100% compatible to be stored in a utf8 datastore. All ascii & extended-ascii chars will be stored as single-byte.
Going the other way, from utf8 to Latin1 charset may or may not work. If there are any 2-byte chars (chars beyond extended-ascii 255) they will not store in a Latin1 datastore.
I was just curious because 65 is the same as the letter A
If this is the wrong stack sorry.
"65 is the same as the letter A": It is true if you say it is. But not saying more than that isn't very useful.
There is no text but encoded text. There are no numbers but encoded numbers. To the CPU, some number encodings are native, everything else is just undifferentiated data.
(Some data is just data for programs, other data is the CPU instructions of programs. It's a security problem if a CPU executes data as instructions inappropriately. Some architectures keep program data and instructions separate.)
Common native number encodings are signed and unsigned integers of 1, 2, 4, and 8 bytes and IEEE-754 single and double precision floating point numbers. Signed integers are usually two's-complement. Multi-byte integers have a byte ordering (or endianness) because on typical machines each byte is individually addressable. If a number encoding is not native, a program library is needed to process such data.
Text is a sequence of encoded characters from a character set. There are hundreds of character sets. A character set is an assignment of a conceptual character to a number called a codepoint. Sometimes the conceptual characters are categorized as lowercase letter, digit, symbol, etc. A codepoint value is mapped to bytes using a character encoding. Most character sets have one encoding, but Unicode has several. Some character sets are subsets of other character sets—such relationships are not generally useful because exactly one character set is used in any one context.
A program is a set of instructions that operate on data. It must apply the correct operations to the right data. So, it is the program that differentiates between text and number, usually by its location or flow path.
Stored data must be in a known layout of encoded text and numbers. Sometimes the layout is stored also. The layout is called metadata. Without the metadata accompanying the data, or being agreed upon, the data cannot be used.
It's all quite simple with appropriate bookkeeping. But there are several methods of bookkeeping so there is no general solution to how to handle data without metadata. Methods include: Well-known and/or registered file extensions, HTTP headers, MIME types, HTML meta charset tag, XML encoding declaration. Some methods only work in a certain context, such as audio/video codecs having a four-character code (FourCC), and unix shell scripts with a shebang. Some methods only help narrow guessing, such as file signatures. Needless to say, guessing should be avoided; it leads to security issues and data loss.
Unfortunately, text files are often without metadata. It is particularly important to agree upon or separately communicate the metadata.
Data without metadata is "binary". So the writer of text must agree with the reader on which character encoding is to be used. Similarly, for all types of data. Here reader and writer are both humans and programs.
Short answer. They don't. Longer answer, every binary combination between 00000000 and 11111111 has a character representation in the ASCII character set. 01000001 just happens to be the first capital letter in the Latin alphabet that was designated over 30 years ago. There are other character sets, and code pages that represent different letter, numbers, non-printable and accented letters. It's entirely possible that the binary 01000001 could be a lower case z with a tilde over the top in a different character set. 'computers' don't know (or care) what a particular binary representation means to humans.
In a multi-part (i.e. Content-Type=multipart/form-data) form, is there an upper limit on the length of the boundary string that an HTTP server should accept?
As far as I can tell, the relevant RFCs say 70 chars:
RFC2616 (HTTP/1.1) section "3.7 Media Types" says that the allowed types in the Content-Type header is defined by RFC1590 (Media Type Registration Procedure).
RFC1590 updates RFC-1521(MIME).
RFC1521 says that a boundary "must be no longer than 70 characters, not counting the two leading hyphens".
The same text also appears in RFC2046 which supposedly obsoletes RFC1521.
So can I be certain all the major HTTP/1.1 browsers out there today adhere to this limit? Are there any browsers (or other HTTP clients/libraries) known to break this limit?
Is there some other spec or common rule-of-thumb I'm missing that says the string will be shorter than 70 chars? In Chrome(ium) I get something like this: ----WebKitFormBoundaryLu4dNSGEhJZUgoe5, which is obviously shorter than 70 chars.
I'm asking this question because my server is running in an extremely memory-constrained environment, so "malloc a buffer large enough to hold the entire header string" is not an ideal answer.
As you note, RFC 2046 updated the MIME spec, but kept the restriction of the maximum boundary string to 70 characters, not counting the two leading hyphens.
I think it's a fair assumption that the spec is followed by all major browsers (and all MIME-using clients, like mail programs) since otherwise passing around multipart data would be very risky indeed.
To be sure, I've experimentally verified it for you using the latest versions of:
curl: ----------------------------5a56a6c893f2 (40)
Chrome 30 (WebKit): ----WebKitFormBoundarym0vCJKBpUYdCIWQG (38)
Safari 6 (WebKit, and same as Chrome): ----WebKitFormBoundaryFHUXvJBZwO2JKkNa (38)
FireFox 24: ---------------------------7096603861379320641089344535 (55)
IE 10: ---------------------------7dd1961640278 (40) - same technique as curl!
Apache HttpClient: -----------------------------1294919323195 (42)
Thus not only does every major browser/client conform, but all would allow you to save 15 allocated bytes per boundary per buffer from the theoretical max. If you could trivially switch on user agent, you could squeeze even more performance out. ;-)
I want to store a hashed password (using BCrypt) in a database. What would be a good type for this, and which would be the correct length? Are passwords hashed with BCrypt always of same length?
EDIT
Example hash:
$2a$10$KssILxWNR6k62B7yiX0GAe2Q7wwHlrzhF3LqtVvpyvHZf0MwvNfVu
After hashing some passwords, it seems that BCrypt always generates 60 character hashes.
EDIT 2
Sorry for not mentioning the implementation. I am using jBCrypt.
The modular crypt format for bcrypt consists of
$2$, $2a$ or $2y$ identifying the hashing algorithm and format
a two digit value denoting the cost parameter, followed by $
a 53 characters long base-64-encoded value (they use the alphabet ., /, 0–9, A–Z, a–z that is different to the standard Base 64 Encoding alphabet) consisting of:
22 characters of salt (effectively only 128 bits of the 132 decoded bits)
31 characters of encrypted output (effectively only 184 bits of the 186 decoded bits)
Thus the total length is 59 or 60 bytes respectively.
As you use the 2a format, you’ll need 60 bytes. And thus for MySQL I’ll recommend to use the CHAR(60) BINARYor BINARY(60) (see The _bin and binary Collations for information about the difference).
CHAR is not binary safe and equality does not depend solely on the byte value but on the actual collation; in the worst case A is treated as equal to a. See The _bin and binary Collations for more information.
A Bcrypt hash can be stored in a BINARY(40) column.
BINARY(60), as the other answers suggest, is the easiest and most natural choice, but if you want to maximize storage efficiency, you can save 20 bytes by losslessly deconstructing the hash. I've documented this more thoroughly on GitHub: https://github.com/ademarre/binary-mcf
Bcrypt hashes follow a structure referred to as modular crypt format (MCF). Binary MCF (BMCF) decodes these textual hash representations to a more compact binary structure. In the case of Bcrypt, the resulting binary hash is 40 bytes.
Gumbo did a nice job of explaining the four components of a Bcrypt MCF hash:
$<id>$<cost>$<salt><digest>
Decoding to BMCF goes like this:
$<id>$ can be represented in 3 bits.
<cost>$, 04-31, can be represented in 5 bits. Put these together for 1 byte.
The 22-character salt is a (non-standard) base-64 representation of 128 bits. Base-64 decoding yields 16 bytes.
The 31-character hash digest can be base-64 decoded to 23 bytes.
Put it all together for 40 bytes: 1 + 16 + 23
You can read more at the link above, or examine my PHP implementation, also on GitHub.
If you are using PHP's password_hash() with the PASSWORD_DEFAULT algorithm to generate the bcrypt hash (which I would assume is a large percentage of people reading this question) be sure to keep in mind that in the future password_hash() might use a different algorithm as the default and this could therefore affect the length of the hash (but it may not necessarily be longer).
From the manual page:
Note that this constant is designed to change over time as new and
stronger algorithms are added to PHP. For that reason, the length of
the result from using this identifier can change over time. Therefore,
it is recommended to store the result in a database column that can
expand beyond 60 characters (255 characters would be a good choice).
Using bcrypt, even if you have 1 billion users (i.e. you're currently competing with facebook) to store 255 byte password hashes it would only ~255 GB of data - about the size of a smallish SSD hard drive. It is extremely unlikely that storing the password hash is going to be the bottleneck in your application. However in the off chance that storage space really is an issue for some reason, you can use PASSWORD_BCRYPT to force password_hash() to use bcrypt, even if that's not the default. Just be sure to stay informed about any vulnerabilities found in bcrypt and review the release notes every time a new PHP version is released. If the default algorithm is ever changed it would be good to review why and make an informed decision whether to use the new algorithm or not.
I don't think that there are any neat tricks you can do storing this as you can do for example with an MD5 hash.
I think your best bet is to store it as a CHAR(60) as it is always 60 chars long
I think best choice is nonbinary type, because in comparison is less combination and should be faster. If data is encoded with base64_encode then each position, each byte have only 64 possible values. If encoded with bin2hex each byte have only 16 possible values, but string is much longer. In binary byte have 256 position on each.
I use for hashes in form of encode64 VARCHAR(255) column with ascii character set and the same collation.
VARBINARY causes comparison problem as described in MySQL documentation. I don't know why answers advice to use VARBINARY have so many positives.
I checked this on my author site, where measure time (just refresh to see).