MySQL "binary" vs "char character set binary" - mysql

What's the difference between binary(10) vs char(10)character set binary?
And varbinary(10) vs varchar(10)character set binary?
Are they synonymous in all MySQL engines?
Is there any gotcha to watch out for?

There isn't a difference.
However, there is a caveat if you're storing a string.
If you only want to store a byte array or other binary data such as a stream or file then use the binary type as that is what they are meant for.
Quote from the MySQL manual:
The use of CHARACTER SET binary in the definition of a CHAR, VARCHAR,
or TEXT column causes the column to be treated as a binary data type.
For example, the following pairs of definitions are equivalent:
CHAR(10) CHARACTER SET binary
BINARY(10)
VARCHAR(10) CHARACTER SET binary
VARBINARY(10)
TEXT CHARACTER SET binary
BLOB
So, technically there is no difference.
However, when storing a string it must be converted from a string to byte values using a character set. The decision is to either do this yourself before the MySQL server or you leave it up to MySQL do to do for you. MySQL will perform with by casting a string to BINARY using the BIN character sets.
If you want to store the encoding in another format, lets say you have a business requirement that says you must use 4 bytes per character (MySQL doesn't do this by default) you could then use the CHARACTER SET BINARY to a textual column and perform the character set encoding yourself.
It is also worth reading The BINARY and VARBINARY Types from the MySQL manual as this details important information such as padding.
Summary:
There is no technical difference as one is a synonym to the other. In my opinion it makes logical sense to store binary strings in data types that would normally hold a string using the CHARACTER SET BINARY and to store byte arrays / streams etc in BINARY fields that cannot be represented by transforming the data though a character set.

Related

CGI script having trouble sending emojis characters from database

I am storing emojis in a MySQL database expressed in UTF8 Bytes, like "\xf0\x9f\x98\x80", which is the Unicode character U+1F600 GRINNING FACE
It is fine if I copy and paste it in and test it like this
print MAIL "Subject: \xf0\x9f\x98\x80\n";
It works and sends me the emoji.
But if I tell the script to get it from the database and plug it in like this:
print MAIL "Subject: $subject\n";
It will give me the subject: \xf0\x9f\x98\x80
What do I need to do? I thought if I was storing it in bytes it would see it as plain text and it would work.
It seems most likely that you have added the value to the database wrongly.
If you use Perl code and write the string '\xf0\x9f\x98\x80' to the database (note the single quotes) then you will get exactly the symptoms you describe. Your database will contain the sixteen-character ASCII string \xf0\x9f\x98\x80 and it will be displayed as such.
You shouldn't be involved with the UTF-8 encoded bytes; it is far better to specify the Unicode code point either by name or number
All of these produce the same Perl UTF-8-encoded string
$s = "\N{U+1F600}";
$s = "\N{GRINNING FACE}";
$s = "\x{1F600}";
The corresponding encoded bytes are irrelevant to the programmer, but if you must you can use the Encode module like this
use Encode 'decode_utf8';
$s = decode_utf8 "\xf0\x9f\x98\x80";
Another way is to enter the character directly into your code. You will need use utf8 to indicate to the compiler that the source contains non-ASCII UTF-8-encoded characters, like this
use utf8;
$s = "😀";
All of these assignments to $s will produce exactly the same result, and the values will compare as being equal using eq
On the database side you need a MySQL column with a four-byte UTF-8 character set, for instance
column VARCHAR(50) CHARACTER SET utf8mb4
Note that the character set must be utf8mb4 as if you use the earlier utf8 then you would be restricted to three-byte encoding, whereas emoji characters are all four bytes

Proper way to store BCrypt Hashes on MySQL

Searching for the proper way to store BCrypt hashes in MySQL I found this question and it only made me more confuse.
The accepted answer point out that we should use:
CHAR(60) BINARY or BINARY(60)
But other people on the comments argue that instead we should use:
CHAR(60) CHARACTER SET latin1 COLLATE latin1_bin
or even:
COLLATE latin1_general_cs
I am not a specialist on databases so could anyone explain me the difference between all these options and which one is truly better for storing BCrypt hashes?
My answer is in the line of "what is proper", rather than "what will work".
Do not use latin1. Sure, it might work, but it is ugly to claim that the encrypted string is text when it is not.
Ditto for saying CHAR....
Simply say BINARY(...) if fixed length or VARBINARY(...) if it can vary in length.
However, there is a gotcha... Whose BCrypt are you using? Does it return binary data? Or a hex string? Or maybe even Base64?
My above answer assumed it returns binary data.
If it returns 60 hex digits, then store UNHEX(60_hex_digits) into BINARY(30) so that it is packed smaller.
If it is Base64, then CHARACTER SET ascii COLLATE ascii_bin would be "proper". (latin1 with a case-sensitive collation would also work.)
If it is binary, then, again, BINARY(60) is the 'proper' way to do it.
The link you provided looks like Base64, but is it? And is it up to 60 characters? Then I would use
VARCHAR(60) CHARACTER SET ascii COLLATE ascii_bin
And explicitly state the charset/collation for the column, thereby overriding the database and/or table "defaults".
All the Base64 chars (and $) are ascii; no need for a more complex charset. Collating with a ..._bin means "compare bytes exactly"; more specifically "don't do case folding". Since Base64 depends on distinguishing between upper and lower case letters, you don't want case folding.

Excel CSV String Not Fully Uploading To Excel

I have this string in Excel (I've UTF encoded It) when I save as CSV and import to MySql I get only the below, I know it's probably a charset issue but could you explain why as I'm having difficulty understanding it.
In Excel Cell:
PARTY HARD PAYDAY SPECIAL â UPTO £40 OFF EVENT PACKAGES INCLUDING HOTTEST EVENTS! MUST END SUNDAY! http://bit.ly/1Gzrw9H
Ends up in DB:
PARTY HARD PAYDAY SPECIAL
The field is structured to be utf8_general_ci encoded and VARCHAR(10000)
Mysql does not support full unicode utf8. There are some 4 byte characters that cannot be processed and, I guess, stored properly in regular utf8. I am assuming that upon import it is truncating the value after SPECIAL since mysql does not know how to process or store the character in the string that comes after that.
In order to handle full utf8 with 4 byte characters you will have to switch over to the utf8mb4.
This is from the mysql documentation:
The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. The utf8mb4 character set uses a maximum of four bytes per character supports supplemental characters...
You can read more here #dev.mysql
Also, Here is a great detailed explanation on reg-utf8 issues in mysql and how to switch to utf8mb4.

What is the difference between plaintext and binary data?

Many languages have functions which only process "plaintext", not binary. Does this mean that only characters within the ASCII range will be allowed?
Binary is just a series of bytes, isn't it similar to plaintext which is just a series of bytes interpreted as characters? So, can plaintext store the same data formats / protocols as binary?
a plain text is human readable, a binary file is usually unreadable by a human, since it's composed of printable and non-printable characters.
Try to open a jpeg file with a text editor (e.g. notepad or vim) and you'll understand what I mean.
A binary file is usually constructed in a way that optimizes speed, since no parsing is needed.
A plain text file is editable by hand, a binary file not.
"Plaintext" can have several meanings.
The one most useful in this context is that it is merely a binary files which is organized in byte sequences that a particular computers system can translate into a finite set of what it considers "text" characters.
A second meaning, somewhat connected, is a restriction that said system should display these "text characters" as symbols readable by a human as members of a recognizable alphabet. Often, the unwritten implication is that the translation mechanism is ASCII.
A third, even more restrictive meaning, is that this system must be a "simple" text editor/viewer. Usually implying ASCII encoding. But, really, there is VERY little difference between you, the human, reading text encoded in some funky format and displayed by a proprietary program, vs. VI text editor reading ASCII encoded file.
Within programming context, your programming environment (comprized by OS + system APIs + your language capabilities) defines both a set of "text" characters, and a set of encodings it is able to read to convert to these "text" characters. Please note that this may not necessarily imply ASCII, English, or 8 bits - as an example, Perl can natively read and use the full Unicode set of "characters".
To answer your specific question, you can definitely use "character" strings to transmit arbitrary byte sequences, with the caveat that string termination conventions must apply.
The problem is that the functions that already exist to "process character data" would probably not have any useful functionality to deal with your binary data.
One thing it often means is that the language might feel free to interpret certian control characters, such as the values 10 or 13, as logical line terminators. In other words, an output operation might automagicly append these characters at the end, and an input operation might strip them from the input (and/or terminate reading there).
In contrast, language I/O operations that advertise working on "binary" data will usually include an input parameter for the length of data to operate on, since there is no other way (short of reading past end of file) to know when it is done.
Generally, it depends on the language/environment/functionality.
Binary data is always that: binary. It is transferred without modification.
"Plain text" mode may mean one or more of the following things:
the stream of bytes is split into lines. The line delimiters are \r, \n, or \r\n, or \n\r. Sometimes it is OS-dependent (like *nix likes \n, while windows likes \r\n). The line ending may be adjusted for the reading application
character encoding may be adjusted. The environment might detect and/or convert the source encoding into the encoding the application expects
probably some other conversions should be added to this list, but I can't think of any more at this moment
Technically nothing. Plain text is a form of binary data. However a major difference is how values are stored. Think of how an integer might be stored. In binary data it would use a two's complement format, probably taking 32 bits of space. In text format a number would be stored instead as a series of unicode digits. So the number 50 would be stored as 0x32 (padded to take up 32 bits) in binary but would be stored as '5' '0' in plain text.

MySQL encoding issue

For some reason, my mysql table is converting single and double quotes into strange characters. E.g
"aha"
is changed into:
“ahaâ€
How can I fix this, or detect this in PHP and decode everything??
The encoding of your mysql client and your server don't match. Use SET NAMES to match the character set of the connection to the one used in your PHP files.
It seems that the UTF-8 encoded string “aha” (binary 0xE2809C 0x61 0x68 0x61 0xE2809D) is interpreted with Windows-1252. There this byte sequence represents the character sequence “ahaâ€.