UTF-8 supplementary characters in MySQL table names? - mysql

What I'm doing
I'm working on a chat application (written in PHP), which allows users to create their own chat rooms. A user may name a chat room anything they like and this name is passed on to the MySQL database in a prepared statement as the table name for that respective chat room.
It is understood that there is no log in / security measure for this application and the table holding the chat log is composed of records with simply the user submitted text and timestamp (2 columns, without counting an AUTO_INCREMENT primary key).
What I'm facing
Given the simple nature of this application, I don't have the intention of changing the structure of the database, but I'm now running into the issue when a user enters emoji (or other supplementary characters) as the name for their own chat room. Passing such information on to the database as is will convert the characters into question marks, due to the way MySQL works internally (https://dev.mysql.com/doc/refman/5.7/en/identifiers.html):
Identifiers are converted to Unicode internally. [..] ASCII NUL (U+0000) and supplementary characters (U+10000 and higher) are not permitted in quoted or unquoted identifiers.
What should / can I do to avoid this problem? Is there a best practice for "escaping" / "sanitizing" user input in a situation like this? I put the respective words in quotation marks because I know it is not the proper / typical way of handling user input in a database.
What I'm trying
An idea I had was using rawurlencode() to literally break down the supplementary characters into unique sequences that I can pass on to the database and still be sure that a chat room with the name 🤠 is not confused with 🤔. However, I have the impression based on this answer that this is not good practice: https://stackoverflow.com/a/8700296/1564356.
Tackling this issue another way, I thought of base64_encode(), but again based on this answer it is not an ideal approach: https://stackoverflow.com/a/24175941/1564356. I'm wondering however, if in this case it would still be an acceptable one.
A third option would be to construct the database in a different way by issuing unique identifiers as the table names for each respective chat room and storing the utf8mb4 compatible string in a column. Then a second table with the actual chat log can be linked with a foreign key. This however complicates the structure of the database and doubles the amount of tables required. I'm not a fan of this approach.
Any ideas? Thanks!

Dynamically created tables, regardless of their naming scheme, are very rarely a sensible design choice. They make every single query you write more complicated, and eliminate a large part of the usefulness of SQL as a language and relational databases as a concept.
Furthermore, allowing users to directly choose table names sounds like a security disaster waiting to happen. Prepared statements will not save you in any way, because the table name is considered part of the query, not part of the data.
Unless you have a very compelling reason for such an unusual design, I would strongly recommend changing to have a single table of chat_logs, with a column of chat_room_id which references a chat_rooms table. The chat_rooms table can then contain the name, which can contain any characters the user wants, along with additional data about the room - creation date, description, extra features, etc. This approach requires exactly 2 tables, however many chat rooms are created.
If you really think you need the separate table for each chat room, because you're trying to do some clever partitioning / sharding, I would still recommend having a chat_rooms table, and then you can simply name the tables after the chat_room_id, e.g. chat_logs_1, chat_logs_2, etc. This approach requires exactly one more table than your current approach, i.e. num_tables = num_chat_rooms + 1.

CHARACTER SET utf8mb4 is needed end-to-end for MySQL in order to store Emoji and some Chinese characters.
In this you will find more on "best practice" and debugging tips when you fail to follow the best practice. It's not just the column charset, it is also the client's charset.
Do not use any encode/decode routines; it only makes the mess worse.
It is best to put the actual characters in MySQL tables, not Unicode strings like U+1F914 or \u1F914, etc.
🤔 is 4 bytes of hex F09FA494 when encoded in UTF-8 (aka MySQL's utf8mb4).
And, I agree with IMSoP; don't dynamically create table.
SQL Injection should be countered with mysqli_real_escape_string (or equivalent, depending on the API), not urlencode or base64.

Related

Sort customers by name according to country/language specific collation rules in a MySQL database

We have customers from different countries using our Grails web application in their own native language (Swedish, Norwegian, Polish, German, Spanish etc) and they save data that is local to them. An example is a Customer-table having columns for first name and last name that need to be sorted as expected in the local language. This means that:
A Swedish customer wants to sort the list of customer according to collation utf8mb4_swedish_ci which will sort a/o/å/ä/ö as expected. Örjan will be sorted last and not in the same place as Olof.
A German customer wants to sort the list of customer according to collation utf8mb4_german2_ci which will sort Ăź/ss/u/ĂĽ as expected.
Similar cases for other languages like Norwegian, Polish etc.
All our columns have the character set utf8mb4 to be able to support the storage of characters from multiple languages.
Previously we used utf8mb4_swedish_ci as our collation for all columns that we could sort on by but because we are getting customers from other countries and languages and moving to an international market we need to implement changes to support customers globally.
We are investigating the following solutions:
Use utf8mb4_unicode_ci as collation in the database but add a collate expression like “order by firstname collate utf8mb4_swedish_ci” on all our queries depending on what language/location is used in the application.
Use multiple columns in the database that has the target collation like “firstname_swedish”(utf8mb4_swedish_ci), “firstname_german” (utf8mb4_german2) or reference a specific table with different columns.
Implement sorting in the application layer instead of the database.
Which solutions above would be the best approach regarding performance, implementation time and maintainability?
Let's try to summarize.
I would at once give up the idea to sort records on the application layer. All table data will have to be retrieved from database to application and this will quickly become a bottleneck and also this will require extra programming.
Apply collation to particular select query: minimum amount of programming, but MySQL will have to copy all records to temporary table, sort it with the given collation and then take your for example first 30 records. This will be done much more effectively than by your application, but as your table grows it will take more and more time and memory. However for, say, a few thousands of customers this is quite acceptable approach.
If you expect more customers and wish to optimize performance and server load, you can use additional columns, indexed with required collations: name_swe, name_ger, etc. Your application can store every customer's name to all of this columns and select only from the one wich has the required collation. This requires additional programming and redundant storage, but you will read only required data in required order, without temporary tables and additional processing.
Here are some thoughts on how to try to make this additional columns transparent to your application:
you can fill columns with different collations automatically using MySQL generated columns or triggers - thus applications has to insert/update data only for the "name" column
when performing select queries you can alias "name_swe", "name_ger" columns as "name" - thus application will have to read only single result column
Another version of this approach is to split customers table on several tables by customer country, each having required collation. You can union this tables to select all customers.
Also probably some DBMS can have multiple indices with different collations on the same column - this would solve the problem with minimum efforts, but as far as i can see mysql does not allow this.
I personally would start with setting collation for select query and take performance optimization measures when it will be needed.

Make MySQL table FIXED by splitting TEXT field into chunks of type CHAR(255)

FIXED MySQL table has well-known performance advantages over DYNAMIC table.
There is a table tags with only one description TEXT field. An idea is to split this field into 4-8 CHAR(255) fields. For INSERT/UPDATE queries just divide description into chunks (PHP function str_split()). That will make table FIXED.
Have anybody practiced this technique? Is it worth it?
OK, this is done, but where it is done I have only seen it done for historical reasons, such as a particular client-server model that requires it, or for legacy reports where the segments are de facto fields in the layout.
The examples I have seen where for free form text entries (remarks, notes, contact log) in insurance/collections applications and the like where formatting on a printed report was important or there was a need to avoid any confusion in post post processing to dress the format where multiple platforms are involved. (\r\n vs \n and EBCDIC vertical tabs).
So not generally for space/performance/recovery purposes.
If the row is "mostly" this field, a alternative would be to create a row for each segment and add a low-range sequence number to the key.
In this way you would have only 1 row for short values and up to 8 for long. Consider your likely statistics.
Caveats :
Always be acutely aware of MySQL indexes dropping trailing spaces. Concatenating these should take this into account if used in an index.
This is not a recommendation, but "tags" sounds like a candidate for a single varchar field for full text indexing. If the data is so important that forensic recovery is a requirement, then normalising the model to store the tags in a separate table may be another way to go.

MySQL can I use regex to denote the type of a row?

I have a 'User' table with a column:
USERNAME varchar(40)
A 'User' can either be a management user, or a standard user. Management users always contain purely numeric usernames.
I can then use regex to pull out all the users with numeric names to get the management users, and vice versa to get the non management.
Is it okay practice to therefore use regex in my select queries to get this information, or should I instead add an additional column:
MANAGEMENT TINYINT(1)
Which stores whether a user is a management user or not.
What is best practice and why?
It is really not very safe or efficient to rely on this sort of "hidden" meaning in a field. Add a new field. Really. There is no sensible reason not to.
I would also recommend reading this article on Falsehoods Programmers Believe About Names.
Some examples:
9: People’s names are written in ASCII.
15: People’s names do not contain numbers.
24: My system will never have to deal with names from China.
25: Or Japan.
26: Or Korea.
Can your regex handle all of those? For sure? Every time?
It is generally preferable to design the schema to be normalized. This includes not having multiple columns that would store the same information. However, the username and being a management user are definitely distict information and you need both.
You can use the BIT type for the new column.
In case you want to enforce particular formats of user name, the standard database solution would be to define a check constraint as follows:
ALTER TABLE YourTable ADD CONSTRAINT YourConstraint CHECK ( is_management_user = 0 OR ...)
(where the ... would include REGEXP_LIKE on Oracle, cast to an integer in TSQL or something like that).
However, MySql does not enforce check constraints and you have to emulate them using triggers if you consider the form of the USERNAME somehow important for security or operation of your application - which, I would hope, it really isn't.

What is the Ideal Encoding Setting for Database Supporting Multi-language?

According to MySQL manual, MySQL includes character set support that enables us to store data using a variety of character sets and perform comparisons according to a variety of collations. Character sets can be specified at four different levels:
Server
Database
Table
Column
Assuming I have a database that stores the following:
User ID (INT)
Email Address (VARCHAR 50)
User profile (TEXT - multi-language)
System flag (CHAR 1 - a-z only)
Between Latin1 and UTF-8, which should I choose for the four different levels to achieve the best possible performance?
ADD NOTE: This is just a simplified example. In real scenario, I would expect several columns storing (a-zA-Z0-9) and one or two columns storing multi-lingual text. That is why I am concerned about performance.
ADD NOTE2: I am referring to a database that stores millions of records. That is why performance matters to me.
I might be wrong, but from my experience the character set of your choice doesn't really have a big impact on your overall database performance (if you start mixing them up in different tables, now that might affect query performance).
If you want to support multiple languages, go for utf8 (or even utf16).
You should choose the same encoding for the whole database. Otherwise you as a developer will be confused later. And since the text is multilingual, is only leaves utf8 as the encoding of your choice.
Note that you can choose an encoding for the database connection, too.

MySQL - Necessity of surrounding table name with ` (tickmarks)?

In MySQL, is it necessary to surround tablenames with the tickmark? I've often seen code snippets that use the tickmarks, but I've not encountered a case where not surrounding a tablename with the tickmarks does anything different than with.
It looks like the framework I work with automatically parses reserved word keynames with tickmarks.
The tickmarks are there to distinguish the table (or column!) names from SQL reserved words. For example, without the tickmarks, a table or column named update could clash with the SQL command update.
It's best practice not to name tables and columns with reserved words in the first place, of course, but if you port an application to a new database engine (or possibly even a new MySQL version), it might have a different set of reserved words, so you may have just broken your app anyway. The tickmarks are your insurance against this sort of oops.
Tickmarks are necessary when table or column names use mySQL keywords.
I once worked at a job where they had a database of "locations". They called it states. and had everything in tables each table was named by state code. so Missouri was MO, and Arkansas was AR.
Several state codes are also reserve words OR ( Oregon), and IN (Indiana), and ON (Ontario) ... I know not exactly a state.
Even tho I think there were better ways of organizing their database data, there are cases where people want to name things a reserved word. This is an example where the `` marks kept code working.
It's only necessary if your table name is composed of non-"word" characters (ie: anything besides letters, numbers, and underscores), or happens to be the same as a keyword in MySQL. It's never "bad" to quote the name, but it's bad not to when you have to, so many programs that generate MySQL queries will be conservative and quote the name -- it's just easier that way.