Make MySQL table FIXED by splitting TEXT field into chunks of type CHAR(255) - mysql

FIXED MySQL table has well-known performance advantages over DYNAMIC table.
There is a table tags with only one description TEXT field. An idea is to split this field into 4-8 CHAR(255) fields. For INSERT/UPDATE queries just divide description into chunks (PHP function str_split()). That will make table FIXED.
Have anybody practiced this technique? Is it worth it?

OK, this is done, but where it is done I have only seen it done for historical reasons, such as a particular client-server model that requires it, or for legacy reports where the segments are de facto fields in the layout.
The examples I have seen where for free form text entries (remarks, notes, contact log) in insurance/collections applications and the like where formatting on a printed report was important or there was a need to avoid any confusion in post post processing to dress the format where multiple platforms are involved. (\r\n vs \n and EBCDIC vertical tabs).
So not generally for space/performance/recovery purposes.
If the row is "mostly" this field, a alternative would be to create a row for each segment and add a low-range sequence number to the key.
In this way you would have only 1 row for short values and up to 8 for long. Consider your likely statistics.
Caveats :
Always be acutely aware of MySQL indexes dropping trailing spaces. Concatenating these should take this into account if used in an index.
This is not a recommendation, but "tags" sounds like a candidate for a single varchar field for full text indexing. If the data is so important that forensic recovery is a requirement, then normalising the model to store the tags in a separate table may be another way to go.

Related

UTF-8 supplementary characters in MySQL table names?

What I'm doing
I'm working on a chat application (written in PHP), which allows users to create their own chat rooms. A user may name a chat room anything they like and this name is passed on to the MySQL database in a prepared statement as the table name for that respective chat room.
It is understood that there is no log in / security measure for this application and the table holding the chat log is composed of records with simply the user submitted text and timestamp (2 columns, without counting an AUTO_INCREMENT primary key).
What I'm facing
Given the simple nature of this application, I don't have the intention of changing the structure of the database, but I'm now running into the issue when a user enters emoji (or other supplementary characters) as the name for their own chat room. Passing such information on to the database as is will convert the characters into question marks, due to the way MySQL works internally (https://dev.mysql.com/doc/refman/5.7/en/identifiers.html):
Identifiers are converted to Unicode internally. [..] ASCII NUL (U+0000) and supplementary characters (U+10000 and higher) are not permitted in quoted or unquoted identifiers.
What should / can I do to avoid this problem? Is there a best practice for "escaping" / "sanitizing" user input in a situation like this? I put the respective words in quotation marks because I know it is not the proper / typical way of handling user input in a database.
What I'm trying
An idea I had was using rawurlencode() to literally break down the supplementary characters into unique sequences that I can pass on to the database and still be sure that a chat room with the name 🤠 is not confused with 🤔. However, I have the impression based on this answer that this is not good practice: https://stackoverflow.com/a/8700296/1564356.
Tackling this issue another way, I thought of base64_encode(), but again based on this answer it is not an ideal approach: https://stackoverflow.com/a/24175941/1564356. I'm wondering however, if in this case it would still be an acceptable one.
A third option would be to construct the database in a different way by issuing unique identifiers as the table names for each respective chat room and storing the utf8mb4 compatible string in a column. Then a second table with the actual chat log can be linked with a foreign key. This however complicates the structure of the database and doubles the amount of tables required. I'm not a fan of this approach.
Any ideas? Thanks!
Dynamically created tables, regardless of their naming scheme, are very rarely a sensible design choice. They make every single query you write more complicated, and eliminate a large part of the usefulness of SQL as a language and relational databases as a concept.
Furthermore, allowing users to directly choose table names sounds like a security disaster waiting to happen. Prepared statements will not save you in any way, because the table name is considered part of the query, not part of the data.
Unless you have a very compelling reason for such an unusual design, I would strongly recommend changing to have a single table of chat_logs, with a column of chat_room_id which references a chat_rooms table. The chat_rooms table can then contain the name, which can contain any characters the user wants, along with additional data about the room - creation date, description, extra features, etc. This approach requires exactly 2 tables, however many chat rooms are created.
If you really think you need the separate table for each chat room, because you're trying to do some clever partitioning / sharding, I would still recommend having a chat_rooms table, and then you can simply name the tables after the chat_room_id, e.g. chat_logs_1, chat_logs_2, etc. This approach requires exactly one more table than your current approach, i.e. num_tables = num_chat_rooms + 1.
CHARACTER SET utf8mb4 is needed end-to-end for MySQL in order to store Emoji and some Chinese characters.
In this you will find more on "best practice" and debugging tips when you fail to follow the best practice. It's not just the column charset, it is also the client's charset.
Do not use any encode/decode routines; it only makes the mess worse.
It is best to put the actual characters in MySQL tables, not Unicode strings like U+1F914 or \u1F914, etc.
🤔 is 4 bytes of hex F09FA494 when encoded in UTF-8 (aka MySQL's utf8mb4).
And, I agree with IMSoP; don't dynamically create table.
SQL Injection should be countered with mysqli_real_escape_string (or equivalent, depending on the API), not urlencode or base64.

Would it be effective to store text of different sizes in different mysql tables?

I am building a database system that will be storing large amounts of text.
The text will be stored in a table with an id column and one varchar/text column.
I was wondering if it would be more effective to use a single table which employed a large varchar, or multiple tables, each employing a different text type.
The multiple table option would contain several different tables, each employing a different kind of text (tinytext, text, etc.), and the system would store text in the most appropriate one based on the length of the text.
I am concerned with both speed and storage space, and would like a solution to balances both.
Edit -
The text table will not be searched on, but it may be joined (usually an id number will be determined, then a single row accessed).
Size will typically be smaller that text, but some will be large enough to require mediumtext. I doubt that longtext will be needed.
Keep it simple! Seriously.
Unless you have an overwhelming majority of text items that are 255 characters or shorter, just use TEXT or LONGTEXT. Spend your time doing interesting things with your text, not fiddling around with complex data structures. Get your project done now; optimize later.
Disk drives and RAM are getting cheaper much faster than your time is.
If your app requirements absolutely need you to use varchar data, for its brevity and searchability, instead of text data, you can do the following.
Create an article table, with one row per text article. It will have all the stuff you need to manage an article, including let's say the title, author, and an article_id.
Create a second table called something like article_text. It will have, possibly, four columns.
article_id foreign key to article table.
language a language code, if you happen to store translations of articles
ordinal a sequence number
textfrag varchar(255) part of the text.
Store each article's text in an series of article_text rows with ascending ordinal values. Each textfrag will hold up to 255 characters of your text. To retrieve an article's text you'll use a query like this.
SELECT textfrag
FROM article_text
WHERE language = 'en_US' /* or whatever */
AND article_id = 23456 /* or whatever */
ORDER BY ordinal
Then, you'll fetch a bunch of rows, concatenate the contents of the textfrag items, and there's your article with no effective length limit. If you create an index with all the fields in it, your retrieval time will be really fast because all retrievals will come from the index.
(article_id, language, ordinal, textfrag)
If you do your best to spit the text into fragments at word boundaries, and you use MyISAM, and you use FULLTEXT indexes, you'll get a very effective fulltext search system.

Handling numerous lengthy text fields

I'm designing an SQL Server database that needs to have quite a few (roughly 15) long varchar fields, most of which I want to allocate a length of at least 1024 or 2048. Since this will obviously go well beyond the page size of 8060, I realize the database will probably take a big performance hit any time this table is accessed.
I have also considered grouping these narratives into similar subjects and making 3 or 4 separate tables, or just creating a single table with a varchar(max) field and an int representing the narrative type:
create table Narrative
(
narrative varchar(max),
narrativeType int
)
Will there be a significant performance hit with the original design?
What types of best practices may be used when dealing with large text fields?
I don't claim my answer to be the final word on the subject. Perhaps others could add to this.
You will not be able to build a clustered or non-clustered index on your VARCHAR(MAX) column because of it's size. This will make searches on your table quite slow.
However, you will be able to use Full Text Search which will benefit performance significantly.
Personally, I would not split the data up in to multiple tables if it can be avoided. The reason for this is that it makes querying homogeneous data troublesome.
In the case of Full Text Search\Indexing, you may be tempted to create multiple tables if your text is in multiple languages (FTS is language dependent). I've worked around this by creating multiple Indexed Views on my table and building Full Text Indexes on my Indexed Views
If you are expecting large volumes of data, you may want to consider Partitions
It would probably be best to read more about the topics and then further refine your question.
I've decided to go ahead and split these narratives into their own table with a 1:1 relationship with the primary table. I don't suspect I will ever be querying the values in these varchar fields, and there is no need to do any indexing on them. Further, they will be accessed much less frequently than any of the other fields in the original table, so pulling them into a separate table helps focus the database design and may even improve performance, since they will only need to be dealt with when absolutely necessary.

Does splitting TEXT fields into multiple tables provide performance optimization in multi-language application?

I'm building a project and I have a question about mysql databases. The application is multi-language. And we are wondering if you will get better performance if we split up different types of text-fields (varchar, text, med-text) to different tables? Or is it just better to create one table with just a text field?
With this question and the multi-language constraint in mind, I am wondering if the performance will rise of I split up the different types of text fields into seperate tables. Because when you just have one table with all the texts and the language you can search it easily. (Give me the text with this value (in an item column) and that language) When you have different tables for different types of text. You will save space in your database. Because you don't need a full text area for a varchar(200), but you will have multiple tables to create a connection between the item, the type of text and the languages you have for your text.
What do you think is the best? Or are there some possibilities that I didn't used?
I find it better for performance reasons to keep columns with blob and text data types in a separate able from the other data types even if it breaks normalization.
Consider a person table with columns name varchar, address varchar, dob date and picture blob. A picture can be about 1MB easily while the remaining columns may not take any more than 1KB. Imagine how many blokcs of data needed to be read even if you only want to list the name and address of people living in a certain city - if you are keeping everything in the same table.
If you are not bound to MySQL, I would suggest you to use some sort of text-searching engines, such as Apache Lucene if you want to do full-text searches. Because as far as I know, MySQL does not provide as much performance as Lucene can for full-text searches.
In case you are bound to MySQL, let me try to provide some information based on current definition of the problem (which is actually not much yet).
MySQL reference documentation states that:
Instances of BLOB or TEXT columns in the result of a query that is processed using a temporary table causes the server to use a table on disk rather than in memory because the MEMORY storage engine does not support those data types.
So, if you run your queries using SELECT * on a table that contains text field, you can either separate queries that really need the text field and the ones that don't need it to gain speed; or alternatively you can separate the text field from the table as well. Saving text field on a secondary table will cause you an extra overhead of the duplicated key storage and also the indexes for that secondary table. However according to your database design, you may also be suffering overhead for unnecessary index updates which can be eliminated by moving the text field to another table, but this is just a proposition since we don't know your schema and data access occasions.

MySQL varchar(2000) vs text?

I need to store on average a paragraph of text, which would be about ~800 characters in the database. In some rare cases it may go up to 2000-2500~ characters. I've read the manual and I know there are many of these questions already, but I've read over 10+ questions on stackoverflow and I still find it a bit hard to figure out whether I should simply use text or something like varchar(2000). Half seem to say use varchar, while the other half say text. Some people say always use text if you have more than 255 characters (yea, this was after 5.0.3 which allowed varchar up to 65k). But then I thought to myself if I were to use text everytime the characters were over 255, then why did mysql bother increasing the size at all if that was always the best option?
They both have a variable size in storage I've read, so would there be no difference in my situation? I was personally leaning towards varchar(2000) then I read that varchar stores the data inline while text doesn't. Does this mean that if I constantly select this column, storing the data as varchar would be better, and conversely if I rarely select this column then using text would be better? If that is true, I think I would now choose the text column as I won't be selecting this column many of the times I run a query on the table. If it matters, this table is also frequently joined to as well (but won't be selecting the column), would that also further the benefit of using text?
Are my assumptions correct that I should go with text in this case?
When a table has TEXT or BLOB columns, the table can't be stored in memory. This means every query (which doesn't hit cache) has to access the file system - which is orders of magnitude slower than the memory.
Therefore you should store this TEXT column in a seperate table which is only accessed when you actually need it. This way the original table can be stored in memory and will be much faster.
Think of it as separating the data into one "memory table" and one "file table". The reason for doing this is to avoid accessing of the filesystem except when neccessary (i.e. only when you need the text).
You don't earn anything by storing the text in multiple tables. You still have to access the file system.
Sorry what I meant was for example, a forum script, in the posts table they might be >storing 20 columns of post data, they also store the actual post as a text field in the >same table. So that post column should be separated out?
Yes.
It seems weird to have a table called post, but the actual post isn't stored there, maybe >in another table called "actual_post" not sure lol.
You can try (posts, post_text) or (post_details, posts) or something like that.
I have a tags table that has just three fields, tag_id, tag, and description. So that >description column should also be separated out? So I need a tags table and a >tags_description table just to store 3 columns?
If the description is a TEXT column and you run queries against this table that doesn't need the description it would certainly be preferable.
I think you summarized it well. Another thing you could consider is just moving the "text" to another table... and join back to the master record. That way every time you are actually using the master table, that extra data of where the "text" is isn't even taking up space in the master record. When you need it you can join to that table. This way you can store it as a varchar just in case you want to do something like " where text like... "