Django MySQL - Setting an index on a Textfield - mysql

I have a database of articles that I want to search through. I had been using normal Django ORM to search, which was getting way to slow and then I got to know a little about Indexes in Django. I'm using MySQL and I now know that with MYSQL I cannot put an index field into a TextField as described here in this stack question which I was facing. However in my case I can't change this to CharField.
I was reading through the MyQSL Docs which stated
MySQL cannot index LONGTEXT columns specified without a prefix length
on the key part, and prefix lengths are not permitted in functional
key parts.
Hence I was of the understanding that since TextField in Django is LONGTEXT for MYSQL, I came across this Django-MySQL package here and thought that using this if I could change the LONGTEXT to a MEDIUMTEXT using this package, this might get resolved. So my updated model I did this
class MyModel(Model):
........
document = SizedTextField(size_class=3)
However, I still see the same error while applying python manage.py makemigrations
django.db.utils.OperationalError: (1170, "BLOB/TEXT column 'document'
used in key specification without a key length")
How can I go about resolving this?

returning all the articles that contain a given word passed by the client. So would be something SELECT * from articles WHERE text CONTAINS searchword
Add
FULLTEXT(text)
and use
WHERE MATCH(text) AGAINST("searchword")
or perhaps
WHERE MATCH(text) AGAINST("+searchword" IN BOOLEAN MODE)
It will run very fast. There are caveats -- short words and "stop" words (like "the") are ignored.
(If DJango cannot facilitate that, then you have to do it with "raw SQL".)

All of these related types, TEXT, MEDIUMTEXT, and LONGTEXT, are too large to be indexed without specifying a prefix. An index prefix means that only the first N characters of the string are included in the index. Like this:
create table mytable (
t text,
index myidx (t(200))
);
The prefix length in this example is 200 characters. So only the first 200 characters are included in the index. Usually this is enough to help performance, unless you had a large number of strings that are identical in their first 200 characters.
The longest prefix that MySQL supports depends on the storage engine and the row format. Old versions of MySQL support index prefix up to 768 bytes, which means a lesser number of characters depending on if you use multi-byte character sets like utf8 or utf8mb4. The recent versions of MySQL default to a more modern row format, which supports up to 3072 bytes for an index, again reduced by 3 or 4 bytes per character.
I'm not a regular Django user, so I tried to skim the documentation about defining indexes on model classes. But given a few seconds of reading, I don't see an option to declare a prefix for an index on a long string column.
I think your options are one of the following:
Change the column to a shorter string column that can be indexed
Create the index using the MySQL client, not using Django migrations

Related

Why/how am I able to exceed the configured length of a varchar in MySQL? [my mistake, I can't]

[Edit: my mistake, read the wrong column for the type, the one I'm inserting into is LONGTEXT, not varchar(190)]
I am working with an application that stores the majority of its information in a MySQL database (MySQL server 5.7). One particular value I'm looking at has a 255 character limit enforced by the GUI but, when I looked at that column in the table where it's stored, it's set to varchar(190). I confirmed that I can enter 255-character values in the GUI and that they are not truncated, as I expected.
How can a varchar(190) column store >190 characters? Are there any consequences to doing it this way?
I read 11.4.1 The CHAR and VARCHAR Types and it states that anything over the limit should be truncated.
The answer is that I can't. Misread the column types because that column is varchar(255) when the application builds the schema in PostgreSQL. It's longtext* in MySQL, which explains why I was able to get past the 190 characters. I tried inserting my 230 char test string into the varchar(190) column and it throws an error, as expected.
Need more coffee.
*not sure why longtext, when the application GUI limits input to 255 characters, but I'll need to ask the people who built it.

SQL UNIQUE key not treating international characters as distinct

I have just exported a MySQL database (in LATIN1) and converted to UTF-8 in the process, and imported on to a newer system.
It seemed to go OK, but I did hit a few instances where a UNIQUE key threw an error because two entries which differed only in an international character, e.g.
"åle" was not considered unique from "ale"
I did not find anything in the documentation on UNIQUE keys that mentioned character sets or encodings at all.
How can I configure the database to ensure that it considers these letters unique?
This depends on the "COLLATION" setting for the column in question. You can see the current collation with "SHOW FULL COLUMNS IN yourtablename".
For example, "utf8_general_ci" considers "ale", "åle" and "ALE" the same. Depending on your use case, something like "utf8_swedish_ci" or "utf8_bin" might be more appropriate. Note that changing collation will also change what ".. where yourcolumn=value" matches, and the ordering of "...order by yourcolumn".
You can change the collation with "ALTER TABLE" (for a single column), or database-wide. More information in the manual: http://dev.mysql.com/doc/refman/5.5/en/globalization.html

Creating key values mysql

I have worked with databases before where the key attributes for entities looked like
83NG92R8B202NG
I am trying to build a db myself and was wondering if there was a sql command that automatically assigned a key to an added tuple or I had to create some sort of random attribute algorithm myself.
Why not use a UUID? More information in mysql docs:
http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-functions.html#function_uuid
If that doesn't fit your needs, take a look at this topic:
Generating a random & unique 8 character string using MySQL

MySQL Uncompress issue

I am facing issue with MySQL, UNCOMPRESS function.
I have table named as user and user_details stores COMPRESS values. In this case before search for values from the user_details i have to UNCOMPRESS it.
But issue is after I do UNCOMPRESS, search become case-sensetive.
Like.. e.g
If I tries below sql, it will only search for values which contain CAPITAL TESTING word and ignore small case testing word
SELECT * FORM user WHERE UNCOMPRESS(user_details) LIKE '%TESTING%'.
I want case-insensitive search.
But issue is after I do UNCOMPRESS, search become case-sensetive.
This is because COMPRESS() "Compresses a string and returns the result as a binary string." (emphasis mine)
When you perform a LIKE operation on a binary string, a binary comparison will be performed (which is case-sensitive).
You may be able to circumvent this by putting a CAST() around the COMPRESS() statement.
But you probably shouldn't be doing this in the first place. It's an extremely inefficient way to search through huge amounts of data. MySQL will have to uncompress every row for this operation, and has no chance of using any of its internal optimization methods like indexes.
Don't use COMPRESS() in the first place.
Try this:
SELECT * FROM `user` WHERE LOWER(UNCOMPRESS(user_details)) LIKE '%testing%'
But as Pekka well pointed or its very inefficient . If your using MyIsam Engine another alternative is myisampack which compresses the hole table and its still query-able.

What will happen to existing data if I change the collation of a column in MySQL?

I am running a production application with MySQL database server. I forget to set column's collation from latin to utf8_unicode, which results in strange data when saving to the column with multi-language data.
My question is, what will happen with my existing data if I change my collation to utf8_unicode now? Will it destroy or corrupt the existing data or will the data remain, but the new data will be saved as utf8 as it should?
I will change with phpMyAdmin web client.
The article http://mysqldump.azundris.com/archives/60-Handling-character-sets.html discusses this at length and also shows what will happen.
Please note that you are mixing up a CHARACTER SET (actually an encoding) with a COLLATION.
A character set defines the physical representation of a string in bytes on disk. You can make this visible, using the HEX() function, for example SELECT HEX(str) FROM t WHERE id = 1 to see how MySQL stores the bytes of your string. What MySQL delivers to you may be different, depending on the character set of your connection, defined with SET NAMES .....
A collation is a sort order. It is dependent on the character set. For example, your data may be in the latin1 character set, but it may be ordered according to either of the two german sort orders latin1_german1_ci or latin1_german2_ci. Depending on your choice, Umlauts such as ö will either sort as oe or as o.
When you are changing a character set, the data in your table needs to be rewritten. MySQL will read all data and all indexes in the table, make a hidden copy of the table which temporarily takes up disk space, then moves the old table into a hidden location, moves the hidden table into place and then drops the old data, freeing up disk space. For some time inbetween, you will need two times the storage for that.
When you are changing a collation, the sort order of the data changes but not the data itself. If the column you are changing is not part of an index, nothing needs to be done besides rewriting the frm file, and sufficiently recent versions of MySQL should not do more.
When you are changing a collation of a column that is part of an index, the index needs to be rewritten, as an index is a sorted excerpt of a table. This will again trigger the ALTER TABLE table copy logic outlined above.
MySQL tries to preserve data doing this: As long as the data you have can be represented in the target character set, the conversion will not be lossy. Warnings will be printed if there is data truncation going on, and data which cannot be represented in the target character set will be replaced by ?
Running a quick test in MySQL 5.1 with a VARCHAR column set to latin1_bin I inserted some non-latin chars
INSERT INTO Test VALUES ('英國華僑');
I select them and get rubbish (as expected).
SELECT text from Test;
gives
text
????
I then changed the collation of the column to utf8_unicode and re-ran the SELECT and it shows the same result
text
????
This is what I would expect - It will keep the data and the data will remain rubbish, because when the data was inserted the column lost the extra character information and just inserted a ? for each non-latin character and there is no way for the ???? to again become 英國華僑.
Your data will stay in place but it won't be fixed.
Valid data will be properly converted:
When you change a data type using
CHANGE or MODIFY, MySQL tries to
convert existing column values to the
new type as well as possible. Warning:
This conversion may result in
alteration of data.
http://dev.mysql.com/doc/refman/5.5/en/alter-table.html
... and more specifically:
To convert a binary or nonbinary
string column to use a particular
character set, use ALTER TABLE. For
successful conversion to occur, one of
the following conditions must
apply:[...] If the column has a
nonbinary data type (CHAR, VARCHAR,
TEXT), its contents should be encoded
in the column character set, not some
other character set. If the contents
are encoded in a different character
set, you can convert the column to use
a binary data type first, and then to
a nonbinary column with the desired
character set.
http://dev.mysql.com/doc/refman/5.1/en/charset-conversion.html
So your problem is invalid data, e.g., data encoded in a different character set. I've tried the tip suggested by the documentation and it basically ruined my data, but the reason is that my data was already lost: running SELECT column, HEX(column) FROM table showed that multibyte chars had been inserted as 0x3F (i.e., the ? symbol in Latin1). My MySQL stack had been smart enough to detect that input data was not Latin1 and convert it into something "compatible". And once data is gone, you can't get it back.
To sum up:
Use HEX() to find out if you still have your data.
Make your tests in a copy of your table.
My question is, what will happen with my existing data if I change my
collation to utf8_unicode now?
Answer: If you change to utf8_unicode_ci, nonthing will happen to your existing data (which is already corrupt and remain corrupt till you modify it).
Will it destroy or corrupt the existing data or will the data remain,
but the new data will be saved as utf8 as it should?
Answer: After you change to utf8_unicode_ci, existing data will not be destroyed. It will remain the same like before (something like ????). However, if you insert new data containing Unicode characters, it will be stored correctly.
I will change with phpMyAdmin web client.
Answer: Sure, you can change collation with phpMyAdmin by going to Operations > Table options
CAUTION! Some problems are solved via
ALTER TABLE ... CONVERT TO ...
Some are solved via a 2-step process
ALTER TABLE ... MODIFY ... VARBINARY...
ALTER TABLE ... MODIFY ... VARCHAR...
If you do the wrong one, you will have a worse mess!
Do SELECT HEX(col), col ... to see what you really have.
Study this to see what case you have: Trouble with utf8 characters; what I see is not what I stored
Perform the correct fix, based on these cases: http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases