Optimizing php/mysql translation lookup with huge database and hash indexes - mysql

I'm currently using a utf8 mysql database. It checks if a translation is already in the database and if not, it does a translation and stores it in the database.
SELECT * FROM `translations` WHERE `input_text`=? AND `input_lang`=? AND `output_lang`=?;
(The other field is "output_text".) For a basic database, it would first compare, letter by letter, the input text with the "input_text" "TEXT" field. As long as the characters are matching it would keep comparing them. If they stop matching, it would go onto the next row.
I don't know how databases work at a low level but I would assume that for a basic database, it would search at least one character from every row in the database before it decides that the input text isn't in the database.
Ideally the input text would be converted to a hash code (e.g. using sha1) and each "input_text" would also be a hash. Then if the database is sorted properly it could rapidly find all of the rows that match the hash and then check the actual text. If there are no matching hashes then it would return no results even though each row wasn't manually checked.
Is there a type of mysql storage engine that can do something like this or is there some additional php that can optimize things? Should "input_text" be set to some kind of "index"? (PRIMARY/UNIQUE/INDEX/FULLTEXT)
Is there an alternative type of database that is compatible with php that is far superior than mysql?
edit:
This talks about B-Tree vs Hash indexes for MySQL:
http://dev.mysql.com/doc/refman/5.5/en/index-btree-hash.html
None of the limitations for hash indexes are a problem for me. It also says
They are used only for equality comparisons that use the = or <=> operators (but are very fast)
["very" was italicized by them]
NEW QUESTION:
How do I set up "input_text" TEXT to be a hash index? BTW multiple rows contain the same "input_text"... is that alright for a hash index?
http://dev.mysql.com/doc/refman/5.5/en/column-indexes.html
Says "The MEMORY storage engine uses HASH indexes by default" - does that mean I've just got to change the storage engine and set the column index to INDEX?

A normal INDEX clause should be enough (be sure to index all your fields, it'll be big on disk, but faster). FULLTEXT indexes are good when you're using LIKE clauses ;-)
Anyway, for that kind of lookups, you should use a NoSQL store like Redis, it's blazingly fast and has an in-memory store and also does data persistence through snapshots.
There is an extension for php here : https://github.com/nicolasff/phpredis
And you'll have redis keys in the following form: YOUR_PROJECT:INPUT_LANG:WORD:OUTPUT_LANG for better data management, just replace each value with your values and you're good to go ;)

An index will speed up the lookups a lot.
By default indexes in InnoDB and MyISAM use search trees (B-trees). There is a limitation on the length of the row the index so you will have to index only the 1-st ~700 bytes of text.
CREATE INDEX txt_lookup ON translations (input_lang, output_lang, input_text(255));
This will create an index on input_lang, output_lang and the 1-st 255 characters of input_text.
When you select with your example query MySQL will use the index to find the rows with the appropriate languages and the same starting 255 characters quickly and then it will do the slow string compare with the full length of the column on the small set of rows which it got from the index.

Related

mysql partitioning by key internal hashing function

We have a table partitioned by key (binary(16))
Is there any option to calculate which partition record will go outside of MySQL?
What is the hash function (not linear one)?
The reason is to sort the CSV files outside MySQL and insert them in parallel in right partitions with LOAD DATA INFILE and then index in parallel too.
I can't find the function in MySQL docs
What's wrong with Linear? Are trying to LOAD in parallel?
How many indexes do you have? If only that hash, sort the table, then load into a non-partitioned InnoDB with the PK already in place. Meanwhile, make sure every column uses the smallest possible datatype. How much RAM do you have?
If you are using MyISAM, consider MERGE - With that, you can load each partition-like table as in a separate thread. When finished, construct the "merge" table that combines them.
What types of queries will you be using? Single row lookups by the BINARY(16)? Anything else might have big performance issues.
How much RAM? We need to tune either key_buffer_size or innodb_buffer_pool_size.
Be aware of the limitations. MyISAM defaults to a 7-byte data pointer and a 6-byte index pointer. 15TB would need only a 6-byte data pointer if the rows are DYNAMIC (byte pointer), or 5 bytes if they are FIXED (row number). So that could be 1 or 2 bytes to be saved. If anything is variable length, go with Dynamic; it would waste too much space (and probably not improve speed) to go fixed. I don't know of the index pointer can be shrunk in your case.
You are in 5.7? MySQL; 8.0 removes MyISAM. Meanwhile, MariaDB still handles MyISAM.
Will you first split the data by "partition"? Or send off INSERTs to different "partitions" one by one. (This choice adds some more wrinkles and possibly optimizations.)
Maybe...
Sort the incoming data by the binary version of MD5().
Split into chunks based on the first 4 bits. (Or do the split without sorting first) Be sure to run LOAD DATA for one 4-bit value in only one thread.
Have PARTITION BY RANGE with 16 partitions:
VALUES LESS THAN 0x1000000000000000
VALUES LESS THAN 0x2000000000000000
...
VALUES LESS THAN 0xF000000000000000
VALUES LESS THAN MAXVALUE
I don't know of a limit on the number of rows in a LOAD DATA, but I would worry about ACID locks having problems if you go over, say, 10K rows at a time.
This technique may even work for a non-partitioned table.

mysql search optimization with hex function

I have a INNODB table and need to search in a VARCHAR column but searching is very very slow. I cannot add a FULLTEXT index :( and MYISAM is not an option.
Some guys gave me advice to use HEX like the code below.
Is that right? Does it perform better? I don't see any progress in my application.
SELECT *
FROM order_line
WHERE HEX(description) LIKE '%6B616C6B7A616E64737465656E%'
Even if the table in question requires InnoDB and you are restricted to an outdated MySQL version for some reason, there are at least two possible improvements (and a third that builds on one of them):
Copy the primary key(s) and the searchable text column(s) into a MyISAM table with a fulltext index and run the searches against that. This will obviously require you to update the search table every time the original is updated.
You can create cache tables storing information about the executed searches and their hits for later reuse (one to track the cached searches and the other to cache the actual results). Again, this will require you to periodically update the cache or purge it completely.
Since you're using LIKE to search, you could use solution #2 with an additional hack: first check if there are cached searches for substrings of the current search query and limit the scope of your search to items that matched those previous searches only.

Fastest MySQL peformance updating a single field in a single indexed row

I'm trying to get the fastest performance from an application that updates indexed rows repeatedly replacing data in a varchar field. This varchar field will be updated with data that is of equal size upon subsequent updates (so a single row never grows). To my utter confusion I have found that the performance is directly related to the size of the field itself and is nowhere near the performance of replacing data in a filesystem file directly. ie 1k field size orders of magnitude faster than 50k field size. (within the row size limit) If the row exists in the database and the size is not changing why would an update incur so much overhead?
i am using innodb and have disabled binary logging. i've ruled out communications overhead by using sql generated strings. tried using myisam and it was roughly 2-3x faster but still too slow. i understand the database has overhead but again i am simply replacing data in a single field with data that is of equal size. what is the db doing other than directly replacing bits?
rough peformance #'s
81 updates/sec (60k string)
1111 updates/sec (1k string)
filesystem performance:
1428 updates/sec (60k string)
the updates i'm doing are insert...on duplicate key update. straight updates are roughly 50% faster but still ridiculously slow for what it is doing.
Can any experts out there enlighten me? Any way to improve these numbers?
I addressed a question in the DBA StackExchange concerning using CHAR vs VARCHAR. Please read all the answers, not just mine.
Keep something else in mind as well. InnoDB features the gen_clust_index, the internal row id clustered index for all InnoDB Tables, one per InnoDB table. If you change anything in the primary key, this will give the gen_clust_index a real workout getting reoganized.

Mysql - Index Performances

Is there any performance issues if you create an index with multiple columns, or should you do 1 index per column?
There's nothing inherently wrong with a multi-column index -- it depends completely on how you're going to query the data. If you have an index on colA+colB, it will help for queries like where colA='value' and colA='value' and colB='value' but it's not going to help for queries like colB='value'.
Advantages of MySQL Indexes
Generally speaking, MySQL indexing into database gives you three advantages:
Query optimization: Indexes make search queries much faster.
Uniqueness: Indexes like primary key index and unique index help to avoid duplicate row data.
Text searching: Full-text indexes in MySQL version 3.23.23, users have the opportunity to optimize searching against even large amounts of text located in any field indexed as such.
Disadvantages of MySQL indexes
When an index is created on the column(s), MySQL also creates a separate file that is sorted, and contains only the field(s) you're interested in sorting on.
Firstly, the indexes take up disk space. Usually the space usage isn’t significant, but because of creating index on every column in every possible combination, the index file would grow much more quickly than the data file. In the case when a table is of large table size, the index file could reach the operating system’s maximum file size.
Secondly, the indexes slow down the speed of writing queries, such as INSERT, UPDATE and DELETE. Because MySQL has to internally maintain the “pointers” to the inserted rows in the actual data file, so there is a performance price to pay in case of above said writing queries because every time a record is changed, the indexes must be updated. However, you may be able to write your queries in such a way that do not cause the very noticeable performance degradation.

MySQL: add a field to a large table

i have a table with about 200,000 records. i want to add a field to it:
ALTER TABLE `table` ADD `param_21` BOOL NOT NULL COMMENT 'about the field' AFTER `param_20`
but it seems a very heavy query and it takes a very long time, even on my Quad amd PC with 4GB of RAM.
i am running under windows/xampp and phpMyAdmin.
does mysql have a business with every record when adding a field?
or can i change the query so it makes the change more quickly?
MySQL will, in almost all cases, rebuild the table during an ALTER**. This is because the row-based engines (i.e. all of them) HAVE to do this to retain the data in the right format for querying. It's also because there are many other changes you could make which would also require rebuilding the table (such as changing indexes, primary keys etc)
I don't know what engine you're using, but I will assume MyISAM. MyISAM copies the data file, making any necessary format changes - this is relatively quick and is not likely to take much longer than the IO hardware can get the old datafile in and the new on out to disc.
Rebuilding the indexes is really the killer. Depending on how you have it configured, MySQL will either: for each index, put the indexed columns into a filesort buffer (which may be in memory but is typically on disc), sort that using its filesort() function (which does a quicksort by recursively copying the data between two files, if it's too big for memory) and then build the entire index based on the sorted data.
If it can't do the filesort trick, it will just behave as if you did an INSERT on every row, and populate the index blocks with each row's data in turn. This is painfully slow and results in far from optimal indexes.
You can tell which it's doing by using SHOW PROCESSLIST during the process. "Repairing by filesort" is good, "Repairing with keycache" is bad.
All of this will use AT MOST one core, but will sometimes be IO bound as well (especially copying the data file).
** There are some exceptions, such as dropping secondary indexes on innodb plugin tables.
You add a NOT NULL column, the tuples need to be populated. So it will be slow...
This touches each of 200.000 records, as each record needs to be updated with a new bool value which is not going to be null.
So; yes it's an expensive query... There is nothing you can do to make it faster.