The table DDL as flows:
CREATE TABLE `video` (
`short_id` varchar(50) NOT NULL,
`prob` float DEFAULT NULL,
`star_id` varchar(50) NOT NULL,
`qipu_id` int(11) NOT NULL,
`cloud_url` varchar(100) DEFAULT NULL,
`is_identical` tinyint(1) DEFAULT NULL,
`quality` varchar(1) DEFAULT NULL,
PRIMARY KEY (`short_id`),
KEY `ix_video_short_id` (`short_id`),
KEY `sid` (`star_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
The video table has 4.5 million lines.
I execute the same query in mysql shell client as flows. except in where clause the star_id equal to a value with quatation mark, another not as flows.
select * from video where star_id="215343405";
12914 rows in set (0.22 sec)
select * from video where star_id=215343405;
12914 rows in set (3.17 sec)
the one with quatation mark is 10x faster then another(I have create index on star_id).i watch out the slow one does not use the index. I just wonder how mysql process the query?
mysql> explain select * from video where star_id=215343405;
Thanks advance!
This is answered in the manual:
For comparisons of a string column with a number, MySQL cannot use an
index on the column to look up the value quickly. If str_col is an
indexed string column, the index cannot be used when performing the
lookup in the following statement:
SELECT * FROM tbl_name WHERE str_col=1;
The reason for this is that there are many different strings that may convert to the value 1, such as '1', ' 1', or '1a'.
If you do not use Quotation marks mysql uses the value as an int and must convert the value for every record. Therefor the db needs a lot of time.
The quotes define the expression as a string, whereas without the single quote it is evaluated as a number. This means that MySQL is forced to perform a Type Conversion to convert the number to a CHAR to do a proper comparison.
As the doc above says,
For comparisons of a string column with a number, MySQL cannot use an
index on the column to look up the value quickly. If str_col is an
indexed string column, the index cannot be used when performing the
lookup...
However, the inverse of that is not true and while the index can be used, using a string as a value causes a poor execution plan (as illustrated by jkavalik's sqlfiddle) where using where is used instead of the faster using index condition. The main difference between the two is that the former requires a row lookup and the latter can get the data directly from the index.
You should definitely modify the column data type (assuming it truly is only meant to contain numbers) to the appropriate data type ASAP, but make sure that no queries are actually using single quotes, otherwise you'll be back where you started.
Related
In one column of a database, we store the parameters that we used to hit an API, for example if the API call was sample.api/call?foo=1&bar=2&foobar=3 then the field will store foo=1&bar=2&foobar=3
It'd be easy enough to make a query to check 2 or 3 of those values if it was guaranteed that they'd be in that order, but that's not guaranteed. There's a possibility that call could have been made with the parameters as bar=2&foo=1&foobar=3 or any other combination.
Is there a way to make that query without saying:
SELECT * FROM table
WHERE value LIKE "%foo=1%"
AND value LIKE "%bar=2%"
AND value LIKE "%foobar=3%"
I've also tried
SELECT * FROM table
WHERE "foo=1" IN (value)
but that didn't yield any results at all.
Edit: I should have previously mentioned that I won't necessarily be always looking for the same parameters.
But why?
The problem with doing simple LIKE statements is this:
SELECT * FROM table
WHERE value LIKE "%foo=1%"
This will match the value asdffoo=1 and also foo=13. One hacky solution is to do this:
SELECT * FROM `api`
WHERE `params` REGEXP '(^|&)foo=1(&|$)'
AND `params` ...
Be aware, this does not use indexes. If you have a large dataset, this will need to do a row scan and be extremely slow!
Alternatively, if you can store your info in the database differently, you can utilize the FIND_IN_SET() function.
-- Store in DB as foo=1,bar=2,foobar=3
SELECT * FROM `api`
WHERE FIND_IN_SET(`params`, 'foo=1')
AND FIND_IN_SET(`params`, 'bar=2')
...
The only other solution would be to involve either another table, something like the following, and following the solution on this page:
CREATE TABLE `endpoints` (
`id` int(6) unsigned NOT NULL AUTO_INCREMENT,
`url` varchar(200) NOT NULL,
PRIMARY KEY (`id`)
) DEFAULT CHARSET=utf8;
CREATE TABLE IF NOT EXISTS `params` (
`id` int(6) unsigned NOT NULL AUTO_INCREMENT,
`endpoint` int(6) NOT NULL,
`param` varchar(200) NOT NULL,
PRIMARY KEY (`id`),
INDEX `idx_param` (`param`)
) DEFAULT CHARSET=utf8;
The last and final recommendation is to upgrade to 5.7, and utilize JSON functionality. Insert the data as a JSON object, and search it as demonstrated in this question.
This is completely impossible to do properly.
Problem 1. bar and foobar overlap
so if you search for bar=2, you will match on foobar=2. This is not what you want.
This can be fixed by prepending a leading & when storing the get query string.
Problem 2. you don't know how many characters are in the value. SO you must also have an end of string character. Which is the same & character. so you need it at the beginning and end.
You now see the issue.
even if you sort the parameters before storing it all to the database, you still cant do LIKE "%&bar=2&%&foo=1&%&foobar=3&%", because the first match can overlap the second.
even after the corrections, you still have to use three LIKES to match the overlapping strings.
I have a very large 500 million rows table with the following columns:
id - Bigint - Autoincrementing primary index.
date - Datetime - Approximately 1.5 million rows per date, data older 1 year is deleted.
uid - VARCHAR(60) - A user ID
sessionNumber - INT
start - INT - epoch of start time.
end - INT - epoch of end time.
More columns not relevant for this query.
The combination of uid and sessionNumber forms a uinque index. I also have an index on date.
Due to the sheer size, I'd like to partition the table.
Most of my accesses would be by date, so partitioning by date ranges seems intuitive, but as the date is not part of the unique index, this is not an option.
Option 1: RANGE PARTITION on Date and BEFORE INSERT TRIGGER
I don't really have a regular issue with the uid and sessionNumber uniqueness being violated. The source data is consistent, but sessions that span two days may be inserted on two consecutive days with midnight being the end time of the first and start time of the second.
I'm trying to understand if I could remove the unique key and instead use a trigger that would
Check if there is a session with the same identifiers the previous day and if so,
Updates the end date.
cancels the actual insert.
However, I am not sure if I can 1) trigger an update on the same table. or 2) prevent the actual insert.
Option 2: LINEAR HASH PARTITION on UID
My second option is to use a linear hash partition on the UID. However I cannot see any example that utilizes a VARCHAR and converts it to an INTEGER which is used for the HASH partitioning.
However I cannot finde a permitted way to convert from VARCHAR to INTEGER. For example
ALTER TABLE mytable
PARTITION BY HASH (CAST(md5(uid) AS UNSIGNED integer))
PARTITIONS 20
returns that the partition function is not allowed.
HASH partitioning must work with a 32-bit integer. But you can't convert an MD5 string to an integer simply with CAST().
Instead of MD5, CRC32() can take an arbitrary string and converts to a 32-bit integer. But this is also not a valid function for partitioning.
mysql> alter table v partition by hash(crc32(uid));
ERROR 1564 (HY000): This partition function is not allowed
You could partition by the string using KEY Partitioning instead of HASH partitioning. KEY Partitioning accepts strings. It passes whatever input string through MySQL's built-in PASSWORD() function, which is basically related to SHA1.
However, this leads to another problem with your partitioning strategy:
mysql> alter table v partition by key(uid);
ERROR 1503 (HY000): A PRIMARY KEY must include all columns in the table's partitioning function
Your table's primary key id does not include the column uid that you want to partition by. This is a restriction of MySQL's partitioning:
every unique key on the table must use every column in the table's partitioning expression.
Here's the table I'm testing with (it would have been a good idea for you to include this in your question):
CREATE TABLE `v` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`date` datetime NOT NULL,
`uid` varchar(60) NOT NULL,
`sessionNumber` int(11) NOT NULL,
`start` int(11) NOT NULL,
`end` int(11) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `uid` (`uid`,`sessionNumber`),
KEY `date` (`date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Before going any further, I have to wonder why you want to use partitioning anyway? "Sheer size" is not a reason to partition a table.
Partitioning, like any optimization, is done for the sake of specific queries you want to optimize for. Any optimization improves one query at the expense of other queries. Optimization has nothing to do with the table. The table is happy to sit there with 5 billion rows, and it doesn't care. Optimization is for the queries.
So you need to know which queries you want to optimize for. Then decide on a strategy. Partitioning might not be the best strategy for the set of queries you need to optimize!
I'll assume your 'uid' is a 128-bit UUID kind of value, which can be stored as a BINARY(16), because that is generally worth the trouble.
Next, stay away from the 'datetime' type, as it is stored like a packed string, and doesn't hold any timezone information. Store date-time-values either as pure numerical values (the number of seconds since the UNIX-epoch), or let MySQL do that for you and use the timestamp(N) type.
Also don't call a column 'date', not just because that is a reserved word, but also because the value contains time details too.
Next, stay away from using anything else than latin1 as the CHARSET of (all) your tables. Only ever do UTF-8-ness at the column level. This to prevent unnecessarily byte-wide columns and indexes creeping in over time. Adopt this habit and you'll happily look back on it after some years, promised.
This makes the table look like:
CREATE TABLE `v` (
`uuid` binary(16) NOT NULL,
`mysql_created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`visitor_uuid` BINARY(16) NOT NULL,
`sessionNumber` int NOT NULL,
`start` int NOT NULL,
`end` int NOT NULL,
PRIMARY KEY (`uuid`),
UNIQUE KEY (`visitor_uuid`,`sessionNumber`),
KEY (`mysql_created_at`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
PARTITIONED BY RANGE COLUMNS (`uuid`)
( PARTITION `p_0` VALUES LESS THAN (X'10')
, PARTITION `p_1` VALUES LESS THAN (X'20')
...
, PARTITION `p_9` VALUES LESS THAN (X'A0')
, PARTITION `p_A` VALUES LESS THAN (X'B0')
...
, PARTITION `p_F` VALUES LESS THAN (MAXVALUE)
);
To make the KEY (mysql_created_at) be only on the date-part, needs a calculated column, which can be added in-place, and then an index on it is also light to add, so I'll leave that as homework.
I have two tables, identities and events.
identities has only two columns, identity1 and identity2 and both have a HASH INDEX.
events has ~50 columns and the column _p has a HASH INDEX.
CREATE TABLE `identities` (
`identity1` varchar(255) NOT NULL DEFAULT '',
`identity2` varchar(255) DEFAULT NULL,
UNIQUE KEY `uniques` (`identity1`,`identity2`),
KEY `index2` (`identity2`) USING HASH,
KEY `index1` (`identity1`) USING HASH
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
-
CREATE TABLE `events` (
`rowid` int(11) NOT NULL AUTO_INCREMENT,
`_p` varchar(255) NOT NULL,
`_t` int(10) NOT NULL,
`_n` varchar(255) DEFAULT '',
`returning` varchar(255) DEFAULT NULL,
`referrer` varchar(255) DEFAULT NULL,
`url` varchar(255) DEFAULT NULL,
[...]
`fcc_already_sells_online` varchar(255) DEFAULT NULL,
UNIQUE KEY `_p` (`_p`,`_t`,`_n`),
KEY `rowid` (`rowid`),
KEY `IDX_P` (`_p`) USING HASH
) ENGINE=InnoDB AUTO_INCREMENT=5231165 DEFAULT CHARSET=utf8;
So, why does this query:
SELECT SQL_NO_CACHE * FROM events WHERE _p IN (SELECT identity2 FROM identities WHERE identity1 = 'user#example.com') ORDER BY _t
takes ~40 seconds, while this one:
SELECT SQL_NO_CACHE * FROM events WHERE _p = 'user#example.com' OR _p = 'user2#example.com' OR _p = 'user3#example.com' OR _p = 'user4#example.com' ORDER BY _t
takes only 20ms when they are basically the same?
edit:
This inner query takes 3,3ms:
SELECT SQL_NO_CACHE identity2 FROM identities WHERE identity1 = 'user#example.com'
The cause:
MySQL treats conditions IN <static values list> and IN <sub-query> as different things. It is well-stated in documentation that the second one is equal to = ANY() query which can not use index even if that index exists. MySQL is just not ingenious enough to do it. On the opposite, first one is treated as a simple range scan when the index is there meaning that MySQL can easily use the index.
Possible ways to resolve:
As I see it, there are workarounds and you've already even mentioned one of them. So it may be:
Using JOIN. If there is a field to join by, this is most likely the best way to solve a problem. Actually, since version 5.6 MySQL already tries to enforce this optimization if it's possible, but that does not work in complex cases or in case where there is no dependent sub-query (so basically if MySQL can not "track" that reference). Looking to your case, this isn't an option and this is actually what is not happening for your sub-query.
Querying the sub-resource in the application and forming the static list. Yes, despite the common practice is to avoid multiple queries due to connection/network/query planning overhead, this is the case where actually it can work. In your case, even if you have something like 200ms overhead on all the recounted stuff before, it still worth to query sub-resource independently and substitute static list to next query in the application afterwards.
this is already asked
it's easier to to manage the IN operator because is only a construct that defines the OR operator on multiple conditions with = operator on the same value. If you use the OR operator the optimizer may not consider that you're always using the = operator on the same value.
Because your query is calling this inner query for each row in events table.
In second case indentity table is not used.
You should use joining instead.
I have an application which I am porting from Postgres to MySQL. Nevermind why. The application uses Entity Framework 4 to query the database.
For various reasons, I have to use Guids in my C# code, save them to the database, and then query data based on the saved values of the Guids. I'm not very familiar with MySQL & how it handles what are essentially blobs.
First, there is no UUID type in MySQL. I have to save them as BINARY(16) values. OK, fine. I have created the columns as BINARY(16) and the data is written into the table. Good.
My problem is that I can't see to match on the stored values of the Guids. I have written a unit test that writes data with a known Guid to the table then tries to retrieve it. The data is going into the database fine but when I try to read it back, I get no rows.
Here's a sample table schema:
CREATE TABLE `MyDatabase`.`MyTable` (
`id` INT NOT NULL AUTO_INCREMENT,
`guid` BINARY(16) NOT NULL,
`applicationId` INT(11) NOT NULL,
`name` VARCHAR(256) NOT NULL,
Description VARCHAR(256) NULL,
SessionTimeout INT NOT NULL,
DomainId INT NOT NULL,
PRIMARY KEY (`id`),
CONSTRAINT IX_aspnetx_groups UNIQUE ( `applicationId`, `id` )
);
Here's the Entity Framework code:
var g = ( from r in context.MyTable
where r.guid = id
select r ).Single();
Here's the query that's generated by Entity Framework:
SELECT
`Extent1`.`id`,
`Extent1`.`guid`,
`Extent1`.`applicationId`,
`Extent1`.`name`,
`Extent1`.`Description`,
`Extent1`.`SessionTimeout`,
`Extent1`.`DomainId`
FROM `aspnetx_groups` AS `Extent1`
WHERE `Extent1`.`guid` = '81d7de5e-4212-4ff8-b3d4-9f115261971d' LIMIT 2;
When this executes, it returns no rows, resulting in a "sequence contains no elements" exception being thrown in my C# code.
How do I make this work?
In the WHERE clause on your SELECT, it looks like you are comparing a BINARY(16) (guid on the left side of the equals) with a character string literal on the right side.
To perform a valid comparison, I would convert that character string literal into a BINARY(16), and then compare that to guid.
So, removing all the dash characters and then using the UNHEX function should do the trick:
WHERE `Extent1`.`guid` =
UNHEX(REPLACE('81d7de5e-4212-4ff8-b3d4-9f115261971d','-',''))
For performance, you'll want your query to reference the bare guid column on the left side (just like it does), and not wrap the guid column in any sort of functions. You'll want any conversion to be done on the literal side of the predicate, so that the conversion only has to be done once at the beginning of the query, rather than having to do a conversion for each row in the table.
It turns out that the problem I was having had to do with quirks of the Entity Framework connector in the MySQL Connector/Net package. There is a connection string setting that you need to add if you are using BINARY(16) as the data type of your Guids in the database:
Old Guids=True
Once you add that to the connection string, Entity Framework starts to emit code that really works when inserting, updating, or comparing Guids.
I've also come to the conclusion that UUID / Guid support in MySQL is only half-baked and needs some serious work to bring it up to a usable state.
I have MySQL InnoDb table where I want to store long (limit is 20k symbols) strings. Is there any way to create index for this field?
you can put an MD5 of the field into another field and index that. then when u do a search, u match versus the full field that is not indexed and the md5 field that is indexed.
SELECT *
FROM large_field = "hello world hello world ..."
AND large_field_md5 = md5("hello world hello world ...")
large_field_md5 is index and so we go directly to the record that matches. Once in a blue moon it might need to test 2 records if there is a duplicate md5.
You will need to limit the length of the index, otherwise you are likely to get error 1071 ("Specified key was too long"). The MySQL manual entry on CREATE INDEX describes this:
Indexes can be created that use only the leading part of column values, using col_name(length) syntax to specify an index prefix length:
Prefixes can be specified for CHAR, VARCHAR, BINARY, and VARBINARY columns.
BLOB and TEXT columns also can be indexed, but a prefix length must be given.
Prefix lengths are given in characters for nonbinary string types and in bytes for binary string types. That is, index entries consist of the first length characters of each column value for CHAR, VARCHAR, and TEXT columns, and the first length bytes of each column value for BINARY, VARBINARY, and BLOB columns.
It also adds this:
Prefix support and lengths of prefixes (where supported) are storage engine dependent. For example, a prefix can be up to 1000 bytes long for MyISAM tables, and 767 bytes for InnoDB tables.
Here is an example how you could do that. As #Gidon Wise mentioned in his answer you can index the additional field. In this case it will be query_md5.
CREATE TABLE `searches` (
`id` int(10) UNSIGNED NOT NULL,
`query` varchar(10000) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`query_md5` varchar(32) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
) ENGINE=InnoDB;
ALTER TABLE `searches`
ADD PRIMARY KEY (`id`),
ADD KEY `searches_query_md5_index` (`query_md5`);
To make sure you will not have any similar md5 hashes you want to double check by doing and `query` =''.
The query will look like this:
select * from `searches` where `query_md5` = "b6d31dc40a78c646af40b82af6166676" and `query` = 'long string ...'
b6d31dc40a78c646af40b82af6166676 is md5 hash of the long string ... string. This, I think can improve query performance and you can be sure that you will get right results.
Use the sha2 function with a specific length. Add this to your table:
`hash` varbinary(32) GENERATED ALWAYS AS (unhex(sha2(`your_text`,256)))
ADD UNIQUE KEY `ix_hash` (`hash`);
Read about the SHA2 function