I have two tables, identities and events.
identities has only two columns, identity1 and identity2 and both have a HASH INDEX.
events has ~50 columns and the column _p has a HASH INDEX.
CREATE TABLE `identities` (
`identity1` varchar(255) NOT NULL DEFAULT '',
`identity2` varchar(255) DEFAULT NULL,
UNIQUE KEY `uniques` (`identity1`,`identity2`),
KEY `index2` (`identity2`) USING HASH,
KEY `index1` (`identity1`) USING HASH
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
-
CREATE TABLE `events` (
`rowid` int(11) NOT NULL AUTO_INCREMENT,
`_p` varchar(255) NOT NULL,
`_t` int(10) NOT NULL,
`_n` varchar(255) DEFAULT '',
`returning` varchar(255) DEFAULT NULL,
`referrer` varchar(255) DEFAULT NULL,
`url` varchar(255) DEFAULT NULL,
[...]
`fcc_already_sells_online` varchar(255) DEFAULT NULL,
UNIQUE KEY `_p` (`_p`,`_t`,`_n`),
KEY `rowid` (`rowid`),
KEY `IDX_P` (`_p`) USING HASH
) ENGINE=InnoDB AUTO_INCREMENT=5231165 DEFAULT CHARSET=utf8;
So, why does this query:
SELECT SQL_NO_CACHE * FROM events WHERE _p IN (SELECT identity2 FROM identities WHERE identity1 = 'user#example.com') ORDER BY _t
takes ~40 seconds, while this one:
SELECT SQL_NO_CACHE * FROM events WHERE _p = 'user#example.com' OR _p = 'user2#example.com' OR _p = 'user3#example.com' OR _p = 'user4#example.com' ORDER BY _t
takes only 20ms when they are basically the same?
edit:
This inner query takes 3,3ms:
SELECT SQL_NO_CACHE identity2 FROM identities WHERE identity1 = 'user#example.com'
The cause:
MySQL treats conditions IN <static values list> and IN <sub-query> as different things. It is well-stated in documentation that the second one is equal to = ANY() query which can not use index even if that index exists. MySQL is just not ingenious enough to do it. On the opposite, first one is treated as a simple range scan when the index is there meaning that MySQL can easily use the index.
Possible ways to resolve:
As I see it, there are workarounds and you've already even mentioned one of them. So it may be:
Using JOIN. If there is a field to join by, this is most likely the best way to solve a problem. Actually, since version 5.6 MySQL already tries to enforce this optimization if it's possible, but that does not work in complex cases or in case where there is no dependent sub-query (so basically if MySQL can not "track" that reference). Looking to your case, this isn't an option and this is actually what is not happening for your sub-query.
Querying the sub-resource in the application and forming the static list. Yes, despite the common practice is to avoid multiple queries due to connection/network/query planning overhead, this is the case where actually it can work. In your case, even if you have something like 200ms overhead on all the recounted stuff before, it still worth to query sub-resource independently and substitute static list to next query in the application afterwards.
this is already asked
it's easier to to manage the IN operator because is only a construct that defines the OR operator on multiple conditions with = operator on the same value. If you use the OR operator the optimizer may not consider that you're always using the = operator on the same value.
Because your query is calling this inner query for each row in events table.
In second case indentity table is not used.
You should use joining instead.
Related
I am joining with a table and noticed that if the field I join on has a varchar size that's too high then MySQL doesn't use the index for that field in the join, thus resulting in a significantly longer query time. I've put explains and table definition below. It is version MySQL 5.7. Any ideas why this is happening?
Table definition:
CREATE TABLE `LotRecordsRaw` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`lotNumber` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`scrapingJobId` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `lotNumber_UNIQUE` (`lotNumber`),
KEY `idx_Lot_lotNumber` (`lotNumber`)
) ENGINE=InnoDB AUTO_INCREMENT=14551 DEFAULT CHARSET=latin1;
Explains:
explain
(
select lotRecord.*
from LotRecordsRaw lotRecord
left join (
select lotNumber, max(scrapingJobId) as id
from LotRecordsRaw
group by lotNumber
) latestJob on latestJob.lotNumber = lotRecord.lotNumber
)
produces:
The screenshot above shows that the derived table is not using the index on "lotNumber". In that example, the "lotNumber" field was a varchar(255). If I change it to be a smaller size, e.g. varchar(45), then the explain query produces this:
The query then runs orders of magnitude faster (2 seconds instead of 100 sec). What's going on here?
Hooray! You found an optimization reason for not blindly using 255 in VARCHAR.
Please try 191 and 192 -- I want to know if that is the cutoff.
Meanwhile, I have some other comments:
A UNIQUE is a KEY. That is, idx_Lot_lotNumber is redundant and may as well be removed.
The Optimizer can (and probably would) use INDEX(lotNumber, scrapingJobId) as a much faster way to find those MAXes.
Unfortunately, there is no way to specify "make a unique index on lotNumber, but also have that other column in the index.
Wait! With lotNumber being unique, there is only one row per lotNumber. That means MAX and GROUP BY are totally unnecessary!
It seems like lotNumber could be promoted to PRIMARY KEY (and completely get rid of id).
I understand that mysql query involving self-join table might lead to slow query and/or CPU spike, but have been struggling to come up with ways to improve it.
CREATE TABLE `tool` (
`tool_id` char(32) NOT NULL,
`provider` varchar(36) NOT NULL,
PRIMARY KEY (`tool_id`),
)
CREATE TABLE `edata` (
`e_data_id` char(32) NOT NULL,
`tool_id` char(32) DEFAULT NULL,
`ref_e_data_id` char(32) DEFAULT NULL,
PRIMARY KEY (`e_data_id`),
KEY `e_ref_e_data__06a0c1a7_fk` (`ref_e_data_id`),
KEY `edata_tool_id_61d6bb9b` (`tool_id`),
CONSTRAINT `e_tool_id_61d6bb9b` FOREIGN KEY (`tool_id`) REFERENCES `tool` (`tool_id`),
)
here is the query in question
mutdata
LEFT JOIN (SELECT e1.edata_id as m_id, a1.provider as m_cp from edata e1 INNER JOIN tool a1 on e1.tool_id=a1.tool_id WHERE a1.deleted=0) as mapping
on mutdata.ref_e_data_id=mapping.m_id or mutdata.e_data_id=map.m_id
in short, first the subquery is constructed as a lookup table like a dictionary or map, then mutdata tries to use the lookup table to determine the corresponding provider (this query is part of even larger query). Is there a way to optimize this part?
These indexes may help:
mutdata: INDEX(ref_e_data_id, e_data_id)
map: INDEX(m_id)
e1: INDEX(tool_id, edata_id)
a1: INDEX(deleted, tool_id, provider)
Try not to use the construct JOIN ( SELECT ... ); instead try to bump that up a level.
Do you really need LEFT in either place?
OR is terrible for performance. Sometimes it is practical to use two SELECT connected by UNION DISTINCT as a workaround. That way, each SELECT may be able to take advantage of a different index.
Where is map in the query?
In one column of a database, we store the parameters that we used to hit an API, for example if the API call was sample.api/call?foo=1&bar=2&foobar=3 then the field will store foo=1&bar=2&foobar=3
It'd be easy enough to make a query to check 2 or 3 of those values if it was guaranteed that they'd be in that order, but that's not guaranteed. There's a possibility that call could have been made with the parameters as bar=2&foo=1&foobar=3 or any other combination.
Is there a way to make that query without saying:
SELECT * FROM table
WHERE value LIKE "%foo=1%"
AND value LIKE "%bar=2%"
AND value LIKE "%foobar=3%"
I've also tried
SELECT * FROM table
WHERE "foo=1" IN (value)
but that didn't yield any results at all.
Edit: I should have previously mentioned that I won't necessarily be always looking for the same parameters.
But why?
The problem with doing simple LIKE statements is this:
SELECT * FROM table
WHERE value LIKE "%foo=1%"
This will match the value asdffoo=1 and also foo=13. One hacky solution is to do this:
SELECT * FROM `api`
WHERE `params` REGEXP '(^|&)foo=1(&|$)'
AND `params` ...
Be aware, this does not use indexes. If you have a large dataset, this will need to do a row scan and be extremely slow!
Alternatively, if you can store your info in the database differently, you can utilize the FIND_IN_SET() function.
-- Store in DB as foo=1,bar=2,foobar=3
SELECT * FROM `api`
WHERE FIND_IN_SET(`params`, 'foo=1')
AND FIND_IN_SET(`params`, 'bar=2')
...
The only other solution would be to involve either another table, something like the following, and following the solution on this page:
CREATE TABLE `endpoints` (
`id` int(6) unsigned NOT NULL AUTO_INCREMENT,
`url` varchar(200) NOT NULL,
PRIMARY KEY (`id`)
) DEFAULT CHARSET=utf8;
CREATE TABLE IF NOT EXISTS `params` (
`id` int(6) unsigned NOT NULL AUTO_INCREMENT,
`endpoint` int(6) NOT NULL,
`param` varchar(200) NOT NULL,
PRIMARY KEY (`id`),
INDEX `idx_param` (`param`)
) DEFAULT CHARSET=utf8;
The last and final recommendation is to upgrade to 5.7, and utilize JSON functionality. Insert the data as a JSON object, and search it as demonstrated in this question.
This is completely impossible to do properly.
Problem 1. bar and foobar overlap
so if you search for bar=2, you will match on foobar=2. This is not what you want.
This can be fixed by prepending a leading & when storing the get query string.
Problem 2. you don't know how many characters are in the value. SO you must also have an end of string character. Which is the same & character. so you need it at the beginning and end.
You now see the issue.
even if you sort the parameters before storing it all to the database, you still cant do LIKE "%&bar=2&%&foo=1&%&foobar=3&%", because the first match can overlap the second.
even after the corrections, you still have to use three LIKES to match the overlapping strings.
I am currently facing an issue with designing a database table and updating/inserting values into it.
The table is used to collect and aggregate statistics that are identified by:
the source
the user
the statistic
an optional material (e.g. item type)
an optional entity (e.g. animal)
My main issue is, that my proposed primary key is too large because of VARCHARs that are used to identify a statistic.
My current table is created like this:
CREATE TABLE `Statistics` (
`server_id` varchar(255) NOT NULL,
`player_id` binary(16) NOT NULL,
`statistic` varchar(255) NOT NULL,
`material` varchar(255) DEFAULT NULL,
`entity` varchar(255) DEFAULT NULL,
`value` bigint(20) NOT NULL)
In particular, the server_id is configurable, the player_id is a UUID, statistic is the representation of an enumeration that may change, material and entity likewise. The value is then aggregated using SUM() to calculate the overall statistic.
So far it works but I have to use DELETE AND INSERT statements whenever I want to update a value, because I have no primary key and I can't figure out how to create such a primary key in the constraints of MySQL.
My main question is: How can I efficiently update values in this table and insert them when they are not currently present without resorting to deleting all the rows and inserting new ones?
The main issue seems to be the restriction MySQL puts on the primary key. I don't think adding an id column would solve this.
Simply add an auto-incremented id:
CREATE TABLE `Statistics` (
statistis_id int auto_increment primary key,
`server_id` varchar(255) NOT NULL,
`player_id` binary(16) NOT NULL,
`statistic` varchar(255) NOT NULL,
`material` varchar(255) DEFAULT NULL,
`entity` varchar(255) DEFAULT NULL,
`value` bigint(20) NOT NULL
);
Voila! A primary key. But you probably want an index. One that comes to mind:
create index idx_statistics_server_player_statistic on statistics(server_id, player_id, statistic)`
Depending on what your code looks like, you might want additional or different keys in the index, or more than one index.
Follow the below hope it will solve your problem :-
- First use a variable let suppose "detailed" as money with your table.
- in your project when you use insert statement then before using statement get the maximum of detailed (SELECT MAX(detailed)+1 as maxid FROM TABLE_NAME( and use this as use number which will help you to FETCH,DELETE the record.
-you can also update with this also BUT during update MAXIMUM of detailed is not required.
Hope you understand this and it will help you .
I have dug a bit more through the internet and optimized my code a lot.
I asked this question because of bad performance, which I assumed was because of the DELETE and INSERT statements following each other.
I was thinking that I could try to reduce the load by doing INSERT IGNORE statements followed by UPDATE statements or INSERT .. ON DUPLICATE KEY UPDATE statements. But they require keys to be useful which I haven't had access to, because of constraints in MySQL.
I have fixed the performance issues though:
By reducing the amount of statements generated asynchronously (I know JDBC is blocking but it worked, it just blocked thousand of threads) and disabling auto-commit, I was able to improve the performance by 600 times (from 60 seconds down to 0.1 seconds).
Next steps are to improve the connection string and gaining even more performance.
I have a problem similar to
SQL: selecting rows where column value changed from previous row
The accepted answer by ypercube which i adapted to
CREATE TABLE `schange` (
`PersonID` int(11) NOT NULL,
`StateID` int(11) NOT NULL,
`TStamp` datetime NOT NULL,
KEY `tstamp` (`TStamp`),
KEY `personstate` (`PersonID`, `StateID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `states` (
`StateID` int(11) NOT NULL AUTO_INCREMENT,
`State` varchar(100) NOT NULL,
`Available` tinyint(1) NOT NULL,
`Otherstatuseshere` tinyint(1) NOT NULL,
PRIMARY KEY (`StateID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
SELECT
COALESCE((#statusPre <> s.Available), 1) AS statusChanged,
c.PersonID,
c.TStamp,
s.*,
#statusPre := s.Available
FROM schange c
INNER JOIN states s USING (StateID),
(SELECT #statusPre:=NULL) AS d
WHERE PersonID = 1 AND TStamp > "2012-01-01" AND TStamp < "2013-01-01"
ORDER BY TStamp ;
The query itself worked just fine in testing, and with the right mix of temporary tables i was able to generate reports with daily sum availability from a huge pile of data in virtually no time at all.
The real problem comes in when i discovered that the tables where using the MyISAM engine, which we have completely abandoned, recreated the tables to use InnoDB, and noticed the query no longer works as expected.
After some bashing head into wall i have discovered that MyISAM seems to go over the columns each row in order (selecting statusChanged before updating #statusPre), while InnoDB seems to do all the variable assigning first, and only after that it populates result rows, regardless if the assigning happens in the select or where clauses, in functions (coalesce, greater etc), subqueries or otherwise.
Trying to accomplish this in a query without variables seems to always end the same way, a subquery requiring exponentially more time to process the more rows are in the set, resulting in a excrushiating minutes (or hours) long wait to get beginning and ending events for one status, while a finished report should include daily sums of multiple.
Can this type of query work on the InnoDB engine, and if so, how should one go about it?
or is the only feasible option to go for a database product that supports WITH statements?
Removing
KEY personstate (PersonID, StateID)
fixes the problem.
No idea why tho, but it was not really required anyway, the timestamp key is the more important one and speeds up the query nicely.