How to optimize database this query in large database? - mysql

Query
SELECT id FROM `user_tmp`
WHERE `code` = '9s5xs1sy'
AND `go` NOT REGEXP 'http://www.xxxx.example.com/aflam/|http://xx.example.com|http://www.xxxxx..example.com/aflam/|http://www.xxxxxx.example.com/v/|http://www.xxxxxx.example.com/vb/'
AND check='done'
AND `dataip` <1319992460
ORDER BY id DESC
LIMIT 50
MySQL returns:
Showing rows 0 - 29 ( 50 total, Query took 21.3102 sec) [id: 2622270 - 2602288]
Query took 21.3102 sec
if i remove
AND dataip <1319992460
MySQL returns
Showing rows 0 - 29 ( 50 total, Query took 0.0859 sec) [id: 3637556 - 3627005]
Query took 0.0859 sec
and if no data, MySQL returns
MySQL returned an empty result set (i.e. zero rows). ( Query took 21.7332 sec )
Query took 21.7332 sec
Explain plan:
SQL query: Explain SELECT * FROM `user_tmp` WHERE `code` = '93mhco3s5y' AND `too` NOT REGEXP 'http://www.10neen.com/aflam/|http://3ltool.com|http://www.10neen.com/aflam/|http://www.10neen.com/v/|http://www.m1-w3d.com/vb/' and checkopen='2010' and `dataip` <1319992460 ORDER BY id DESC LIMIT 50;
Rows: 1
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE user_tmp index NULL PRIMARY 4 NULL 50 Using where
Example of the database used
CREATE TABLE IF NOT EXISTS user_tmp ( id int(9) NOT NULL
AUTO_INCREMENT, ip text NOT NULL, dataip bigint(20) NOT NULL,
ref text NOT NULL, click int(20) NOT NULL, code text NOT
NULL, too text NOT NULL, name text NOT NULL, checkopen
text NOT NULL, contry text NOT NULL, vOperation text NOT NULL,
vBrowser text NOT NULL, iconOperation text NOT NULL,
iconBrowser text NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=4653425 ;
--
-- Dumping data for table user_tmp
INSERT INTO `user_tmp` (`id`, `ip`, `dataip`, `ref`, `click`, `code`, `too`, `name`, `checkopen`, `contry`, `vOperation`, `vBrowser`, `iconOperation`, `iconBrowser`) VALUES
(1, '54.125.78.84', 1319506641, 'http://xxxx.example.com/vb/showthread.php%D8%AA%D8%AD%D9%85%D9%8A%D9%84-%D8%A7%D8%BA%D9%86%D9%8A%D8%A9-%D8%A7%D9%84%D8%A8%D9%88%D9%85-giovanni-marradi-lovers-rendezvous-3cd-1999-a-155712.html', 0, '4mxxxxx5', 'http://www.xxx.example.com/aflam/', 'xxxxe', '2010', 'US', 'Linux', 'Chrome 12.0.742 ', 'linux.png', 'chrome.png');
I want the correct way to do the query and optimize database

You don't have any indexes besides the primary key. You need to make index on fields that you use in your WHERE statement. If you need to index only 1 field or a combination of several fields depends on the other SELECTs you will be running against that table.
Keep in mind that REGEXP cannot use indexes at all, LIKE can use index only when it does not begin with wildcard (so LIKE 'a%' can use index, but LIKE '%a' cannot), bigger than / smaller than (<>) usually don't use indexes also.
So you are left with the code and check fields. I suppose many rows will have the same value for check, so I would begin the index with code field. Multi-field indexes can be used only in the order in which they are defined...
Imagine index created for fields code, check. This index can be used in your query (where the WHERE clause contains both fields), also in the query with only code field, but not in query with only check field.
Is it important to ORDER BY id? If not, leave it out, it will prevent the sort pass and your query will finish faster.

I will assume you are using mysql <= 5.1
The answers above fall into two basic categories:
1. You are using the wrong column type
2. You need indexes
I will deal with each as both are relevant for performance which is ultimately what I take your questions to be about:
Column Types
The difference between bigint/int or int/char for the dataip question is basically not relevant to your issue. The fundamental issue has more to do with index strategy. However when considering performance holistically, the fact that you are using MyISAM as your engine for this table leads me to ask if you really need "text" column types. If you have short (less than 255 say) character columns, then making them fixed length columns will most likely increase performance. Keep in mind that if any one column is of variable length (varchar, text, etc) then this is not worth changing any of them.
Vertical Partitioning
The fact to keep in mind here is that even though you are only requesting the id column from the standpoint of disk IO and memory you are getting the entire row back. Since so many of the rows are text, this could mean a massive amount of data. Any of these rows that are not used for lookups of users or are not often accessed could be moved into another table where the foreign key has a unique key placed on it keeping the relationship 1:1.
Index Strategy
Most likely the problem is simply indexing as is noted above. The reason that your current situation is caused by adding the "AND dataip <1319992460" condition is that it forces a full table scan.
As stated above placing all the columns in the where clause in a single, composite index will help. The order of the columns in the index will no matter so long as all of them appear in the where clause.
However, the order could matter a great deal for other queries. A quick example would be an index made of (colA, colB). A query with "where colA = 'foo'" will use this index. But a query with "where colB = 'bar'" will not because colB is not the left most column in the index definition. So, if you have other queries that use these columns in some combination it is worth minimizing the number of indexes created on the table. This is b/c every index increases the cost of a write and uses disk space. Writes are expensive b/c of necessary disk activity. Don't make them more expensive.

You need to add index like this:
ALTER TABLE `user_tmp` ADD INDEX(`dataip`);
And if your column 'dataip' contains only unique values you can add unique key like this:
ALTER TABLE `user_tmp` ADD UNIQUE(`dataip`);
Keep in mind, that adding index can take long time on a big table, so don't do it on production server with out testing.

You need to create index on fields in the same order that that are using in where clause. Otherwise index is not be used. Index fields of your where clause.

does dataip really need to be a bigint? According to mysql The signed range is -9223372036854775808 to 9223372036854775807 ( it is a 64bit number ).
You need to choose the right column type for the job, and add the right type of index too. Else these queries will take forever.

Related

Why the EXTRA is NULL in Mysql EXPLAIN? Why >= is Using index condition?

mysql> CREATE TABLE `t` (
`id` int(11) NOT NULL,
`a` int(11) DEFAULT NULL,
`b` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `a` (`a`),
KEY `b` (`b`)
) ENGINE=InnoDB
there is a table named t and it has two indexes named a and b.
Insert into t 100000 rows data
mysql> create procedure idata()
begin
declare i int;
set i=1;
while(i<=100000)do
insert into t values(i, i, i);
set i=i+1;
end while;
end;
Query OK, 0 rows affected (0.01 sec)
mysql> delimiter ;
mysql> call idata();
I do some experiments, some are as follows
Now, i want to know;
(1)why explain select * from t where a >= 90000; extra is Using index condition? it has index key, but it doesn't have index filter and table filter, so why it is Using index condition?
(2)why explain select * from t where a = 90000; extra is NULL? is needs to have an access to the table,if the first case is Using index condition, why the second can't be Using index condition?
(3)why explain select a from t where a >= 90000; extra is Using where; Using index? i know it uses cover index, so extra has Using index;but why extra has Using where? it means server needs to filter the data? but storage engine has already return the correct, why server needs to filer?
First, terminology...
"Using index" means that the (in this case) INDEX(a) contains all the columns needed. That is "the index is covering".
"Using index condition" is quite different. Internally, it is called ICP (Index Condition Pushdown). This refers to whether the "handler" checks the expression or whether the "condition" (a >= 90000) is handed off to the Engine (InnoDB) to do the work.
As for "Using where"; that is still a mystery to me, even after using MySQL for 20 years and looking thousands of Explains. I ignore it.
In all 3 of your cases, INDEX(a) is used. This is indicated primarily by "key" ("a"--the name of the key, not the column), "key_len" ("5": 4-byte INT plus 1 for NULLable), and secondarily by "type" (which does not say "All").
Further
If you change the 90000 to 70000, you may find that it will switch to a table scan. Why bounce back and forth between the Index's BTree and the data's BTree (via the PRIMARY KEY). The Optimizer will assume that it will be faster to simply scan all the table, ignoring the rows that fail the WHERE clause.
EXPLAIN FORMAT=JSON SELECT -- Gives you a lot more information. (Perhaps not much more info for this simple query.) One useful surprise is that it will show how many sorts the single mention of "filesort" really refers to. (A possibly easy way to make this happen is GROUP BY x ORDER BY y; that is group and order by different columns.)
Explain rarely has such clean numbers, like your "10001". Usually, the "rows" columns is an approximation, sometimes a terrible approx.
The slowlog records "Rows examined"; it will probably say 10001 (or maybe only 10000) and 1 for your tests. For a table scan, it would be a full 100K.
Another way to get "Rows examined" is via the "Handler" STATUS values. See http://mysql.rjweb.org/doc.php/index_cookbook_mysql#handler_counts
Your first and last query make use of WHERE with implicit comparison to other rows, in that case it makes use of the index and shows it in the extra field (type range).
When you make a condition with 0-1 results, it can directly access them (O(1) lookup). No comparison or ordering happens, just take one row, return it.

Can you create an index to determine if a text column is null in mysql?

I have a very large (> 100 million rows) table and a query that I need to write needs to check if a column of type TEXT is null. I don't care about the actual text, only if the column value is null or not.
I'm trying to write an index to make that query efficient, and I want the query to be satisfied by only looking at the index e.g. such that the explain output of that query shows Using index instead of Using index condition.
To be concrete, the table I'm dealing with has a structure like this:
CREATE TABLE foo (
id INT NOT NULL AUTO_INCREMENT,
overridden TEXT,
rowState VARCHAR(32) NOT NULL,
active TINYINT NOT NULL DEFAULT 0,
PRIMARY KEY (id),
) ENGINE=InnoDB
And the query I want is this:
SELECT
IF(overridden IS NULL, "overridden", rowState) AS state,
COUNT(*)
FROM foo
WHERE active=1
GROUP BY state
The overridden column is a description of why the column is overridden but, as I mentioned above, I don't care about its content, I only want to use it as a boolean flag.
I've tried creating an index on (active, rowState, overridden(1)) but the explain output still shows Using index condition.
Is there something better that I could do?
I suspect this would be a good application for a "prefix" index. (Most attempts at using such are a fiasco.)
The optimal order, I think, would be:
INDEX(active, -- handle the entire WHERE first
overridden(1), -- IS NULL possibly works well
rowState) -- to make index "covering"
Using index condition means that the selection of the row(s) is handled in the Engine, not the "Handler".
EXPLAIN FORMAT=JSON SELECT ... may provide more insight into what is going on. So might the "Optimizer trace".
Or...
If the MySQL version is new enough, create a "generated column" that persists with the computed value of overridden IS NULL.

Mysql not using an index when selecting a non-indexed field

I have a table meta with the following structure (this is just an example denormalized data)
`id` int(3) not null auto_increment primary key,
`category_id` int(3),
`subdomain` varchar(191),
`created_at` timestamp,
`updated_at` timestamp
The subdomain field could store unique values and repeating values like 'general' can be repeated many times
Situation 1
Also i have an index subdomain. This index applied on query
Select `id` from `table` where `subdomain` = 'general'
But when i try to get some non-indexed field, mysql scans all table and index is not used
Select `created_at` from `table` where `subdomain` = 'general'
As i know, Inno-db non-clustered index stores a reference to a row and there is no need to perform linear search over all rows to retrieve some field.
Also i know optimizer can choose an unexpected plan for human, but what the reasons can be in this case?
No matter how much data in the table, result always the same.
This can happen, when the filtering backed by the index is not very selective/your value to filter for has a high cardinality. This means a high percentage of your total rows match the where-condition supported by the index (e.g. 90% of your rows match subdomain = 'general'). If you use the index under that condition you end up processing more data compared to a full table scan.
Example: you have 100 rows and 90 of them match subdomain = 'general'.
A full table scan needs to access all 100 rows to check the conditaion and 90 values are read for the result.
An index backed select needs to access 90 items in the index fo fulfill the condition and follow the pointer from the index to the actual row to select the not indexed value from that row. Ending up in 90 lookups on the index + 90 reads from the rows = 180 operations. This is slower than the full table scan where you just access some rows more than needed. The operations might not have the same cost, but you end up doing more work in the end.

Why does MySQL decide on wrong index?

I have a partitioned table in MySQL that looks like this:
CREATE TABLE `table1` (
`id` bigint(19) NOT NULL AUTO_INCREMENT,
`field1` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
`field2_id` int(11) NOT NULL,
`created_at` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`id`,`created_at`),
KEY `index1` (`field2_id`,`id`)
) ENGINE=InnoDB AUTO_INCREMENT=603221206 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
/*!50100 PARTITION BY RANGE (to_days(created_at))
(PARTITION p_0 VALUES LESS THAN (730485) ENGINE = InnoDB,
..... lots more partitions .....
PARTITION p_20130117 VALUES LESS THAN (735250) ENGINE = InnoDB) */;
And this is a typical SELECT query on the table:
SELECT field1 from TABLE1 where field2_id = 12345 and id > 13314313;
Doing an explain on it, MySQL sometimes decides to use PRIMARY instead of index1. This seems to be pretty consistent when you do a first explain. However, after a few repeated explains, MySQL finally decides to use the index. The problem is, this table has millions of rows, and inserts and selects are hitting it on the order of several times per second. Choosing the wrong index was causing these SELECT queries to take up to ~40 seconds, instead of sub second times. Can't really schedule downtime, so I can't run an optimize on the table (because of the size, it would probably take a long time), and not sure it would help in this case anyway.
I fixed this by forcing the index, so it looks like this:
SELECT field1 from TABLE1 FORCE INDEX (index1) WHERE field2_id = 12345 and id > 13314313;
We're running this on MySQL 5.1.63, which we can't move away from at the moment.
My question is, why is MySQL choosing the wrong index? And is there something that can be done to fix it, besides forcing the index on all queries? Is partitioning confusing the InnoDB engine? I've worked a lot with MySQL, and have never seen this behavior before. The query is as simple as can be, and the index is also a perfect match. We have a lot of queries that are assuming the DB layer will do the right thing, and I don't want to go through all of them forcing to use the correct index.
Update 1:
This is the typical explain, without the FORCE INDEX clause. Once that's put in, the possible keys column only show the forced index.
id select_type table type possible_keys key key_len ref rows
1 SIMPLE table1 range PRIMARY,index1 index1 12 NULL 207
I'm not 100% sure, but i think this sounds logic:
You partition your table BY RANGE (to_days(created_at)). the created_at field is part of the primary_key. Your select-queries are using the other part of the primary-key. This way the server optimization engine thinks this would be the speediest index - using the partition and the id-primary-part.
i suggest (without knowing the real cause that lead to your choice) to change your partition-range to the id and change the order of your index1-key.
for more information on partitioning have a look
I'm not sure why the engine would pick the incorrect index. I would think that an index that has an EQUALITY test would supersede that of one with a >, < or range. However, another option that might help force the correct index would be to force a "computed" value on the other id column so the engine might not be able to do a direct correlation to the index... Something like
WHERE field2_id = 12345 and id > 13314313
changed to
WHERE field2_id = 12345 and id + 0 > 13314313

MySQL: Optimizing COUNT(*) and GROUP BY

I have a simple MyISAM table resembling the following (trimmed for readability -- in reality, there are more columns, all of which are constant width and some of which are nullable):
CREATE TABLE IF NOT EXISTS `history` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`time` int(11) NOT NULL,
`event` int(11) NOT NULL,
`source` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `event` (`event`),
KEY `time` (`time`),
);
Presently the table contains only about 6,000,000 rows (of which currently about 160,000 match the query below), but this is expected to increase. Given a particular event ID and grouped by source, I want to know how many events with that ID were logged during a particular interval of time. The answer to the query might be something along the lines of "Today, event X happened 120 times for source A, 105 times for source B, and 900 times for source C."
The query I concocted does perform this task, but it performs monstrously badly, taking well over a minute to execute when the timespan is set to "all time" and in excess of 30 seconds for as little as a week back:
SELECT COUNT(*) AS count FROM history
WHERE event=2000 AND time >= 0 AND time < 1310563644
GROUP BY source
ORDER BY count DESC
This is not for real-time use, so even if the query takes a second or two that would be fine, but several minutes is not. Explaining the query gives the following, which troubles me for obvious reasons:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE history ref event,time event 4 const 160399 Using where; Using temporary; Using filesort
I've experimented with various multi-column indexes (such as (event, time)), but with no improvement. This seems like such a common use case that I can't imagine there not being a reasonable solution, but my Googling all boil down to versions of the query I already have, with no particular suggestions on how to avoid the temporary (and even then, why performance is so abysmal).
Any suggestions?
You say you have tried multi-column indexes. Have you also tried single-column indexes, one per column?
UPDATE: Also, the COUNT(*) operation over a GROUP BY clause is probably a lot faster, if the grouped column also has an index on it... Of course, this depends on the number of NULL values that are actually in that column, which are not indexed.
For event, MySQL can execute a UNIQUE SCAN, which is quite fast, whereas for time, a RANGE SCAN will be applied, which is not so fast... If you separate indexes, I'd expect better performance than with multi-column ones.
Also, maybe you could gain something by partitioning your table by some expected values / value ranges:
http://dev.mysql.com/doc/refman/5.5/en/partitioning-overview.html
I offer you to try this multi-column index:
ALTER TABLE `history` ADD INDEX `history_index` (`event` ASC, `time` ASC, `source` ASC);
Then if it doesn't help, try to force index on this query:
SELECT COUNT(*) AS count FROM history USE INDEX (history_index)
WHERE event=2000 AND time >= 0 AND time < 1310563644
GROUP BY source
ORDER BY count DESC
If the source are known or you want to find the count for specific source, then you can try like this.
select count(source= 'A' or NULL) as A,count(source= 'B' or NULL) as B from history;
and for ordering you can do it in your application code. Also try with indexing event and source together.
This will be definitely faster than the older one.