Effective indexing for a DB with millions of rows

Effective indexing for a DB with millions of rows - mysql

I have a MYISAM MySQL DB table with many millions of rows inside which I've been asked to work with, but I need to make the queries faster first.
There was no indexing before at all! I added a new index on the 'type' column which has helped but I wanted to know if there were any other columns that might be best indexed too?
Here is my CREATE TABLE:
CREATE TABLE `clicks` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`companyid` int(11) DEFAULT '0',
`type` varchar(32) NOT NULL DEFAULT '',
`contextid` int(11) NOT NULL DEFAULT '0',
`period` varchar(16) NOT NULL DEFAULT '',
`timestamp` int(11) NOT NULL DEFAULT '0',
`location` varchar(32) NOT NULL DEFAULT '',
`ip` varchar(32) DEFAULT NULL,
`useragent` varchar(64) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `companyid` (`companyid`,`type`,`period`),
KEY `type` (`type`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
A typical SELECT statement would commonly filter by the companyid, type and contextid columns.
For example:
SELECT period, count(period) as count FROM clicks WHERE contextid in (123) AND timestamp > 123123123 GROUP BY period ORDER BY timestamp ASC
or
SELECT period, count(period) as count FROM clicks WHERE contextid in (123) AND type IN('direct') AND timestamp > 123123123 GROUP BY period ORDER BY timestamp ASC
The last part of my questions would be this: when I added the index on type it took about 1 hour - if I am adding or removing multiple indexes, can you do it in one query or do you have to do them 1 by 1 and wait for each to finish?
Thanks for your thoughts.

Indexing is really powerful, but isn't as much of a black art as you might think. Learn about MySQL's EXPLAIN PLAN capabilities, this will help you systematically find where improvements can be made:
http://dev.mysql.com/doc/refman/5.5/en/execution-plan-information.html

Which indexes to add really depends on your queries. Anything that you're sorting (GROUP BY) or selecting (WHERE) on is a good candidate for an index.
You may also want to have a look at how Mysql uses indexes.
As regards the time taken to add indexes, where you're sure you want to add multiple indexes, you could do mysqldump, manually edit the table structure in the .sql file, and then reimport. This can take a while, but at least you can do all the changes at once. However, this doesn't really fit with the idea of testing as you go... so use this approach with care. (I've done it when modifying a number of tables with the same structure, and wanting to add some indexes to all of them.)
Also, I'm not a 100% sure, but I think that when you add an index, Mysql creates a copy of the table with the index, and then deletes the original table - so make sure there's enough space on your server / partition for the current size of the table & some margin.

Here's one of your queries, broken on multiple lines so it's easier to read.
SELECT period, count(period) as count
FROM clicks
WHERE contextid in (123)
AND timestamp > 123123123
GROUP BY period
ORDER BY timestamp ASC
I'm not even sure this is a valid query. I thought your GROUP BY and ORDER BY had to match up in SQL. I think that you would have to order on count, as the GROUP BY would order on period.
The important part of the query for optimization is the WHERE clause. In this case, an index on contextid and timestamp would speed up the query.
Obviously, you can't index every WHERE clause. You index the most common WHERE clauses.
I'd add indexes to existing tables one at a time. Yes, it's slow. But you should only have to add the indexes once.

In my opinion timestamp and period can be indexed as they are being used in the WHERE clause.
Also instead of using contextid in (123) use contextid = 123 and instead of type IN('direct') use type = 'direct'

You can add multiple indexes in a single query. This will save some time overall, but the table will be inaccessible while you wait for the entire query to complete:
ALTER TABLE table1 ADD INDEX `Index1`('col1'),
ADD INDEX `Index2`('col2')
Regarding indexes, it's a complex subject. However, adding indexes on single columns with high-cardinality that are included in your WHERE clause is a good place to start. MySQL will try to pick the best index for the query and use that.
To further tweak performance, you should consider multi-column indexes, which I see you've implemented with your 'companyid' index.
To be able to utilize an index all the way through to a GROUP BY or ORDER BY clause relies on a lot of conditions, that you might want to read up on.
To best utilize indexes, your database server must have enough RAM to store the indexes entirely in memory and the server must be configured properly to actually utilize the memory.

Related

Order By causing my query to run really slow

I have an sql query as follows
select *
from incidents
where remote_ip = '192.168.1.1' and is_infringement = 1
order by reported_at desc
limit 1;
This query at the moment takes 313.24 secs to run.
If I remove the order by so the query is
select *
from incidents
where remote_ip = '192.168.1.1' and is_infringement = 1
then it only takes 0.117 secs to run.
The reported_at column is indexed.
So 2 questions, firstly why is it takings so long with this order_by statement and secondly how can i speed it up?
EDIT: In response to the questions below here is the output when using explain:
'1', 'SIMPLE', 'incidents', 'index', 'uniqueReportIndex,idx_incidents_remote_ip', 'incidentsReportedAt', '4', NULL, '1044', '100.00', 'Using where'
And the table create statement:
CREATE TABLE `incidents` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`incident_ip_id` int(10) unsigned DEFAULT NULL,
`remote_id` bigint(20) DEFAULT NULL,
`remote_ip` char(32) NOT NULL,
`is_infringement` tinyint(1) NOT NULL DEFAULT '0',
`messageBody` text,
`reported_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' COMMENT 'Formerly : created_datetime',
`created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`updated_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`id`),
UNIQUE KEY `uniqueReportIndex` (`remote_ip`,`host_id_1`,`licence_feature`,`app_end`),
UNIQUE KEY `uniqueRemoteIncidentId` (`remote_id`),
KEY `incident_ip_id` (`incident_ip_id`),
KEY `id` (`id`),
KEY `incidentsReportedAt` (`reported_at`),
KEY `idx_incidents_remote_ip` (`remote_ip`)
)
Note: i have omitted some of the non relevant fields so there are more indexes than fields but you can safely assume the fields for all the indexes are in the table

The output of EXPLAIN reveals that, because of the ORDER BY clause, MySQL decides to use the incidentsReportedAt index. It reads each row from the table data in the order provided by the index and checks the WHERE conditions on it. This requires reading a lot of information from the table data, information that is scattered through the entire table. Not a good workflow.
Update
The OP created an index on columns reported_at and report_ip (as suggested in the original answer, see below) and the execution time went down from 313 to 133 seconds. An improvement, but not enough. I think the cause of this still large execution time is the access to table data for each row to verify the is_infringement = 1 part of the WHERE clause but even adding it to the index won't help very much.
The OP says in a comment:
Ok after further research and changing the index to be the other way round (remote_ip, reported_at) the query is now super fast (0.083 sec).
This index is better, indeed, because the remote_ip = '192.168.1.1' condition filters out a lot of rows. The same effect can be achieved using the existing uniqueReportIndex index. It is possible that the original index on reported_at fooled MySQL into thinking it is better to use it to check the rows in the order required by ORDER BY instead of filtering first and sorting at the end.
I think MySQL uses the new index on (remote_ip, reported_at) for filtering (WHERE remote_ip = '192.168.1.1') and for sorting (ORDER BY reported_at DESC). The WHERE condition provides a small list of candidate rows that are easily identified and also sorted using this index.
The original answer follows.
The advice it provides is not correct but it helped the OP find the correct solution.
Create an index on columns reported_at and report_ip in this order
then see what EXPLAIN says and how the query performs. It should work faster.
You can even create the new index on columns reported_at, report_ip and is_infringement (the order of columns in the index is very important).
The index on three columns helps MySQL identify the rows without the need to read the table data (because all the columns from WHERE and ORDER BY clauses are in the index). It needs to read the table data only for the rows it returns because of SELECT *.
After you create the new index (either on two or three columns), remove the old index incidentsReportedAt. It is not needed any more; it uses disk and memory space and takes time to be updated but it is not used. The new index (that has the reported_at column on the first position) will be used instead.
The index on two columns requires more reads of the table data for the is_infringement = 1 condition. The query probably runs a little slower that with the three-columns index. On the other hand, there is a little gain on table updates and disk and memory space usage.
The decision to index on two or three columns depends on how often the query posted in the question runs and what it serves (visitors, admins, cron jobs etc).

High traffic table, optimal indexes?

I have a monitoring table with the following structure:
CREATE TABLE `monitor_data` (
`monitor_id` INT(10) UNSIGNED NOT NULL,
`monitor_data_time` INT(10) UNSIGNED NOT NULL,
`monitor_data_value` INT(10) NULL DEFAULT NULL,
INDEX `monitor_id_data_time` (`monitor_id`, `monitor_data_time`),
INDEX `monitor_data_time` (`monitor_data_time`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB;
This is a very high traffic table with potentially thousands of rows every minute. Each row belongs to a monitor and contains a value and time (unix_timestamp)
I have three issues:
1.
Suddenly, after a number of months in dev, the table suddenly became very slow. Queries that previously was done under a second could now take up to a minute. I'm using standard settings in my.cnf since this is a dev machine, but the behavior was indeed very strange to me.
2.
I'm not sure that I have optimal indexes. A "normal" query looks like this:
SELECT DISTINCT(md.monitor_data_time), monitor_data_value
FROM monitor_data md
WHERE md.monitor_id = 165
AND md.monitor_data_time >= 1484076760
AND md.monitor_data_time <= 1487271199
ORDER BY md.monitor_data_time ASC;
A EXPLAIN on the query above looks like this:
id;select_type;table;type;possible_keys;key;key_len;ref;rows;Extra
1;SIMPLE;md;range;monitor_id_data_time,monitor_data_time;monitor_id_data_time;8;\N;149799;Using index condition; Using temporary; Using filesort
What do you think about the indexes?
3.
If I leave out the DISTINCT in the query above, I actually get duplicate rows even though there aren't any duplicate rows in the table. Any explanation to this behavior?
Any input is greatly appreciated!
UPDATE 1:
New suggestion on table structure:
CREATE TABLE `monitor_data_test` (
`monitor_id` INT UNSIGNED NOT NULL,
`monitor_data_time` INT UNSIGNED NOT NULL,
`monitor_data_value` INT UNSIGNED NULL DEFAULT NULL,
PRIMARY KEY (`monitor_data_time`, `monitor_id`),
INDEX `monitor_data_time` (`monitor_data_time`)
) COLLATE='utf8_general_ci' ENGINE=InnoDB;

SELECT DISTINCT(md.monitor_data_time), monitor_data_value
is the same as
SELECT DISTINCT md.monitor_data_time, monitor_data_value
That is, the pair is distinct. It does not dedup just the time. Is that what you want?
If you are trying to de-dup just the time, then do something like
SELECT time, AVG(value)
...
GROUP BY time;
For optimal performance of
WHERE md.monitor_id = 165
AND md.monitor_data_time >= 14840767604 ...
you need
PRIMARY KEY (monitor_id, monitor_data_time)
and it must be in that order. The opposite order is much less useful. The guiding principle is: Start with the '=', then move on to the 'range'. More discussion here.
Do you have 4 billion monitor_id values? INT takes 4 bytes; consider using a smaller datatype.
Do you have other queries that need optimizing? It is better to design the index(es) after gather all the important queries.
Why PK
In InnoDB, the PRIMARY KEY is "clustered" with the data. That is, the data is an ordered list of triples: (id, time, value) stored in a B+Tree. Locating id = 165 AND time = 1484076760 is a basic operation of a BTree. And it is very fast. Then scanning forward (that's the "+" part of "B+Tree") until time = 1487271199 is a very fast operation of "next row" in this ordered list. Furthermore, since value is right there with the id and time, there is no extra effort to get the values.
You can't scan the requested rows any faster. But it requires PRIMARY KEY. (OK, UNIQUE(id, time) would be 'promoted' to be the PK, but let's not confuse the issue.)
Contrast... Given an index (time, id), it would do the scan over the dates fine, but it would have to skip over any entries where id != 165 But it would have to read all those rows to discover they do not apply. A lot more effort.
Since it is unclear what you intended by DISTINCT, I can't continue this detailed discussion of how that plays out. Suffice it to say: The possible rows have been found; now some kind of secondary pass is needed to do the DISTINCT. (It may not even need to do a sort.)

What do you think about the indexes?
The index on (monitor_id,monitor_data_time) seems appropriate for the query. That's suited to an index range scan operation, very quickly eliminating boatloads of rows that need to be examined.
Better would be a covering index that also includes the monitor_data_value column. Then the query could be satisfied entirely from the index, without a need to lookup pages from the data table to get monitor_data_value.
And even better would be having the InnoDB cluster key be the PRIMARY KEY or UNIQUE KEY on the columns, rather than incurring the overhead of the synthetic row identifier that InnoDB creates when an appropriate index isn't defined.
If I wasn't allowing duplicate (monitor_id, monitor_data_time) tuples, then I'd define the table with a UNIQUE index on those non-nullable columns.
CREATE TABLE `monitor_data`
( `monitor_id` INT(10) UNSIGNED NOT NULL
, `monitor_data_time` INT(10) UNSIGNED NOT NULL
, `monitor_data_value` INT(10) NULL DEFAULT NULL
, UNIQUE KEY `monitor_id_data_time` (`monitor_id`, `monitor_data_time`)
) ENGINE=InnoDB
or equivalent, specify PRIMARY in place of UNIQUE and remove the identifier
CREATE TABLE `monitor_data`
( `monitor_id` INT(10) UNSIGNED NOT NULL
, `monitor_data_time` INT(10) UNSIGNED NOT NULL
, `monitor_data_value` INT(10) NULL DEFAULT NULL
, PRIMARY KEY (`monitor_id`, `monitor_data_time`)
) ENGINE=InnoDB
Any explanation to this behavior?
If the query (shown in the question) returns a different number of rows with the DISTINCT keyword, then there must be duplicate (monitor_id,monitor_data_time,monitor_data_value) tuples in the table. There's nothing in the table definition that guarantees us that there aren't duplicates.
There are a couple of other possible explanations, but those explanations are all related to rows being added/changed/removed, and the queries seeing different snapshots, transaction isolation levels, yada, yada. If the data isn't changing, then there are duplicate rows.
A PRIMARY KEY constraint (or UNIQUE KEY constraint non-nullable columns) would guarantee us uniqueness.
Note that DISTINCT is a keyword in the SELECT list. It's not a function. The DISTINCT keyword applies to all expressions in the SELECT list. The parens around md.monitor_date_time are superfluous.
Leaving the DISTINCT keyword out would eliminate the need for the "Using filesort" operation. And that can be expensive for large sets, particularly when the set is too large to sort in memory, and the sort has to spill to disk.
It would be much more efficient to have guaranteed uniqueness, omit the DISTINCT keyword, and return rows in order by the index, preferably the cluster key.
Also, the secondary index monitor_data_time doesn't benefit this query. (There may be other queries that can make effective use of the index, though one suspects that those queries would also make effective use of a composite index that had monitor_data_time as the leading column.

How can I select a set of IDs from large table fast?

I have a large table with ID as primary. About 3 million rows and I need to extract a small set of rows base on given ID list.
Currently I am doing it on where... in but it's very slow, like 5 to 10s.
My code:
select id,fa,fb,fc
from db1.t1
where id in(15,213,156,321566,13,165,416,132163,6514361,... );
I tried to query one ID at a time but it is still slow. like
select id,fa,fb,fc from db1.t1 where id =25;
I also tried to use a temp table and insert the ID list and call Join. But no improvement.
select id,fa,fb,fc from db1.t1 inner join db1.temp on t1.id=temp.id
Is there any way to make it faster?
here is table.
CREATE TABLE `db1`.`t1` (
`id` int(9) NOT NULL,
`url` varchar(256) COLLATE utf8_unicode_ci NOT NULL,
`title` varchar(1024) COLLATE utf8_unicode_ci DEFAULT NULL,
`lastUpdate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`lastModified` datetime DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Ok here is explain select.
id=1,
select_type='SIMPLE',
table='t1',
type='range',
possible_keys='PRIMARY',
key='PRIMARY',
key_len= '4',
ref= '',
rows=9,
extra='Using where'

Here are some tips how you can speed up the performance of your table:
Try to avoid complex SELECT queries on MyISAM tables that are updated
frequently, to avoid problems with table locking that occur due to
contention between readers and writers.
To sort an index and data according to an index, use myisamchk
--sort-index --sort-records=1 (assuming that you want to sort on index 1). This is a good way to make queries faster if you have a
unique index from which you want to read all rows in order according
to the index. The first time you sort a large table this way, it may
take a long time.
For MyISAM tables that change frequently, try to avoid all
variable-length columns (VARCHAR, BLOB, and TEXT). The table uses
dynamic row format if it includes even a single variable-length
column.
Strings are automatically prefix- and end-space compressed in MyISAM
indexes. See “CREATE INDEX Syntax”.
You can increase performance by caching queries or answers in your
application and then executing many inserts or updates together.
Locking the table during this operation ensures that the index cache
is only flushed once after all updates. You can also take advantage
of MySQL's query cache to achieve similar results; see “The MySQL Query Cache”..
You can read further on this articles on Optimizing your queries.
MySQL Query Cache
Query Cache SELECT Options
Optimizing MySQL queries with IN operator
Optimizing MyISAM Queries

First of all clustered indexes are faster then non-clustered indexes if I am not wrong.
Then sometime even you have index on a table, try to create re-index, or create statistics to rebuild it.
I saw on SQL explain plan that when we use where ID in (...), it converts it to
Where (ID =1) or (ID=2) or (Id=3)..... so bigger the list many ors, so for very big tables avoid IN ()
Try "Explain" this SQL and it can tell you where is the actual bottle neck.
Check this link http://dev.mysql.com/doc/refman/5.5/en/explain.html
hope will work

Looks like original sql statement using 'in' should be fine since the Id columns is indexed
I think you basically need a faster computer - are you doing this query on shared hosting?

What is better: select duplicates check OR making a unique index?

I have a table like this:
`id` int(11) NOT NULL AUTO_INCREMENT,
`torderid` int(11) DEFAULT NULL,
`tuid` int(11) DEFAULT NULL,
`tcontid` int(11) DEFAULT NULL,
`tstatus` varchar(10) DEFAULT 'pending',
the rule here is that the same user UID can't have more than one pending order with the same contid.
so the first thing i could do is to check if there is a pending order like this:
select count(id) into #cnt from tblorders where tuid = 1 and tcontid=5 and tstatus like 'pending';
and if it is > 0 then can't insert.
or i can just make a unique index for the three columns and that way the table won't accept new records of the duplicates.
the question is:
WHICH WAY IS FASTER? because thats gonna be a large database...

Few suggestions.
use tstatus = 'pending'; instead of tstatus like 'pending';
Creating composite primary keys for tid, tcontid, tstatus may not work if you are considering only for 'pending' status. What about other statuses?
If you decide to index the columns, I would recommend you create a separate table for tstatus and use the foreign key reference here. So it will save the space for the indexed columns and also your query will always run on the indexed fields.

An index is clearly faster as it is designed for that usecase.
It will fasten the search of the tuple you are looking for and if the constraint is not satisfied, it will send an error so, in your treatment script you will handle it easier (and faster) than by fetching the result, and so.

Assuming that a user will attempt to add a conflicting record less often than a valid record then the compound index will be faster. The SQL engine will maintain the index and throw an error when there is an index constraint violation.
Even if you did elect the select method you would need to maintain the index. Aside from that, pulling a result set from a select all the way back into your application's memory space and then checking the result is much slower than enforcing it on an index constraint.
For more info please see: https://dev.mysql.com/doc/refman/5.5/en/multiple-column-indexes.html

How to optimize MySQL table containing 1.6+ million records for LIKE '%abc%' querying

I have a table with this structure and it currently contains about 1.6 million records.
CREATE TABLE `chatindex` (
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`roomname` varchar(90) COLLATE utf8_bin NOT NULL,
`username` varchar(60) COLLATE utf8_bin NOT NULL,
`filecount` int(10) unsigned NOT NULL,
`connection` int(2) unsigned NOT NULL,
`primaryip` int(10) unsigned NOT NULL,
`primaryport` int(2) unsigned NOT NULL,
`rank` int(1) NOT NULL,
`hashcode` varchar(12) COLLATE utf8_bin NOT NULL,
PRIMARY KEY (`timestamp`,`roomname`,`username`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
Both the roomname and username columns can contain the same exact data, but the uniqueness and the important bit of each item comes from combining the timestamp with those two items.
The query that is starting to take a while (10-20 seconds) is this:
SELECT timestamp,roomname,username,primaryip,primaryport
FROM `chatindex`
WHERE username LIKE '%partialusername%'
What exactly can I do to optimize this? I can't do partialusername% because for some queries I will only have a small bit of the center of the actual username, and not the first few characters from the beginning of the actual value.
Edit:
Also, would sphinx be better for this particular purpose?

Use Fulltext indexes , these are actually designed for this purpose. Now InnoDb support fulltext indexes in MySQL 5.6.4.

Create Index on table column username (full-text indexing).
As an idea, you can create some views on this table that will contain filtered data on the basis of alphabets or other criteria and based on that your code will decide which view to use to fetch the search results.

You should use MyISAM table to do Fulltext search as it supports FULLTEXT indexes, MySQL v5.6+ is still under development phase you should not use it as a production servers and it may take ~1 year to go GA.
Now, You should convert this table as MyISAM and add FULLTEXT index which refers column in where clause:
These links can be useful:
http://dev.mysql.com/doc/refman/5.0/en/create-index.html
http://dev.mysql.com/doc/refman/5.1/en/fulltext-fine-tuning.html

On MSSQL this is a perfect case to use fulltext indexes together with CONTAIN clause. The LIKE clause fails to obtain a good performance on such big table and with so many variants of text to search for.
Take a look onto this link, there are many issues related to dinamic search conditions.

If you do an explain on the current query, you will see that you are doing a full table scan of the table which is why it is so slow. An index on username will materially speed up the search as the index can be cached by MySQL and the table row entries will only be accessed for matching users.
A fulltext index will not materially help searches like %fred% to match oldfredboy etc. so I am at loss as to why others are recommending using this. What a fulltext index does is to create a wordlist based index so that list you search for something like "explain the current query" the fulltext engine does a intersect of row IDs containing "explain" with those containing "current" and those containing "query" to get a list of ID which contain all three. Adding a fulltext index materially increases the insert , update on delete costs for the table, so it does add a performance penalty. Furthermore, you need to use the fulltext-specific "MATCH" syntax to make full use of a fulltext index.
If you do a question search on "[mysql] fulltext like" to see further discussion on this.
A normal index will do everything that you need. Searches like '%fred%' require a full scan of the index what ever you do so you need to keep the index as lean as possible. Also if a high % of hits match 'fred%', then it might also be worth first trying a like 'fred%' search first as this will do an index range scan.
One other point, why are you using the timestamp, roomname, username as the primary key? This doesn't make sense to me. If you don't use the primary key as an access path then an auto_increment id is easier. I would have thought roomname, timestamp, username would make some sense as you surely tend to access rooms within a time window.
Only add indexes that you will use.

Table index(full text indexes) is must for such high volumes of data.
Further if possible go for partitioning of table. so these will definitely improve the performance.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008