How can I select a set of IDs from large table fast?

How can I select a set of IDs from large table fast? - mysql

I have a large table with ID as primary. About 3 million rows and I need to extract a small set of rows base on given ID list.
Currently I am doing it on where... in but it's very slow, like 5 to 10s.
My code:
select id,fa,fb,fc
from db1.t1
where id in(15,213,156,321566,13,165,416,132163,6514361,... );
I tried to query one ID at a time but it is still slow. like
select id,fa,fb,fc from db1.t1 where id =25;
I also tried to use a temp table and insert the ID list and call Join. But no improvement.
select id,fa,fb,fc from db1.t1 inner join db1.temp on t1.id=temp.id
Is there any way to make it faster?
here is table.
CREATE TABLE `db1`.`t1` (
`id` int(9) NOT NULL,
`url` varchar(256) COLLATE utf8_unicode_ci NOT NULL,
`title` varchar(1024) COLLATE utf8_unicode_ci DEFAULT NULL,
`lastUpdate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`lastModified` datetime DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Ok here is explain select.
id=1,
select_type='SIMPLE',
table='t1',
type='range',
possible_keys='PRIMARY',
key='PRIMARY',
key_len= '4',
ref= '',
rows=9,
extra='Using where'

Here are some tips how you can speed up the performance of your table:
Try to avoid complex SELECT queries on MyISAM tables that are updated
frequently, to avoid problems with table locking that occur due to
contention between readers and writers.
To sort an index and data according to an index, use myisamchk
--sort-index --sort-records=1 (assuming that you want to sort on index 1). This is a good way to make queries faster if you have a
unique index from which you want to read all rows in order according
to the index. The first time you sort a large table this way, it may
take a long time.
For MyISAM tables that change frequently, try to avoid all
variable-length columns (VARCHAR, BLOB, and TEXT). The table uses
dynamic row format if it includes even a single variable-length
column.
Strings are automatically prefix- and end-space compressed in MyISAM
indexes. See “CREATE INDEX Syntax”.
You can increase performance by caching queries or answers in your
application and then executing many inserts or updates together.
Locking the table during this operation ensures that the index cache
is only flushed once after all updates. You can also take advantage
of MySQL's query cache to achieve similar results; see “The MySQL Query Cache”..
You can read further on this articles on Optimizing your queries.
MySQL Query Cache
Query Cache SELECT Options
Optimizing MySQL queries with IN operator
Optimizing MyISAM Queries

First of all clustered indexes are faster then non-clustered indexes if I am not wrong.
Then sometime even you have index on a table, try to create re-index, or create statistics to rebuild it.
I saw on SQL explain plan that when we use where ID in (...), it converts it to
Where (ID =1) or (ID=2) or (Id=3)..... so bigger the list many ors, so for very big tables avoid IN ()
Try "Explain" this SQL and it can tell you where is the actual bottle neck.
Check this link http://dev.mysql.com/doc/refman/5.5/en/explain.html
hope will work

Looks like original sql statement using 'in' should be fine since the Id columns is indexed
I think you basically need a faster computer - are you doing this query on shared hosting?

Related

Order By causing my query to run really slow

I have an sql query as follows
select *
from incidents
where remote_ip = '192.168.1.1' and is_infringement = 1
order by reported_at desc
limit 1;
This query at the moment takes 313.24 secs to run.
If I remove the order by so the query is
select *
from incidents
where remote_ip = '192.168.1.1' and is_infringement = 1
then it only takes 0.117 secs to run.
The reported_at column is indexed.
So 2 questions, firstly why is it takings so long with this order_by statement and secondly how can i speed it up?
EDIT: In response to the questions below here is the output when using explain:
'1', 'SIMPLE', 'incidents', 'index', 'uniqueReportIndex,idx_incidents_remote_ip', 'incidentsReportedAt', '4', NULL, '1044', '100.00', 'Using where'
And the table create statement:
CREATE TABLE `incidents` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`incident_ip_id` int(10) unsigned DEFAULT NULL,
`remote_id` bigint(20) DEFAULT NULL,
`remote_ip` char(32) NOT NULL,
`is_infringement` tinyint(1) NOT NULL DEFAULT '0',
`messageBody` text,
`reported_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' COMMENT 'Formerly : created_datetime',
`created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`updated_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`id`),
UNIQUE KEY `uniqueReportIndex` (`remote_ip`,`host_id_1`,`licence_feature`,`app_end`),
UNIQUE KEY `uniqueRemoteIncidentId` (`remote_id`),
KEY `incident_ip_id` (`incident_ip_id`),
KEY `id` (`id`),
KEY `incidentsReportedAt` (`reported_at`),
KEY `idx_incidents_remote_ip` (`remote_ip`)
)
Note: i have omitted some of the non relevant fields so there are more indexes than fields but you can safely assume the fields for all the indexes are in the table

The output of EXPLAIN reveals that, because of the ORDER BY clause, MySQL decides to use the incidentsReportedAt index. It reads each row from the table data in the order provided by the index and checks the WHERE conditions on it. This requires reading a lot of information from the table data, information that is scattered through the entire table. Not a good workflow.
Update
The OP created an index on columns reported_at and report_ip (as suggested in the original answer, see below) and the execution time went down from 313 to 133 seconds. An improvement, but not enough. I think the cause of this still large execution time is the access to table data for each row to verify the is_infringement = 1 part of the WHERE clause but even adding it to the index won't help very much.
The OP says in a comment:
Ok after further research and changing the index to be the other way round (remote_ip, reported_at) the query is now super fast (0.083 sec).
This index is better, indeed, because the remote_ip = '192.168.1.1' condition filters out a lot of rows. The same effect can be achieved using the existing uniqueReportIndex index. It is possible that the original index on reported_at fooled MySQL into thinking it is better to use it to check the rows in the order required by ORDER BY instead of filtering first and sorting at the end.
I think MySQL uses the new index on (remote_ip, reported_at) for filtering (WHERE remote_ip = '192.168.1.1') and for sorting (ORDER BY reported_at DESC). The WHERE condition provides a small list of candidate rows that are easily identified and also sorted using this index.
The original answer follows.
The advice it provides is not correct but it helped the OP find the correct solution.
Create an index on columns reported_at and report_ip in this order
then see what EXPLAIN says and how the query performs. It should work faster.
You can even create the new index on columns reported_at, report_ip and is_infringement (the order of columns in the index is very important).
The index on three columns helps MySQL identify the rows without the need to read the table data (because all the columns from WHERE and ORDER BY clauses are in the index). It needs to read the table data only for the rows it returns because of SELECT *.
After you create the new index (either on two or three columns), remove the old index incidentsReportedAt. It is not needed any more; it uses disk and memory space and takes time to be updated but it is not used. The new index (that has the reported_at column on the first position) will be used instead.
The index on two columns requires more reads of the table data for the is_infringement = 1 condition. The query probably runs a little slower that with the three-columns index. On the other hand, there is a little gain on table updates and disk and memory space usage.
The decision to index on two or three columns depends on how often the query posted in the question runs and what it serves (visitors, admins, cron jobs etc).

Why is mysql slow at performing UPDATE WHERE IN query (on the PK) as I iterate deeper into the table

I have two databases:
Database A
CREATE TABLE `jobs` (
`job_id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`in_b`, tinyint(1) DEFAULT 0,
PRIMARY KEY (`url_id`),
KEY `idx_inb` (`in_b`),
)
Database B
CREATE TABLE `jobs_copy` (
`job_id` int(11) unsigned NOT NULL,
`created` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`url_id`)
)
Performance Issue
I am performing a query where I get a batch of jobs (100 jobs) from Database A and create a copy in Database B, then mark them as in_b with a:
UPDATE jobs SET in_b=1 WHERE job_id IN (1,2,3.....)
This worked fine. The rows were being transferred fairly quickly until I reached job_id values > 2,000,000. The select query to get a batch of jobs was still quick (4ms), but the update statement was much slower.
Is there a reason for this? I searched MySQL Docs / Stackoverflow to see if converting the "IN" to a "OR" query would improve this query, but the general consensus was that a "ON" query will be faster in most cases.
If anyone has any insight as to why this is happening and how I can avoid this slowdown as I reach 10mil + rows, I would be extremely grateful.
Thanks in advance,
Ash
P.S. I am completing these update/select/insert through two RESTful services (one attached to each DB) but this is a constant from job_id 1 to through 2mil etc.

Your UPDATE query is progressively slowing down because it's having to read many rows from your large table to find the rows it needs to process. It's probably doing a so-called full table scan because there is no suitable index.
Pro tip: when a query starts out running fast, but then gets slower and slower over time, it's a sign that optimization (possibly indexing) is required.
To optimize this query:
UPDATE jobs SET in_b=1 WHERE job_id IN (1,2,3.....)
Create an index on the job_id column, as follows.
CREATE INDEX job_id_index ON jobs(job_id)
This should allow your query to locate the records which it needs to update very quickly with its IN (2,3,6) search filter.

How to optimize MySQL table containing 1.6+ million records for LIKE '%abc%' querying

I have a table with this structure and it currently contains about 1.6 million records.
CREATE TABLE `chatindex` (
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`roomname` varchar(90) COLLATE utf8_bin NOT NULL,
`username` varchar(60) COLLATE utf8_bin NOT NULL,
`filecount` int(10) unsigned NOT NULL,
`connection` int(2) unsigned NOT NULL,
`primaryip` int(10) unsigned NOT NULL,
`primaryport` int(2) unsigned NOT NULL,
`rank` int(1) NOT NULL,
`hashcode` varchar(12) COLLATE utf8_bin NOT NULL,
PRIMARY KEY (`timestamp`,`roomname`,`username`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
Both the roomname and username columns can contain the same exact data, but the uniqueness and the important bit of each item comes from combining the timestamp with those two items.
The query that is starting to take a while (10-20 seconds) is this:
SELECT timestamp,roomname,username,primaryip,primaryport
FROM `chatindex`
WHERE username LIKE '%partialusername%'
What exactly can I do to optimize this? I can't do partialusername% because for some queries I will only have a small bit of the center of the actual username, and not the first few characters from the beginning of the actual value.
Edit:
Also, would sphinx be better for this particular purpose?

Use Fulltext indexes , these are actually designed for this purpose. Now InnoDb support fulltext indexes in MySQL 5.6.4.

Create Index on table column username (full-text indexing).
As an idea, you can create some views on this table that will contain filtered data on the basis of alphabets or other criteria and based on that your code will decide which view to use to fetch the search results.

You should use MyISAM table to do Fulltext search as it supports FULLTEXT indexes, MySQL v5.6+ is still under development phase you should not use it as a production servers and it may take ~1 year to go GA.
Now, You should convert this table as MyISAM and add FULLTEXT index which refers column in where clause:
These links can be useful:
http://dev.mysql.com/doc/refman/5.0/en/create-index.html
http://dev.mysql.com/doc/refman/5.1/en/fulltext-fine-tuning.html

On MSSQL this is a perfect case to use fulltext indexes together with CONTAIN clause. The LIKE clause fails to obtain a good performance on such big table and with so many variants of text to search for.
Take a look onto this link, there are many issues related to dinamic search conditions.

If you do an explain on the current query, you will see that you are doing a full table scan of the table which is why it is so slow. An index on username will materially speed up the search as the index can be cached by MySQL and the table row entries will only be accessed for matching users.
A fulltext index will not materially help searches like %fred% to match oldfredboy etc. so I am at loss as to why others are recommending using this. What a fulltext index does is to create a wordlist based index so that list you search for something like "explain the current query" the fulltext engine does a intersect of row IDs containing "explain" with those containing "current" and those containing "query" to get a list of ID which contain all three. Adding a fulltext index materially increases the insert , update on delete costs for the table, so it does add a performance penalty. Furthermore, you need to use the fulltext-specific "MATCH" syntax to make full use of a fulltext index.
If you do a question search on "[mysql] fulltext like" to see further discussion on this.
A normal index will do everything that you need. Searches like '%fred%' require a full scan of the index what ever you do so you need to keep the index as lean as possible. Also if a high % of hits match 'fred%', then it might also be worth first trying a like 'fred%' search first as this will do an index range scan.
One other point, why are you using the timestamp, roomname, username as the primary key? This doesn't make sense to me. If you don't use the primary key as an access path then an auto_increment id is easier. I would have thought roomname, timestamp, username would make some sense as you surely tend to access rooms within a time window.
Only add indexes that you will use.

Table index(full text indexes) is must for such high volumes of data.
Further if possible go for partitioning of table. so these will definitely improve the performance.

Effective indexing for a DB with millions of rows

I have a MYISAM MySQL DB table with many millions of rows inside which I've been asked to work with, but I need to make the queries faster first.
There was no indexing before at all! I added a new index on the 'type' column which has helped but I wanted to know if there were any other columns that might be best indexed too?
Here is my CREATE TABLE:
CREATE TABLE `clicks` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`companyid` int(11) DEFAULT '0',
`type` varchar(32) NOT NULL DEFAULT '',
`contextid` int(11) NOT NULL DEFAULT '0',
`period` varchar(16) NOT NULL DEFAULT '',
`timestamp` int(11) NOT NULL DEFAULT '0',
`location` varchar(32) NOT NULL DEFAULT '',
`ip` varchar(32) DEFAULT NULL,
`useragent` varchar(64) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `companyid` (`companyid`,`type`,`period`),
KEY `type` (`type`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
A typical SELECT statement would commonly filter by the companyid, type and contextid columns.
For example:
SELECT period, count(period) as count FROM clicks WHERE contextid in (123) AND timestamp > 123123123 GROUP BY period ORDER BY timestamp ASC
or
SELECT period, count(period) as count FROM clicks WHERE contextid in (123) AND type IN('direct') AND timestamp > 123123123 GROUP BY period ORDER BY timestamp ASC
The last part of my questions would be this: when I added the index on type it took about 1 hour - if I am adding or removing multiple indexes, can you do it in one query or do you have to do them 1 by 1 and wait for each to finish?
Thanks for your thoughts.

Indexing is really powerful, but isn't as much of a black art as you might think. Learn about MySQL's EXPLAIN PLAN capabilities, this will help you systematically find where improvements can be made:
http://dev.mysql.com/doc/refman/5.5/en/execution-plan-information.html

Which indexes to add really depends on your queries. Anything that you're sorting (GROUP BY) or selecting (WHERE) on is a good candidate for an index.
You may also want to have a look at how Mysql uses indexes.
As regards the time taken to add indexes, where you're sure you want to add multiple indexes, you could do mysqldump, manually edit the table structure in the .sql file, and then reimport. This can take a while, but at least you can do all the changes at once. However, this doesn't really fit with the idea of testing as you go... so use this approach with care. (I've done it when modifying a number of tables with the same structure, and wanting to add some indexes to all of them.)
Also, I'm not a 100% sure, but I think that when you add an index, Mysql creates a copy of the table with the index, and then deletes the original table - so make sure there's enough space on your server / partition for the current size of the table & some margin.

Here's one of your queries, broken on multiple lines so it's easier to read.
SELECT period, count(period) as count
FROM clicks
WHERE contextid in (123)
AND timestamp > 123123123
GROUP BY period
ORDER BY timestamp ASC
I'm not even sure this is a valid query. I thought your GROUP BY and ORDER BY had to match up in SQL. I think that you would have to order on count, as the GROUP BY would order on period.
The important part of the query for optimization is the WHERE clause. In this case, an index on contextid and timestamp would speed up the query.
Obviously, you can't index every WHERE clause. You index the most common WHERE clauses.
I'd add indexes to existing tables one at a time. Yes, it's slow. But you should only have to add the indexes once.

In my opinion timestamp and period can be indexed as they are being used in the WHERE clause.
Also instead of using contextid in (123) use contextid = 123 and instead of type IN('direct') use type = 'direct'

You can add multiple indexes in a single query. This will save some time overall, but the table will be inaccessible while you wait for the entire query to complete:
ALTER TABLE table1 ADD INDEX `Index1`('col1'),
ADD INDEX `Index2`('col2')
Regarding indexes, it's a complex subject. However, adding indexes on single columns with high-cardinality that are included in your WHERE clause is a good place to start. MySQL will try to pick the best index for the query and use that.
To further tweak performance, you should consider multi-column indexes, which I see you've implemented with your 'companyid' index.
To be able to utilize an index all the way through to a GROUP BY or ORDER BY clause relies on a lot of conditions, that you might want to read up on.
To best utilize indexes, your database server must have enough RAM to store the indexes entirely in memory and the server must be configured properly to actually utilize the memory.

"SELECT COUNT(*)" is slow, even with where clause

I'm trying to figure out how to optimize a very slow query in MySQL (I didn't design this):
SELECT COUNT(*) FROM change_event me WHERE change_event_id > '1212281603783391';
+----------+
| COUNT(*) |
+----------+
| 3224022 |
+----------+
1 row in set (1 min 0.16 sec)
Comparing that to a full count:
select count(*) from change_event;
+----------+
| count(*) |
+----------+
| 6069102 |
+----------+
1 row in set (4.21 sec)
The explain statement doesn't help me here:
explain SELECT COUNT(*) FROM change_event me WHERE change_event_id > '1212281603783391'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: me
type: range
possible_keys: PRIMARY
key: PRIMARY
key_len: 8
ref: NULL
rows: 4120213
Extra: Using where; Using index
1 row in set (0.00 sec)
OK, it still thinks it needs roughly 4 million entries to count, but I could count lines in a file faster than that! I don't understand why MySQL is taking this long.
Here's the table definition:
CREATE TABLE `change_event` (
`change_event_id` bigint(20) NOT NULL default '0',
`timestamp` datetime NOT NULL,
`change_type` enum('create','update','delete','noop') default NULL,
`changed_object_type` enum('Brand','Broadcast','Episode','OnDemand') NOT NULL,
`changed_object_id` varchar(255) default NULL,
`changed_object_modified` datetime NOT NULL default '1000-01-01 00:00:00',
`modified` datetime NOT NULL default '1000-01-01 00:00:00',
`created` datetime NOT NULL default '1000-01-01 00:00:00',
`pid` char(15) default NULL,
`episode_pid` char(15) default NULL,
`import_id` int(11) NOT NULL,
`status` enum('success','failure') NOT NULL,
`xml_diff` text,
`node_digest` char(32) default NULL,
PRIMARY KEY (`change_event_id`),
KEY `idx_change_events_changed_object_id` (`changed_object_id`),
KEY `idx_change_events_episode_pid` (`episode_pid`),
KEY `fk_import_id` (`import_id`),
KEY `idx_change_event_timestamp_ce_id` (`timestamp`,`change_event_id`),
KEY `idx_change_event_status` (`status`),
CONSTRAINT `fk_change_event_import` FOREIGN KEY (`import_id`) REFERENCES `import` (`import_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Version:
$ mysql --version
mysql Ver 14.12 Distrib 5.0.37, for pc-solaris2.8 (i386) using readline 5.0
Is there something obvious I'm missing? (Yes, I've already tried "SELECT COUNT(change_event_id)", but there's no performance difference).

InnoDB uses clustered primary keys, so the primary key is stored along with the row in the data pages, not in separate index pages. In order to do a range scan you still have to scan through all of the potentially wide rows in data pages; note that this table contains a TEXT column.
Two things I would try:
run optimize table. This will ensure that the data pages are physically stored in sorted order. This could conceivably speed up a range scan on a clustered primary key.
create an additional non-primary index on just the change_event_id column. This will store a copy of that column in index pages which be much faster to scan. After creating it, check the explain plan to make sure it's using the new index.
(you also probably want to make the change_event_id column bigint unsigned if it's incrementing from zero)

Here are a few things I suggest:
Change the column from a "bigint" to an "int unsigned". Do you really ever expect to have more than 4.2 billion records in this table? If not, then you're wasting space (and time) the the extra-wide field. MySQL indexes are more efficient on smaller data types.
Run the "OPTIMIZE TABLE" command, and see whether your query is any faster afterward.
You might also consider partitioning your table according to the ID field, especially if older records (with lower ID values) become less relevant over time. A partitioned table can often execute aggregate queries faster than one huge, unpartitioned table.
EDIT:
Looking more closely at this table, it looks like a logging-style table, where rows are inserted but never modified.
If that's true, then you might not need all the transactional safety provided by the InnoDB storage engine, and you might be able to get away with switching to MyISAM, which is considerably more efficient on aggregate queries.

I've run into behavior like this before with IP geolocation databases. Past some number of records, MySQL's ability to get any advantage from indexes for range-based queries apparently evaporates. With the geolocation DBs, we handled it by segmenting the data into chunks that were reasonable enough to allow the indexes to be used.

Check to see how fragmented your indexes are. At my company we have a nightly import process that trashes our indexes and over time it can have a profound impact on data access speeds. For example we had a SQL procedure that took 2 hours to run one day after de-fragmenting the indexes it took 3 minutes. we use SQL Server 2005 ill look for a script that can check this on MySQL.
Update: Check out this link: http://dev.mysql.com/doc/refman/5.0/en/innodb-file-defragmenting.html

Run "analyze table_name" on that table - it's possible that the indices are no longer optimal.
You can often tell this by running "show index from table_name". If the cardinality value is NULL then you need to force re-analysis.

MySQL does say "Using where" first, since it does need to read all records/values from the index data to actually count them. With InnoDb it also tries to "grab" that 4 mil record range to count it.
You may need to experiment with different transaction isolation levels: http://dev.mysql.com/doc/refman/5.1/en/set-transaction.html#isolevel_read-uncommitted
and see which one is better.
With MyISAM it would be just fast, but with intensive write model will result in lock issues.

To make the search more efficient, although I recommend adding index. I leave the command for you to try the metrics again
CREATE INDEX ixid_1 ON change_event (change_event_id);
and repeat query
SELECT COUNT(*) FROM change_event me WHERE change_event_id > '1212281603783391';
-JACR

I would create a "counters" table and add "create row"/"delete row" triggers to the table you are counting. The triggers should increase/decrease count values on "counters" table on every insert/delete, so you won't need to compute them every time you need them.
You can also accomplish this on the application side by caching the counters but this will involve clearing the "counter cache" on every insertion/deletion.
For some reference take a look at this http://pure.rednoize.com/2007/04/03/mysql-performance-use-counter-tables/

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008