MySql MyISAM INSERT slowness - mysql

I am having a performance issue when inserting some data in a mysql table.
The table has a bunch of columns, let's say DATE,A,B,C,D,E,F where DATE,A,B,C,D,E is the primary key. Every day, I insert 70k rows in that table (with a different date), and this table contains 18 million rows now. The method I use to insert the rows is just sending 70k INSERT queries.
The problem I am having is that the queries started to take a lot more time than they used to. Going from a few minutes to a few hours. I profiled the inserts and this is the charts I got:
Speed of each insert (in sec) vs. Number of insert for that day:
A few strange facts:
Most queries take less than 2 ms to execute
The speed of slow queries increases linearly with the number of rows in the table for that date
This behavior only happens at night, after a bunch of processes have happened on the database. Inserting during the day is fast, so is weekends
The overall speed doesn't depend on what else is running on the database, in fact, nothing else is running on the database when this happens
There is nothing in the query that can explain that a query is fast or no, the fast ones are very similar to the slow one, and from one day to another are not the same set.
the behavior does not change from one day to the next.
Any idea what could cause this?
** Edit ** the columns in the index are in the following order:
DATE NOT NULL,
DATE NOT NULL,
VARCHAR (10) NOT NULL,
VARCHAR (45) NOT NULL,
VARCHAR (45) NOT NULL,
VARCHAR (3) NOT NULL,
VARCHAR (45) NOT NULL,
DOUBLE NOT NULL,
VARCHAR (10) NOT NULL,
VARCHAR (45) NOT NULL,
VARCHAR (45) NOT NULL,
VARCHAR (45) NOT NULL,
The Dates are either the same as today, or left empty, the double is always the same number (no clue who designed this table)

The short explanation is that you have an index that is non-incremental within the scope of a single day. Non-incremental indices are generally slower to insert/update because they will more often require rebalancing the index tree, and to a greater extent, than an incremental index.
To explain this further - assume the following schema:
a (int) | b (varchar)
And the index is (a, b)
Now we insert:
1, 'foo'
2, 'bar'
3, 'baz'
This will be quite fast because the index will append on each insert. Now lets try the following:
100, 'foo'
100, 'bar'
100, 'baz'
This won't quite be as fast since 'bar' needs to be inserted before 'foo', and 'baz' needs to insert between the other 2. This often requires the index to rewrite the tree to accomodate, and this 'rebalancing' act takes some time. The larger the components involved in the rebalancing (in this case, the subset where a=100), the more time it will take. Note that this rebalancing activity will only occur more often and more extensively, but not necessarily on each insert. This is because the tree will usually leave some room within the leaves for expansion. When the leaves runs out of room, it knows that it's time to rebalance.
In your case, since your index is primarily based on the current date, you are constantly rebalancing your tree within the scope of the single day. Each day starts a new scope, and as such starts rebalancing within that day's scope. Initially this involves just a bit of rebalancing, but this will grow as your scope of existing entries for the day increases. The cycle starts over as you start a new day, which is the result you are seeing.
That this is happening to the primary key may make matters even worse, since instead of shifting some index pointers around, entire rows of data may need to be shifted to accommodate the new entry. (This last point assumes that MyISAM clustering is performed on the primary key, a point that I haven't gotten clarification on to this day, although anectodal evidence does seem to support this. For example, see here and here.)

Related

Updating single table frequently vs using another table and CRON to import changes into main table in MySQL?

I have a table with login logs which is EXTREMELY busy and large InnoDB table. New rows are inserted all the time, the table is queried by other parts of the system, it is by far the busiest table in the DB. In this table, there is logid which is PRIMARY KEY and its generated as a random hash by software (not auto increment ID). I also want to store some data like number of items viewed.
create table loginlogs
(
logid bigint unsigned primary key,
some_data varchar(255),
viewed_items biging unsigned
)
viewed_items is a value that will get updated for multiple rows very often (assume thousands of updates / second). The dilemma I am facing now is:
Should I
UPDATE loginlogs SET viewed_items = XXXX WHERE logid = YYYYY
or should I create
create table loginlogs_viewed_items
(
logid bigint unsigned primary key,
viewed_items biging unsigned,
exported tinyint unsigned default 0
)
and then execute with CRON
UPDATE loginlogs_viewed_items t
INNER JOIN loginlogs l ON l.logid = t.logid
SET
t.exported = 1,
l.viewed_items = t.viewed_items
WHERE
t.exported = 0;
e.g. every hour?
Note that either way the viewed_items counter will be updated MANY TIMES for one logid, it can be even 100 / hour / logid and there is tons of rows. So whichever table I chose for this, either the main one or the separate one, it will be getting updated quite frequently.
I want to avoid unnecessary locking of loginlogs table and at the same time I do not want to degrade performance by duplicating data in another table.
Hmm, I wonder why you'd want to change log entries and not just add new ones...
But anyway, as you said either way the updates have to happen, whether individually or in bulk.
If you have less busy time windows updating in bulk then might have an advantage. Otherwise the bulk update may have more significant impact when running in contrast to individual updates that might "interleave" more with the other operations making the impact less "feelable".
If the column you need to update is not needed all the time, you could think of having a separate table just for this column. That way queries that just need the other columns may be less affected by the updates.
"Tons of rows" -- To some people, that is "millions". To others, even "billions" is not really big. Please provide some numbers; the answer can be different. Meanwhile, here are some general principles.
I will assume the table is ENGINE=InnoDB.
UPDATEing one row at a time is 10 times as costly as updating 100 rows at a time.
UPDATEing more than 1000 rows in a single statement is problematic. It will lock each row, potentially leading to delays in other statements and maybe even deadlocks.
Having a 'random' PRIMARY KEY (as opposed to AUTO_INCREMENT or something roughly chronologically ordered) is very costly when the table is bigger than the buffer_pool. How much RAM do you have?
"the table is queried by other parts of the system" -- by the random PK? One row at a time? How frequently?
Please elaborate on how exported works. For example, does it get reset to 0 by something else?
Is there a single client doing all the work? Or are there multiple servers throwing data and queries at the table? (Different techniques are needed.)

Improving MySQL Query Speeds - 150,000+ Rows Returned Slows Query

Hi I currently have a query which is taking 11(sec) to run. I have a report which is displayed on a website which runs 4 different queries which are similar and all take 11(sec) each to run. I don't really want the customer having to wait a minute for all of these queries to run and display the data.
I am using 4 different AJAX requests to call an APIs to get the data I need and these all start at once but the queries are running one after another. If there was a way to get these queries to all run at once (parallel) so the total load time is only 11(sec) that would also fix my issue, I don't believe that is possible though.
Here is the query I am running:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
I can't think of anyway to speed this query up at all, below are pictures of the table indexes and the explain statement on this query.
I think the above query is using relevant indexes in the where conditions.
If there is anything you can think of to speed this query up please let me know, I have been working on it for 3 days and can't seem to figure out the problem. It would be great to get the query times down to 5(sec) maximum. If I am wrong about the AJAX issue please let me know as this would also fix my issue.
" EDIT "
I have came across something quite strange which might be causing the issue. When I change the day_epoch range to something smaller (5th - 9th) which returns 130,000 rows the query time is 0.7(sec) but then I add one more day onto that range (5th - 10th) and it returns over 150,000 rows the query time is 13(sec). I have ran loads of different ranges and have came to the conclusion if the amount of rows returned is over 150,000 that has a huge effect on the query times.
Table Definition -
CREATE TABLE `tracking_daily_stats_zone_unique_device_uuids_per_hour` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`day_epoch` int(10) NOT NULL,
`day_of_week` tinyint(1) NOT NULL COMMENT 'day of week, monday = 1',
`hour` int(2) NOT NULL,
`venue_id` int(5) NOT NULL,
`zone_id` int(5) NOT NULL,
`device_uuid` binary(16) NOT NULL COMMENT 'binary representation of the device_uuid, unique for a single day',
`device_vendor_id` int(5) unsigned NOT NULL DEFAULT '0' COMMENT 'id of the device vendor',
`first_seen` int(10) unsigned NOT NULL DEFAULT '0',
`last_seen` int(10) unsigned NOT NULL DEFAULT '0',
`is_repeat` tinyint(1) NOT NULL COMMENT 'is the device a repeat for this day?',
`prev_last_seen` int(10) NOT NULL DEFAULT '0' COMMENT 'previous last seen ts',
PRIMARY KEY (`id`,`venue_id`) USING BTREE,
KEY `venue_id` (`venue_id`),
KEY `zone_id` (`zone_id`),
KEY `day_of_week` (`day_of_week`),
KEY `day_epoch` (`day_epoch`),
KEY `hour` (`hour`),
KEY `device_uuid` (`device_uuid`),
KEY `is_repeat` (`is_repeat`),
KEY `device_vendor_id` (`device_vendor_id`)
) ENGINE=InnoDB AUTO_INCREMENT=450967720 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY HASH (venue_id)
PARTITIONS 100 */
The straight forward solution is to add this query specific index to the table:
ALTER TABLE tracking_daily_stats_zone_unique_device_uuids_per_hour
ADD INDEX complex_idx (`venue_id`, `day_epoch`, `zone_id`)
WARNING This query change can take a while on DB.
And then force it when you call:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
USE INDEX (complex_idx)
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
It is definitely not universal but should work for this particular query.
UPDATE When you have partitioned table you can get profit by forcing particular PARTITION. In our case since that is venue_id just force it:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
PARTITION (`p46`)
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
Where p46 is concatenated string of p and venue_id = 46
And another trick if you go this way. You can remove AND venue_id = 46 from WHERE clause. Because there is no other data in that partition.
What happens if you change the order of conditions? Put venue_id = ? first. The order matters.
Now it first checks all rows for:
- day_epoch >= 1552435200
- then, the remaining set for day_epoch < 1553040000
- then, the remaining set for venue_id = 46
- then, the remaining set for zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
When working with heavy queries, you should always try to make the first "selector" the most effective. You can do that by using a proper index for 1 (or combination) index and to make sure that first selector narrows down the most (at least for integers, in case of strings you need another tactic).
Sometimes, a query simply is slow. When you have a lot of data (and/or not enough resources) you just cant really do anything about that. Thats where you need another solution: Make a summary table. I doubt you show 150.000 rows x4 to your visitor. You can sum it, e.g., hourly or every few minutes and select from that way smaller table.
Offtopic: Putting an index on everything only slows you down when inserting/updating/deleting. Index the least amount of columns, just the once you actually filter on (e.g. use in a WHERE or GROUP BY).
450M rows is rather large. So, I will discuss a variety of issues that can help.
Shrink data A big table leads to more I/O, which is the main performance killer. ('Small' tables tend to stay cached, and not have an I/O burden.)
Any kind of INT, even INT(2) takes 4 bytes. An "hour" can easily fit in a 1-byte TINYINT. That saves over a 1GB in the data, plus a similar amount in INDEX(hour).
If hour and day_of_week can be derived, don't bother having them as separate columns. This will save more space.
Some reason to use a 4-byte day_epoch instead of a 3-byte DATE? Or perhaps you do need a 5-byte DATETIME or TIMESTAMP.
Optimal INDEX (take #1)
If it is always a single venue_id, then either this is a good first cut at the optimal index:
INDEX(venue_id, zone_id, day_epoch)
First is the constant, then the IN, then a range. The Optimizer does well with this in many cases. (It is unclear whether the number of items in an IN clause can lead to inefficiencies.)
Better Primary Key (better index)
With AUTO_INCREMENT, there is probably no good reason to include columns after the auto_inc column in the PK. That is, PRIMARY KEY(id, venue_id) is no better than PRIMARY KEY(id).
InnoDB orders the data's BTree according to the PRIMARY KEY. So, if you are fetching several rows and can arrange for them to be adjacent to each other based on the PK, you get extra performance. (cf "Clustered".) So:
PRIMARY KEY(venue_id, zone_id, day_epoch, -- this order, as discussed above;
id) -- to make sure that the entire PK is unique.
INDEX(id) -- to keep AUTO_INCREMENT happy
And, I agree with DROPping any indexes that are not in use, including the one I recommended above. It is rarely useful to index flags (is_repeat).
UUID
Indexing a UUID can be deadly for performance once the table is really big. This is because of the randomness of UUIDs/GUIDs, leading to ever-increasing I/O burden to insert new entries in the index.
Multi-dimensional
Assuming day_epoch is sometimes multiple days, you seem to have 2 or 3 "dimensions":
A date range
A list of zones
A venue.
INDEXes are 1-dimensional. Therein lies the problem. However, PARTITIONing can sometimes help. I discuss this briefly as "case 2" in http://mysql.rjweb.org/doc.php/partitionmaint .
There is no good way to get 3 dimensions, so let's focus on 2.
You should partition on something that is a "range", such as day_epoch or zone_id.
After that, you should decide what to put in the PRIMARY KEY so that you can further take advantage of "clustering".
Plan A: This assumes you are searching for only one venue_id at a time:
PARTITION BY RANGE(day_epoch) -- see note below
PRIMARY KEY(venue_id, zone_id, id)
Plan B: This assumes you sometimes srefineearch for venue_id IN (.., .., ...), hence it does not make a good first column for the PK:
Well, I don't have good advice here; so let's go with Plan A.
The RANGE expression must be numeric. Your day_epoch works fine as is. Changing to a DATE, would necessitate BY RANGE(TO_DAYS(...)), which works fine.
You should limit the number of partitions to 50. (The 81 mentioned above is not bad.) The problem is that "lots" of partitions introduces different inefficiencies; "too few" partitions leads to "why bother".
Note that almost always the optimal PK is different for a partitioned table than the equivalent non-partitioned table.
Note that I disagree with partitioning on venue_id since it is so easy to put that column at the start of the PK instead.
Analysis
Assuming you search for a single venue_id and use my suggested partitioning & PK, here's how the SELECT performs:
Filter on the date range. This is likely to limit the activity to a single partition.
Drill into the data's BTree for that one partition to find the one venue_id.
Hopscotch through the data from there, landing on the desired zone_ids.
For each, further filter based the date.

split table performance in mysql

everyone. Here is a problem in my mysql server.
I have a table about 40,000,000 rows and 10 columns.
Its size is about 4GB.And engine is innodb.
It is a master database, and only execute one sql like this.
insert into mytable ... on duplicate key update ...
And about 99% sqls executed update part.
Now the server is becoming slower and slower.
I heard that split table may enhance its performance. Then I tried on my personal computer, splited into 10 tables, failed , also tried 100 ,failed too. The speed became slower instead. So I wonder why splitting tables didn't enhance the performance?
Thanks in advance.
more details:
CREATE TABLE my_table (
id BIGINT AUTO_INCREMENT,
user_id BIGINT,
identifier VARCHAR(64),
account_id VARCHAR(64),
top_speed INT UNSIGNED NOT NULL,
total_chars INT UNSIGNED NOT NULL,
total_time INT UNSIGNED NOT NULL,
keystrokes INT UNSIGNED NOT NULL,
avg_speed INT UNSIGNED NOT NULL,
country_code VARCHAR(16),
update_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY(id), UNIQUE KEY(user_id)
);
PS:
I also tried different computers with Solid State Drive and Hard Disk Drive, but didn't help too.
Splitting up a table is unlikely to help at all. Ditto for PARTITIONing.
Let's count the disk hits. I will skip counting non-leaf nodes in BTrees; they tend to be cached; I will count leaf nodes in the data and indexes; they tend not to be cached.
IODKU does:
Read the index block containing the for any UNIQUE keys. In your case, that is probably user_id. Please provide a sample SQL statement. 1 read.
If the user_id entry is found in the index, read the record from the data as indexed by the PK(id) and do the UPDATE, and leave this second block in the buffer_pool for eventual rewrite to disk. 1 read now, 1 write later.
If the record is not found, do INSERT. The index block that needs the new row was already read, so it is ready to have a new entry inserted. Meanwhile, the "last" block in the table (due to id being AUTO_INCREMENT) is probably already cached. Add the new row to it. 0 reads now, 1 write later (UNIQUE). (Rewriting the "last" block is amortized over, say, 100 rows, so I am ignoring it.)
Eventually do the write(s).
Total, assuming essentially all take the UPDATE path: 2 reads and 1 write. Assuming the user_id follows no simple pattern, I will assume that all 3 I/Os are "random".
Let's consider a variation... What if you got rid of id? Do you need id anywhere else? Since you have a UNIQUE key, it could be the PK. That is replace your two indexes with just PRIMARY KEY(user_id). Now the counts are:
1 read
If UPDATE, 0 read, 1 write
If INSERT, 0 read, 0 write
Total: 1 read, 1 write. 2/3 as many as before. Better, but still not great.
Caching
How much RAM do you have?
What is the value of innodb_buffer_pool_size?
SHOW TABLE STATUS -- What are Data_length and Index_length?
I suspect that the buffer_pool is not big enough, and possible could be raised. If you have more than 4GB of RAM, make it about 70% of RAM.
Others
SSDs should have helped significantly, since you appear to be I/O bound. Can you tell whether you are I/O-bound or CPU-bound?
How many rows are you updating at once? How long does it take? Is it batched, or one at a time? There may be a significant improvement possible here.
Do you really need BIGINT (8 bytes)? INT UNSIGNED is only 4 bytes.
Is a transaction involved?
Is the Master having a problem? The Slave? Both? I don't want to fix the Master in such a way that it messes up the Slave.
Try to split your database into some mysql instances using mysql proxy just like mysql-proxy or haproxy instead of one mysql instance. Maybe you can have great performance.

MySQL table setup for stock information

I am collecting about 3 - 6 millions lines of stock data per day and storing it in a MySQL database.
All of the data is coming from Interactive Brokers every piece of information comes with these five fields: Symbol, Date, Time, Value and Type (type being information on what type of data I am receiving such as price, volume etc)
Here is my create table statement. idticks is just my unique key but I almost never am able to use it in queries.
CREATE TABLE `ticks` (
`idticks` int(11) NOT NULL AUTO_INCREMENT,
`symbol` varchar(30) NOT NULL,
`date` int(11) NOT NULL,
`time` int(11) NOT NULL,
`value` double NOT NULL,
`type` double NOT NULL,
KEY `idticks` (`idticks`),
KEY `symbol` (`symbol`),
KEY `date` (`date`),
KEY `idx_ticks_symbol_date` (`symbol`,`date`),
KEY `idx_ticks_type` (`type`),
KEY `idx_ticks_date_type` (`date`,`type`),
KEY `idx_ticks_date_symbol_type` (`date`,`symbol`,`type`),
KEY `idx_ticks_symbol_date_time_type` (`symbol`,`date`,`time`,`type`)
) ENGINE=InnoDB AUTO_INCREMENT=13533258 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY KEY (`date`)
PARTITIONS 1 */;
As you can see, I have no idea what I am doing because I just keep on creating indexes to make my queries go faster.
Right now the data is being stored on a rather slow computer for testing purposes so I understand that my queries are not nearly as fast as they could be (I have a 6 core, 64gig of ram, SSD machine arriving tomorrow which should help significantly)
That being said, I am running queries like this one
select time, value from ticks where symbol = "AAPL" AND date = 20150522 and type = 8 order by time asc
The query above, if I do not limit it, returns 12928 records for one of my test days and takes 10.2 seconds if I do it from cleared cache.
I am doing lots of graphing and eventually would like to be able to just query the data as I need to it graph. Right now I haven't noticed a lot of difference in speed between getting part of a days worth of data vs just getting the entire day's. It would be cool to have those queries respond fast enough that there is barely any delay when I moving to the next day/screen whatever.
Another query I am using for usability of a program I am writing to interact with the data include
String query = "select distinct `date` from ticks where symbol = '" + symbol + "' order by `date` desc";
But most of my need is the ability to pull a certain type of data from a certain day for a certain symbol like my first query.
I've googled all over the place and I think I understand that creating tons of indexes makes the database bigger and slows down the input speed (I get about 300 pieces of information per second on a busy day). Should I just index each column individually?
I am willing to throw more harddrives at things if it means responsive interface.
Basically, my questions relate to the creation/altering of my table. Based on the above query, can you think of anything I could do to make that faster? Or an indexing system that would help me out? Is InnoDB even the right engine? I tried googling this vs MyISam and after a couple of hours of this, I still wasn't sure.
Thanks :)
Combine date and time into a DATETIME field
Assuming Price and Volume always come in together, put them together (2 columns) and get rid if type.
Get rid of the AUTO_INCREMENT; change to PRIMARY KEY(symbol, datetime)
Get rid of any indexes that are the left part of some other index.
Once you are using DATETIME, use date ranges to find everything in a single date (if you need such). Do not use DATE(datetime) = '...', performance will be terrible.
Symbol can probably be ascii, not utf8.
Use InnoDB, the clustering of the Primary Key can be beneficial.
Do you expect to collect (and use) more data than will fit in innodb_buffer_pool_size? If so, we need to discuss your SELECTs and look into PARTITIONing.
Make those changes, then come back for more advice/abuse.
You're creating a historical database, so MyISAM would work as well as InnoDB. InnoDB is a transactional relational database, and is better suited for relational databases with multiple tables that must remain synchronized.
Your Stock table looks like this.
Stock
-----
Stock ID (idticks)
Symbol
Date
Time
Value
Type
It would be better if you combine the date and time into a time stamp column, and unpack the types like this.
Stock
-----
Stock ID
Symbol
Time Stamp
Volume
Open
Close
Bid
Ask
...
This makes it easier for the database to return rows for a query on a particular type, like the close value.
As far as indexes, you can create as many indexes as you want. You're adding (inserting) information, so the increased time to add information is offset by the decreased time to query the information.
I'd have a primary index on Stock ID, and a unique index on Symbol and Time Stamp descending. You could also have indexes on the values you query most often, like Close.

Extra column ruins MySQL performance

I have a warehouse table that looks like this:
CREATE TABLE Warehouse (
id BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
eventId BIGINT(20) UNSIGNED NOT NULL,
groupId BIGINT(20) NOT NULL,
activityId BIGINT(20) UNSIGNED NOT NULL,
... many more ids,
"txtProperty1" VARCHAR(255),
"txtProperty2" VARCHAR(255),
"txtProperty3" VARCHAR(255),
"txtProperty4" VARCHAR(255),
"txtProperty5" VARCHAR(255),
... many more of these
PRIMARY KEY ("id")
KEY "WInvestmentDetail_idx01" ("groupId"),
... several more indices
) ENGINE=INNODB;
Now, the following query spends about 0.8s in query time and 0.2s in fetch time, for a total of about one second. The query returns ~67,000 rows.
SELECT eventId
FROM Warehouse
WHERE accountId IN (10, 8, 13, 9, 7, 6, 12, 11)
AND scenarioId IS NULL
AND insertDate BETWEEN DATE '2002-01-01' AND DATE '2011-12-31'
ORDER BY insertDate;
Adding more ids to the select clause doesn't really change the performance at all.
SELECT eventId, groupId, activityId, insertDate
FROM Warehouse
WHERE accountId IN (10, 8, 13, 9, 7, 6, 12, 11)
AND scenarioId IS NULL
AND insertDate BETWEEN DATE '2002-01-01' AND DATE '2011-12-31'
ORDER BY insertDate;
However, adding a "property" column does change it to 0.6s fetch time and 1.8s query time.
SELECT eventId, txtProperty1
FROM Warehouse
WHERE accountId IN (10, 8, 13, 9, 7, 6, 12, 11)
AND scenarioId IS NULL
AND insertDate BETWEEN DATE '2002-01-01' AND DATE '2011-12-31'
ORDER BY insertDate;
Now to really blow your socks off. Instead of txtProperty1, using txtProperty2 changes the times to 0.8s fetch, 24s query!
SELECT eventId, txtProperty2
FROM Warehouse
WHERE accountId IN (10, 8, 13, 9, 7, 6, 12, 11)
AND scenarioId IS NULL
AND insertDate BETWEEN DATE '2002-01-01' AND DATE '2011-12-31'
ORDER BY insertDate;
The two columns are pretty much identical in the type of data they hold: mostly non-null, and neither are indexed (not that that should make a difference anyways). To be sure the table itself is healthy I ran analyze/optimize against it.
This is really mystifying to me. I can see why adding columns to the select clause only can slightly increase fetch time, but it should not change query time, especially not significantly. I would appreciate any ideas as to what is causing this slowdown.
EDIT - More data points
SELECT * actually outperforms txtProperty2 - 0.8s query, 8.4s fetch. Too bad I can't use it because the fetch time is (expectedly) too long.
The MySQL documentation for the InnoDB engine suggests that if your varchar data doesn't fit on the page (i.e. the node of the b-tree structure), then the information will be referenced on overflow pages. So on your wide Warehouse table, it may be that txtProperty1 is on-page and txtProperty2 is off-page, thus requiring additional I/O to retrieve.
Not too sure as to why the SELECT * is better; it may be able to take advantage of reading data sequentially, rather than picking its way around the disk.
I'll admit that this is a bit of a guess, but I'll give it a shot.
You have id -- the first field -- as the primary key. I'm not 100% sure how MySQL does clustered indexes as far as lookups, but it is reasonable to suspect that, for any given ID, there is some "pointer" to the record with that ID.
It is relatively easy to find the beginnings of fields when all prior fields have fixed width. All your BIGINT(20) fields have a defined size that makes it easy for the db engine to find the field given a pointer to the start of the record; it's a simple calculation. Likewise, the start of the first VARCHAR(255) field is easy to find. After that, though, because the fields are VARCHAR fields, the db engine must take the data into account to find the start of the next field, which is much slower than simply calculating where that field should be. So, for any fields after txtProperty1, you will have this issue.
What would happen if you changed all the VARCHAR(255) fields to CHAR(255) fields? It is very possible that your query will be much faster, albeit at the cost of using the maximum storage for each CHAR(255) field regardless of the data it actually contains.
Fragmented tablespace? Try a null alter table:
ALTER TABLE tbl_name ENGINE=INNODB
Since I am a SQL Server user and not a MySQL guy, this is a long shot. In SQL Server the clustered index is the table. All the table data is stored in the clustered index. Additional indexes store redundant copies of the indexed data sorted in the appropriate sort order.
My reasoning is this. As you add more and more data to the query, the fetch time remains negligible. I presume this is because you are fetching all the data from the clustered index during the query phase and there is effectively nothing left to do during the fetch phase.
The reason the SELECT * works the way it does is because your table is so wide. As long as you are just requesting the key and one or two additional columns, it is best to just get everything during the query. Once you ask for everything, it becomes cheaper to segregate the fetching between the two phases. I am guessing that if you add columns to your query one at a time, you will discover the boundary where the query analyzer switches from doing all of the fetching in the query phase to doing most of the fetching in the fetching phase.
You should post the explain plans of the two queries so we can see what they are.
My guess is that the fast one is using a "Covering index", and the slow one isn't.
This means that the slow one must do 67,000 primary key lookups, which will be very inefficient if the table isn't all in memory (typically requiring 67k IO operations if the table is arbitrarily large and each row in its own page).
In MySQL, EXPLAIN will show "Using index" if a covering index is being used.
I Had a similar issue and creating additional right sized indexes helped significantly. What also helps is using partitioned database tables and tuning the databases ram.
i.e. add an index to the table for (eventId, txtProperty2)
Note: I noticed that you stated "Warehouse". Keep in mind that it is somewhat expected that if you have a huge database table you are working with additional delays are expected with each increased condition.