I have a warehouse table that looks like this:
CREATE TABLE Warehouse (
id BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
eventId BIGINT(20) UNSIGNED NOT NULL,
groupId BIGINT(20) NOT NULL,
activityId BIGINT(20) UNSIGNED NOT NULL,
... many more ids,
"txtProperty1" VARCHAR(255),
"txtProperty2" VARCHAR(255),
"txtProperty3" VARCHAR(255),
"txtProperty4" VARCHAR(255),
"txtProperty5" VARCHAR(255),
... many more of these
PRIMARY KEY ("id")
KEY "WInvestmentDetail_idx01" ("groupId"),
... several more indices
) ENGINE=INNODB;
Now, the following query spends about 0.8s in query time and 0.2s in fetch time, for a total of about one second. The query returns ~67,000 rows.
SELECT eventId
FROM Warehouse
WHERE accountId IN (10, 8, 13, 9, 7, 6, 12, 11)
AND scenarioId IS NULL
AND insertDate BETWEEN DATE '2002-01-01' AND DATE '2011-12-31'
ORDER BY insertDate;
Adding more ids to the select clause doesn't really change the performance at all.
SELECT eventId, groupId, activityId, insertDate
FROM Warehouse
WHERE accountId IN (10, 8, 13, 9, 7, 6, 12, 11)
AND scenarioId IS NULL
AND insertDate BETWEEN DATE '2002-01-01' AND DATE '2011-12-31'
ORDER BY insertDate;
However, adding a "property" column does change it to 0.6s fetch time and 1.8s query time.
SELECT eventId, txtProperty1
FROM Warehouse
WHERE accountId IN (10, 8, 13, 9, 7, 6, 12, 11)
AND scenarioId IS NULL
AND insertDate BETWEEN DATE '2002-01-01' AND DATE '2011-12-31'
ORDER BY insertDate;
Now to really blow your socks off. Instead of txtProperty1, using txtProperty2 changes the times to 0.8s fetch, 24s query!
SELECT eventId, txtProperty2
FROM Warehouse
WHERE accountId IN (10, 8, 13, 9, 7, 6, 12, 11)
AND scenarioId IS NULL
AND insertDate BETWEEN DATE '2002-01-01' AND DATE '2011-12-31'
ORDER BY insertDate;
The two columns are pretty much identical in the type of data they hold: mostly non-null, and neither are indexed (not that that should make a difference anyways). To be sure the table itself is healthy I ran analyze/optimize against it.
This is really mystifying to me. I can see why adding columns to the select clause only can slightly increase fetch time, but it should not change query time, especially not significantly. I would appreciate any ideas as to what is causing this slowdown.
EDIT - More data points
SELECT * actually outperforms txtProperty2 - 0.8s query, 8.4s fetch. Too bad I can't use it because the fetch time is (expectedly) too long.
The MySQL documentation for the InnoDB engine suggests that if your varchar data doesn't fit on the page (i.e. the node of the b-tree structure), then the information will be referenced on overflow pages. So on your wide Warehouse table, it may be that txtProperty1 is on-page and txtProperty2 is off-page, thus requiring additional I/O to retrieve.
Not too sure as to why the SELECT * is better; it may be able to take advantage of reading data sequentially, rather than picking its way around the disk.
I'll admit that this is a bit of a guess, but I'll give it a shot.
You have id -- the first field -- as the primary key. I'm not 100% sure how MySQL does clustered indexes as far as lookups, but it is reasonable to suspect that, for any given ID, there is some "pointer" to the record with that ID.
It is relatively easy to find the beginnings of fields when all prior fields have fixed width. All your BIGINT(20) fields have a defined size that makes it easy for the db engine to find the field given a pointer to the start of the record; it's a simple calculation. Likewise, the start of the first VARCHAR(255) field is easy to find. After that, though, because the fields are VARCHAR fields, the db engine must take the data into account to find the start of the next field, which is much slower than simply calculating where that field should be. So, for any fields after txtProperty1, you will have this issue.
What would happen if you changed all the VARCHAR(255) fields to CHAR(255) fields? It is very possible that your query will be much faster, albeit at the cost of using the maximum storage for each CHAR(255) field regardless of the data it actually contains.
Fragmented tablespace? Try a null alter table:
ALTER TABLE tbl_name ENGINE=INNODB
Since I am a SQL Server user and not a MySQL guy, this is a long shot. In SQL Server the clustered index is the table. All the table data is stored in the clustered index. Additional indexes store redundant copies of the indexed data sorted in the appropriate sort order.
My reasoning is this. As you add more and more data to the query, the fetch time remains negligible. I presume this is because you are fetching all the data from the clustered index during the query phase and there is effectively nothing left to do during the fetch phase.
The reason the SELECT * works the way it does is because your table is so wide. As long as you are just requesting the key and one or two additional columns, it is best to just get everything during the query. Once you ask for everything, it becomes cheaper to segregate the fetching between the two phases. I am guessing that if you add columns to your query one at a time, you will discover the boundary where the query analyzer switches from doing all of the fetching in the query phase to doing most of the fetching in the fetching phase.
You should post the explain plans of the two queries so we can see what they are.
My guess is that the fast one is using a "Covering index", and the slow one isn't.
This means that the slow one must do 67,000 primary key lookups, which will be very inefficient if the table isn't all in memory (typically requiring 67k IO operations if the table is arbitrarily large and each row in its own page).
In MySQL, EXPLAIN will show "Using index" if a covering index is being used.
I Had a similar issue and creating additional right sized indexes helped significantly. What also helps is using partitioned database tables and tuning the databases ram.
i.e. add an index to the table for (eventId, txtProperty2)
Note: I noticed that you stated "Warehouse". Keep in mind that it is somewhat expected that if you have a huge database table you are working with additional delays are expected with each increased condition.
Related
Hi I currently have a query which is taking 11(sec) to run. I have a report which is displayed on a website which runs 4 different queries which are similar and all take 11(sec) each to run. I don't really want the customer having to wait a minute for all of these queries to run and display the data.
I am using 4 different AJAX requests to call an APIs to get the data I need and these all start at once but the queries are running one after another. If there was a way to get these queries to all run at once (parallel) so the total load time is only 11(sec) that would also fix my issue, I don't believe that is possible though.
Here is the query I am running:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
I can't think of anyway to speed this query up at all, below are pictures of the table indexes and the explain statement on this query.
I think the above query is using relevant indexes in the where conditions.
If there is anything you can think of to speed this query up please let me know, I have been working on it for 3 days and can't seem to figure out the problem. It would be great to get the query times down to 5(sec) maximum. If I am wrong about the AJAX issue please let me know as this would also fix my issue.
" EDIT "
I have came across something quite strange which might be causing the issue. When I change the day_epoch range to something smaller (5th - 9th) which returns 130,000 rows the query time is 0.7(sec) but then I add one more day onto that range (5th - 10th) and it returns over 150,000 rows the query time is 13(sec). I have ran loads of different ranges and have came to the conclusion if the amount of rows returned is over 150,000 that has a huge effect on the query times.
Table Definition -
CREATE TABLE `tracking_daily_stats_zone_unique_device_uuids_per_hour` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`day_epoch` int(10) NOT NULL,
`day_of_week` tinyint(1) NOT NULL COMMENT 'day of week, monday = 1',
`hour` int(2) NOT NULL,
`venue_id` int(5) NOT NULL,
`zone_id` int(5) NOT NULL,
`device_uuid` binary(16) NOT NULL COMMENT 'binary representation of the device_uuid, unique for a single day',
`device_vendor_id` int(5) unsigned NOT NULL DEFAULT '0' COMMENT 'id of the device vendor',
`first_seen` int(10) unsigned NOT NULL DEFAULT '0',
`last_seen` int(10) unsigned NOT NULL DEFAULT '0',
`is_repeat` tinyint(1) NOT NULL COMMENT 'is the device a repeat for this day?',
`prev_last_seen` int(10) NOT NULL DEFAULT '0' COMMENT 'previous last seen ts',
PRIMARY KEY (`id`,`venue_id`) USING BTREE,
KEY `venue_id` (`venue_id`),
KEY `zone_id` (`zone_id`),
KEY `day_of_week` (`day_of_week`),
KEY `day_epoch` (`day_epoch`),
KEY `hour` (`hour`),
KEY `device_uuid` (`device_uuid`),
KEY `is_repeat` (`is_repeat`),
KEY `device_vendor_id` (`device_vendor_id`)
) ENGINE=InnoDB AUTO_INCREMENT=450967720 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY HASH (venue_id)
PARTITIONS 100 */
The straight forward solution is to add this query specific index to the table:
ALTER TABLE tracking_daily_stats_zone_unique_device_uuids_per_hour
ADD INDEX complex_idx (`venue_id`, `day_epoch`, `zone_id`)
WARNING This query change can take a while on DB.
And then force it when you call:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
USE INDEX (complex_idx)
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
It is definitely not universal but should work for this particular query.
UPDATE When you have partitioned table you can get profit by forcing particular PARTITION. In our case since that is venue_id just force it:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
PARTITION (`p46`)
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
Where p46 is concatenated string of p and venue_id = 46
And another trick if you go this way. You can remove AND venue_id = 46 from WHERE clause. Because there is no other data in that partition.
What happens if you change the order of conditions? Put venue_id = ? first. The order matters.
Now it first checks all rows for:
- day_epoch >= 1552435200
- then, the remaining set for day_epoch < 1553040000
- then, the remaining set for venue_id = 46
- then, the remaining set for zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
When working with heavy queries, you should always try to make the first "selector" the most effective. You can do that by using a proper index for 1 (or combination) index and to make sure that first selector narrows down the most (at least for integers, in case of strings you need another tactic).
Sometimes, a query simply is slow. When you have a lot of data (and/or not enough resources) you just cant really do anything about that. Thats where you need another solution: Make a summary table. I doubt you show 150.000 rows x4 to your visitor. You can sum it, e.g., hourly or every few minutes and select from that way smaller table.
Offtopic: Putting an index on everything only slows you down when inserting/updating/deleting. Index the least amount of columns, just the once you actually filter on (e.g. use in a WHERE or GROUP BY).
450M rows is rather large. So, I will discuss a variety of issues that can help.
Shrink data A big table leads to more I/O, which is the main performance killer. ('Small' tables tend to stay cached, and not have an I/O burden.)
Any kind of INT, even INT(2) takes 4 bytes. An "hour" can easily fit in a 1-byte TINYINT. That saves over a 1GB in the data, plus a similar amount in INDEX(hour).
If hour and day_of_week can be derived, don't bother having them as separate columns. This will save more space.
Some reason to use a 4-byte day_epoch instead of a 3-byte DATE? Or perhaps you do need a 5-byte DATETIME or TIMESTAMP.
Optimal INDEX (take #1)
If it is always a single venue_id, then either this is a good first cut at the optimal index:
INDEX(venue_id, zone_id, day_epoch)
First is the constant, then the IN, then a range. The Optimizer does well with this in many cases. (It is unclear whether the number of items in an IN clause can lead to inefficiencies.)
Better Primary Key (better index)
With AUTO_INCREMENT, there is probably no good reason to include columns after the auto_inc column in the PK. That is, PRIMARY KEY(id, venue_id) is no better than PRIMARY KEY(id).
InnoDB orders the data's BTree according to the PRIMARY KEY. So, if you are fetching several rows and can arrange for them to be adjacent to each other based on the PK, you get extra performance. (cf "Clustered".) So:
PRIMARY KEY(venue_id, zone_id, day_epoch, -- this order, as discussed above;
id) -- to make sure that the entire PK is unique.
INDEX(id) -- to keep AUTO_INCREMENT happy
And, I agree with DROPping any indexes that are not in use, including the one I recommended above. It is rarely useful to index flags (is_repeat).
UUID
Indexing a UUID can be deadly for performance once the table is really big. This is because of the randomness of UUIDs/GUIDs, leading to ever-increasing I/O burden to insert new entries in the index.
Multi-dimensional
Assuming day_epoch is sometimes multiple days, you seem to have 2 or 3 "dimensions":
A date range
A list of zones
A venue.
INDEXes are 1-dimensional. Therein lies the problem. However, PARTITIONing can sometimes help. I discuss this briefly as "case 2" in http://mysql.rjweb.org/doc.php/partitionmaint .
There is no good way to get 3 dimensions, so let's focus on 2.
You should partition on something that is a "range", such as day_epoch or zone_id.
After that, you should decide what to put in the PRIMARY KEY so that you can further take advantage of "clustering".
Plan A: This assumes you are searching for only one venue_id at a time:
PARTITION BY RANGE(day_epoch) -- see note below
PRIMARY KEY(venue_id, zone_id, id)
Plan B: This assumes you sometimes srefineearch for venue_id IN (.., .., ...), hence it does not make a good first column for the PK:
Well, I don't have good advice here; so let's go with Plan A.
The RANGE expression must be numeric. Your day_epoch works fine as is. Changing to a DATE, would necessitate BY RANGE(TO_DAYS(...)), which works fine.
You should limit the number of partitions to 50. (The 81 mentioned above is not bad.) The problem is that "lots" of partitions introduces different inefficiencies; "too few" partitions leads to "why bother".
Note that almost always the optimal PK is different for a partitioned table than the equivalent non-partitioned table.
Note that I disagree with partitioning on venue_id since it is so easy to put that column at the start of the PK instead.
Analysis
Assuming you search for a single venue_id and use my suggested partitioning & PK, here's how the SELECT performs:
Filter on the date range. This is likely to limit the activity to a single partition.
Drill into the data's BTree for that one partition to find the one venue_id.
Hopscotch through the data from there, landing on the desired zone_ids.
For each, further filter based the date.
I am having difficulties in optimizing this SQL statement in MySQL. I have two tables that are populated independently and so the times logged in each table's column will not be the same. What I want is a single table (view) that lists all the records in the sensor_history with the current process information that was present at the sensor's measurement_time. If a process log time was not present, I can live with NULLs in the process fields in the resulting view for that particular record.
What I have here works but it is brute force and woefully inefficient. There are about 500k records in the sensor_history table and about 20k records in the process_history table. I have tried getting my head around different join methods but I run into syntax issues or bad results. I have tried some online optimizers without success and so I am hoping someone here can point me in the right direction.
For simplicity, I have removed the foreign key relations to other tables. There are no indices in use but feel free to suggest any that may help. Here are the basics:
CREATE TABLE `sensor_history` (
`measurement_time_utc` int(11) NOT NULL,
`sensor_id` int(11) NOT NULL,
`sensor_measurement_x` double NOT NULL,
`sensor_measurement_y` double NOT NULL,
`sensor_measurement_z` double NOT NULL,
`sensor_quality` int(11) NOT NULL
);
CREATE TABLE `process_history` (
`log_time_utc` int(11) NOT NULL,
`process_id` int(11) NOT NULL,
`process_speed` double NOT NULL,
`process_load` int(11) NOT NULL
);
CREATE VIEW `rollup` AS SELECT
`sensor_history`.`measurement_time_utc`,
`sensor_history`.`sensor_id`,
`sensor_history`.`sensor_measurement_x`,
`sensor_history`.`sensor_measurement_y`,
`sensor_history`.`sensor_measurement_z`,
`sensor_history`.`sensor_quality`,
(SELECT `process_history`.`process_id` FROM `process_history` WHERE `sensor_history`.`measurement_time_utc`>=`process_history`.`log_time_utc` ORDER BY `process_history`.`log_time_utc` DESC LIMIT 1) AS `process_id`,
(SELECT `process_history`.`process_speed` FROM `process_history` WHERE `sensor_history`.`measurement_time_utc`>=`process_history`.`log_time_utc` ORDER BY `process_history`.`log_time_utc` DESC LIMIT 1) AS `process_speed`,
(SELECT `process_history`.`process_load` FROM `process_history` WHERE `sensor_history`.`measurement_time_utc`>=`process_history`.`log_time_utc` ORDER BY `process_history`.`log_time_utc` DESC LIMIT 1) AS `process_load`
FROM `sensor_history`;
How can I make a more efficient rollup view? Thanks in advance.
Views are really hard to optimize in MySQL. Your best hope is for an index on:
process_history(log_time_utc, process_id, process_speed)
The last two columns are included so the index covers the query and doesn't need to refer to the data pages.
While you are trying to figure out what the Analysts really need, let's do some improvements that are easier to do now than later.
DOUBLE takes 8 bytes and delivers about 16 significant digits. That is gross overkill for every sensor I have heard of. Consider the 4-byte FLOAT, which gives you about 7 significant digits.
(Where am I going with this? Capturing "sensor" data keeps coming, and it eventually fills up disk and that makes it slow. So, let's shrink things soon.)
INT is 4 bytes and has a range of +/- 2 billion. Are you expecting that many sensors? How about a 1-byte TINYINT UNSIGNED with a range of 0..255? Or `SMALLINT UNSIGNED (1-bytes, range 0..64K)? Ditto for any other ids.
Or... Do you really need to save all the data? Maybe day-old data can be summarized down to hourly min, max, avg, etc? And month-old data is needed only to a day's resolution?
We have lots to discuss once your analysts explain to you what the do want. Then you need to read-between-the-lines to see what they will want. (I can help there, too.)
I am collecting about 3 - 6 millions lines of stock data per day and storing it in a MySQL database.
All of the data is coming from Interactive Brokers every piece of information comes with these five fields: Symbol, Date, Time, Value and Type (type being information on what type of data I am receiving such as price, volume etc)
Here is my create table statement. idticks is just my unique key but I almost never am able to use it in queries.
CREATE TABLE `ticks` (
`idticks` int(11) NOT NULL AUTO_INCREMENT,
`symbol` varchar(30) NOT NULL,
`date` int(11) NOT NULL,
`time` int(11) NOT NULL,
`value` double NOT NULL,
`type` double NOT NULL,
KEY `idticks` (`idticks`),
KEY `symbol` (`symbol`),
KEY `date` (`date`),
KEY `idx_ticks_symbol_date` (`symbol`,`date`),
KEY `idx_ticks_type` (`type`),
KEY `idx_ticks_date_type` (`date`,`type`),
KEY `idx_ticks_date_symbol_type` (`date`,`symbol`,`type`),
KEY `idx_ticks_symbol_date_time_type` (`symbol`,`date`,`time`,`type`)
) ENGINE=InnoDB AUTO_INCREMENT=13533258 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY KEY (`date`)
PARTITIONS 1 */;
As you can see, I have no idea what I am doing because I just keep on creating indexes to make my queries go faster.
Right now the data is being stored on a rather slow computer for testing purposes so I understand that my queries are not nearly as fast as they could be (I have a 6 core, 64gig of ram, SSD machine arriving tomorrow which should help significantly)
That being said, I am running queries like this one
select time, value from ticks where symbol = "AAPL" AND date = 20150522 and type = 8 order by time asc
The query above, if I do not limit it, returns 12928 records for one of my test days and takes 10.2 seconds if I do it from cleared cache.
I am doing lots of graphing and eventually would like to be able to just query the data as I need to it graph. Right now I haven't noticed a lot of difference in speed between getting part of a days worth of data vs just getting the entire day's. It would be cool to have those queries respond fast enough that there is barely any delay when I moving to the next day/screen whatever.
Another query I am using for usability of a program I am writing to interact with the data include
String query = "select distinct `date` from ticks where symbol = '" + symbol + "' order by `date` desc";
But most of my need is the ability to pull a certain type of data from a certain day for a certain symbol like my first query.
I've googled all over the place and I think I understand that creating tons of indexes makes the database bigger and slows down the input speed (I get about 300 pieces of information per second on a busy day). Should I just index each column individually?
I am willing to throw more harddrives at things if it means responsive interface.
Basically, my questions relate to the creation/altering of my table. Based on the above query, can you think of anything I could do to make that faster? Or an indexing system that would help me out? Is InnoDB even the right engine? I tried googling this vs MyISam and after a couple of hours of this, I still wasn't sure.
Thanks :)
Combine date and time into a DATETIME field
Assuming Price and Volume always come in together, put them together (2 columns) and get rid if type.
Get rid of the AUTO_INCREMENT; change to PRIMARY KEY(symbol, datetime)
Get rid of any indexes that are the left part of some other index.
Once you are using DATETIME, use date ranges to find everything in a single date (if you need such). Do not use DATE(datetime) = '...', performance will be terrible.
Symbol can probably be ascii, not utf8.
Use InnoDB, the clustering of the Primary Key can be beneficial.
Do you expect to collect (and use) more data than will fit in innodb_buffer_pool_size? If so, we need to discuss your SELECTs and look into PARTITIONing.
Make those changes, then come back for more advice/abuse.
You're creating a historical database, so MyISAM would work as well as InnoDB. InnoDB is a transactional relational database, and is better suited for relational databases with multiple tables that must remain synchronized.
Your Stock table looks like this.
Stock
-----
Stock ID (idticks)
Symbol
Date
Time
Value
Type
It would be better if you combine the date and time into a time stamp column, and unpack the types like this.
Stock
-----
Stock ID
Symbol
Time Stamp
Volume
Open
Close
Bid
Ask
...
This makes it easier for the database to return rows for a query on a particular type, like the close value.
As far as indexes, you can create as many indexes as you want. You're adding (inserting) information, so the increased time to add information is offset by the decreased time to query the information.
I'd have a primary index on Stock ID, and a unique index on Symbol and Time Stamp descending. You could also have indexes on the values you query most often, like Close.
I am looking into storing a "large" amount of data and not sure what the best solution is, so any help would be most appreciated. The structure of the data is
450,000 rows
11,000 columns
My requirements are:
1) Need as fast access as possible to a small subset of the data e.g. rows (1,2,3) and columns (5,10,1000)
2) Needs to be scalable will be adding columns every month but the number of rows are fixed.
My understanding is that often its best to store as:
id| row_number| column_number| value
but this would create 4,950,000,000 entries? I have tried storing as just rows and columns as is in MySQL but it is very slow at subsetting the data.
Thanks!
Build the giant matrix table
As N.B. said in comments, there's no cleaner way than using one mysql row for each matrix value.
You can do it without the id column:
CREATE TABLE `stackoverflow`.`matrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
You may add a UNIQUE INDEX on colNum, rowNum, or only a non-unique INDEX on colNum if you often access matrix by column (because PRIMARY INDEX is on ( `rowNum`, `colNum` ), note the order, so it will be inefficient when it comes to select a whole column).
You'll probably need more than 200Go to store the 450.000x11.000 lines, including indexes.
Inserting data may be slow (because there are two indexes to rebuild, and 450.000 entries [1 per row] to add when adding a column).
Edit should be very fast, as index wouldn't change and value is of fixed size
If you access same subsets (rows + cols) often, maybe you can use PARTITIONing of the table if you need something "faster" than what mysql provides by default.
After years of experience (20201 edit)
Re-reading myself years later, I would say the "cache" ideas are totally dumb, as it's MySQL role to handle these sort of cache (it should actually already be in the innodb pool cache).
A better thing would be, if matrix is full of zeroes, not storing the zero values, and consider 0 as "default" in the client code. That way, you may lightenup the storage (if needed: mysql should actually be pretty fast responding to queries event on such 5 billion row table)
Another thing, if storage makes issue, is to use a single ID to identify both row and col: you say number of rows is fixed (450000) so you may replace (row, col) with a single (id = 450000*col+row) value [tho it needs BIGINT so maybe not better than 2 columns)
Don't do like below: don't reinvent MySQL cache
Add a cache (actually no)
Since you said you add values, and doesn't seem to edit matrix values, a cache can speed up frequently asked rows/columns.
If you often read the same rows/columns, you can cache their result in another table (same structure to make it easier):
CREATE TABLE `stackoverflow`.`cachedPartialMatrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
That table will be void at the beginning, and each SELECT on the matrix table will feed the cache. When you want to get a column / row:
SELECT the row/column from that caching table
If the SELECT returns a void/partial result (no data returned or not enough data to match the expected row/column number) then do the SELECT on the matrix table
Save the SELECT from the matrix table to the cachingPartialMatrix
If the caching matrix gets too big, clear it (the bigger cached matrix is, the slower it becomes)
Smarter cache (actually, no)
You can make it even smarter with a third table to count how many times a selection is done:
CREATE TABLE `stackoverflow`.`requestsCounter` (
`isRowSelect` BOOLEAN NOT NULL ,
`index` INT NOT NULL ,
`count` INT NOT NULL ,
`lastDate` DATETIME NOT NULL,
PRIMARY KEY ( `isRowSelect` , `index` )
) ENGINE = MYISAM ;
When you do a request on your matrix (one may use TRIGGERS) for the Nth-row or Kth-column, increment the counter. When the counter gets big enough, feed the cache.
lastDate can be used to remove some old values from the cache (take care: if you remove the Nth-column from cache entries because its ``lastDate```is old enough, you may break some other entries cache) or to regularly clear the cache and only leave the recently selected values.
I am having a performance issue when inserting some data in a mysql table.
The table has a bunch of columns, let's say DATE,A,B,C,D,E,F where DATE,A,B,C,D,E is the primary key. Every day, I insert 70k rows in that table (with a different date), and this table contains 18 million rows now. The method I use to insert the rows is just sending 70k INSERT queries.
The problem I am having is that the queries started to take a lot more time than they used to. Going from a few minutes to a few hours. I profiled the inserts and this is the charts I got:
Speed of each insert (in sec) vs. Number of insert for that day:
A few strange facts:
Most queries take less than 2 ms to execute
The speed of slow queries increases linearly with the number of rows in the table for that date
This behavior only happens at night, after a bunch of processes have happened on the database. Inserting during the day is fast, so is weekends
The overall speed doesn't depend on what else is running on the database, in fact, nothing else is running on the database when this happens
There is nothing in the query that can explain that a query is fast or no, the fast ones are very similar to the slow one, and from one day to another are not the same set.
the behavior does not change from one day to the next.
Any idea what could cause this?
** Edit ** the columns in the index are in the following order:
DATE NOT NULL,
DATE NOT NULL,
VARCHAR (10) NOT NULL,
VARCHAR (45) NOT NULL,
VARCHAR (45) NOT NULL,
VARCHAR (3) NOT NULL,
VARCHAR (45) NOT NULL,
DOUBLE NOT NULL,
VARCHAR (10) NOT NULL,
VARCHAR (45) NOT NULL,
VARCHAR (45) NOT NULL,
VARCHAR (45) NOT NULL,
The Dates are either the same as today, or left empty, the double is always the same number (no clue who designed this table)
The short explanation is that you have an index that is non-incremental within the scope of a single day. Non-incremental indices are generally slower to insert/update because they will more often require rebalancing the index tree, and to a greater extent, than an incremental index.
To explain this further - assume the following schema:
a (int) | b (varchar)
And the index is (a, b)
Now we insert:
1, 'foo'
2, 'bar'
3, 'baz'
This will be quite fast because the index will append on each insert. Now lets try the following:
100, 'foo'
100, 'bar'
100, 'baz'
This won't quite be as fast since 'bar' needs to be inserted before 'foo', and 'baz' needs to insert between the other 2. This often requires the index to rewrite the tree to accomodate, and this 'rebalancing' act takes some time. The larger the components involved in the rebalancing (in this case, the subset where a=100), the more time it will take. Note that this rebalancing activity will only occur more often and more extensively, but not necessarily on each insert. This is because the tree will usually leave some room within the leaves for expansion. When the leaves runs out of room, it knows that it's time to rebalance.
In your case, since your index is primarily based on the current date, you are constantly rebalancing your tree within the scope of the single day. Each day starts a new scope, and as such starts rebalancing within that day's scope. Initially this involves just a bit of rebalancing, but this will grow as your scope of existing entries for the day increases. The cycle starts over as you start a new day, which is the result you are seeing.
That this is happening to the primary key may make matters even worse, since instead of shifting some index pointers around, entire rows of data may need to be shifted to accommodate the new entry. (This last point assumes that MyISAM clustering is performed on the primary key, a point that I haven't gotten clarification on to this day, although anectodal evidence does seem to support this. For example, see here and here.)