I have a table "data" which holds around 100,000,000 records.
I have added a new column to it "batch_id" (Integer).
On the application layer, I'm updating the batch_id in batches of 10,000 records for each of the 100,000,000 records (the batch_id is always the same for 10k).
I'm doing something like this (application layer pseudo code):
loop {
$batch_id = $batch_id + 1;
mysql.query("UPDATE data SET batch_id='$batch_id' WHERE batch_id IS NULL LIMIT 10000");
}
I have an index on the batch_id column.
In the beginning, this update statement took ~30 seconds. I'm now halfway through the Table and it's getting slower and slower. At the moment the same statement takes around 10 minutes(!). It reached a point where this is no longer feasible as it would take over a month to update the whole table at the current speed.
What could I do to speed it up, and why is MySQL Getting slower towards the end of the table?
Could an index on the primary key help?
Is the primary key automatically indexed in MySQL? The answer is Yes
So instead one index for batch_id will help.
The problem is without index the engine do a full table scan. At first is easy find 10k with null values, but when more and more records are updated the engine have to scan much more to find those nulls.
But should be easier create batch_id as an autonumeric column
OTHER OPTION: Create a new table and then add the index and replace old table.
CREATE newTable as
SELECT IF(#newID := #newID + 1,
#newID DIV 10000,
#newID DIV 10000) as batch_id,
<other fields>
FROM YourTable
CROSS JOIN (SELECT #newID :=0 ) as v
Insert auto increment primary key to existing table
Do you have a monotonically increasing id in the table? And all rows for a "batch" have 'consecutive' ids? Then don't add batch_id to the table, instead, create another table Batches with one row per batch: (batch_id (PK), id_start, id_end, start_time, end_time, etc).
If you stick to exact chunks of 10K, then don't even materialize batch_id. Instead, compute it from id DIV 10000 whenever you need it.
If you want to discuss this further, please provide SHOW CREATE TABLE for the existing table, and explain what you will be doing with the "batches".
To answer your question about "slow near the end": It is having to scan farther and farther in the table to find the NULLs. You would be better to walk through the table once, fiddling with each 10K chunk as you go. Do this using the PRIMARY KEY, whatever it is. (That is, even if it is not AUTO_INCREMENT.) More Details .
Related
I have a very large table 20-30 million rows that is completely overwritten each time it is updated by the system supplying the data over which I have no control.
The table is not sorted in a particular order.
The rows in the table are unique, there is no subset of columns that I can be assured to have unique values.
Is there a way I can run a SELECT query followed by a DELETE query on this table with a fixed limit without having to trigger any expensive sorting/indexing/partitioning/comparison whilst being certain that I do not delete a row not covered by the previous select.
I think you're asking for:
SELECT * FROM MyTable WHERE x = 1 AND y = 3;
DELETE * FROM MyTable WHERE NOT (x = 1 AND y = 3);
In other words, use NOT against the same search expression you used in the first query to get the complement of the set of rows. This should work for most expressions, unless some of your terms return NULL.
If there are no indexes, then both the SELECT and DELETE will incur a table-scan, but no sorting or temp tables.
Re your comment:
Right, unless you use ORDER BY, you aren't guaranteed anything about the order of the rows returned. Technically, the storage engine is free to return the rows in any arbitrary order.
In practice, you will find that InnoDB at least returns rows in a somewhat predictable order: it reads rows in some index order. Even if your table has no keys or indexes defined, every InnoDB table is stored as a clustered index, even if it has to generate an internal key called GEN_CLUST_ID behind the scenes. That will be the order in which InnoDB returns rows.
But you shouldn't rely on that. The internal implementation is not a contract, and it could change tomorrow.
Another suggestion I could offer:
CREATE TABLE MyTableBase (
id INT AUTO_INCREMENT PRIMARY KEY,
A INT,
B DATE,
C VARCHAR(10)
);
CREATE VIEW MyTable AS SELECT A, B, C FROM MyTableBase;
With a table and a view like above, your external process can believe it's overwriting the data in MyTable, but it will actually be stored in a base table that has an additional primary key column. This is what you can use to do your SELECT and DELETE statements, and order by the primary key column so you can control it properly.
I have a MySQL table that contains millions of entries.
Each entry must be processed at some point by a cron job.
I need to be able to quickly locate unprocessed entries, using an index.
So far, I have used the following approach: I add a nullable, indexed processedOn column that contains the timestamp at which the entry has been processed:
CREATE TABLE Foo (
...
processedOn INT(10) UNSIGNED NULL,
KEY (processedOn)
);
And then retrieve an unprocessed entry using:
SELECT * FROM Foo WHERE processedOn IS NULL LIMIT 1;
Thanks to MySQL's IS NULL optimization, the query is very fast, as long as the number of unprocessed entries if small (which is almost always the case).
This approach is good enough: it does the job, but at the same time I feel like the index is wasted because it's only ever used for WHERE processedOn IS NULL queries, and never for locating a precise value or range of values for this field. So this has an inevitable impact on storage space and INSERT performance, as every single timestamp is indexed for nothing.
Is there a better approach? Ideally the index would just contain pointers to the unprocessed rows, and no pointer to any processed row.
I know I could split this table into 2 tables, but I'd like to keep it in a single table.
What comes to my mind is to create a isProcessed column, with default value = 'N' and you set to 'Y' when processed (at the same time you set the processedOn column). Then create an index on the isProcessed field. When you query (with the where clause WHERE isProcessed = 'N'), it will respond very fast.
UPDATE: ALTERNATIVE with partitioning:
Create your table with partitions and define a field that will have just 2 values 1 or 0. This will create one partition for records with the field = 1 and another for records with field = 0.
create table test (field1 int, field2 int DEFAULT 0)
PARTITION BY LIST(field2) (
PARTITION p0 VALUES IN (0),
PARTITION p1 VALUES IN (1)
);
This way, if you want to query only the records with the field equal to one of the values, just do this:
select * from test partition (p0);
The query above will show only records with field2 = 0.
And if you need to query all records together, you just query the table normally:
select * from test;
As far as I was able to understand, this will help you with your need.
I have multiple answers and comments on others' answers.
First, let me assume that the PRIMARY KEY for Foo is id INT UNSIGNED AUTO_INCREMENT (4 bytes) and that the table is Engine=InnoDB.
Indexed Extra column
The index for the extra column would be, per row, the width of the extra column and the PRIMARY KEY, plus a bunch of overhead. With your processedOn, you are talking about 8 bytes (2 INTs). With a simple flag, 5 bytes.
Separate table
This table would have only id for the unprocessed items. It would take extra code to populate it. It's size would stay at some "high-water mark". So, if there were a burst of unprocessed items, it would grow, but not shrink back. (Here's a rare case where OPTIMIZE TABLE is useful.) InnoDB requires a PRIMARY KEY, and id would work perfectly. So, one column, no extra index. It is a lot smaller than the extra index discussed above. Finding something to work on:
$id = SELECT id FROM tbl LIMIT 1; -- don't care which one
process it
DELETE FROM tbl where id = $id
2 PARTITIONs, one processed, one not
No. When you change a row from processed to unprocessed, the row must be removed from one partition and inserted into the other. This is done behind the scenes by your UPDATE ... SET flag = 1. Also, both partitions have the "high-water" issue -- they will grow but not shrink. And the space overhead for partitioning may be as much as the other solutions.
SELECT by PARTITION ... requires 5.6. Without that, you would need an INDEX, so you are back to the index issues.
Continual Scanning
This incurs zero extra disk space. (That's better than you had hoped for, correct?) And it is not too inefficient. Here's how it works. Here is some pseudo-code to put into your cron job. But don't make it a cron job. Instead, let it run all the time. (The reason will become clear, I hope.)
SELECT #a := 0;
Loop:
# Get a clump
SELECT #z := id FROM Foo WHERE id > #a ORDER BY id LIMIT 1000,1;
if no results, Set #z to MAX(id)
# Find something to work on in that clump:
SELECT #id := id FROM Foo
WHERE id > #a
AND id <= #z
AND not-processed
LIMIT 1;
if you found something, process it and set #z := #id
SET #a := #z;
if #a >= MAX(id), set #a := 0; # to start over
SLEEP 2 seconds # or some amount that is a compromise
Go Loop
Notes:
It walks through the table with minimal impact.
It works even with gaps in id. (It could be made simpler if there were no gaps.) (If the PK is not AUTO_INCREMENT, it is almost identical.)
The sleep is to be a 'nice guy'.
Selective Index
MariaDB's dynamic columns and MySQL 5.7's JSON can index things, and I think they are "selective". One state would be to have the column empty, the other would be to have the flag set in the dynamic/json column. This will take some research to verify, and may require an upgrade.
I have a table 'tbl' something like that:
ID bigint(20) - primary key, autoincrement
field1
field2
field3
That table has 600k+ rows.
Query:
SELECT * from tbl ORDER by ID LIMIT 600000, 1 takes 1.68 second
Query:
SELECT ID, field1 from tbl ORDER by ID LIMIT 600000, 1 takes 1.69 second
Query:
SELECT ID from tbl ORDER by ID LIMIT 600000, 1 takes 0.16 second
Query:
SELECT * from tbl WHERE ID = xxx takes 0.005 second
Those queries are tested in phpmyadmin.
And the result is query 3 and query 4 together return necessarily data.
Query 1 does the same jobs but much slower...
This doesn't look right for me.
Could anyone give any advice?
P.S. I'm sorry for formatting.. I'm new to this site.
New test:
Q5 : CREATE TEMPORARY TABLE tmptable AS (SELECT ID FROM tbl WHERE ID LIMIT 600030, 30);
SELECT * FROM tbl WHERE ID IN (SELECT ID FROM tmptable); takes 0.38 sec
I still don't understand how it's possible. I recreated all indexes.. what else can I do with that table? Delete and refill it manually? :)
Query 1 looks at the table's primary key index, finds the correct 600,000 ids and their corresponding locations within the table, then goes to the table and fetches everything from those 600k locations.
Query 2 looks at the table's primary key index, finds the correct 600k ids and their corresponding locations within the table, then goes to the table and fetches whichever subset of fields are asked for from those 600k rows.
Query 3 looks at the table's primary key index, finds the correct 600k ids, and returns them. It doesn't need to look at the table at all.
Query 4 looks at the table's primary key index, finds the single entry requested, goes to the table, reads that single entry, and returns it.
Time-wise, let's build backwards:
(Q4) The table index allows lookup of a key (id) in O(log n) time, meaning every time the table doubles in size it only takes one extra step to find the key in the index*. If you have 1 million rows, then, it would only take ~20 steps to find it. A billion rows? 30 steps. The index entry includes data on where in the table to go to find the data for that row, so MySQL jumps to that spot in the table and reads the row. The time reported for this is almost entirely overhead.
(Q3) As I mentioned, the table index is very fast; this query finds the first entry and just traverses the tree until it has the requested number of rows. I'm sure I could calculate the precise number of steps it would take, but as a maximum we'll say 20 steps x 600k rows = 12M steps; since it's traversing a tree it would likely be more like 1M steps, but the precise number is largely irrelevant. The most important thing to realize here is that once MySQL has walked the index to pull the ids it needs, it has everything you asked for. There's no need to go look at the table. The time reported for this one is essentially the time it takes MySQL to walk the index.
(Q2) This begins with the same tree-walking as discussed for query 3, but while pulling the IDs it needs, MySQL also pulls their location within the table files. It then has to go to the table file (probably already cached/mmapped in memory), and for every entry it pulled, seek to the proper place in the table and get the fields requested out of those rows. The time reported for this query is the time it takes to walk the index (as in Q3) plus the time to visit every row specified in the index.
(Q1) This is identical to Q2 when all fields are specified. As the time is essentially identical to Q2, we can see that it doesn't really take measurably more time to pull more fields out of the database, any time there is dwarfed by crawling the index and seeking to the rows.
*: Most databases use an indexing data structure (B-trees for MySQL) that has a log base much higher than 2, meaning that instead of an extra step every time the table doubles, it's more like an extra step every time the table size goes up by a factor of hundreds to thousands. This means that instead of the 20-30 steps I stated in the example, it's more like 2-5.
I have a very large database of images and i need to run an update to increment the view count on the images. every hour there are over one million unique rows to update. Right now it takes about an hour to run this query is there anyway to have this run faster?
i'm creating a memory table:
CREATE TABLE IF NOT EXISTS tmp_views_table (
key VARCHAR(7) NOT NULL,
views INT NOT NULL,
primary key ( `key` )
) ENGINE = MEMORY
Then I insert 1000 views at a time using a loop that runs until all the views have been inserted into the memory table:
insert low_priority into tmp_views_table
values ('key', 'count'),('key', 'count'),('key', 'count'), etc...
Then i run an update on the actual table like this:
update images, tmp_views_table
set images.views = images.views+tmp_views_table.views
where images.key = tmp_views_table.key
this last update is the one that is taking around an hour, the memory table stuff runs pretty quickly.
Is there a faster way that i can do this update?
Are you using Innodb, right? Try general tuning of mysql and innodb engine to allow for faster data changes.
I suppose you have an index on the key field of images table. You can try your update query also without index on the memory table - in that case the query optimizer should choose full table scan of the memory table.
I have never used joins with UPDATE statements, so I don't know exactly it is executed, but maybe the JOIN is taking too long. Maybe you can post an EXPLAIN result of that query.
Here is what I have used in one project to do the something similar - insert/update real-time data to temp table and merge it to aggregate table once a day, so can try if it will execute faster.
INSERT INTO st_views_agg (pageid,pagetype,day,count)
SELECT pageid,pagetype,DATE(`when`) AS day, COUNT(*) AS count FROM st_views_pending WHERE (pagetype=4) GROUP BY pageid,pagetype,day
ON DUPLICATE KEY UPDATE count=count+VALUES(count);
I've got a mysql table where each row has its own sequence number in a "sequence" column. However, when a row gets deleted, it leaves a gap. So...
1
2
3
4
...becomes...
1
2
4
Is there a neat way to "reset" the sequencing, so it becomes consecutive again in one SQL query?
Incidentally, I'm sure there is a technical term for this process. Anyone?
UPDATED: The "sequence" column is not a primary key. It is only used for determining the order that records are displayed within the app.
If the field is your primary key...
...then, as stated elsewhere on this question, you shouldn't be changing IDs. The IDs are already unique and you neither need nor want to re-use them.
Now, that said...
Otherwise...
It's quite possible that you have a different field (that is, as well as the PK) for some application-defined ordering. As long as this ordering isn't inherent in some other field (e.g. if it's user-defined), then there is nothing wrong with this.
You could recreate the table using a (temporary) auto_increment field and then remove the auto_increment afterwards.
I'd be tempted to UPDATE in ascending order and apply an incrementing variable.
SET #i = 0;
UPDATE `table`
SET `myOrderCol` = #i:=#i+1
ORDER BY `myOrderCol` ASC;
(Query not tested.)
It does seem quite wasteful to do this every time you delete items, but unfortunately with this manual ordering approach there's not a whole lot you can do about that if you want to maintain the integrity of the column.
You could possibly reduce the load, such that after deleting the entry with myOrderCol equal to, say, 5:
SET #i = 5;
UPDATE `table`
SET `myOrderCol` = #i:=#i+1
WHERE `myOrderCol` > 5
ORDER BY `myOrderCol` ASC;
(Query not tested.)
This will "shuffle" all the following values down by one.
I'd say don't bother. Reassigning sequential values is a relatively expensive operation and if the column value is for ordering purpose only there is no good reason to do that. The only concern you might have is if for example your column is UNSIGNED INT and you suspect that in the lifetime of your application you might have more than 4,294,967,296 rows (including deleted rows) and go out of range, even if that is your concern you can do the reassigning as a one time task 10 years later when that happens.
This is a question that often I read here and in other forums. As already written by zerkms this is a false problem. Moreover if your table is related with other ones you'll lose relations.
Just for learning purpose a simple way is to store your data in a temporary table, truncate the original one (this reset auto_increment) and than repopulate it.
Silly example:
create table seq (
id int not null auto_increment primary key,
col char(1)
) engine = myisam;
insert into seq (col) values ('a'),('b'),('c'),('d');
delete from seq where id = 3;
create temporary table tmp select col from seq order by id;
truncate seq;
insert into seq (col) select * from tmp;
but it's totally useless. ;)
If this is your PK then you shouldn't change it. PKs should be (mostly) unchanging columns. If you were to change them then not only would you need to change it in that table but also in any foreign keys where is exists.
If you do need a sequential sequence then ask yourself why. In a table there is no inherent or guaranteed order (even in the PK, although it may turn out that way because of how most RDBMSs store and retrieve the data). That's why we have the ORDER BY clause in SQL. If you want to be able to generate sequential numbers based on something else (time added into the database, etc.) then consider generating that either in your query or with your front end.
Assuming that this is an ID field, you can do this when you insert:
INSERT INTO yourTable (ID)
SELECT MIN(ID)
FROM yourTable
WHERE ID > 1
As others have mentioned I don't recommend doing this. It will hold a table lock while the next ID is evaluated.