I have a MySQL table that contains millions of entries.
Each entry must be processed at some point by a cron job.
I need to be able to quickly locate unprocessed entries, using an index.
So far, I have used the following approach: I add a nullable, indexed processedOn column that contains the timestamp at which the entry has been processed:
CREATE TABLE Foo (
...
processedOn INT(10) UNSIGNED NULL,
KEY (processedOn)
);
And then retrieve an unprocessed entry using:
SELECT * FROM Foo WHERE processedOn IS NULL LIMIT 1;
Thanks to MySQL's IS NULL optimization, the query is very fast, as long as the number of unprocessed entries if small (which is almost always the case).
This approach is good enough: it does the job, but at the same time I feel like the index is wasted because it's only ever used for WHERE processedOn IS NULL queries, and never for locating a precise value or range of values for this field. So this has an inevitable impact on storage space and INSERT performance, as every single timestamp is indexed for nothing.
Is there a better approach? Ideally the index would just contain pointers to the unprocessed rows, and no pointer to any processed row.
I know I could split this table into 2 tables, but I'd like to keep it in a single table.
What comes to my mind is to create a isProcessed column, with default value = 'N' and you set to 'Y' when processed (at the same time you set the processedOn column). Then create an index on the isProcessed field. When you query (with the where clause WHERE isProcessed = 'N'), it will respond very fast.
UPDATE: ALTERNATIVE with partitioning:
Create your table with partitions and define a field that will have just 2 values 1 or 0. This will create one partition for records with the field = 1 and another for records with field = 0.
create table test (field1 int, field2 int DEFAULT 0)
PARTITION BY LIST(field2) (
PARTITION p0 VALUES IN (0),
PARTITION p1 VALUES IN (1)
);
This way, if you want to query only the records with the field equal to one of the values, just do this:
select * from test partition (p0);
The query above will show only records with field2 = 0.
And if you need to query all records together, you just query the table normally:
select * from test;
As far as I was able to understand, this will help you with your need.
I have multiple answers and comments on others' answers.
First, let me assume that the PRIMARY KEY for Foo is id INT UNSIGNED AUTO_INCREMENT (4 bytes) and that the table is Engine=InnoDB.
Indexed Extra column
The index for the extra column would be, per row, the width of the extra column and the PRIMARY KEY, plus a bunch of overhead. With your processedOn, you are talking about 8 bytes (2 INTs). With a simple flag, 5 bytes.
Separate table
This table would have only id for the unprocessed items. It would take extra code to populate it. It's size would stay at some "high-water mark". So, if there were a burst of unprocessed items, it would grow, but not shrink back. (Here's a rare case where OPTIMIZE TABLE is useful.) InnoDB requires a PRIMARY KEY, and id would work perfectly. So, one column, no extra index. It is a lot smaller than the extra index discussed above. Finding something to work on:
$id = SELECT id FROM tbl LIMIT 1; -- don't care which one
process it
DELETE FROM tbl where id = $id
2 PARTITIONs, one processed, one not
No. When you change a row from processed to unprocessed, the row must be removed from one partition and inserted into the other. This is done behind the scenes by your UPDATE ... SET flag = 1. Also, both partitions have the "high-water" issue -- they will grow but not shrink. And the space overhead for partitioning may be as much as the other solutions.
SELECT by PARTITION ... requires 5.6. Without that, you would need an INDEX, so you are back to the index issues.
Continual Scanning
This incurs zero extra disk space. (That's better than you had hoped for, correct?) And it is not too inefficient. Here's how it works. Here is some pseudo-code to put into your cron job. But don't make it a cron job. Instead, let it run all the time. (The reason will become clear, I hope.)
SELECT #a := 0;
Loop:
# Get a clump
SELECT #z := id FROM Foo WHERE id > #a ORDER BY id LIMIT 1000,1;
if no results, Set #z to MAX(id)
# Find something to work on in that clump:
SELECT #id := id FROM Foo
WHERE id > #a
AND id <= #z
AND not-processed
LIMIT 1;
if you found something, process it and set #z := #id
SET #a := #z;
if #a >= MAX(id), set #a := 0; # to start over
SLEEP 2 seconds # or some amount that is a compromise
Go Loop
Notes:
It walks through the table with minimal impact.
It works even with gaps in id. (It could be made simpler if there were no gaps.) (If the PK is not AUTO_INCREMENT, it is almost identical.)
The sleep is to be a 'nice guy'.
Selective Index
MariaDB's dynamic columns and MySQL 5.7's JSON can index things, and I think they are "selective". One state would be to have the column empty, the other would be to have the flag set in the dynamic/json column. This will take some research to verify, and may require an upgrade.
Related
I have two tables. Let's call them KEY and VALUE.
KEY is small, somewhere around 1.000.000 records.
VALUE is huge, say 1.000.000.000 records.
Between them there is a connection such that each KEY might have many VALUES. It's not a foreign key but basically the same meaning.
The DDL looks like this
create table KEY (
key_id int,
primary key (key_id)
);
create table VALUE (
key_id int,
value_id int,
primary key (key_id, value_id)
);
Now, my problem. About half of all key_ids in VALUE have been deleted from KEY and I need to delete them in a orderly fashion while both tables are still under high load.
It would be easy to do
delete v
from VALUE v
left join KEY k using (key_id)
where k.key_id is null;
However, as it's not allowed to have a limit on multi table delete I don't like this approach. Such a delete would take hours to run and that makes it impossible to throttle the deletes.
Another approach is to create cursor to find all missing key_ids and delete them one by one with a limit. That seems very slow and kind of backwards.
Are there any other options? Some nice tricks that could help?
Any solution that tries to delete so much data in one transaction is going to overwhelm the rollback segment and cause a lot of performance problems.
A good tool to help is pt-archiver. It performs incremental operations on moderate-sized batches of rows, as efficiently as possible. pt-archiver can copy, move, or delete rows depending on options.
The documentation includes an example of deleting orphaned rows, which is exactly your scenario:
pt-archiver --source h=host,D=db,t=VALUE --purge \
--where 'NOT EXISTS(SELECT * FROM `KEY` WHERE key_id=`VALUE`.key_id)' \
--limit 1000 --commit-each
Executing this will take significantly longer to delete the data, but it won't use too many resources, and without interrupting service on your existing database. I have used it successfully to purge hundreds of millions of rows of outdated data.
pt-archiver is part of the Percona Toolkit for MySQL, a free (GPL) set of scripts that help common tasks with MySQL and compatible databases.
Directly from MySQL documentation
If you are deleting many rows from a large table, you may exceed the
lock table size for an InnoDB table. To avoid this problem, or simply
to minimize the time that the table remains locked, the following
strategy (which does not use DELETE at all) might be helpful:
Select the rows not to be deleted into an empty table that has the same structure as the original table:
INSERT INTO t_copy SELECT * FROM t WHERE ... ;
Use RENAME TABLE to atomically move the original table out of the way and rename the copy to the original name:
RENAME TABLE t TO t_old, t_copy TO t;
Drop the original table:
DROP TABLE t_old;
No other sessions can access the tables involved while RENAME TABLE
executes, so the rename operation is not subject to concurrency
problems. See Section 12.1.9, “RENAME TABLE Syntax”.
So in Your case You may do
INSERT INTO value_copy SELECT * FROM VALUE WHERE key_id IN
(SELECT key_id FROM `KEY`);
RENAME TABLE value TO value_old, value_copy TO value;
DROP TABLE value_old;
And according to what they wrote here RENAME operation is quick and number of records doesn't affect it.
What about this for having a limit?
delete x
from `VALUE` x
join (select key_id, value_id
from `VALUE` v
left join `KEY` k using (key_id)
where k.key_id is null
limit 1000) y
on x.key_id = y.key_id AND x.value_id = y.value_id;
First, examine your data. Find the keys which have too many values to be deleted "fast". Then find out which times during the day you have the smallest load on the system. Perform the deletion of the "bad" keys during that time. For the rest, start deleting them one by one with some downtime between deletes so that you don't put to much pressure on the database while you do it.
May be instead of limit divide whole set of rows into small parts by key_id:
delete v
from VALUE v
left join KEY k using (key_id)
where k.key_id is null and v.key_id > 0 and v.key_id < 100000;
then delete rows with key_id in 100000..200000 and so on.
You can try to delete in separated transaction batches.
This is for MSSQL, but should be similar.
declare #i INT
declare #step INT
set #i = 0
set #step = 100000
while (#i< (select max(VALUE.key_id) from VALUE))
BEGIN
BEGIN TRANSACTION
delete from VALUE where
VALUE.key_id between #i and #i+#step and
not exists(select 1 from KEY where KEY.key_id = VALUE.key_id and KEY.key_id between #i and #i+#step)
set #i = (#i+#step)
COMMIT TRANSACTION
END
Create a temporary table!
drop table if exists batch_to_delete;
create temporary table batch_to_delete as
select v.* from `VALUE` v
left join `KEY` k on k.key_id = v.key_id
where k.key_id is null
limit 10000; -- tailor batch size to your taste
-- optional but may help for large batch size
create index batch_to_delete_ix_key on batch_to_delete(key_id);
create index batch_to_delete_ix_value on batch_to_delete(value_id);
-- do the actual delete
delete v from `VALUE` v
join batch_to_delete d on d.key_id = v.key_id and d.value_id = v.value_id;
To me this is a kind of task the progress of which I would want to see in a log file. And I would avoid solving this in pure SQL, I would use some scripting in Python or other similar language. Another thing that would bother me is that lots of LEFT JOINs with WHERE IS NOT NULL between the tables might cause unwanted locks, so I would avoid JOINs either.
Here is some pseudo code:
max_key = select_db('SELECT MAX(key) FROM VALUE')
while max_key > 0:
cur_range = range(max_key, max_key-100, -1)
good_keys = select_db('SELECT key FROM KEY WHERE key IN (%s)' % cur_range)
keys_to_del = set(cur_range) - set(good_keys)
while 1:
deleted_count = update_db('DELETE FROM VALUE WHERE key IN (%s) LIMIT 1000' % keys_to_del)
db_commit
log_something
if not deleted_count:
break
max_key -= 100
This should not bother the rest of the system very much, but may take long. Another issue is to optimize the table after you deleted all those rows, but this is another story.
If the target columns are properly indexed this should go fast,
DELETE FROM `VALUE`
WHERE NOT EXISTS(SELECT 1 FROM `key` k WHERE k.key_id = `VALUE`.key_id)
-- ORDER BY key_id, value_id -- order by PK is good idea, but check the performance first.
LIMIT 1000
Alter the limit from 10 to 10000 to get acceptable performance, and rerun it several times.
Also take in mind that this mass deletes will perform locks and backups for each row ..
multiple the execution time for each row several times ...
There are some advanced methods to prevent this, but the easiest workaround
is just to put a transaction around this query.
Do you have SLAVE or Dev/Test environment with same data?
The first step is to find out your data distribution if you are worried about a particular key having 1 million value_ids
SELECT v.key_id, COUNT(IFNULL(k.key_id,1)) AS cnt
FROM `value` v LEFT JOIN `key` k USING (key_id)
WHERE k.key_id IS NULL
GROUP BY v.key_id ;
EXPLAIN PLAN for above query is much better than adding
ORDER BY COUNT(IFNULL(k.key_id,1)) DESC ;
Since you don't have partitioning on key_id (too many partitions in your case) and want to keep database running during your delete process, the option is to delete in chucks with SLEEP() between different key_id deletes to avoid overwhelming server. Don't forget to keep an eye on your binary logs to avoid disk filling.
The quickest way is :
Stop application so data is not changed.
Dump key_id and value_id from VALUE table with only matching key_id in KEY table by using
mysqldump YOUR_DATABASE_NAME value --where="key_id in (select key_id from YOUR_DATABASE_NAME.key)" --lock-all --opt --quick --quote-names --skip-extended-insert > VALUE_DATA.txt
Truncate VALUE table
Load data exported in step 2
Start Application
As always, try this in Dev/Test environment with Prod data and same infrastructure so you can calculate downtime.
Hope this helps.
I am just curious what the effect would be of adding a non-unique index on key_id in table VALUE. Selectivity is not high at all (~0.001) but I am curious how that would affect the join performance.
Why don't you split your VALUE table into several ones according to some rule like key_id module some power of 2 (like 256 for example)?
I have a very large table 20-30 million rows that is completely overwritten each time it is updated by the system supplying the data over which I have no control.
The table is not sorted in a particular order.
The rows in the table are unique, there is no subset of columns that I can be assured to have unique values.
Is there a way I can run a SELECT query followed by a DELETE query on this table with a fixed limit without having to trigger any expensive sorting/indexing/partitioning/comparison whilst being certain that I do not delete a row not covered by the previous select.
I think you're asking for:
SELECT * FROM MyTable WHERE x = 1 AND y = 3;
DELETE * FROM MyTable WHERE NOT (x = 1 AND y = 3);
In other words, use NOT against the same search expression you used in the first query to get the complement of the set of rows. This should work for most expressions, unless some of your terms return NULL.
If there are no indexes, then both the SELECT and DELETE will incur a table-scan, but no sorting or temp tables.
Re your comment:
Right, unless you use ORDER BY, you aren't guaranteed anything about the order of the rows returned. Technically, the storage engine is free to return the rows in any arbitrary order.
In practice, you will find that InnoDB at least returns rows in a somewhat predictable order: it reads rows in some index order. Even if your table has no keys or indexes defined, every InnoDB table is stored as a clustered index, even if it has to generate an internal key called GEN_CLUST_ID behind the scenes. That will be the order in which InnoDB returns rows.
But you shouldn't rely on that. The internal implementation is not a contract, and it could change tomorrow.
Another suggestion I could offer:
CREATE TABLE MyTableBase (
id INT AUTO_INCREMENT PRIMARY KEY,
A INT,
B DATE,
C VARCHAR(10)
);
CREATE VIEW MyTable AS SELECT A, B, C FROM MyTableBase;
With a table and a view like above, your external process can believe it's overwriting the data in MyTable, but it will actually be stored in a base table that has an additional primary key column. This is what you can use to do your SELECT and DELETE statements, and order by the primary key column so you can control it properly.
I have a table "data" which holds around 100,000,000 records.
I have added a new column to it "batch_id" (Integer).
On the application layer, I'm updating the batch_id in batches of 10,000 records for each of the 100,000,000 records (the batch_id is always the same for 10k).
I'm doing something like this (application layer pseudo code):
loop {
$batch_id = $batch_id + 1;
mysql.query("UPDATE data SET batch_id='$batch_id' WHERE batch_id IS NULL LIMIT 10000");
}
I have an index on the batch_id column.
In the beginning, this update statement took ~30 seconds. I'm now halfway through the Table and it's getting slower and slower. At the moment the same statement takes around 10 minutes(!). It reached a point where this is no longer feasible as it would take over a month to update the whole table at the current speed.
What could I do to speed it up, and why is MySQL Getting slower towards the end of the table?
Could an index on the primary key help?
Is the primary key automatically indexed in MySQL? The answer is Yes
So instead one index for batch_id will help.
The problem is without index the engine do a full table scan. At first is easy find 10k with null values, but when more and more records are updated the engine have to scan much more to find those nulls.
But should be easier create batch_id as an autonumeric column
OTHER OPTION: Create a new table and then add the index and replace old table.
CREATE newTable as
SELECT IF(#newID := #newID + 1,
#newID DIV 10000,
#newID DIV 10000) as batch_id,
<other fields>
FROM YourTable
CROSS JOIN (SELECT #newID :=0 ) as v
Insert auto increment primary key to existing table
Do you have a monotonically increasing id in the table? And all rows for a "batch" have 'consecutive' ids? Then don't add batch_id to the table, instead, create another table Batches with one row per batch: (batch_id (PK), id_start, id_end, start_time, end_time, etc).
If you stick to exact chunks of 10K, then don't even materialize batch_id. Instead, compute it from id DIV 10000 whenever you need it.
If you want to discuss this further, please provide SHOW CREATE TABLE for the existing table, and explain what you will be doing with the "batches".
To answer your question about "slow near the end": It is having to scan farther and farther in the table to find the NULLs. You would be better to walk through the table once, fiddling with each 10K chunk as you go. Do this using the PRIMARY KEY, whatever it is. (That is, even if it is not AUTO_INCREMENT.) More Details .
I am looking into storing a "large" amount of data and not sure what the best solution is, so any help would be most appreciated. The structure of the data is
450,000 rows
11,000 columns
My requirements are:
1) Need as fast access as possible to a small subset of the data e.g. rows (1,2,3) and columns (5,10,1000)
2) Needs to be scalable will be adding columns every month but the number of rows are fixed.
My understanding is that often its best to store as:
id| row_number| column_number| value
but this would create 4,950,000,000 entries? I have tried storing as just rows and columns as is in MySQL but it is very slow at subsetting the data.
Thanks!
Build the giant matrix table
As N.B. said in comments, there's no cleaner way than using one mysql row for each matrix value.
You can do it without the id column:
CREATE TABLE `stackoverflow`.`matrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
You may add a UNIQUE INDEX on colNum, rowNum, or only a non-unique INDEX on colNum if you often access matrix by column (because PRIMARY INDEX is on ( `rowNum`, `colNum` ), note the order, so it will be inefficient when it comes to select a whole column).
You'll probably need more than 200Go to store the 450.000x11.000 lines, including indexes.
Inserting data may be slow (because there are two indexes to rebuild, and 450.000 entries [1 per row] to add when adding a column).
Edit should be very fast, as index wouldn't change and value is of fixed size
If you access same subsets (rows + cols) often, maybe you can use PARTITIONing of the table if you need something "faster" than what mysql provides by default.
After years of experience (20201 edit)
Re-reading myself years later, I would say the "cache" ideas are totally dumb, as it's MySQL role to handle these sort of cache (it should actually already be in the innodb pool cache).
A better thing would be, if matrix is full of zeroes, not storing the zero values, and consider 0 as "default" in the client code. That way, you may lightenup the storage (if needed: mysql should actually be pretty fast responding to queries event on such 5 billion row table)
Another thing, if storage makes issue, is to use a single ID to identify both row and col: you say number of rows is fixed (450000) so you may replace (row, col) with a single (id = 450000*col+row) value [tho it needs BIGINT so maybe not better than 2 columns)
Don't do like below: don't reinvent MySQL cache
Add a cache (actually no)
Since you said you add values, and doesn't seem to edit matrix values, a cache can speed up frequently asked rows/columns.
If you often read the same rows/columns, you can cache their result in another table (same structure to make it easier):
CREATE TABLE `stackoverflow`.`cachedPartialMatrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
That table will be void at the beginning, and each SELECT on the matrix table will feed the cache. When you want to get a column / row:
SELECT the row/column from that caching table
If the SELECT returns a void/partial result (no data returned or not enough data to match the expected row/column number) then do the SELECT on the matrix table
Save the SELECT from the matrix table to the cachingPartialMatrix
If the caching matrix gets too big, clear it (the bigger cached matrix is, the slower it becomes)
Smarter cache (actually, no)
You can make it even smarter with a third table to count how many times a selection is done:
CREATE TABLE `stackoverflow`.`requestsCounter` (
`isRowSelect` BOOLEAN NOT NULL ,
`index` INT NOT NULL ,
`count` INT NOT NULL ,
`lastDate` DATETIME NOT NULL,
PRIMARY KEY ( `isRowSelect` , `index` )
) ENGINE = MYISAM ;
When you do a request on your matrix (one may use TRIGGERS) for the Nth-row or Kth-column, increment the counter. When the counter gets big enough, feed the cache.
lastDate can be used to remove some old values from the cache (take care: if you remove the Nth-column from cache entries because its ``lastDate```is old enough, you may break some other entries cache) or to regularly clear the cache and only leave the recently selected values.
I've got a mysql table where each row has its own sequence number in a "sequence" column. However, when a row gets deleted, it leaves a gap. So...
1
2
3
4
...becomes...
1
2
4
Is there a neat way to "reset" the sequencing, so it becomes consecutive again in one SQL query?
Incidentally, I'm sure there is a technical term for this process. Anyone?
UPDATED: The "sequence" column is not a primary key. It is only used for determining the order that records are displayed within the app.
If the field is your primary key...
...then, as stated elsewhere on this question, you shouldn't be changing IDs. The IDs are already unique and you neither need nor want to re-use them.
Now, that said...
Otherwise...
It's quite possible that you have a different field (that is, as well as the PK) for some application-defined ordering. As long as this ordering isn't inherent in some other field (e.g. if it's user-defined), then there is nothing wrong with this.
You could recreate the table using a (temporary) auto_increment field and then remove the auto_increment afterwards.
I'd be tempted to UPDATE in ascending order and apply an incrementing variable.
SET #i = 0;
UPDATE `table`
SET `myOrderCol` = #i:=#i+1
ORDER BY `myOrderCol` ASC;
(Query not tested.)
It does seem quite wasteful to do this every time you delete items, but unfortunately with this manual ordering approach there's not a whole lot you can do about that if you want to maintain the integrity of the column.
You could possibly reduce the load, such that after deleting the entry with myOrderCol equal to, say, 5:
SET #i = 5;
UPDATE `table`
SET `myOrderCol` = #i:=#i+1
WHERE `myOrderCol` > 5
ORDER BY `myOrderCol` ASC;
(Query not tested.)
This will "shuffle" all the following values down by one.
I'd say don't bother. Reassigning sequential values is a relatively expensive operation and if the column value is for ordering purpose only there is no good reason to do that. The only concern you might have is if for example your column is UNSIGNED INT and you suspect that in the lifetime of your application you might have more than 4,294,967,296 rows (including deleted rows) and go out of range, even if that is your concern you can do the reassigning as a one time task 10 years later when that happens.
This is a question that often I read here and in other forums. As already written by zerkms this is a false problem. Moreover if your table is related with other ones you'll lose relations.
Just for learning purpose a simple way is to store your data in a temporary table, truncate the original one (this reset auto_increment) and than repopulate it.
Silly example:
create table seq (
id int not null auto_increment primary key,
col char(1)
) engine = myisam;
insert into seq (col) values ('a'),('b'),('c'),('d');
delete from seq where id = 3;
create temporary table tmp select col from seq order by id;
truncate seq;
insert into seq (col) select * from tmp;
but it's totally useless. ;)
If this is your PK then you shouldn't change it. PKs should be (mostly) unchanging columns. If you were to change them then not only would you need to change it in that table but also in any foreign keys where is exists.
If you do need a sequential sequence then ask yourself why. In a table there is no inherent or guaranteed order (even in the PK, although it may turn out that way because of how most RDBMSs store and retrieve the data). That's why we have the ORDER BY clause in SQL. If you want to be able to generate sequential numbers based on something else (time added into the database, etc.) then consider generating that either in your query or with your front end.
Assuming that this is an ID field, you can do this when you insert:
INSERT INTO yourTable (ID)
SELECT MIN(ID)
FROM yourTable
WHERE ID > 1
As others have mentioned I don't recommend doing this. It will hold a table lock while the next ID is evaluated.