MySQL update of large table based on another large table too slow - mysql

I have one table that looks like this:
+-------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+-------+
| name | varchar(255) | NO | PRI | NULL | |
| timestamp1 | int | NO | | NULL | |
| timestamp2 | int | NO | | NULL | |
+-------------+--------------+------+-----+---------+-------+
This table has around 250 million rows in it. I get a csv once a day that contains around 225 million rows of just one name column. 99% of the names that are in the csv I get everyday are already in the database. So what I want to do is for all the ones that are already there I update their timestamp1 column to UNIX_TIMESTAMP(NOW()). Then all the names that are not in the original table, but are in the csv I add to the original table. Right now this is how I am doing this:
DROP TEMPORARY TABLE IF EXISTS tmp_import;
CREATE TEMPORARY TABLE tmp_import (name VARCHAR(255), primary_key(name));
LOAD DATA LOCAL INFILE 'path.csv' INTO TABLE tmp_import LINES TERMINATED BY '\n';
UPDATE og_table tb SET timestamp1 = UNIX_TIMESTAMP(NOW()) WHERE og.name IN (SELECT tmp.name FROM tmp_import tmp);
DELETE FROM tmp_import WHERE name in (SELECT og.name FROM og_table og);
INSERT INTO og_table SELECT name, UNIX_TIMESTAMP(NOW()) AS timestamp1, UNIX_TIMESTAMP(NOW()) AS timestamp2 FROM tmp_import;
As someone might guess the update line is taking a long time, over 6 hours or throwing an error. Reading the data in is taking upwards of 40 minutes. I know this is mostly because it is creating an index for name when I don't set it as a primary key it only takes 9 minutes to read the data in, but I thought having an index would speed up the operation. I have tried the update several different way. What I have and the following:
UPDATE og_table og SET timestamp1 = UNIX_TIMESTAMP(NOW()) WHERE EXISTS (SELECT tmp.name FROM tmp_import tmp where tmp.name = og.name);
UPDATE og_table og inner join tmp_import tmp on og.name=tmp.name SET og.timestamp1 = UNIX_TIMESTAMP(NOW());
Both of those attempts did not work. It normally takes several hours and then ends up with:
ERROR 1206 (HY000): The total number of locks exceeds the lock table size
I am using InnoDB for these tables, but there are no necessary foreign keys and having the benefit of that engine is not necessarily needed so I would be open to trying different storage engines.
I have been looking through a lot of posts and have yet to find something to help in my situation. If I missed a post I apologize.

If the name values are rather long, you might greatly improve performance by using a hash function, such as MD5 or SHA-1 and store&index the hash only. You probably don't even need all 128 or 160 bits. 80-bit portions should be good enough with a very low chance of a collision. See this.
Another thing you might want to check is if you have enough RAM. How big is your table and how much RAM do you have? Also, it's not just about how much RAM you have on the machine, but how much of it is available to MySQL/InnoDB's buffer cache.
What disk are you using? If you are using a spinning disk (HDD), that might be a huge bottleneck if InnoDB needs to constantly make scattered reads.
There are many other things that might help, but I would need more details. For example, if the names in the CSV are not sorted, and your buffer caches are about 10-20% of the table size, you might have a huge performance boost by splitting the work in batches, so that names in each batch are close enough (for example, first process all names that start with 'A', then those starting with 'B', etc.). Why would that help? In a big index (in InnoDB tables are also implemented as indexes) that doesn't fit into the buffer cache, if you make millions of reads all around the index, the DB will need to constantly read from the disk. But if you work on a smaller area, the data blocks (pages) will only be read once and then they will stay in the RAM for subsequent reads until you've finished with that area. That alone might easily improve performance by 1 or 2 orders of magnitude, depending on your case.

A big update (as Barmar points out) takes a long time. Let's avoid it by building a new table, then swapping it into place.
First, let me get clarification and provide a minimal example.
You won't be deleting any rows, correct? Just adding or updating rows?
You have (in og_table):
A 88 123
B 99 234
The daily load (tmp_import) says
B
C
You want
A 88 123
B NOW() 234
C NOW() NULL
Is that correct? Now for the code:
load nightly data and build the merge table:
LOAD DATA ... (name) -- into TEMPORARY tmp_import
CREATE TABLE merge LIKE og_table; -- not TEMPORARY
Populate a new table with the data merged together
INSERT INTO merge
-- B and C (from the example):
( SELECT ti.name, FROM_UNIXTIME(NOW()), og.timestamp2
FROM tmp_import AS ti
LEFT JOIN og_table AS USING(name)
) UNION ALL
-- A:
( SELECT og.name, og.timestamp1, og.timestamp2
FROM og_table AS og
LEFT JOIN tmp_import AS ti USING(name)
WHERE ti.name IS NULL -- (that is, missing from csv)
);
Swap it into place
RENAME TABLE og_table TO x,
merge TO og_table;
DROP TABLE x;
Bonus: og_table is "down" only very briefly (during the RENAME).
A possible speed-up: Sort the CSV file by name before loading it. (If that takes an extra step, the cost of that step may be worse than the cost of not having the data sorted. There is not enough information to predict.)

Related

Mysql fastest technique for insert, replace, on duplicate of mass records

I know there are a lot related questions with many answers, but I have a bit of a more nuanced question. I have been doing reading on different insert techniques for mass records, but are there limits on how big a query insert can be? Can the same technique be used for REPLACE and INSERT ...ON DUPLICATE KEY UPDATE ... ? Is there a faster method?
Table:
+-----------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+-------------+------+-----+---------+----------------+
| a | int(11) | NO | PRI | NULL | auto_increment |
| b | int(11) | YES | | NULL | |
| c | int(11) | YES | | NULL | |
#1
1) "INSERT INTO TABLE COLUMNS (a,b,c) values (1,2,3);"
2) "INSERT INTO TABLE COLUMNS (a,b,c) values (5,6,7);"
3) "INSERT INTO TABLE COLUMNS (a,b,c) values (8,9,10);"
...
10,000) "INSERT INTO TABLE COLUMNS (a,b,c) values (30001,30002,30003);"
or
#2 - should be faster, but is there a limit?
"INSERT INTO TABLE COLUMNS (a,b,c) values (1,2,3),(4,5,6),(8,9,10)....(30001,30002,30003)" ;
From a scripting perspective (PHP), using #2, is it better to loop through and queue up 100 entries (1000 times)...or a 1000 entries (100 times), or just all 10,000 at once? Could this be done with 100,000 entries?
Can the same be used with REPLACE:
"REPLACE INTO TABLE (a, b, c) VALUES(1,2,3),(4,5,6)(7,8,9),...(30001,30002,30003);"
Can it also be used with INSERT ON DUPLICATE?
INSERT INTO TABLE (a, b, c) VALUES(1,2,3),(4,5,6),(7,8,9),....(30001,30002,30003) ON DUPLICATE KEY UPDATE (b=2,c=3)(b=5,c=6),(b=8,c=9),....(b=30002,c=30003) ?
For any and all of the above (assuming the replace/on duplicate are valid), are there faster methods to achieve the inserts?
The length of any SQL statement is limited by a MySQL option called max_allowed_packet.
The syntax of INSERT allows you to add an unlimited number of tuples after the VALUES clause, but the total length of the statement from INSERT to the last tuple must still be no more than the number of bytes equal to max_allowed_packet.
Regardless of that, I have found that LOAD DATA INFILE is usually significantly faster than any INSERT syntax. It's so much faster, that you might even find it faster to write your tuples to a temporary CSV file and then use LOAD DATA INFILE on that CSV file.
You might like my presentation comparing different bulk-loading solutions in MySQL: Load Data Fast!
#1 (single-row inserts) -- Slow. A variant is INSERT IGNORE -- beware: it burns AUTO_INCREMENT ids.
#2 (batch insert) -- Faster than #1 by a factor of 10. But do the inserts in batches of no more than 1000. (After that, you are into "diminishing returns" and may conflict with other activities.
#3 REPLACE -- Bad. It is essentially a DELETE plus an INSERT. Once IODKU was added to MySQL, I don't think there is any use for REPLACE. All the old AUTO_INCREMENT ids will be tossed and new ones created.
#4 IODKU (Upsert) -- [If you need to test before Insert.] It can be batched, but not the way you presented it. (There is no need to repeat the b and c values.)
INSERT INTO (
INSERT INTO TABLE (a, b, c)
VALUES(1,2,3),(4,5,6),(7,8,9),....(30001,30002,30003)
ON DUPLICATE KEY UPDATE
b = VALUES(b),
c = VALUES(c);
Or, in MySQL 8.0, the last 2 lines are:
b = NEW.b,
c = NEW.c;
IODKU also burns ids.
MySQL LOAD DATA INFILE with ON DUPLICATE KEY UPDATE discusses a 2-step process of LOAD + IODKU. Depending on how complex the "updates" are, 2+ steps may be your best answer.
#5 LOAD DATA -- as Bill mentions, this is a good way if the data comes from a file. (I am dubious about its speed if you also have to write the data to a file first.) Be aware of the usefulness of #variables to make minor tweaks as you do the load. (Eg, STR_TO_DATE(..) to fix a DATE format.)
#6 INSERT ... SELECT ...; -- If the data is already in some other table(s), you may as well combine the Insert and Select. This works for IODKU, too.
As a side note, if you need to get AUTO_INCREMENT ids of each batched row, I recommend some variant on the following. It is aimed at batch-normalization of id-name pairs that might already exist in the mapping table. Normalization

How can make this join of two huge MySQL Tables finish?

I have two tables
table1:
column1: varchar(20)
column2: varchar(20)
column3: varchar(20)
table2:
column1: varchar(20)
column2: varchar(20)
column3: varchar(20) <- empty
column1 and column2 both have a separate Fulltext index in table1
both tables hold 20 million rows
I need to fill column3 of table2 by matching column1 & column2 from table2 to column1 & column2 from table1, then take the value in column3 from table1 and put it into column3 of table2. column1 & column2 might not match exactly, so the query I use for this is:
UPDATE table1, table2
SET table2.column3 = table1.column3
WHERE table2.column1 LIKE table1.'%column1%' AND
table2.column2 LIKE table1.'%column2%';
This query never finishes. I let it run for 2 weeks and still didn't produce any result. It utilized one CPU core 100%, had little SSD IO and apparently needs to be optimized somehow.
I am open to any suggestions regarding query optimization, index optimization or even DBMS optimization (or even migration, if it helps) since I need to do queries like this more often in the future.
EDIT1
There are plenty of optimization guides, please use google for that. You can increase the threads in config (InnoDB). For the Update itself i recommend to first create a temp_table and then copy to db2
I know that but couldn't quite solve my scenario with those guides. I also know that questions of all possible permutations of combinations for this problem (huge databases, performance, bottlenecks, query design) are all around, also on stackoverflow. However, to this day I couldn't figure out what the best way to proceed would be for this specific combination of problems and hoped for getting help here. That being said:
- more threads would require sharding or partitioning in order to utilize more than one CPU core, which I would like to avoid if I can solve the problem with other means
- how would you propose to create such temporary table here?
Why do you use like operator if you do not use wild card characters? Replace them with =. Also, do you have multi-column index on the 3 columns in the where criteria in each of the tables? Pls share the output of the explain as well, along with any existing indexes in the 2 tables.
I left those characters out in the example but want to use them once the basic query works, sorry for the confusion. I am not entirely sure how to put those wildcards into a column comparison though.
I have two seperate indizes, should I create a 2-column index instead? (there are only 2 columns in the where criteria)
would you rather have the explain of the structure I have now or prefer the explain of the structure with a 2-column index?
i guess you say databases but you are talking about tables, right?
Exactly, sorry for the confusion.
The query you wrote will perform 20m x 20m lookups (for each row in table 1 look up all rows in table 2). You can't write whatever in and expect it to work if you have an SSD or a good CPU. If you arrived at this point, it's time to think before you start writing SQL. What it is that you need to do, what are the tools you have at your disposal and what's the middle part that you don't know - those are the questions you need to answer every time before you issue 400 billion lookup query.
That is the scenario I am facing though. I don't expect it to work at all like it is at the moment, to be honest, so I am looking for pointers which might make this a solvable scenario. The basic "update this, where that matches" query apparently doesn't apply here. So I am trying to figure out a way to a more advanced solution. Any criticism is very welcome, so thank you for this input. How would you suggest to proceed here?
EDIT2
Give us some sample values and non-exact comparisons.
table1:
+---------+---------+-------------+---------+---------+---------+
| column1 | column2 | column3 | column4 | column5 | columnN |
+---------+---------+-------------+---------+---------+---------+
| John | Doe_ | employee001 | xyz | 12345 | ... |
| Jim | Doe | employee002 | abc | 67890 | ... |
+---------+---------+-------------+---------+---------+---------+
table2:
+---------+---------+---------+
| column1 | column2 | column3 |
+---------+---------+---------+
| John | Doe | |
| Jim | Doe | |
+---------+---------+---------+
Here, a LIKE query would fill both rows of table 2, if it would match "Doe_" for "Doe". But by writing this down, I just realized that a LIKE query is no option here because the variations wouldn't constrain to a suffix of column2 in table 1, rather various possible likes would be required (leading AND trailing variants for both columns in both tables). This in turn would multiply the number of required matches.
So let's forget about the LIKE and concentrate on exact matching only.
FULLTEXT and LIKE have nothing to do with each other.
"Might not match exactly" -- You will need more limitations on this non-restriction. Else, any attempt at a query will continue to take weeks.
t2.c1 LIKE CONCAT('%', t1.c1, '%') requires checking ever row of t1 against every row of t2; that's 400 trillion tests. No hardware can do that in a reasonable length of time.
FULLTEXT works with "words". If your c1 and c2 are strings of words, then there is some hope to use FULLTEXT. FULLTEXT is much faster than LIKE because it has an index structure based on words.
However, even FULLTEXT is no where near the speed of t2.c1 = t1.c1. Still, that would need a composite INDEX(c1, c2) Then it would be a full table scan (20M rows) of one table, plus 20M probes via a BTree index into the other table. This is like 40M operations -- a lot better than 400T for LIKE.
In order to proceed, please think through your definition of "Might not match exactly" and present the best you can live with.
Ok, since I decided to drop the LIKE requirement, what exactly do you propose to use as index?
I read your post like this:
ALTER TABLE `table1` ADD FULLTEXT INDEX `indexname1` (`column1`, `column2`);
ALTER TABLE `table2` ADD FULLTEXT INDEX `indexname2` (`column1`, `column2`);
UPDATE `table1`, `table2`
SET `table2`.`column3` = `table1`.`column3 `
WHERE CONCAT(`table1`.`column1`, `table1`.`column2`) = CONCAT(`table2`.`column1`, `table2`.`column2`);
Is this correct?
Two followup questions though:
1) Is the update in your oppinion as fast, faster or slower as creating a new table, i.e.:
CREATE TABLE `merged` AS
SELECT `table1`.`column1`, `table1`.`column2`, `table1`.`column3`
FROM `table1`, `table2`
WHERE CONCAT(`table1`.`column1`, `table1`.`column2`) = CONCAT(`table2`.`column1`, `table2`.`column2`);
2) Would the indizes and / or the matching be case sensitive? If yes, can adapt the query without having to change column1 & column2 to all upper case (or all lower case)?
FULLTEXT and LIKE have nothing to do with each other.
"Might not match exactly" -- You will need more limitations on this non-restriction. Else, any attempt at a query will continue to take weeks.
t2.c1 LIKE CONCAT('%', t1.c1, '%') requires checking ever row of t1 against every row of t2; that's 400 trillion tests. No hardware can do that in a reasonable length of time.
FULLTEXT works with "words". If your c1 and c2 are strings of words, then there is some hope to use FULLTEXT. FULLTEXT is much faster than LIKE because it has an index structure based on words.
However, even FULLTEXT is no where near the speed of t2.c1 = t1.c1. Still, that would need a composite INDEX(c1, c2) Then it would be a full table scan (20M rows) of one table, plus 20M probes via a BTree index into the other table. This is like 40M operations -- a lot better than 400T for LIKE.
In order to proceed, please think through your definition of "Might not match exactly" and present the best you can live with.
Edit
WHERE CONCAT(t1.c1, t1.c2) = CONCAT(t2.c1, t2.c2) is a lot worse than saying WHERE t1.c1=t2.c2 AND t1.c2 = t2.c2. The latter will run fast with INDEX(c1,c2).
Try this:
1. Add a new column to db1 and db2 with a character, that never appears in column1 and column2, for example #
ALTER TABLE `db1` ADD `column4` VARCHAR(40) NOT NULL ;
UPDATE db1 SET column4 = column1 + '#' + column2
2. Do the same for db2. Then create an index (BTREE) on column 4 (in db1 and db2).
ALTER TABLE `db1` ADD INDEX ( `column4` ) ;
ALTER TABLE `db2` ADD INDEX ( `column4` ) ;
3. Then run next query:
UPDATE db1, db2 SET db2.column3 = db1.column3 WHERE db1.column4 = db2.column4;
It should run fast enough.
When it's done - just drop column4 and it's index

Optimize Mysql Query (rawdata to processed table)

Hi everyone so my question is this, So I have a file that reads in roughly 3000 rows of data by the local infile command. After which there is a trigger on the table that's inserted into that copies three columns from from the updated table and two columns from a table that exist in the database already(if this is unclear to what I mean the structures are coming). From there only combinations that have unique glNumbers will be entered into the processed table. This takes over a minute and half normally. I find this pretty long, I was wondering if this is normal for what I'm doing(can't believe that's true) or is there a way to optimize the queries so it goes faster?
Tables that are inserted to are labeled the first three letters of each month. Here is the default structure.
RawData Structure
| idjan | glNumber | journel | invoiceNumber | date | JT | debit | credit | descriptionDetail | totalDebit | totalCredit |
(sorry) for the poor format there isn't a really good way to do this it seems)
After Insert Trigger Query
delete from processedjan;
insert into processedjan(glNumber,debit,credit,bucket1,bucket2)
select a.glNumber, a.totalDebit, a.totalCredit, b.bucket1, b.bucket2
from jan a inner join bucketinformation b on a.glNumber = b.glNumber
group by glNumber;
Processed Datatable Structure
| glNumber | bucket1| bucket2| credit | debit |
Also I guess it helps to know the bucket 1 and bucket 2 come from another table where its matched against the glNumber. That table is roughly 800 rows with three columns for the glNumber and the two buckets.
While postgresql has statement level triggers, mysql only has row level triggers. From the mysql reference:
A trigger is defined to activate when a statement inserts, updates, or
deletes rows in the associated table. These row operations are trigger
events. For example, rows can be inserted by INSERT or LOAD DATA
statements, and an insert trigger activates for each inserted row.
So while you are managing to load 3000 rows in one operation, unfortunately 3000 more queries are executed by the triggers. But the complex nature of your transaction sounds like you might actually be performing 2-3 queries per row. That's the real reason for the slow down.
You can speed things up by disabling the trigger and carrying out a INSERT .. SELECT after the load data in file. You can automate it with a small script.

Default value for order field in mysql

In a given table I have a field (field_order) that will serve as way to define a custom order for showing the rows of the table. When inserting a new record
I would like to set that particular field with the numbers of rows in that table plus one
So if the table has 3 rows, at the time of inserting a new one, the default value for field_order should be 4.
What would be the best approach to set that value?
A simple select count inside the insert statement?
Is there a constant like CURRENT_TIMESTAMP for TIMESTAMP datatype that returns that value?
EDIT: The reason behind this is is to be able to sort the table by that particular field; and that field would be manipulated by a user in client side using jQuery's sortable
Okay, so the solutions surrounding this question actually involve a bit of nuance. Went ahead and decided to answer, but also wanted to address some of the nuance/details that the comments aren't addressing yet.
First off, I would very strongly advise you against using auto_increment on the primary key, if for no other reason that that it's very easy for those auto increment ids to get thrown off (for example, rolled back transactions will interfere with them MySQL AUTO_INCREMENT does not ROLLBACK. So will deletes, as #Sebas mentioned).
Second, you have to consider your storage engine. If you are using MyISAM, you can very quickly obtain a COUNT(*) of the table (because MyISAM always knows how many rows are in each table). If you're using INNODB, that's not the case. Depending on what you need this table for, you may be able to get away with MyISAM. It's not the default engine, but it is certainly possible that you could encounter a requirement for which MyISAM would be a better choice.
The third thing you should ask yourself is, "Why?" Why do you need to store your data that way at all? What does that actually give you? Do you in fact need that information in SQL? In the same table of the same SQL table?
And if the "Why" has an answer that justifies its use, then the last thing I'd ask is "how?" In particular, how are you going to deal with concurrent inserts? How are you going to deal with deletes or rollbacks?
Given the requirement that you have, doing a count star of the table is basically necessary... but even then, there's some nuance involved (deletes, rollbacks, concurrency) and also some decisions to be made (which storage engine do you use; can you get away with using MyISAM, which will be faster for count stars?).
More than anything, though, I'd be question why I needed this in the first place. Maybe you really do... but that's an awfully strange requirement.
IN LIGHT OF YOUR EDIT:
EDIT: The reason behind this is is to be able to sort the table by
that particular field; and that field would be manipulated by a user
in client side using jQuery's sortable
Essentially what you are asking for is metadata about your tables. And I would recommend storing those metadata in a separate table, or in a separate service altogether (Elastic Search, Redis, etc). You would need to periodically update that separate table (or Key value store). If you were doing this in SQL, you could use a trigger. Or you used something like Elastic Search, you could insert your data into SQL and ES at the same time. Either way, you have some tricky issues you need to contend with (for example, eventual consistency, concurrency, all the glorious things that can backfire when you are using triggers in MySQL).
If it were me, I'd note two things. One, not even Google delivers an always up to date COUNT(*). "Showing rows 1-10 out of approximately XYZ." They do that in part because they have more data that I imagine you do, and in part because it actually is impractical (and very quickly becomes infeasible and prohibitive) to calculate an exact COUNT(*) of a table and keep it up to date at all times.
So, either I'd change my requirement entirely and leverage a statistic I can obtain quickly (if you are using MyISAM for storage, go ahead and use count( * )... it will be very fast) or I would consider maintaining an index of the count stars of my tables that periodically updates via some process (cron job, trigger, whatever) every couple of hours, or every day, or something along those lines.
Inre the bounty on this question... there will never be a single, canonical answer to this question. There are tradeoffs to be made no matter how you decide to manage it. They may be tradeoffs in terms of consistency, latency, scalability, precise vs approximate solutions, losing INNODB in exchange for MyISAM... but there will be tradeoffs. And ultimately the decision comes down to what you are willing to trade in order to get your requirement.
If it were me, I'd probably flex my requirement. And if I did, I'd probably end up indexing it in Elastic Search and make sure it was up to date every couple of hours or so. Is that what you should do? That depends. It certainly isn't a "right answer" as much as it is one answer (out of many) that would work if I could live with my count(*) getting a bit out of date.
Should you use Elastic Search for this? That depends. But you will be dealing with tradeoffs which ever way you go. That does not depend. And you will need to decide what you're willing to give up in order to get what you want. If it's not critical, flex the requirement.
There may be a better approach, but all I can think of right now is to create a second table that holds the value you need, and use triggers to make the appropriate inserts / deletes:
Here's an example:
-- Let's say this is your table
create table tbl_test(
id int unsigned not null auto_increment primary key,
text varchar(50)
);
-- Now, here's the table I propose.
-- It will be related to your original table using 'Id'
-- (If you're using InnoDB you can add the appropriate constraint
create table tbl_incremental_values(
id int unsigned not null primary key,
incremental_value int unsigned not null default 0
);
-- The triggers that make this work:
delimiter $$
create trigger trig_add_one after insert on tbl_test for each row
begin
declare n int unsigned default 0;
set n = (select count(*) from tbl_test);
insert into tbl_incremental_values
values (NEW.id, (n));
end $$
-- If you're using InnoDB tables and you've created a constraint that cascades
-- delete operations, skip this trigger
create trigger trig_remove before delete on tbl_test for each row
begin
delete from tbl_incremental_values where id = OLD.id;
end $$
delimiter ;
Now, let's test it:
insert into tbl_test(text) values ('a'), ('b');
select a.*, b.incremental_value
from tbl_test as a inner join tbl_incremental_values as b using (id);
-- Result:
-- id | text | incremental_value
-- ---+------+------------------
-- 1 | a | 1
-- 2 | b | 2
delete from tbl_test where text = 'b';
select a.*, b.incremental_value
from tbl_test as a inner join tbl_incremental_values as b using (id);
-- Result:
-- id | text | incremental_value
-- ---+------+------------------
-- 1 | a | 1
insert into tbl_test(text) values ('c'), ('d');
select a.*, b.incremental_value
from tbl_test as a inner join tbl_incremental_values as b using (id);
-- Result:
-- id | text | incremental_value
-- ---+------+------------------
-- 1 | a | 1
-- 3 | c | 2
-- 4 | d | 3
This will work fine for small datasets, but as evanv says in his answer:
Why?" Why do you need to store your data that way at all? What does that actually give you? Do you in fact need that information in SQL? In the same table of the same SQL table?
If all you need is to output that result, there's a much easier way to make this work: user variables.
Let's now say that your table is something like this:
create table tbl_test(
id int unsigned not null auto_increment primary key,
ts timestamp,
text varchar(50)
);
insert into tbl_test(text) values('a');
insert into tbl_test(text) values('b');
insert into tbl_test(text) values('c');
insert into tbl_test(text) values('d');
delete from tbl_test where text = 'b';
insert into tbl_test(text) values('e');
The ts column will take the value of the date and time on which each row was inserted, so if you sort it by that column, you'll get the rows in the order they were inserted. But now: how to add that "incremental value"? Using a little trick with user variables it is possible:
select a.*
, #n := #n + 1 as incremental_value
-- ^^^^^^^^^^^^ This will update the value of #n on each row
from (select #n := 0) as init -- <-- you need to initialize #n to zero
, tbl_test as a
order by a.ts;
-- Result:
-- id | ts | text | incremental_value
-- ---+---------------------+------+----------------------
-- 1 | xxxx-xx-xx xx:xx:xx | a | 1
-- 3 | xxxx-xx-xx xx:xx:xx | c | 2
-- 4 | xxxx-xx-xx xx:xx:xx | d | 3
-- 5 | xxxx-xx-xx xx-xx-xx | e | 4
But now... how to deal with big datasets, where it's likely you'll use LIMIT? Simply by initializing #n to the start value of limit:
-- A dull example:
prepare stmt from
"select a.*, #n := #n + 1 as incremental_value
from (select #n := ?) as init, tbl_test as a
order by a.ts
limit ?, ?";
-- The question marks work as "place holders" for values. If you're working
-- directly on MySQL CLI or MySQL workbench, you'll need to create user variables
-- to hold the values you want to use.
set #first_row = 2, #nrows = 2;
execute stmt using #first_row, #first_row, #nrows;
-- ^^^^^^^^^^ ^^^^^^^^^^ ^^^^^^
-- Initalizes The "floor" The number
-- the #n of the of rows
-- value LIMIT you want
--
-- Set #first_row to zero if you want to get the first #nrows rows
--
-- Result:
-- id | ts | text | incremental_value
-- ---+---------------------+------+----------------------
-- 4 | xxxx-xx-xx xx:xx:xx | d | 3
-- 5 | xxxx-xx-xx xx-xx-xx | e | 4
deallocate prepare stmt;
It seems like the original question was asking for an easy way to set a default sort order on a new record. Later on the user may adjust that "order field" value. Seems like DELETES and ROLLBACKS have nothing to do with this.
Here's a simple solution. For the sort order field, set your default value as 0, and use the primary key as your secondary sort. Simply change your sort order in the query to be DESC. If you want the default functionality to be "display most recently added first", then use:
SELECT * from my_table
WHERE user_id = :uid
ORDER BY field_order, primary_id DESC
If you want to "display most recently added last" use:
SELECT * from my_table
WHERE user_id = :uid
ORDER BY field_order DESC, primary_id
What I have done to avoid the SELECT COUNT(*) ... in the insert query is to have an unsorted state of the field_order column, let's say a default value of 0.
The select-query looks like:
SELECT * FROM my_table ... ORDER BY id_primary, field_order
As long as you don't apply a custom order, your query will result in chronological order.
When you want to apply custom sorting field_order should be re-setted by counting them from -X to 0:
id | sort
---+-----
1 | -2
2 | -1
3 | 0
When altering occurs the custom sort remains, and new rows will always be sorted chronoligicaly at end of the custom sorting already in place:
id | sort
---+-----
1 | -2
3 | 0
4 | 0

Mysql: Duplicate key error with autoincrement primary key

I have a table 'logging' in which we log visitor history. We have 14 millions pageviews in a day, so we insert 14 million records in table in a day, and traffic is highest in afternoon. From somedays we are facing the problems for duplicate key entry 'id', which according to me should not be the case, since id is autoincremented field and we are not explicitly passing id in insert query. Following are the details
logging (MyISAM)
----------------------------------------
| id | int(20) |
| virtual_user_id | varchar(1000) |
| visited_page | varchar(255) |
| /* More such columns are there */ |
----------------------------------------
Please let me know what is the problem here. Is keeping table in MyISAM a problem here.
Problem 1: size of your primary key
http://dev.mysql.com/doc/refman/5.0/en/integer-types.html
The max size of an INT regardless of the size you give it is 2147483647, twice that much if unsigned.
That means you get a problem every 153 days.
To prevent that you might want to change the datatype to an unsigned bigint.
Or for even more ridiculously large volumes even a unix timestamp + microtime as a composite key. Or a different DB solution altogether.
Problem 2: the actual error
It might be concurrency, even though I don't find that very plausible.
You'll have to provide the insert IDs / errors for that. Do you use transactions?
Another possibility is a corrupt table.
Don't know your mysql version, but this might work:
CHECK TABLE tablename
See if that has any complaints.
REPAIR TABLE tablename
General advice:
Is this a sensible amount of data to be inserting into a database, and doesn't it slow everything down too much anyhow?
I wonder how your DB performs with locking and all during the delete during for example an alter table.
The right way to do it totally depends on the goals and requirements of your system which I don't know, but here's an idea:
Log lines into a log. Import the log files in our own time. Don't bother your visitors with errors or delays when your DB is having trouble or when you need to do some big operation that locks everything.