Maintaining a large table of unique values in MySQL - mysql

This is probably a common situation, but I couldn't find a specific answer on SO or Google.
I have a large table (>10 million rows) of friend relationships on a MySQL database that is very important and needs to be maintained such that there are no duplicate rows. The table stores the user's uids. The SQL for the table is:
CREATE TABLE possiblefriends(
id INT NOT NULL AUTO_INCREMENT,
PRIMARY KEY(id),
user INT,
possiblefriend INT)
The way the table works is that each user has around 1000 or so "possible friends" that are discovered and need to be stored, but duplicate "possible friends" need to be avoided.
The problem is, due to the design of the program, over the course of a day, I need to add 1 million rows or more to the table that may or not be duplicate row entries. The simple answer would seem to be to check each row to see if it is a duplicate, and if not, then insert it into the table. But this technique will probably get very slow as the table size increases to 100 million rows, 1 billion rows or higher (which I expect it to soon).
What is the best (i.e. fastest) way to maintain this unique table?
I don't need to have a table with only unique values always on hand. I just need it once-a-day for batch jobs. In this case, should I create a separate table that just inserts all the possible rows (containing duplicate rows and all), and then at the end of the day, create a second table that calculates all the unique rows in the first table?
If not, what is the best way for this table long-term?
(If indexes are the best long-term solution, please tell me which indexes to use)

Add a unique index on (user, possiblefriend) then use one of:
INSERT ... ON DUPLICATE KEY UPDATE ...
INSERT IGNORE
REPLACE
to ensure that you don't get errors when you try to insert a duplicate row.
You might also want to consider if you can drop your auto-incrementing primary key and use (user, possiblefriend) as the primary key. This will decrease the size of your table and also the primary key will function as the index, saving you from having to create an extra index.
See also:
“INSERT IGNORE” vs “INSERT … ON DUPLICATE KEY UPDATE”

A unique index will let you be sure that the field is indeed unique, you can add a unique index like so:
CREATE TABLE possiblefriends(
id INT NOT NULL AUTO_INCREMENT,
PRIMARY KEY(id),
user INT,
possiblefriend INT,
PRIMARY KEY (id),
UNIQUE INDEX DefUserID_UNIQUE (user ASC, possiblefriend ASC))
This will also speec up your table access significantly.
Your other issue with the mass insert is a little more tricky, you could use the in-built ON DUPLICATE KEY UPDATE function below:
INSERT INTO table (a,b,c) VALUES (1,2,3)
ON DUPLICATE KEY UPDATE c=c+1;
UPDATE table SET c=c+1 WHERE a=1;

Related

What if `auto_increment` gaps caused by MySQL `INSERT...ON DUPLICATE KEY UPDATE` cannot be ignored?

While performing INSERT...ON DUPLICATE KEY UPDATE on InnoDB in MySQL, we are often told to ignore the potential gaps in auto_increment columns. What if such gaps are very likely and cannot be ignored?
As an example, suppose there is one table rating that stores the users' ratings of items. The table scheme is something like
CREATE TABLE rating (
id INT AUTO_INCREMENT PRIMARY KEY,
user_id INT NOT NULL,
item_id INT NOT NULL,
rating INT NOT NULL,
UNIQUE KEY tuple (user_id, item_id),
FOREIGN KEY (user_id) REFERENCES user(id),
FOREIGN KEY (item_id) REFERENCES item(id)
);
It is possible that there are many users and many items, while users may frequently change the ratings of items that they have already rated before. Every time a rating is changed, a gap is created if we use INSERT...ON DUPLICATE KEY UPDATE, otherwise we will have to query twice (do a SELECT first) which is performance harming or check affected rows which cannot accommodate multiple records INSERT.
For some system where 100K users each has rated 10 items and changes half of the ratings every day, the auto_increment id will be exhausted within two years. Then what should we do to prevent it in practice?
Full answer.
Gaps it's ok! Just use bigger id field, for example BIGINT. Don't try to reuse gaps. This is a bad idea. Don't think about performance or optimization in this case. Its waste of time.
Another solution is to make composite key as primary. In your case, you can remove id field, and use pair (user_id, item_id) as primary key.
In case of "rating" the most frequent queries are "delete by user_id", and inserting. So you are not really need this "id" primary key for functionality. But you always need any primary key to be presented in table.
The only drawback of this method is, that now when you want to delete just one row from the table, you will need to use query something like:
DELETE FROM rating WHERE user_id = 123 AND item_id=1234
instead of old
DELETE FROM rating WHERE id = 123
But in this case it isn't hard to change one line of code in your application. Furthermore, in most cases people doesn't needs such functionality.
We work in a large table and we have tables with 100s millions of records in some table. We repeatedly use INSERT IGNORE or INSERT.. ON DUPLICATE KEY. Making the column as unsigned bigint will avoid the id issue.
But I would suggest you to think of long term solution as well. With some known facts.
SELECT and INSERT/UPDATE is quite often faster than INSERT..ON DUPLICATE KEY, again based on you data size and other factors
If you have two unique keys ( or one primary and one unique key), your query might not always predictable. It gives replication error if you use statement based replication.
ID is not the only issue with large tables. If you have table with more than some 300M records, performances degrades drastically. You need to think of partitioning/clustering/sharding your database/tables pretty soon
Personally I would suggest not to use INSERT.. ON DUPLICATE KEY. Read extensively on its usage and performance impact if you are planning for a highly scalable service

MySQL "Insert ... On Duplicate Key" with more than one unique key

I've been reading up on how to use MySQL insert on duplicate key to see if it will allow me to avoid Selecting a row, checking if it exists, and then either inserting or updating. As I've read the documentation however, there is one area that confuses me. This is what the documentation says:
If you specify ON DUPLICATE KEY UPDATE, and a row is inserted that would cause a duplicate value in a UNIQUE index or PRIMARY KEY, an UPDATE of the old row is performed
The thing is, I don't want to know if this will work for my problem, because the 'condition' I have for not inserting a new one is the existence of a row that has two columns equal to a certain value, not necessarily that the primary key is the same. Right now the syntax I'm imagining is this, but I don't know if it will always insert instead of replace:
INSERT INTO attendance (event_id, user_id, status) VALUES(some_event_number, some_user_id, some_status) ON DUPLICATE KEY UPDATE status=1
The thing is, event_id and user_id aren't primary keys, but if a row in the table 'attendance' already has those columns with those values, I just want to update it. Otherwise I would like to insert it. Is this even possible with ON DUPLICATE? If not, what other method might I use?
The quote includes "a duplicate value in a UNIQUE index". So, your values do not need to be the primary key:
create unique index attendance_eventid_userid on attendance(event_id, user_id);
Presumably, you want to update the existing record because you don't want duplicates. If you want duplicates sometimes, but not for this particular insert, then you will need another method.
If I were you, I would make a primary key out of event_id and user_id. That will make this extremely easy with ON DUPLICATE.
SQLFiddle
create table attendance (
event_id int,
user_id int,
status varchar(100),
primary key(event_id, user_id)
);
Then with ease:
insert into attendance (event_id, user_id, status) values(some_event_number, some_user_id, some_status)
on duplicate key
update status = values(status);
Maybe you can try to write a trigger that checks if the pair (event_id, user_id) exists in the table before inserting, and if it exists just update it.
To the broader question of "Will INSERT ... ON DUPLICATE respect a UK even if the PK changes", the answer is yes: SQLFiddle
In this SQLFiddle I insert a new record, with a new PK id, but its values would violate the UK. It performs the ON DUPLICATE and the original PK id is preserved, but the non-UK ON DUPLICATE KEY UPDATE value changes.

On Duplicate Update does not work for unique index

My SQL Table I am trying to insert/update has this definition:
CREATE TABLE `place`.`a_table` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`some_id` bigint(20) NOT NULL,
`someOther_id` bigint(20) NOT NULL,
`some_value` text,
`re_id` bigint(20) NOT NULL DEFAULT '0',
`up_id` bigint(20) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
KEY `some_id_key` (`some_id`),
KEY `some_id_index1` (`some_id`,`someOther_id`),
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8mb4;
As you can see, some_id and someOther_id share an index.
I am trying to perform my insert/update statement like the following:
INSERT INTO `a_table` (`re_id`,`some_id`,`someOther_id`,`up_id`,`some_value`) VALUES
(100,181,7,101,'stuff in the memo wow') On DUPLICATE KEY UPDATE
`up_id`=101,`some_value`='sampleValues'
I expect since I did not specify the id, that it will fall back onto the index key (some_id_index1) as the insert/update rule. However, it is only Inserting.
Obviously this is incorrect. What am I doing wrong here?
OK, firstly to help you ask better questions. The literal answer to your question "What am I doing wrong here?" is nothing.
You have a table with an auto-increment primary key and two non-unique secondary indexes. You are inserting rows to that table without specifying the value of the primary key, so MySQL will use the auto-increment rule to give it a unique key, and therefore the ON DUPLICATE KEY condition will never trigger. In fact one could argue that it is redundant.
So the question you have to ask yourself is what do you think should happen. Now Stack Overflow is not a forum, so don't come back adding comments to this question trying to clarify your original question. Instead formulate a new question that makes it clear to those trying to answer exactly what it is that you are asking.
As I see it, there are a number of posibilities:
You want to modify the secondary indexes to be unique. One you do that it will trigger the ON DUPLICATE KEY rule and switch to updating
You actually don't want the auto-increment column at all and some_id should actually be your primary key
You don't understand how database indexes work (specifically at least one of your secondary index is likely unnecessary as the database typically can combine several indexes or use a partial index to optimize your queries anyway, so the second secondary index is more useful being an index of only the someOther_id field unless there is a specific uniqueness constraint that you are enforcing. So for example if you expect multiple rows with the same some_id but only ever one row of a specific someOther_id for any specific some_id then the second secondary index would be required, however in that case the first would not as the database can use the second as a partial index to achieve the same performance optimizations)
I suggest you sit down with a pen and piece of paper away from your computer and try to write down exactly what it is that you want to do in such a way that [pick one of: your gradmother; an eleven year old] can understand. Scrunch up the piece of paper and throw it away until you can write it down in one go without making any mistakes. Then return to the computer and try to code what you have just written.
99 times out of 100 you will actually find that this helps you solve your problem without the need to ask others questions, because 99 times out of 100 our problems are due to our own lack of understanding of the problem itself. Trying to (virtually) explain your problem to either your grandmother or an eleven year old forces you to throw away some assumptions that are blinding you and get to the core of the problem real fast before you hit the eyes glaze over look when they stop paying attention. [I am not saying you actually pair-program with your grandmother/an eleven year old]
Here is one example of such a problem statement that I have imagined for you. It is likely incorrect as I do not know what your specific problem is:
We need a table that provides cross-reference notes about two other tables.
There are different types of cross-reference (we use the column 're_id' to
identify the type of cross-reference) and there are different types of notes
(we use the columns 'up_id' as well as 'someValue') to store the actual notes.
The cross-reference is indicated with two columns 'some_id' which is the id
of the row in the 'some' table and 'someOther_id' which is the id of the row
in the 'someOther' table. There can be only one cross-reference between any one
row in the 'some' table and any one row in the 'someOther' table, but there can
be multiple cross-references from one specific row in the 'some' table to different
rows in the 'someOther' table.
With the above problem statement I would switch the primary key from an auto-increment to instead be a two column primary key on (some_id,someOther_id) and remove all the secondary keys!
But I hope you realize that your actual solution actually is likely different as your problem statement will be different from my guess.
From the MySQL documentation
If a table contains an AUTO_INCREMENT column and INSERT ... UPDATE inserts a row, the LAST_INSERT_ID() function returns the AUTO_INCREMENT value. If the statement updates a row instead, LAST_INSERT_ID() is not meaningful. However, you can work around this by using LAST_INSERT_ID(expr). Suppose that id is the AUTO_INCREMENT column. To make LAST_INSERT_ID() meaningful for updates, insert rows as follows:
INSERT INTO table (a,b,c) VALUES (1,2,3)
ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id), c=3;
the same query is working fine, change columns_value in DUPLICATE KEY UPDATE
into actual column some_value
fiddle demo
the problem is
he only UNIQUE constraint on your table should be defined over the columns (up_id).
I just altered your table with unique constraint for the column up_id,
Fiddle_demo

Faster selects when using GUID vs. Select WHERE?

I have a table with thousands of records. I do a lot of selects like this to find if a person exists.
SELECT * from person WHERE personid='U244A902'
Because the person ID is not pure numerical, I didn't use it as the primary key and went with auto-increment. But now I'm rethinking my strategy, because I think SELECTS are getting slower as the table fills up. I'm thinking the reason behind this slowness is because personid is not the primary key.
So my question, if I were to go through the trouble of restructuring the table and use the personid as the primary key instead, without an auto-increment, would that significantly speed up the selects? I'm talking about a table that has 200,000 records now and will fill up to about 5 million when done.
The slowness is due indirectly to the fact that the personid is not a primary key, in that it isn't indexed because it wasn't defined as a key. The quickest fix is to simply index it:
CREATE UNIQUE INDEX `idx_personid` ON `person` (`personid`);
However, if it is a unique value, it should be the table's primary key. There is no real need for a separate auto_increment key.
ALTER TABLE person DROP the_auto_increment_column;
ALTER TABLE person ADD PRIMARY KEY personid;
Note however, that if you were also using the_auto_increment_column as a FOREIGN KEY in other tables and dropped it in favor of personid, you would need to modify all your other tables to use personid instead. The difficulty of doing so may not be completely worth the gain for you.
You can to create an index to personid.
CREATE INDEX id_index ON person(personidid)
ALTER TABLE `person ` ADD INDEX `index1` (`personid`);
try to index your coloumns on which you are using where clause or selecting the coloumns

How do I mitigate duplicate row inserts based on a non-key column?

I need to import data from one MySQL table into another. The old table has a different outdated structure (which isn't terribly relevant). That said, I'm appending a field to the new table called "imported_id" which saves the original id from the old table in order to prevent duplicate imports of the old records.
My question now is, how do I actually prevent duplicates? Due to the parallel rollout of the new system with the old, the import will unfortunately need to be run more than once. I can't make the "import_id" field PK/UNIQUE because it will have null values for fields that do not come from the old table, thereby throwing an error when adding new fields. Is there a way to use some type of INSERT IGNORE on the fly for an arbitrary column that doesn't natively have constraints?
The more I think about this problem, the more I think I should handle it in the initial SELECT. However, I'd be interested in quality mechanisms by which to handle this in general.
Best.
You should be able to create a unique key on the import_id column and still specify that column as nullable. It is only primary key columns that must be specified as NOT NULL.
That said, on the new table you could specify a unique key on the nullable import_id column and then handle any duplicate key errors when inserting from the old table into the new table using ON DUPLICATE KEY
Here's a basic worked example of what I'm driving at:
create table your_table
(id int unsigned primary key auto_increment,
someColumn varchar(50) not null,
import_id int null,
UNIQUE KEY `importIdUidx1` (import_id)
);
insert into your_table (someColumn,import_id) values ('someValue1',1) on duplicate key update someColumn = 'someValue1';
insert into your_table (someColumn) values ('someValue2');
insert into your_table (someColumn) values ('someValue3');;
insert into your_table (someColumn,import_id) values ('someValue4',1) on duplicate key update someColumn = 'someValue4';
where the first and last inserts represent inserts from the old table and the 2nd and 3rd represent inserts from elsewhere.
Hope this helps and good luck!