I have a table that stores messages from one user to another. messages(user_id,friend_id,message,created_date). My primary key is (friend_id,created_date). This prevents duplicate messages (AFAIK) because they will fail to insert.
Right now this is ok because my code generates about 20 of these queries at a time per user and I only have one user. But if there were hundreds or thousands of users would this create a bottleneck in my database with all the failed transactions? And if what kinds of things could I do to improve the situation?
EDIT:
The boiled down question is should I use the primary key constraint,check outside of mysql, or use some other mysql functionality to keep duplicates out of the database?
Should be fine as mysql will just do a primary key lookup internally and ignore the record (I'm assuming you're using INSERT IGNORE). If you were checking if they exist before inserting, mysql will still check again when you insert. This means if most inserts are going to succeed, then you're saving an extra check. If the vast majority of inserts were failing (not likely) then possibly the savings from not sending unnecessary data would outweigh the occasional repeated check.
Related
I googled and searched on SO, but was not able to find an answer; maybe you could point me to some reference/docs?
This is more about understanding the way MySQL treats table contents while inserting.
I have a table (Myisam) which has an auto-increment primary key 'autoid'. I am using a simple script to insert 1000s+ of records. What I am trying to do is running multiple instances of this script (you can image it similar to accessing the script from different machines at same time).
Is MySql capable of distributing the auto-increment primary keys accordingly without any further action from my side or do I have to do some sort of table locking for each machine? Maybe I have to choose InnoDb over MyIsam?
What I am trying to achieve is: irrespective of how many machines are simultaneously triggering the script, all inserts should be completed without skipping any auto-increment id or throwing errors like "Duplicate Value for...".
Thanks a lot
The whole point of using a database is that can handle situations like this transactionally. So yes, this scenario works fine on every commonly used DBMS system, including MySQL.
How do you think the average forum would work with 50 users simultaneously posting replies to a topic, all from forked parallel Apache processes so possible only microseconds apart, or from multiple loadbalanced webservers?
Internally it just uses a mutex/semaphore like any other process when accessing and incrementing the shared resource (the autoincrement value of a table in this case) to mitigate the inherent race conditions.
I was running my ruby scripts to load in to mysql. It has an error:
Mysql::Error: Duplicate entry '4444281482' for key 'PRIMARY'
Where my primary key is Auto-increment ID (Big-INT). I was running the script in multiple terminals with different data using screen, to load into the same table. This problem never happened before, but when it happens, all the scripts in different terminals are likely to suffer from that problem. The dataset is different. It seems to happen randomly.
What is likely to be the cause?
Why there would be duplicate in an auto-increment field?
You mention that you are running the script from different terminals using different data. According to the MySQL manual, and assuming your engine is InnoDB, since each transaction is inserting a different amount of rows against an AUTO_INCREMENT column, the engine may not know how many rows will be retrieved in advance. This could possibly explain why you are receiving a duplicate key error. With the use of a table-level lock held to the end of the statement, only one INSERT statement can execute at a time and the generation of auto-increment numbers won't interleave.
I'm pretty sure I had this problem - it has nothing to do with client (I mean its reproducible in both my app, query browser, cli client etc.).
If you don't bother with gaps in your id numeration you can try
ALTER TABLE `tableName` AUTO_INCREMENT = 4444281492;
(of course you can try to add more than 10 indexes, like 100000 to be sure ;) you can always revert counter to old value with same query)
This will change your auto increment counter to greater number and potentially skip invalid indexes - although I have no idea what is the cause of this issue (in my case it persisted durign mysqld restart or entire machine reboot)
oh and I should add - I did it on dev server, if this is production I would advice further investigation.
Which one is more reliable and has better performance? Setting MySQL unique key and using INSERT IGNORE or first checking if data exists on database and act according to the result?
If the answer is the second one, is there any way to make a single SQL query instead of two?
UPDATE: I ask because my colleagues in the company I work believe that deal with such issues should be done in application part which is more reliable according to them.
You application won't catch duplicates.
Two concurrent calls can insert the same data, because each process doesn't see the other while your application checks for uniqueness. Each process thinks it's OK to INSERT.
You can force some kind of serialisation but then you have a bottleneck and performance limit. And you will have other clients writing to the database, even if it is just a release script-
That is why there are such things as unique indexes and constraints generally. Foreign keys, triggers, check constraints, NULL/NIOT NULL, datatype constraints are all there to enforce data integrity
There is also the arrogance of some code monkey thinking they can do better.
See programmers.se: Constraints in a relational databases - Why not remove them completely? and this Enforcing Database Constraints In Application Code (SO)
Settings a unique key is better. It will reduce the amount of round-trips to mysql you'll have to do for a single operation, and item uniqueness is ensured, reducing errors caused by your own logic.
You definitely should set a unique key in your MySQL table, no matter what you decide.
As far as the other part of your question, definitely use insert ignore on duplicate key update if that is what you intend for your application.
I.e. if you're going to load a bunch of data and you don't care what the old data was, you just want the new data, that is the way to go.
On the other hand, if there is some sort of decision branch that is based on whether the change is an update or a new value, I think you would have to go with option 2.
I.e. If changes to the table are recorded in some other table (e.g. table: change_log with columns: id,table,column,old_val,new_val), then you couldn't just use INSERT IGNORE because you would never be able to tell which values were changed vs. which were newly inserted.
I have a system which has a complex primary key for interfacing with external systems, and a fast, small opaque primary key for internal use. For example: the external key might be a compound value - something like (given name (varchar), family name (varchar), zip code (char)) and the internal key would be an integer ("customer ID").
When I receive an incoming request with the external key, I need to look up the internal key - and here's the tricky part - allocate a new internal key if I don't already have one for the given external ID.
Obviously if I have only one client talking to the database at a time, this is fine. SELECT customer_id FROM customers WHERE given_name = 'foo' AND ..., then INSERT INTO customers VALUES (...) if I don't find a value. But, if there are potentially many requests coming in from external systems concurrently, and many may arrive for a previously unheard-of customer all at once, there is a race condition where multiple clients may try to INSERT the new row.
If I were modifying an existing row, that would be easy; simply SELECT FOR UPDATE first, to acquire the appropriate row-level lock, before doing an UPDATE. But in this case, I don't have a row that I can lock, because the row doesn't exist yet!
I've come up with several solutions so far, but each of them has some pretty significant issues:
Catch the error on INSERT, re-try the entire transaction from the top. This is a problem if the transaction involves a dozen customers, especially if the incoming data is potentially talking about the same customers in a different order each time. It's possible to get stuck in mutually recursive deadlock loops, where the conflict occurs on a different customer each time. You can mitigate this with an exponential wait time between re-try attempts, but this is a slow and expensive way to deal with conflicts. Also, this complicates the application code quite a bit as everything needs to be restartable.
Use savepoints. Start a savepoint before the SELECT, catch the error on INSERT, and then roll back to the savepoint and SELECT again. Savepoints aren't completely portable, and their semantics and capabilities differ slightly and subtly between databases; the biggest difference I've noticed is that, sometimes they seem to nest and sometimes they don't, so it would be nice if I could avoid them. This is only a vague impression though - is it inaccurate? Are savepoints standardized, or at least practically consistent? Also, savepoints make it difficult to do things in parallel on the same transaction, because you might not be able to tell exactly how much work you'll be rolling back, although I realize I might just need to live with that.
Acquire some global lock, like a table-level lock using a LOCK statement (oracle mysql postgres). This obviously slows down these operations and results in a lot of lock contention, so I'd prefer to avoid it.
Acquire a more fine-grained, but database-specific lock. I'm only familiar with Postgres's way of doing this, which is very definitely not supported in other databases (the functions even start with "pg_") so again it's a portability issue. Also, postgres's way of doing this would require me to convert the key into a pair of integers somehow, which it may not neatly fit into. Is there a nicer way to acquire locks for hypothetical objects?
It seems to me that this has got to be a common concurrency problem with databases but I haven't managed to find a lot of resources on it; possibly just because I don't know the canonical phrasing. Is it possible to do this with some simple extra bit of syntax, in any of the tagged databases?
I'm not clear on why you can't use INSERT IGNORE, which will run without error and you can check if an insert occurred (modified records). If the insert "fails", then you know the key already exists and you can do a SELECT. You could do the INSERT first, then the SELECT.
Alternatively, if you are using MySQL, use InnoDB which supports transactions. That would make it easier to rollback.
Perform each customer's "lookup or maybe create" operations in autocommit mode, prior to and outside of the main, multi-customer transaction.
WRT generating an opaque primary key, there are a number of options, eg., use a guid or (at least, with Oracle) a sequence table. WRT insuring the external key is unique, apply unique constraint on the column. If the insert fails because the key exists, reattempt the fetch. You can use an insert with where not exist or where not in. Use a stored procedure to reduce the round trips and improve performance.
I don't have a testing environment for this yet. But before I think too much about solutions I'd like to know if people think this would be a problem.
I will have 10-20 java processes connected to a MySql db via JDBC. Each will be inserting unique records, all going to the same table. The rate of inserts will be on the order of 1000's per second.
My expectation is that some process will attempt to insert and encounter a table lock while another process is inserting, and this will result in a JDBC exception and that insert to fail.
Clearly if you increase the insert rate sufficiently there eventually will be a point where some buffer somewhere fills up faster than it can be emptied. When such a buffer hits its maximum capacity and can't contain any more data some of your insert statements will have to fail. This will result in an exception being thrown at the client.
However, assuming you have high-end hardware I don't imagine this should happen with 1000 inserts per second, but it does depend on the specific hardware, how much data there is per insert, how many indexes you have, what other queries are running on the system simultaneously, etc.
Regardless of whether you are doing 10 inserts per second or 1000 you still shouldn't blindly assume that every insert will succeed - there's always a chance that an insert will fail because of some network communication error or some other problem. Your application should be able to correctly handle the situation where an insert fails.
Use InnoDB as it supports reads and writes at the same time. MyISAM will usually lock the table during the insert, but give preference to SELECT statements. This can cause issues if you're doing reporting or visualization of the data while trying to do inserts.
If you have a natural primary key (no auto_increment), using it will help avoid some deadlock issues. (Newer versions have fixed this.)
http://www.mysqlperformanceblog.com/2007/09/26/innodb-auto-inc-scalability-fixed/
You might also want to see if you can queue your writes in memory and send them to the database in batches. Lots of little inserts will be much slower than doing batches in transactions.
Good presentation on getting the most out of the MySQL Connector/J JDBC driver:
http://assets.en.oreilly.com/1/event/21/Connector_J%20Performance%20Gems%20Presentation.pdf
What engine do you use? That can make a difference.
http://dev.mysql.com/doc/refman/5.5/en/concurrent-inserts.html