MYSQL string and performance on Replace - mysql

I have say 10,000 records to Replace (most will be Insert, but there is need for Replace so I must use Replace) of a large table with dates, varchars, integers, etc. The table also has a primary key and several other indexes. I build an SQL string of rows to Replace. If performance is my only concern, what generally, is the optimum number of rows to insert at once? I assume it is not 1. Is it 10, 500, 1,000, etc at a time?
Does the length of the string matter for performance?

I recommend 100-1000 at a time; that will run about 10 times as fast as one at a time. Anything beyond 100-1000 is into "diminishing returns". (That is, you won't get much more improvement.)
But... There are other things of note.
REPLACE is DELETE (which might delete zero rows), plus INSERT. This is slower than INSERT ... ON DUPLICATE KEY UPDATE ..., which either INSERTs or changes whatever you say to change (which could be all the columns).
If you are using AUTO_INCREMENT... With REPLACE you are throwing away any existing ids and creating new ones. In the long run, you could have a problem with running out of ids.
If you don't want to waste any AUTO_INCREMENT ids, let's talk further.
If replication is involved, 100 is better than 1000; let me know if you want to know why.
The length of the string is limited to max_packet_length, which defaults to 8MB.
With InnoDB, simply use autocommit=1 to get each IODKU committed as you go.
Summary: For performance, use IODKU and build the string until you get to, say, 7MB or 1000 rows, whichever comes first.
Much of this is mentioned in Rick's Rules of Thumb.

Related

How to fix values missed by MYSQL auto_increment

I have a MYSQL table, where (to an already existing table) I added another column "Number" that is auto_incremented and has a UNIQUE KEY constraint.
There are 17000+ records in the table. After adding the "Number" column, one value is missed - there is a value of 14 369 and the next one is 14 371.
I tried removing the column and adding it again, but the missing value is still missing.
What might be the problem, and what is the least painfull way to solve this?
There is no problem and there is nothing to fix.
MySQL's auto_increment provides unique values, and it calculates them using sequential increment algorithm (it just increments a number).
That algorithm guarantees the fastest and accurate way of generating unique values.
That's its job. It doesn't "reuse" numbers and forcing it to do so comes with disastrous performance and stability.
Since queries do fail sometimes, these numbers get "lost" and you can't have them back.
If you require sequential numbers for whatever reason, create a procedure or scheduled event and maintain the numbers yourself.
You have to bear in mind that MySQL is a transactional database designed to operate under concurrent access. If it were to reuse these numbers, the performance would be abysmal since it'd have to use locks and force people to wait until it reorganizes the numbers.
InnoDB engine, the default engine, uses primary key values to organize records on the hard drive. If you were to change any of the values, it would start re-writing the records incurring a HUGE I/O wait that depends on the amount of data on the disk - it could bring the whole serve to a grinding halt.
TL:DR; there is no problem, there is nothing to fix, don't do it. If you persist, expect abnormal behavior.

Maintaining a list of unique values in a database

Let's say you have a random number generator spitting out numbers between 1 and 100 000 000 and you want to store them in a database (MySQL) with the timestamp when they were generaeted. If a number that has previously been seen comes, it is discarded.
What would be the best algorithm to make this happen? SELECT then INSERT as necessary? Is there something more efficient?
You can go for a SEQUENCE:
+
no relations are being locked, thus best performance;
no race conditions;
portable.
-
it is possible to get “gaps” in the series of numbers.
You can do a SELECT ... then INSERT ...:
+
no gaps, you can also do some complicated math on your numbers.
-
it's possible to get another parallel session in the middle between SELECT and INSERT and end up with 2 equal numbers;
if there's a UNIQUE constraint, then previos situation will lead to an exception;
to avoid such situation, you might go for an explicit table locks, but this will cause an immediate performance impact.
You can choose INSERT ON DUPLICATE KEY UPDATE, and by now it seems to be the best option (take a look at "INSERT IGNORE" vs "INSERT ... ON DUPLICATE KEY UPDATE"), at least in my view, with the only exception — not portable to other RDBMSes.
P.S. This article is not related to MySQL, but it is worth reading it to get an overview of all the catches that can happen on your way.
If you don't need to insert a new random value every time you can use INSERT IGNORE or REPLACE INTO. Otherwise you should SELECT to check and then INSERT.
This would normally be solved by creating a unique index on the random number column in the table. You could experiment to see if a b-tree versus a hash has better performance.
If you have lots of memory, you could pre-populate a table with 100,000,000 rows -- all possible values. Then, when you look to see if something is already created, then you only need to see if the time stamp is non-null. However, this would require over a Gbyte of RAM to store the table in memory, and would only be the opimal solution if you are trying to maximize transactions per second.
If you put a UNIQUE index on the column with the extracted numbers any INSERT attempting to duplicate a UNIQUE key will fail.
Therefore the easiest and most portable version will be (PHP code, but you get the idea):
function extraction() {
do {
$random = generate_random_number();
$result = #mysql_query("INSERT INTO extractions(number) VALUE ($random)");
} while (!$result);
return $random;
}

MySQL Improving speed of order by statements

I've got a table in a MySQL db with about 25000 records. Each record has about 200 fields, many of which are TEXT. There's nothing I can do about the structure - this is a migration from an old flat-file db which has 16 years of records, and many fields are "note" type free-text entries.
Users can be viewing any number of fields, and order by any single field, and any number of qualifiers. There's a big slowdown in the sort, which is generally taking several seconds, sometimes as much as 7-10 seconds.
an example statement might look like this:
select a, b, c from table where b=1 and c=2 or a=0 order by a desc limit 25
There's never a star-select, and there's always a limit, so I don't think the statement itself can really be optimized much.
I'm aware that indexes can help speed this up, but since there's no way of knowing what fields are going to be sorted on, i'd have to index all 200 columns - what I've read about this doesn't seem to be consistent. I understand there'd be a slowdown when inserting or updating records, but assuming that's acceptable, is it advisable to add an index to each column?
I've read about sort_buffer_size but it seems like everything I read conflicts with the last thing I read - is it advisable to increase this value, or any of the other similar values (read_buffer_size, etc)?
Also, the primary identifier is a crazy pattern they came up with in the nineties. This is the PK and so should be indexed by virtue of being the PK (right?). The records are (and have been) submitted to the state, and to their clients, and I can't change the format. This column needs to sort based on the logic that's in place, which involves a stored procedure with string concatenation and substring matching. This particular sort is especially slow, and doesn't seem to cache, even though this one field is indexed, so I wonder if there's anything I can do to speed up the sorting on this particular field (which is the default order by).
TYIA.
I'd have to index all 200 columns
That's not really a good idea. Because of the way MySQL uses indexes most of them would probably never be used while still generating quite a large overhead. (see chapter 7.3 in link below for details). What you could do however, is to try to identify which columns appear most often in WHERE clause, and index those.
In the long run however, you will probably need to find a way, to rework your data structure into something more manageable, because as it is now, it has the smell of 'spreadsheet turned into database' which is not a nice smell.
I've read about sort_buffer_size but it seems like everything I read
conflicts with the last thing I read - is it advisable to increase
this value, or any of the other similar values (read_buffer_size,
etc)?
In general he answer is yes. However the actual details depend on your hardware, OS and what storage engine you use. See chapter 7.11 (especially 7.11.4 in link below)
Also, the primary identifier is a crazy pattern they came up with in
the nineties.[...] I wonder if there's anything I can do to speed up
the sorting on this particular field (which is the default order by).
Perhaps you could add a primarySortOrder column to your table, into which you could store numeric values that would map the PK order (precaluclated from the store procedure you're using).
Ant the link you've been waiting for: Chapter 7 from MySQL manual: Optimization
Add an index to all the columns that have a large number of distinct values, say 100 or even 1000 or more. Tune this number as you go.

UPDATE vs INSERT performance

Am I correct to assume that an UPDATE query takes more resources than an INSERT query?
I am not a database guru but here my two cents:
Personally I don't think you have much to do in this regard, even if INSERT would be faster (all to be proven), can you convert an update in an insert?! Frankly I don't think you can do it all the times.
During an INSERT you don't usually have to use WHERE to identify which row to update but depending on your indices on that table the operation can have some cost.
During an update if you do not change any column included in any indices you could have quick execution, if the where clause is easy and fast enough.
Nothing is written in stones and really I would imagine it depends on whole database setup, indices and so on.
Anyway, found this one as a reference:
Top 84 MySQL Performance Tips
If you plan to perform a large processing (such as rating or billing for a cellular company), this question has a huge impact on system performance.
Performing large scale updates vs making many new tables and index has proven to reduce my company billing process form 26 hours to 1 hour!
I have tried it on 2 million records for 100,000 customer.
I first created the billing table and then every customer summary calls, I updated the billing table with the duration, price, discount.. a total of 10 fields.
In the second option I created 4 phases.
Each phase reads the previous table(s), creates index (after the table insert completed) and using: "insert into from select .." I have created the next table for the next phase.
Summary
Although the second alternative requires much more disk space (all views and temporary tables deleted at the end) there are 3 main advantages to this option:
It was 4 time faster than option 1.
In case there was a problem in the middle of the process I could start the process from the point it failed, as all the tables for the beginning of the phase were ready and the process could restart from this point. If the process fails implementing the first option, you will need to start the all the process all over again.
This made the development and QA work much faster as they could work parallel.
The key resource here is disk access (IOPS to be precise) and we should evaluate which ones results in minimum of that.
Agree with others on how it is impossible to give a generic answer but some thoughts to lead you in the right direction , assume a simple key-value store and key is indexed. Insertion is inserting a new key and update is updating the value of an existing key.
If that is the case (a very common case) , update would be faster than insertion because update involves an indexed lookup and changing an existing value without touching the index. You can assume that is one disk read to get the data and possibly one disk write. On the other hand insertion would involve two disk writes one for index , one for data. But the another hidden cost is the btree node splitting and new node creation which would happen in background while insertion leading to more disk access on average.
You cannot compare an INSERT and an UPDATE in general. Give us an example (with schema definition) and we will explain which one costs more and why. Also, you can compere a concrete INSERT and an UPDATE by checking their plan and execution time.
Some rules of thumbs though:
if you only update only one field, which is not indexed and you only update one record and you use rowid/primary key to find that record then this UPDATE will cost less, than
an INSERT, which will also affect only one row, though this row will have many not null constrained, indexed fields; and all those indexes have to be maintained (e.g. add a new leaf)
It depends. A simple UPDATE that uses a primary key in the WHERE clause and updates only a single non-indexed field would likely be less costly than an INSERT on the same table. But even that depends on the database engine involved. An UPDATE that involved modifying many indexed fields, however, might be more costly than the INSERT on that table because more index key modifications would be required. An UPDATE with a poorly constructed WHERE clause that required a table scan of millions of records would certainly be more expensive than an INSERT on that table.
These statements can take many forms, but if you limit the discussion to their "basic" forms that involve a single record, then the larger portion of the cost will usually be dedicated to modifying the indexes. Each indexed field that is modified during an UPDATE would typically involve two basic operations (delete the old key and add the new key) whereas the INSERT would require one (add the new key). Of course, a clustered index would then add some other dynamics as would locking issues, transaction isolation, etc. So, ultimately, the comparison between these statements in a general sense is not really possible and would probably require benchmarking of specific statements if it actually mattered.
Typically, though, it makes sense to just use the correct statement and not worry about it since it is usually not an option to choose between an UPDATE and an INSERT.
It depends. If update don't require changes of the key it's most likely that it will only costs like a search and then it will probably cost less than an insert, unless database is organized like an heap.
This is the only think i can state, because performances greatly depends on the database organization used.
If you for example use MyISAM that i suppose organized like an isam, insert should cost generally the same in terms of database read accesses but it will require some additional write operation.
On Sybase / SQL Server an update which impacts a column with a read-only index is internally replaced by a delete and then an insert, so this is obviously slower than insert. I do not know the implementation for other engines but I think this is a common strategy at least when indices are involved.
Now for tables without indices ( or for update requests not involving any index ) I suppose there are cases where the update can be faster, depending on the structure of the table.
In mysql you can change your update to insert with ON DUPLICATE KEY UPDATE
INSERT INTO t1 (a,b,c) VALUES (1,2,3)
ON DUPLICATE KEY UPDATE c=c+1;
UPDATE t1 SET c=c+1 WHERE a=1;
A lot of people here are commenting that you can't compare an insert vs update but I disagree. People should understand that an update takes a lot more resources than insert or even possibly deleting and inserting.
Now regarding how you can even compare the 2 as one doesn't directly replace the other. But in certain cases you make an insert and then update the table with data from another table.
For instance I get a feed from an API which contains id1, but this table relates to another table and I would like to add table2_id. Instead of doing an update statement that takes a lot more resources, I can handle this in the backend which is faster and just do an insert statement instead of an insert and then an update. The update statement also locks the table causing a traffic jam so to speak.

Is there any harm in resetting the auto-increment?

I have a 100 million rows, and it's getting too big.
I see a lot of gaps. (since I delete, add, delete, add.)
I want to fill these gaps with auto-increment.
If I do reset it..is there any harM?
If I do this, will it fill the gaps?:
mysql> ALTER TABLE tbl AUTO_INCREMENT = 1;
Potentially very dangerous, because you can get a number again that is already in use.
What you propose is resetting the sequence to 1 again. It will just produce 1,2,3,4,5,6,7,.. and so on, regardless of these numbers being in a gap or not.
Update: According to Martin's answer, because of the dangers involved, MySQL will not even let you do that. It will reset the counter to at least the current value + 1.
Think again what real problem the existence of gaps causes. Usually it is only an aesthetic issue.
If the number gets too big, switch to a larger data type (bigint should be plenty).
FWIW... According to the MySQL docs applying
ALTER TABLE tbl AUTO_INCREMENT = 1
where tbl contains existing data should have no effect:
To change the value of the
AUTO_INCREMENT counter to be used for
new rows, do this:
ALTER TABLE t2 AUTO_INCREMENT = value;
You cannot reset the counter to a
value less than or equal to any that
have already been used. For MyISAM, if
the value is less than or equal to the
maximum value currently in the
AUTO_INCREMENT column, the value is
reset to the current maximum plus one.
For InnoDB, if the value is less than
the current maximum value in the
column, no error occurs and the
current sequence value is not changed.
I ran a small test that confirmed this for a MyISAM table.
So the answers to you questions are: no harm, and no it won't fill the gaps. As other responders have said: a change of data type looks like the least painful choice.
Chances are you wouldn't gain anything from doing this, and you could easily screw up your application by overwriting rows, since you're going to reset the count for the IDs. (In other words, the next time you insert a row, it'll overwrite the row with ID 1, and then 2, etc.) What will you gain from filling the gaps? If the number gets too big, just change it to a larger number (such as BIGINT).
Edit: I stand corrected. It won't do anything at all, which supports my point that you should just change the type of the column to a larger integer type. The maximum possible value for a BIGINT is 2^64, which is over 18 quintillion. If you only have 100 million rows at the moment, that should be plenty for the foreseeable future.
I agree with musicfreak... The maximum for an integer (int(10)) is 4,294,967,295 (unsigned ofcoarse). If you need to go even higher, switching to BIGINT brings you up to 18,446,744,073,709,551,615.
Since you can't change the next auto-increment value, you have other options. The datatype switch could be done, but it seems a little unsettling to me since you don't actually have that many rows. You'd have to make sure your code can handle IDs that large, which may or may not be tough for you.
Are you able to do much downtime? If you are, there are two options I can think of:
Dump/reload the data. You can do this so it won't keep the ID numbers. For example you could use a SELECT ... INTO to copy the data, sans-IDs, to a new table with identical DDL. Then you drop the old table and rename the new table to the old name. Depending on how much data there is, this could take a noticeable about of time (and temporary disk space).
You could make a little program to issue UPDATE statements to change the IDs. If you let that run slowly, it would "defragment" your IDs over time. Then you could temporarily stop the inserts (just a minute or two), update the last IDs, then restart it. After updating the last IDs you can change the AUTO_INCREMENT value to be the next number and your hole will be gone. This shouldn't cause any real downtime (at least on InnoDB), but it could take quite a while depending on how aggressive your program is.
Of course, both of these ignore referential integrity. I'm assuming that's not a problem (log statements that aren't used as foreign keys, or some such).
Does it really matter if there are gaps?
If you really want to go back and fill them, you can always turn off auto increment, and manually scan for the next available id every time you want to insert a row -- remembering to lock the table to avoid race conditions, of course. But it's a lot of work to do for not much gain.
Do you really need a surrogate key anyway? Depending on the data (you haven't mentioned a schema) you can probably find a natural key.