Lets say I have 300 million users in my mysql database (innodb). Some of them have username set, while some of them don't (username is null), and lets say 60% of them are not null (have actual varchar value).
If I wanted to set all 300 million users' usernames to null, would
UPDATE users SET username = null WHERE username IS NOT NULL
perform better than
UPDATE users SET username = null - without a WHERE clause, just blanket null them all?
I know that WHERE always performs faster when setting actual values, but somehow null fields made me think about this.
Both will take terrrrribly long. I suggest you do it in 'chunks' as described in my blog here:
http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks
Here is another solution:
ALTER TABLE t DROP COLUMN c;
ALTER TABLE t ADD COLUMN c VARCHAR(...) DEFAULT NULL;
Each ALTER will copy the table over once without writing to the ROLLBACK log (etc), thereby being significantly faster. (I doubt if you can combine the two into a single statement.)
But first, let's back up and discuss why you need to do this unusual task. It is likely to indicate a poor schema design. And rethinking the design may be a better approach.
Related
In MySQL, I am using an InnoDB table that contains unique names, and IDs for those names. Clients need to atomically check for an existing name, insert a new one if it does not exist, and get the ID. The ID is an AUTO_INCREMENT value, and it must not increment out-of-control when checking for existing values regardless of the setting of "innodb_autoinc_lock_mode"; this is because very often the same name will be checked (e.g. "Alice"), and every now and then some new name will come along (e.g. "Bob").
The "INSERT...ON DUPLICATE KEY UPDATE" statement causes an AUTO_INCREMENT increase even in the duplicate-key case, depending on "innodb_autoinc_lock_mode", and is thus unacceptable. The ID will be used as the target of a Foreign-Key Constraint (in another table), and thus it is not okay to change existing IDs. Clients must not deadlock when they do this action concurrently, regardless of how the operations might be interleaved.
I would like the processing during the atomic operation (e.g. checking for the existing ID and deciding whether or not to do the insert) to be done on the server-side rather than the client-side, so that the delay for other sessions attempting to do the same thing simultaneously is minimal and does not need to wait for client-side processing.
My test table to demonstrate this is named FirstNames:
CREATE TABLE `FirstNames` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`FirstName` varchar(45) COLLATE utf8mb4_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `FirstName_UNIQUE` (`FirstName`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
The best solution that I have come up with thus far is as follows:
COMMIT;
SET #myName='Alice';
SET #curId=NULL;
SET autocommit=0;
LOCK TABLES FirstNames WRITE;
SELECT Id INTO #curId FROM FirstNames WHERE FirstName = #myName;
INSERT INTO `FirstNames` (`FirstName`) SELECT #myName FROM DUAL WHERE #curId IS NULL;
COMMIT;
UNLOCK TABLES;
SET #curId=IF(#curId IS NULL, LAST_INSERT_ID(), #curId);
SELECT #curId;
This uses "LOCK TABLES...WRITE" following the instructions given in the MySQL "Interaction of Table Locking and Transactions" documentation for the correct way to lock InnoDB tables. This solution requires the user to have the "LOCK TABLES" privilege.
If I run the above query with #myName="Alice", I obtain a new ID and then continue to obtain the same ID no matter how many times I run it. If I then run with #myName="Bob", I get another ID with the next AUTO_INCREMENT value, and so on. Checking for a name that already exists does not increase the table's AUTO_INCREMENT value.
I am wondering if there is a better solution to accomplish this, perhaps one that does not require the "LOCK TABLES"/"UNLOCK TABLES" commands and combines more "rudimentary" commands (e.g. "INSERT" and "SELECT") in a more clever way? Or is this the best methodology that MySQL currently has to offer?
Edit
This is not a duplicate of "How to 'insert if not exists' in MySQL?". That question does not address all of the criteria that I stated. The issue of keeping the AUTO_INCREMENT value stable is not resolved there (it is only mentioned in passing).
Many of the answers do not address getting the ID of the existing/inserted record, some of the answers do not provide an atomic operation, and some of the answers have the logic being done on the client-side rather than the server-side. A number of the answers change an existing record, which is not what I'm looking for. I am asking for either a better method to meet all of the criteria stated, or confirmation that my solution is the optimal one with existing MySQL support.
The question is really about how to normalize data when you expect there to be duplicates. And then avoid "burning" ids.
http://mysql.rjweb.org/doc.php/staging_table#normalization discusses a 2-step process and is aimed at mass updates due to high-speed ingestion of rows. It degenerates to a single row, but still requires the 2 steps.
Step 1 INSERTs any new rows, creating new auto_inc ids.
Step 2 pulls back the ids en masse.
Note that the work is best done with autocommit=ON and outside the main transaction that is loading the data. This avoids an extra cause for burning ids, namely potential rollbacks.
You can use a conditional INSERT in a single statement:
INSERT INTO FirstNames (FirstName)
SELECT i.firstName
FROM (SELECT 'Alice' AS firstName) i
WHERE NOT EXISTS (SELECT * FROM FirstNames t WHERE t.FirstName = i.firstName);
The next AUTO_INCREMENT value stays untouched in case of existance. But I can't tell you that would be the case in any (future) version or for every configuration. However, it is not much different from what you did - Just in a single statement and without locking the table.
At this point you can be sure that the name exists and just select the corresponding Id:
SELECT Id FROM FirstNames WHERE FirstName = 'Alice';
I've got a bit of a stupid question. The thing is my program has to have the function to delete data from my database. Yay, not really the problem. But how can I delete data without the danger that others can see, that there has been something deleted.
User Table:
U_ID U_NAME
1 Chris
2 Peter
OTHER TABLE
ID TIMESTAMP FK_U_D
1 2012-12-01 1
2 2012-12-02 1
Sooooo the ID's are AUTO_INCREMENT, so if I delete one of them there's a gap. Furthermore, the timestamp is also bigger than the row before, so ascending.
I want to let the data with ID 1 disappear from the user's profile (U_ID 1).
If I delete it, there is a gap. If I just change the FK_U_ID to 2 (Peter) it's obvious, because when I insert data, there are 20 or 30 data rows with the same U_ID...so it's obvious that there has been a modification.
If I set the FK_U_ID NULL --> same sh** like when I change it to another U_ID.
Is there any solution to get this work? I know that if nobody but me has access to the database, it's just no problem. But just in case, if somebody controls my program it should not be obvious that there has been modifications.
So here we go.
For the ID gaps issue you can use GUIDs as #SLaks suggests, but then you can't use the native RDBMS auto_increment which means you have to create the GUID and insert it along with the rest of the record data upon creation. Of course, you don't really need the ID to be globally unique, you could just store a random string of 20 characters or something, but then you have to do a DB read to see if that ID is taken and repeat (recursively) that process until you find an unused ID... could be quite taxing.
It's not at all clear why you would want to "hide" evidence that a delete was performed. That sounds like a really bad idea. I'm not a fan of promulgating misinformation.
Two of the characteristics of an ideal primary key are:
- anonymous (be void of any useful information, doesn't matter what it's set to)
- immutable (once assigned, it will never be changed.)
But, if we set that whole discussion aside...
I can answer a slightly different question (an answer you might find helpful to your particular situation)
The only way to eliminate a "gap" in the values in a column with an AUTO_INCREMENT would be to change the column values from their current values to a contiguous sequence of new values. If there are any foreign keys that reference that column, the values in those columns would need to be updated as well, to preserve the relationship. That will likely leave the current auto_increment value of the table higher than the largest value of the id column, so I'd want to reset that as well, to avoid a "gap" on the next insert.
(I have done re-sequencing of auto_increment values in development and test environments, to "cleanup" lookup tables, and to move the id values of some tables to ranges that are distinct from ranges in other tables... that let's me test SQL to make sure the SQL join predicates aren't inadvertently referencing the wrong table, and returning rows that look correct by accident... those are some reasons I've done reassignment if auto_increment values)
Note that the database can "automagically" update foreign key values (for InnnoDB tables) when you change the primary key value, as long as the foreign key constraint is defined with ON UPDATE CASCADE, and FOREIGN_KEY_CHECKS is not disabled.
If there are no foreign keys to deal with, and assuming that all of the current values of id are positive integers, then I've been able to do something like this: (with appropriate backups in place, so I can recover if things don't work right)
UPDATE mytable t
JOIN (
SELECT s.id AS old_id
, #i := #i + 1 AS new_id
FROM mytable s
CROSS
JOIN (SELECT #i := 0) i
ORDER BY s.id
) c
ON t.id = c.old_id
SET t.id = c.new_id
WHERE t.id <> c.new_id
To reset the table AUTO_INCREMENT back down to the largest id value in the table:
ALTER TABLE mytable AUTO_INCREMENT = 1;
Typically, I will create a table and populate it from that query in the inline view (aliased as c) above. I can then use that table to update both foreign key columns and the primary key column, first disabling the FOREIGN_KEY_CHECKS and then re-enabling it. (In a concurrent environment, where other processes might be inserting/updating/deleting rows from one of the tables, I would of course first obtain an exclusive lock on all of the tables to be updated.)
Taking up again, the discussion I set aside earlier... this type of "administrative" function can be useful in a test environment, when setting up test cases. But it is NOT a function that is ever performed in a production environment, with live data.
Is it possible to update the value of a field but cap it at the same time?:
UPDATE users SET num_apples=num_apples-1 WHERE xxx = ?
I don't want to the field "num_apples" to fall below zero. Can I do that in one operation?
Thanks
----- Update ------------------
UPDATE users SET num_apples=num_apples-1 WHERE user_id = 123 AND num_apples > 0;
If I only have an index on "user_id", and not "num_apples", is that going to be bad for performance? I'm not sure how mysql implements this operation. I'm hoping that the WHERE on the user_id part makes it fast. I have to perform this operation somewhat frequently.
Thanks
Just add a WHERE condition specifying only rows > 0, so it won't update any rows into the negative.
UPDATE users SET num_apples=num_apples-1 WHERE num_apples > 0;
Update
Following your subquestion on indexing, as always, the way to test performance is to benchmark it for yourself. Examine the EXPLAIN for the query and make sure it is using the index on user_id (it should be). And finally, don't worry too much about performance of this simple operation until it becomes a problem. You don't have an index on num_apples now, but could you not add one if performance wasn't scaling to your needs?
You don't need to create 2 indexes as only one will be used. You should index both fields into one index. The index should be the pair user_id and num_apples:
alter table t add index(user_id, num_apples) yourNewIndex;
You can actually remove the previous index as this will also include it:
alter table t drop index yourOldIndex;
Before dropping it you can get information on what index is being used by running:
EXPLAIN UPDATE users SET num_apples=num_apples-1
WHERE user_id = 123 AND num_apples > 0;
If the index used is the yourNewIndex, then MySQL realized that it is faster to use that than the previous one.
Edit:
do I even need any checks? Will mysql prevent the value from going < 0 by default in that case?
Yes you will. You'll get a data truncation error when running the update if you do not control that:
Data truncation: BIGINT UNSIGNED value is out of range
I have a column that is a datetime, converted_at.
I plan on making calls that check WHERE converted_at is not null very often. As such, I'm considering having a boolean field converted. Is their a significant performance difference between checking if a field is not null vs if it is false?
Thanks.
If things are answerable in a single field you favour that over to splitting the same thing into two fields. This creates more infrastructure, which, in your case is avoidable.
As to the nub of the question, I believe most database implementation, MySQL included, will have an internal flag which is boolean anyways for representing the NULLability of a field.
You should rely that this is done for you correctly.
As to performance, the bigger question should be on profiling the typical queries that you run on your database and where you created appropriate indexes and analyze table on to improve execution plans and which indexes are used during queries. This question will have a far bigger impact to performance.
Using WHERE converted_at is not null or WHERE converted = FALSE will probably be the same in matters of query performance.
But if you have this additional bit field, that is used to store whether the converted_at field is Null or not, you'll have to somehow maintain integrity (via triggers?) whenever a new row is added and every time the column is updated. So, this is a de-normalization. And also means more complicated code. Moreover, you'll have at least one more index on the table (which means a bit slower Insert/Update/Delete operations).
Therefore, I don't think it's good to add this bit field.
If you can change the column in question from NULL to NOT NULL (possibly by normalizing the table), you may get some performance gain (at the cost/gain of having more tables).
I had the same question for my own usage. So I decided to put it to the test.
So I created all the fields required for the 3 possibilities I imagined:
# option 1
ALTER TABLE mytable ADD deleted_at DATETIME NULL;
ALTER TABLE mytable ADD archived_at DATETIME NULL;
# option 2
ALTER TABLE mytable ADD deleted boolean NOT NULL DEFAULT 0;
ALTER TABLE mytable ADD archived boolean NOT NULL DEFAULT 0;
# option 3
ALTER TABLE mytable ADD invisibility TINYINT(1) UNSIGNED NOT NULL DEFAULT 0
COMMENT '4 values possible' ;
The last is a bitfield where 1=archived, 2=deleted, 3=deleted + archived
First difference, you have to create indexes for optioon 2 and 3.
CREATE INDEX mytable_deleted_IDX USING BTREE ON mytable (deleted) ;
CREATE INDEX mytable_archived_IDX USING BTREE ON mytable (archived) ;
CREATE INDEX mytable_invisibility_IDX USING BTREE ON mytable (invisibility) ;
Then I tried all of the options using a real life SQL request, on 13k records on the main table, here is how it looks
SELECT *
FROM mytable
LEFT JOIN table1 ON mytable.id_qcm = table1.id_qcm
LEFT JOIN table2 ON table2.id_class = mytable.id_class
INNER JOIN user ON mytable.id_user = user.id_user
where mytable.id_user=1
and mytable.deleted_at is null and mytable.archived_at is null
# and deleted=0
# and invisibility=0
order BY id_mytable
Used alternatively the above commented filter options.
Used mysql 5.7.21-1 debian9
My conclusion:
The "is null" solution (option 1) is a bit faster, or at least same performance.
The 2 others ("deleted=0" and "invisibility=0") seems in average a bit slower.
But the nullable fields option have decisive advantages: No index to create, easier to update, easier to query. And less storage space used.
(additionnaly inserts & updates virtually should be faster as well, since mysql do not need to update indexes, but you never would be able to notice that).
So you should use the nullable datatime fields option.
Assume that I have one big table with three columns: "user_name", "user_property", "value_of_property". Lat's also assume that I have a lot of user (let say 100 000) and a lot of properties (let say 10 000). Then the table is going to be huge (1 billion rows).
When I extract information from the table I always need information about a particular user. So, I use, for example where user_name='Albert Gates'. So, every time the mysql server needs to analyze 1 billion lines to find those of them which contain "Albert Gates" as user_name.
Would it not be wise to split the big table into many small ones corresponding to fixed users?
No, I don't think that is a good idea. A better approach is to add an index on the user_name column - and perhaps another index on (user_name, user_property) for looking up a single property. Then the database does not need to scan all the rows - it just need to find the appropriate entry in the index which is stored in a B-Tree, making it easy to find a record in a very small amount of time.
If your application is still slow even after correctly indexing it can sometimes be a good idea to partition your largest tables.
One other thing you could consider is normalizing your database so that the user_name is stored in a separate table and use an integer foriegn key in its place. This can reduce storage requirements and can increase performance. The same may apply to user_property.
you should normalise your design as follows:
drop table if exists users;
create table users
(
user_id int unsigned not null auto_increment primary key,
username varbinary(32) unique not null
)
engine=innodb;
drop table if exists properties;
create table properties
(
property_id smallint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;
drop table if exists user_property_values;
create table user_property_values
(
user_id int unsigned not null,
property_id smallint unsigned not null,
value varchar(255) not null,
primary key (user_id, property_id),
key (property_id)
)
engine=innodb;
insert into users (username) values ('f00'),('bar'),('alpha'),('beta');
insert into properties (name) values ('age'),('gender');
insert into user_property_values values
(1,1,'30'),(1,2,'Male'),
(2,1,'24'),(2,2,'Female'),
(3,1,'18'),
(4,1,'26'),(4,2,'Male');
From a performance perspective the innodb clustered index works wonders in this similar example (COLD run):
select count(*) from product
count(*)
========
1,000,000 (1M)
select count(*) from category
count(*)
========
250,000 (500K)
select count(*) from product_category
count(*)
========
125,431,192 (125M)
select
c.*,
p.*
from
product_category pc
inner join category c on pc.cat_id = c.cat_id
inner join product p on pc.prod_id = p.prod_id
where
pc.cat_id = 1001;
0:00:00.030: Query OK (0.03 secs)
Properly indexing your database will be the number 1 way of improving performance. I once had a query take a half an hour (on a large dataset, but none the less). Then we come to find out that the tables had no index. Once indexed the query took less than 10 seconds.
Why do you need to have this table structure. My fundemental problem is that you are going to have to cast the data in value of property every time you want to use it. That is bad in my opinion - also storing numbers as text is crazy given that its all binary anyway. For instance how are you going to have required fields? Or fields that need to have constraints based on other fields? Eg start and end date?
Why not simply have the properties as fields rather than some many to many relationship?
have 1 flat table. When your business rules begin to show that properties should be grouped then you can consider moving them out into other tables and have several 1:0-1 relationships with the users table. But this is not normalization and it will degrade performance slightly due to the extra join (however the self documenting nature of the table names will greatly aid any developers)
One way i regularly see databqase performance get totally castrated is by having a generic
Id, property Type, Property Name, Property Value table.
This is really lazy but exceptionally flexible but totally kills performance. In fact on a new job where performance is bad i actually ask if they have a table with this structure - it invariably becomes the center point of the database and is slow. The whole point of relational database design is that the relations are determined ahead of time. This is simply a technique that aims to speed up development at a huge cost to application speed. It also puts a huge reliance on business logic in the application layer to behave - which is not defensive at all. Eventually you find that you wan to use properties in a key relationsip which leads to all kinds of casting on the join which further degrades performance.
If data has a 1:1 relationship with an entity then it should be a field on the same table. If your table gets to more than 30 fields wide then consider movign them into another table but dont call it normalisation because it isnt. It is a technique to help developers group fields together at the cost of performance in an attempt to aid understanding.
I don't know if mysql has an equivalent but sqlserver 2008 has sparse columns - null values take no space.
SParse column datatypes
I'm not saying a EAV approach is always wrong, but i think using a relational database for this approach is probably not the best choice.