Randomize timestamp column in large MySQL table - mysql

I have a test database table with ~100m rows which were generated by cloning original 3k rows multiple times. Let's say this table describes some events which have timestamps. Due to cloning now we have ~10m events per day which is far from real cases. So I'd like to randomize the date column and scatter records for several days.
Here is the procedure I've come up with:
DROP PROCEDURE IF EXISTS `randomizedates`;
DELIMITER //
CREATE PROCEDURE `randomizedates`(IN `daterange` INT)
BEGIN
DECLARE id INT UNSIGNED;
DECLARE buf TIMESTAMP;
DECLARE done INT DEFAULT FALSE;
DECLARE cur1 CURSOR FOR SELECT event_id FROM events;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
OPEN cur1;
the_loop: LOOP
FETCH cur1 INTO id;
IF done THEN
LEAVE the_loop;
END IF;
SET buf = (SELECT NOW() - INTERVAL FLOOR(RAND() * daterange) DAY);
UPDATE events SET starttime = buf WHERE event_id = id;
END LOOP the_loop;
CLOSE cur1;
END //
DELIMITER ;
On 3k table it executes for ~6 seconds so assuming linear сomplexity it will take ~50 hours to be applied on 100m table. Is there a way to speed it up? Or maybe my procedure is incorrect at all?

Just do:
set #datarange = 7;
update `events`
set starttime = NOW() - INTERVAL FLOOR(RAND()) * #datarange DAY;
Databases are not good at fetching and processing single rows in a lopp, like we are used to do in procedural languages (iterators, for each loops, arrays etc), they are best at, and optimized for processing SQL, which is essetially a declarative language - you declare what you want to get without specyfying how to do it, in contrast to procedural languages, which are used to specify the steps the program must do.
Remember - row by row = slow by slow.
Look at simple example that simulates your table and compares your procedure to UPDATE:
drop table `events`;
create table `events` as
select * from information_schema.tables
where 1=0;
alter table `events` add column event_id int primary key auto_increment first;
alter table `events` change column create_time starttime timestamp;
insert into `events`
select null, t.*
from information_schema.tables t
cross join (
select 1 from information_schema.tables
limit 100
) xx
mysql> select count(*) from `events`;
+----------+
| count(*) |
+----------+
| 17200 |
+----------+
We created a table with 17 thousand rows. Now we call the procedure:
mysql> call `randomizedates`(7);
Query OK, 0 rows affected (34.26 sec)
and the update command:
mysql> set #datarange = 7;
Query OK, 0 rows affected (0.00 sec)
mysql> update `events`
-> set starttime = NOW() - INTERVAL FLOOR(RAND()) * #datarange DAY;
Query OK, 17200 rows affected (0.23 sec)
Rows matched: 17200 Changed: 17200 Warnings: 0
As you see - 34 seconds / 0.23 second = 14782 % faster - it's a huge difference !!!

Related

Mysql and execute stored procedure in atomic way or select update atomically

In Mysql I have two concurrent processes that need to read some rows and update a flag based on a condition.
I have to write a stored procedure with transaction but the problem is that sometimes the two processes updates the same rows.
I have a table Status and I want read 15 rows where the flag Reserved is true, then update those rows setting the flag Reserved to False.
The updated rows must be returned to the client.
My stored procedure is:
CREATE DEFINER=`user`#`%` PROCEDURE `get_reserved`()
BEGIN
DECLARE tmpProfilePageId bigint;
DECLARE finished INTEGER DEFAULT 0;
DECLARE curProfilePage CURSOR FOR
SELECT ProfilePageId
FROM Status
WHERE Reserved is false and ((timestampdiff(HOUR, UpdatedTime, NOW()) >= 23) or UpdatedTime is NULL)
ORDER BY UpdatedTime ASC
LIMIT 15;
DECLARE CONTINUE HANDLER
FOR NOT FOUND SET finished = 1;
DECLARE EXIT HANDLER FOR SQLEXCEPTION ROLLBACK;
DECLARE EXIT HANDLER FOR SQLWARNING ROLLBACK;
START TRANSACTION;
DROP TEMPORARY TABLE IF EXISTS TmpAdsProfile;
CREATE TEMPORARY TABLE TmpAdsProfile(Id INT PRIMARY KEY AUTO_INCREMENT, ProfilePageId BIGINT);
OPEN curProfilePage;
getProfilePage: LOOP
FETCH curProfilePage INTO tmpProfilePageId;
IF finished = 1 THEN LEAVE getProfilePage;
END IF;
UPDATE StatusSET Reserved = true WHERE ProfilePageId = tmpProfilePageId;
INSERT INTO TmpAdsProfile (ProfilePageId) VALUES (tmpProfilePageId);
END LOOP getProfilePage;
CLOSE curProfilePage;
SELECT ProfilePageId FROM TmpAdsProfile;
COMMIT;
END
Anyway, if I execute two concurrent processes that call this stored procedure, sometimes they update the same rows.
How can I execute the stored procedure in an atomic way?
Simplify this a bit and use FOR UPDATE. That will lock the rows you want to change until you commit the transaction. You can get rid of the cursor entirely. Something like this, not debugged!
START TRANSACTION;
CREATE OR REPLACE TEMPORARY TABLE TmpAdsProfile AS
SELECT ProfilePageId
FROM Status
WHERE Reserved IS false
AND ((timestampdiff(HOUR, UpdatedTime, NOW()) >= 23) OR UpdatedTime IS NULL)
ORDER BY UpdatedTime ASC
LIMIT 15
FOR UPDATE;
UPDATE Status SET Reserved = true
WHERE ProfilePageId IN (SELECT ProfilePageId FROM TmpAdsProfile);
COMMIT;
SELECT ProfilePageId FROM TmpAdsProfile;
That temporary table will only ever have fifteen rows in it. So indexes and PKs and all that are not necessary. Therefore you can use CREATE ... AS SELECT ... to create and populate the table in one go.
And, consider recasting your UpdatedTime filter so it can use an index.
AND (UpdatedTime <= NOW() - INTERVAL 23 HOUR OR UpdatedTime IS NULL)
The appropriate index for the SELECT query is
CREATE INDEX status_update ON Status (Reserved, UpdatedTime, ProfilePageId);
The faster your SELECT operation can be, the less time your transaction will take, so the better your overall performance will be.

Keep every user last 100 transaction and delete the rest

I have a table name tbl_tmp_trans
it contains every user transactions ever done ( and it's up to 6Mil right now !)
we have decided to keep only last 100 transaction per user in our database so we could keep the db clean
here is a query that i have came up with
delete from tbl_tmp_trans
where trans_id in
(
select trans_id
from
(
select trans_id
from tbl_faucets_transactions
order by date
group by user_id
limit 100
) foo
)
what am i doing wrong?
because after doing this my cpu reach 100% and mysql crashed.
Thanks in advance
P.S: our db is Mysql and table engine is Innodb
P.S2: We have about 120k and transction table have near 6 million record
I have a proposal... Hopefully, it might help you.
Alter your table:
alter table tbl_tmp_trans add column todel tinyint(1);
Implement a stored procedure to iterate through the table with a cursor and mark (set todel to 1) records that should be deleted. Example procedure to do that:
delimiter //
drop procedure if exists mark_old_transactions //
create procedure mark_old_transactions()
begin
declare done int default false;
declare tid int;
declare uid int;
declare last_uid int default 0;
declare count int default 0;
declare cur cursor for select trans_id, user_id from tbl_tmp_trans order by user_id, date desc;
declare continue handler for not found set done = true;
open cur;
repeat
fetch cur into tid, uid;
if (!done) then
if (uid!=last_uid) then
set count = 0;
end if;
set last_uid = uid;
set count = count + 1;
if (count > 100) then
update tbl_tmp_trans set todel=1 where trans_id=tid;
end if;
end if;
until done
end repeat;
close cur;
end //
Invoke the procedure, maybe do some simple checks (how many transactions you delete from the table, etc.), and delete the marked records.
call mark_old_transactions;
-- select count(*) from tbl_tmp_trans where todel=1;
-- select count(*) from tbl_tmp_trans;
delete from tbl_tmp_trans where todel=1;
Finally, remove the column that we just added.
alter table tbl_tmp_trans drop column todel;
Some notes:
Probably you have to iterate through all the records of the table
anyway, so you don't loose performance with the cursor.
If you have ~120K users and ~6M transactions, you have ~50 transactions per user on average. Which means, that probably you don't really
have too many users with transactions over 100, so the number of
updates (hopefully) won't be too many. => the procedure runs relatively fast.
Delete should be fast again with the new column.

selecting top 3 rows in an ordered select

i have table data like this:
id,time,otherdata
a,1,fsdfas
a,2,fasdfag
a,3,fasdfas
a,7,asfdsaf
b,8,fasdf
a,8,asdfasd
a,9,afsadfa
b,10,fasdf
...
so essentially, i can select all the data in the order i want by saying something like:
select * from mytable ordered by id,time;
so i get all the records in the order i want, sorted by id first, and then by time. but instead of getting all the records, i need the latest 3 times for each id.
Answer:
Well, I figured out how to do it. I'm surprised at how quick it was, as I'm operating on a couple million rows of data and it took about 11 seconds. I wrote a procedure in a sql script to do it, and here's what it looks like. --Note that instead of getting the last 3, it gets the last "n" number of rows of data.
use my_database;
drop procedure if exists getLastN;
drop table if exists lastN;
-- Create a procedure that gets the last three records for each id
delimiter //
create procedure getLastN(n int)
begin
# Declare cursor for data iterations, and variables for storage
declare idData varchar(32);
declare done int default 0;
declare curs cursor for select distinct id from my_table;
declare continue handler for not found set done = 1;
open curs;
# Create a temporary table to contain our results
create temporary table lastN like my_table;
# Iterate through each id
DATA_LOOP: loop
if done then leave DATA_LOOP; end if;
fetch curs into idData;
insert into lastThree select * from my_table where id = idData order by time desc limit n;
end loop;
end//
delimiter ;
call getLastN(3);
select * from lastN;
sorry if this doesn't exactly work, I've had to change variable names and stuff to obfuscate my work's work, but i ran this exact piece of code and got what i needed!
I think it's as simple as:
SELECT * FROM `mytable`
GROUP BY `id`
ORDER BY `time` DESC
LIMIT 3
Two approaches that I'm aware of are (1) to use a set of unions, each one containing a "limit 3", or (2) to use a temporary variable. These approaches, along with other useful links and discussion can be found here.
Try this:
select *
from mytable as m1
where (
select count(*) from mytable as m2
where m1.id = m2.id
) <= 3 ORDER BY id, time

Fill database tables with a large amount of test data

I need to load a table with a large amount of test data. This is to be used for testing performance and scaling.
How can I easily create 100,000 rows of random/junk data for my database table?
You could also use a stored procedure. Consider the following table as an example:
CREATE TABLE your_table (id int NOT NULL PRIMARY KEY AUTO_INCREMENT, val int);
Then you could add a stored procedure like this:
DELIMITER $$
CREATE PROCEDURE prepare_data()
BEGIN
DECLARE i INT DEFAULT 100;
WHILE i < 100000 DO
INSERT INTO your_table (val) VALUES (i);
SET i = i + 1;
END WHILE;
END$$
DELIMITER ;
When you call it, you'll have 100k records:
CALL prepare_data();
For multiple row cloning (data duplication) you could use
DELIMITER $$
CREATE PROCEDURE insert_test_data()
BEGIN
DECLARE i INT DEFAULT 1;
WHILE i < 100000 DO
INSERT INTO `table` (`user_id`, `page_id`, `name`, `description`, `created`)
SELECT `user_id`, `page_id`, `name`, `description`, `created`
FROM `table`
WHERE id = 1;
SET i = i + 1;
END WHILE;
END$$
DELIMITER ;
CALL insert_test_data();
DROP PROCEDURE insert_test_data;
Here it's solution with pure math and sql:
create table t1(x int primary key auto_increment);
insert into t1 () values (),(),();
mysql> insert into t1 (x) select x + (select count(*) from t1) from t1;
Query OK, 1265 rows affected (0.01 sec)
Records: 1265 Duplicates: 0 Warnings: 0
mysql> insert into t1 (x) select x + (select count(*) from t1) from t1;
Query OK, 2530 rows affected (0.02 sec)
Records: 2530 Duplicates: 0 Warnings: 0
mysql> insert into t1 (x) select x + (select count(*) from t1) from t1;
Query OK, 5060 rows affected (0.03 sec)
Records: 5060 Duplicates: 0 Warnings: 0
mysql> insert into t1 (x) select x + (select count(*) from t1) from t1;
Query OK, 10120 rows affected (0.05 sec)
Records: 10120 Duplicates: 0 Warnings: 0
mysql> insert into t1 (x) select x + (select count(*) from t1) from t1;
Query OK, 20240 rows affected (0.12 sec)
Records: 20240 Duplicates: 0 Warnings: 0
mysql> insert into t1 (x) select x + (select count(*) from t1) from t1;
Query OK, 40480 rows affected (0.17 sec)
Records: 40480 Duplicates: 0 Warnings: 0
mysql> insert into t1 (x) select x + (select count(*) from t1) from t1;
Query OK, 80960 rows affected (0.31 sec)
Records: 80960 Duplicates: 0 Warnings: 0
mysql> insert into t1 (x) select x + (select count(*) from t1) from t1;
Query OK, 161920 rows affected (0.57 sec)
Records: 161920 Duplicates: 0 Warnings: 0
mysql> insert into t1 (x) select x + (select count(*) from t1) from t1;
Query OK, 323840 rows affected (1.13 sec)
Records: 323840 Duplicates: 0 Warnings: 0
mysql> insert into t1 (x) select x + (select count(*) from t1) from t1;
Query OK, 647680 rows affected (2.33 sec)
Records: 647680 Duplicates: 0 Warnings: 0
If you want more control over the data, try something like this (in PHP):
<?php
$conn = mysql_connect(...);
$num = 100000;
$sql = 'INSERT INTO `table` (`col1`, `col2`, ...) VALUES ';
for ($i = 0; $i < $num; $i++) {
mysql_query($sql . generate_test_values($i));
}
?>
where function generate_test_values would return a string formatted like "('val1', 'val2', ...)". If this takes a long time, you can batch them so you're not making so many db calls, e.g.:
for ($i = 0; $i < $num; $i += 10) {
$values = array();
for ($j = 0; $j < 10; $j++) {
$values[] = generate_test_data($i + $j);
}
mysql_query($sql . join(", ", $values));
}
would only run 10000 queries, each adding 10 rows.
try filldb
you can either post your schema or use existing schema and generate dummy data and export from this site and import in your data base.
I really like the mysql_random_data_loader utility from Percona, you can find more details about it here.
mysql_random_data_loader is a utility that connects to the mysql database and fills the specified table with random data. If foreign keys are present in the table, they will also be correctly filled.
This utility has a cool feature, the speed of data generation can be limited.
For example, to generate 30,000 records, in the sakila.film_actor table with a speed of 500 records per second, you need the following command
mysql_random_data_load sakila film_actor 30000 --host=127.0.0.1 --port=3306 --user=my_user --password=my_password --qps=500 --bulk-size=1
I have successfully used this tool to simulate a workload in a test environment by running this utility on multiple threads at different speeds for different tables.
create table mydata as select * from information_schema.columns;
insert into mydata select * from mydata;
-- repeating the insert 11 times will give you at least 6 mln rows in the table.
I am terribly sorry if this is out of place, but I wanted to offer some explanation on this code as I know just enough to explain it and how the answer above is rather useful if you only understand what it does.
The first line Creates a table called mydata , and it generates the layout of the columns from the information_schema, which stores the information about your MYSQL server, and in this case, it is pulling from information_schema.columns, which allows the table being created to have all the column information needed to create not only the table, but all the columns you will need automatically, very handy.
The second line starts off with an Insert statement that will now target that new table called mydata and insert the Information_schema data into the table. The last line is just a comment suggesting you run the script a few times if you want to generate more data.
Lastly in conclusion, in my testing, one execution of this script generated 6,956 rows of data. If you are needing a quick way to generate some records, this isn't a bad method. However, for more advanced testing, you might want to ALTER the table to include a primary key that auto increments so that you have a unique index as a database without a primary key is a sad database. It also tends to have unpredictable results since there can be duplicate entries. All that being said, I wanted to offer some insight into this code because I found it useful, I think others might as well, if only they had spent the time to explain what it is doing. Most people aren't a fan of executing code that they have no idea what it is going to do, even from a trusted source, so hopefully someone else found this useful as I did. I'm not offering this as "the answer" but rather as another source of information to help provide some logistical support to the above answer.
This is a more performant modification to #michalzuber answer. The only difference is removing the WHERE id = 1, so that the inserts can accumulate on each run.
The amount of records produced would be n^2;
So for 10 iterations 10^2 = 1024 records
For 20 iterations 20^2 = 1048576 records and so on.
DELIMITER $$
CREATE PROCEDURE insert_test_data()
BEGIN
DECLARE i INT DEFAULT 1;
WHILE i <= 10 DO
INSERT INTO `table` (`user_id`, `page_id`, `name`, `description`, `created`)
SELECT `user_id`, `page_id`, `name`, `description`, `created`
FROM `table`;
SET i = i + 1;
END WHILE;
END$$
DELIMITER ;
CALL insert_test_data();
DROP PROCEDURE insert_test_data;

MySQL Non-Negative INT Columns

I want to do the following query:
UPDATE `users` SET balance = (balance - 10) WHERE id=1
But if the balance will become a negative number I want an error to be returned. Any ideas on if this is possible?
If you do
UPDATE `users` SET balance = (balance - 10) WHERE id=1 and balance >=10
You should be able to detect that a row was not modified.
Note that while another answer suggests using an unsigned int column, this may not work:
Create a test table
create table foo(val int unsigned default '0');
insert into foo(val) values(5);
Now we attempt to subtract 10 from our test row:
update foo set val=val-10;
Query OK, 1 row affected, 1 warning (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 1
mysql> select * from foo;
+------------+
| val |
+------------+
| 4294967295 |
+------------+
This was on mysql 5.0.38
You can make the balance field of the users table an unsigned int:
ALTER TABLE `users` CHANGE `balance` `balance` INT UNSIGNED;
This sort of things is done by triggers. MySql have support for triggers only since 5.0.2.
DELIMITER $$
CREATE TRIGGER balance_check BEFORE INSERT ON user FOR EACH ROW
BEGIN
IF new.balance < #limit_value THEN
-- do something that causes error.
-- mysql doesn't have mechanism to block action by itself
END IF;
END $$
DELIMITER ;
Triggers in MySql are quite rudimentary. You have to hack things around to do some things (e.g. cause error).
I dont think you can do this with a simple query. you should use a mysql user defined function that manage that before update the row. or a trigger
Just a tip that wouldn't fit as a comment. I was just trying to subtract 32000 from 32047 (not a negative result) and was getting errors. Also confusing, I was getting BIGINT errors but my subtraction was on a SMALLINT column! (Which still makes no sense.)
If you're getting "out of range" errors even when your "balance" is positive, try adding "limit 1" to the end of your query. Maybe this is a bug in MySQL?
mysql> update posts set cat_id=cat_id-32000 where timestamp=1360870280;
ERROR 1690 (22003): BIGINT UNSIGNED value is out of range in '(`xxxxx`.`posts`.`cat_id` - 32000)'
mysql> update posts set cat_id=cat_id-32000 where timestamp=1360870280 limit 1;
Query OK, 1 row affected (6.45 sec)
Rows matched: 1 Changed: 1 Warnings: 0
In my case the timestamp is unique (I just checked to be sure) but not explicitly defined as unique when I created the table. So why is the "limit 1" here necessary? But who cares, it works!