MariaDB: Optimum Way to Update Billions of Records - mysql

I am looking for an optimum way to update billions of records present in one table (Example below Table 3). Each entry is associated with a timestamp that is in the order of milliseconds. In this example Table 3 is out of date, Tables 1 and 2 are up to date with the real entries of their respective data. I do not have anything that links tables 1 and 2 to table 3. If needed, please also let me know as I am not a database expert.
Table 1 has 4 columns:
Timestamp T_0 PRIMARY KEY (ex: '2014-07-04 16:17:16.800000')
X1_T1 VARCHAR
X2_T1 VARCHAR
X3_T1 VARCHAR
Table 2 has 4 columns:
Timestamp T_0 PRIMARY KEY (ex: '2014-07-04 16:17:16.800000')
X1_T2 VARCHAR
X2_T2 VARCHAR
X3_T3 VARCHAR
Table 3 has 7 columns:
Timestamp T_0 PRIMARY KEY (ex: '2014-07-04 16:17:16.800000')
X1_T1 VARCHAR
X2_T1 VARCHAR
X3_T1 VARCHAR
X1_T2 VARCHAR
X2_T2 VARCHAR
X3_T3 VARCHAR
I was successful at updating table 3 using a procedure that loops through timestamps and updates each row using the command:
SET tmp_T_0=(SELECT '2014-01-05 17:00:00.000000'); // set to the start of the table's timestamp
label1: LOOP
UPDATE TABLE3 SET
X1_T1=(select X1_T1 FROM TABLE1 where T_0 = tmp_T_0),
X2_T1=(select X2_T1 FROM TABLE1 where T_0 = tmp_T_0),
X3_T1=(select X3_T1 FROM TABLE1 where T_0 = tmp_T_0),
X1_T2=(select X1_T2 FROM TABLE2 where T_0 = tmp_T_0),
X2_T2=(select X2_T2 FROM TABLE2 where T_0 = tmp_T_0),
X3_T2=(select X3_T2 FROM TABLE2 where T_0 = tmp_T_0)
WHERE T_0 = tmp_T_0;
SET tmp_T_0=(SELECT TIMESTAMP(tmp_T_0,'00:00:00.001')); //ADD one millisecond and continue
SET LoopInt=(SELECT(LoopInt + 1));
IF LoopInt < LoopEnd THEN
ITERATE label1;
END IF;
LEAVE label1;
END LOOP label1;
The above method takes around 53 seconds for 100,000 entries. That is not acceptable because it would require around 100 days to complete the rest of entries.
It should be noted that it's not a must that Table 3 has data from tables 1 and/or 2 for each of its respective timestamp entries (i.e., a timestamp in Table 3 may contain data for X1_T1 X2_T1 and X3_T1 while the other values X1_T2 X2_T2 and X3_T2 is NULL).
Any suggestions would help.
Thank you

What about trying this query to pull one hour's worth of info from TABLE1 to TABLE3?
UPDATE TABLE3 AS t3
JOIN TABLE1 AS t1 ON t3.T_0 = t1.T_0
SET t3.X1_T1 = IFNULL(t1.X1_T1,t3.X1_T1),
t3.X2_T1 = IFNULL(t1.X2_T1,t3.X2_T1),
t3.X3_T1 = IFNULL(t1.X3_T1,t3.X3_T1)
WHERE t3.T_0 >= '2014-01-05' + INTERVAL 0 HOUR
AND t3.T_0 < '2014-01-05' + INTERVAL 1 HOUR
What's going on? First, the WHERE clause limits the query's scope to one hour. That's handy because you can test stuff. Also, you're going to want to loop this job hour by hour so your queries don't run for too long. If you're using InnoDB or Aria as a storage engine, if you don't limit the scope of your queries you'll blow out the transaction rollback space too.
You can run this query many times, each time with a change to the HOUR interval, like so.
WHERE t3.T_0 >= '2014-01-05' + INTERVAL 1 HOUR
AND t3.T_0 < '2014-01-05' + INTERVAL 2 HOUR
You're JOINing TABLE1 to TABLE3. That's valid because you've said TABLE3 contains every possible timestamp, and TABLE1 doesn't. The way I have written this query, it won't touch the rows of TABLE3 that don't have corresponding rows in TABLE1. I think that's what you want.
Finally, the IFNULL() function arranges only to change the TABLE3 data when there's non-NULL TABLE1 data.
Look, if your TABLE1 data is sparse (that is, it has lots of randomly scattered valid values in a table that is mostly NULL) you probably want to use three queries like this instead, so you don't actually change rows in TABLE3 unless you have new data. Changing values in rows is relatively expensive.
UPDATE TABLE3 AS t3
JOIN TABLE1 AS t1 ON t3.T_0 = t1.T_0
SET t3.X1_T1 = t1.X1_T1
WHERE t3.T_0 >= '2014-01-05' + INTERVAL 0 HOUR
AND t3.T_0 < '2014-01-05' + INTERVAL 1 HOUR
AND t1.X1_T1 IS NOT NULL
UPDATE TABLE3 AS t3
JOIN TABLE1 AS t1 ON t3.T_0 = t1.T_0
SET t3.X2_T1 = t1.X2_T1
WHERE t3.T_0 >= '2014-01-05' + INTERVAL 0 HOUR
AND t3.T_0 < '2014-01-05' + INTERVAL 1 HOUR
AND t1.X2_T1 IS NOT NULL
UPDATE TABLE3 AS t3
JOIN TABLE1 AS t1 ON t3.T_0 = t1.T_0
SET t3.X3_T1 = t1.X3_T1
WHERE t3.T_0 >= '2014-01-05' + INTERVAL 0 HOUR
AND t3.T_0 < '2014-01-05' + INTERVAL 1 HOUR
AND t1.X3_T1 IS NOT NULL
You will need to repeat all this for your TABLE2 data.
You may want to run this whole thing in a single query. Don't do that! This is the kind of job you need to be able to do an hour at a time and restart when needed. I suggest an hour at a time, but that is 3.6 megarows. You might want to do even smaller chunks at a time, like 6 minutes (360 kilorows).
If I were you I'd definitely debug this whole deal on a copy of a couple of days' worth of your TABLE3.

Related

need to copy data from one table to other table with small chunks

I want to copy the data from table T1 to table T2 using.
There are 170 billion records in Table T1.
we tried Using a standard SQL statement like "insert into T2 select * from T1 where date>='01-01-2022' and date'01-02-2022';" is creating severe performance problems or a DB outage.
I don't mind if it takes many hours if I replicate data slowly from T1 to T2 with a limit of 1 record also fine(expecting small chunks).
how to use stored procedures to automate this operation.
Any tool can assist us in copying data into a few rows at a time for inserting.
The stored procedure may look like:
CREATE PROCEDURE copy_by_chunks (
IN date_start DATETIME,
IN date_end DATETIME,
IN chunk_in_hours INT
)
BEGIN
REPEAT
INSERT INTO table2
SELECT *
FROM table1
WHERE created_at >= date_start
AND created_at < LEAST(date_start + INTERVAL chunk_in_hours HOURS, date_end);
SET date_start = date_start + INTERVAL chunk_in_hours HOURS;
-- if you want to make a pause between chunks then use (for 1-second pause)
-- SET date_start = date_start + INTERVAL chunk_in_hours + SLEEP(1) HOURS;
UNTIL date_start >= date_end END REPEAT;
END
Recommended chunk size is one which provides a copying of 5k-10k rows.
Of course you must check the provided parameters values correctness (neither parameter is NULL, date_start is less than date_end and so on) and break the procedure if they're incorrect. Also you may add copying progress printing.

how to update a sql database table based on the condition from another table in the same database

I have a two tables in a database.
table_1(device_ID, date,voltage)
table_2(device_ID,device_status)
I am trying to create an event to execute every 5 minutes.
What I am trying to achieve is, select device_ID from table_1 if there is no new data over the last 10 minutes and update the table_2, that means set device_status to 0.
How do i pass conditions between two tables?
BEGIN
select device_ID from table_1 where date = DATE_SUB(NOW(), INTERVAL 10 Minutes);
//here i will get device_IDs if there was a data within last 10 minutes.
//but i need device_ID if there were no data.
//how to update table_2 based on the above condition?
END
You can use the results of your first query as a subquery to de-select rows (by using NOT IN) for the UPDATE:
UPDATE table2
SET device_status = 0
WHERE device_ID NOT IN (select device_ID
from table_1
where date > DATE_SUB(NOW(), INTERVAL 10 Minutes))
Note I think you probably want >, not = in your where condition in the subquery.

MySQL Query - Include dates without records

I have a report that displays a graph. The X axis uses the date from the below query. Where the query returns no date, I am getting gaps and would prefer to return a value. Is there any way to force a date where there are no records?
SELECT
DATE(instime),
CASE
WHEN direction = 1 AND duration > 0 THEN 'Incoming'
WHEN direction = 2 THEN 'Outgoing'
WHEN direction = 1 AND duration = 0 THEN 'Missed'
END AS type,
COUNT(*)
FROM taxticketitem
GROUP BY
DATE(instime),
CASE
WHEN direction = 1 AND duration > 0 THEN 'Incoming'
WHEN direction = 2 THEN 'Outgoing'
WHEN direction = 1 AND duration = 0 THEN 'Missed'
END
ORDER BY DATE(instime)
One possible way is to create a table of dates and LEFT JOIN your table with them. The table could look something like this:
CREATE TABLE `datelist` (
`date` DATE NOT NULL,
PRIMARY KEY (`date`)
);
and filled with all dates between, say Jan-01-2000 through Dec-31-2050 (here is my Date Generator script).
Next, write your query like this:
SELECT datelist.date, COUNT(taxticketitem.id) AS c
FROM datelist
LEFT JOIN taxticketitem ON datelist.date = DATE(taxticketitem.instime)
WHERE datelist.date BETWEEN `2012-01-01` AND `2012-12-31`
GROUP BY datelist.date
ORDER BY datelist.date
LEFT JOIN and counting not null values from right table's ensures that the count is correct (0 if no row exists for a given date).
You would need to have a set of dates to LEFT JOIN your table to it. Unfortunately, MySQL lacks a way to generate it on the fly.
You would need to prepare a table with, say, 100000 consecutive integers from 0 to 99999 (or how long you think your maximum report range would be):
CREATE TABLE series (number INT NOT NULL PRIMARY KEY);
and use it like this:
SELECT DATE(instime) AS r_date, CASE ... END AS type, COUNT(instime)
FROM series s
LEFT JOIN
taxticketitems ti
ON ti.instime >= '2013-01-01' + INTERVAL number DAY
AND ti.instime < '2013-01-01' + INTERVAL number + 1 DAY
WHERE s.number <= DATEDIFF('2013-02-01', '2013-01-01')
GROUP BY
r_date, type
Had to do something similar before.
You need to have a subselect to generate a range of dates. All the dates you want. Easiest with a start date added to a number:-
SELECT DATE_ADD(SomeStartDate, INTERVAL (a.I + b.1 * 10) DAY)
FROM integers a, integers b
Given a table called integers with a single column called i with 10 rows containing 0 to 9 that SQL will give you a range of 100 days starting at SomeStartDate
You can then left join your actual data against that to get the full range.

MySQL: Find Missing Dates Between a Date Range

I need some help with a mysql query. I've got db table that has data from Jan 1, 2011 thru April 30, 2011. There should be a record for each date. I need to find out whether any date is missing from the table.
So for example, let's say that Feb 2, 2011 has no data. How do I find that date?
I've got the dates stored in a column called reportdatetime. The dates are stored in the format: 2011-05-10 0:00:00, which is May 5, 2011 12:00:00 am.
Any suggestions?
This is a second answer, I'll post it separately.
SELECT DATE(r1.reportdate) + INTERVAL 1 DAY AS missing_date
FROM Reports r1
LEFT OUTER JOIN Reports r2 ON DATE(r1.reportdate) = DATE(r2.reportdate) - INTERVAL 1 DAY
WHERE r1.reportdate BETWEEN '2011-01-01' AND '2011-04-30' AND r2.reportdate IS NULL;
This is a self-join that reports a date such that no row exists with the date following.
This will find the first day in a gap, but if there are runs of multiple days missing it won't report all the dates in the gap.
CREATE TABLE Days (day DATE PRIMARY KEY);
Fill Days with all the days you're looking for.
mysql> INSERT INTO Days VALUES ('2011-01-01');
mysql> SET #offset := 1;
mysql> INSERT INTO Days SELECT day + INTERVAL #offset DAY FROM Days; SET #offset := #offset * 2;
Then up-arrow and repeat the INSERT as many times as needed. It doubles the number of rows each time, so you can get four month's worth of rows in seven INSERTs.
Do an exclusion join to find the dates for which there is no match in your reports table:
SELECT d.day FROM Days d
LEFT OUTER JOIN Reports r ON d.day = DATE(r.reportdatetime)
WHERE d.day BETWEEN '2011-01-01' AND '2011-04-30'
AND r.reportdatetime IS NULL;`
It could be done with a more complicated single query, but I'll show a pseudo code with temp table just for illustration:
Get all dates for which we have records:
CREATE TEMP TABLE AllUsedDates
SELECT DISTINCT reportdatetime
INTO AllUsedDates;
now add May 1st so we track 04-30
INSERT INTO AllUsedData ('2011-05-01')
If there's no "next day", we found a gap:
SELECT A.NEXT_DAY
FROM
(SELECT reportdatetime AS TODAY, DATEADD(reportdatetime, 1) AS NEXT_DAY FROM AllUsed Dates) AS A
WHERE
(A.NEXT_DATE NOT IN (SELECT reportdatetime FROM AllUsedDates)
AND
A.TODAY <> '2011-05-01') --exclude the last day
If you mean reportdatetime has the entry of "Feb 2, 2011" but other fields associated to that date are not present like below table snap
reportdate col1 col2
5/10/2011 abc xyz
2/2/2011
1/1/2011 bnv oda
then this query works fine
select reportdate from dtdiff where reportdate not in (select df1.reportdate from dtdiff df1, dtdiff df2 where df1.col1 = df2.col1)
Try this
SELECT DATE(t1.datefield) + INTERVAL 1 DAY AS missing_date FROM table t1 LEFT OUTER JOIN table t2 ON DATE(t1.datefield) = DATE(t2.datefield) - INTERVAL 1 DAY WHERE DATE(t1.datefield) BETWEEN '2020-01-01' AND '2020-01-31' AND DATE(t2.datefield) IS NULL;
If you want to get missing dates in a datetime field use this.
SELECT CAST(t1.datetime_field as DATE) + INTERVAL 1 DAY AS missing_date FROM table t1 LEFT OUTER JOIN table t2 ON CAST(t1.datetime_field as DATE) = CAST(t2.datetime_field as DATE) - INTERVAL 1 DAY WHERE CAST(t1.datetime_field as DATE) BETWEEN '2020-01-01' AND '2020-07-31' AND CAST(t2.datetime_field as DATE) IS NULL;
The solutions above seem to work, but they seem EXTREMELY slow (taking possibly hours, I waited for 30 min only) at least in my database.
This clause takes less than a second in same database (of course you need to repeat it manually dozen times and possibly change function names to find the actual dates). pvm = my datetime, WEATHER = my table.
mysql> select year(pvm) as _year,count(distinct(date(pvm))) as _days from WEATHER where year(pvm)>=2000 and month(pvm)=1 group by _year order by _year asc;
--ako

Finding entire fluctuation in a dataset

I have a table of historic data for a set of tanks in a MySQL database. I want to find fluctuations in the volume of tank contents of greater than 200 gallons/hour. My SQL statement thus far is:
SELECT t1.tankhistid as start, t2.tankhistid as end
FROM
(SELECT * from tankhistory WHERE tankid = ? AND curtime BETWEEN ? AND ?) AS t1
INNER JOIN
(SELECT * from tankhistory WHERE tankid = ? AND curtime BETWEEN ? AND ?) AS t2
ON t1.tankid = t2.tankid AND t1.curtime < t2.curtime
WHERE TIMESTAMPDIFF(HOUR, t1.curtime, t2.curtime) < 1 AND ABS(t1.vol - t2.vol) > 200
ORDER BY t1.tankhistid, t2.tankhistid
In the code above, curtime is a timestamp at the time of inserting the record, tankhistid is the table integer primary key, tankid is the individual tank id, and vol is the volume reading.
This returns too many results since data is collected every 5 minutes and fluctuations could take hours (multiple rows with the same id in an end and then start column) , or just over 10 minutes (multiple rows with the same start or end id). Example output:
7514576,7515478
7515232,7515478
7515314,7515478
7515396,7515478
7515478,7515560
7515478,7515642
7515478,7515724
Note that all of these rows should just be one: 7514576,7515724. The query takes 4 minutes for just one day of a tank's data, so any speed up would be great as well. I am guessing there is a way to take the current query and use it as a subquery, but I am not sure how to do the filtering.