I am saving tables from Spark SQL using MySQL as my storage engine. My table looks like
+-------------+----------+
| count| date|
+-------------+----------+
| 72|2017-09-08|
| 84|2017-09-08|
+-------------+----------+
I want to UPDATE the table by adding the count using GROUP BY and dropping the individual rows. So my output should be like
+-------------+----------+
| count| date|
+-------------+----------+
| 156|2017-09-08|
+-------------+----------+
Is it a right expectation and if possible, how it could be achieved using Spark SQL ?
Before you write the table to MYSQL, apply the following logic in your spark dataframe/dataset
import org.apache.spark.sql.functions._
df.groupBy("date").agg(sum("count").as("count"))
And write the transformed dataframe to MYSQL.
Soln 1
In MySQL, you can make use of TEMPORARY TABLE to store the results after grouping.
Then truncate the original table.
Now insert data from temporary table to original table.
CREATE TEMPORARY TABLE temp_table
AS
(SELECT SUM(count) as count, [date] from table_name GROUP BY [date]);
TRUNCATE TABLE table_name;
INSERT INTO table_name (count,[date])
SELECT (count,[date]) from temp_table;
DROP TEMPORARY TABLE temp_table;
Soln 2
Update the rows using following query.
UPDATE table_name t
INNER JOIN
(SELECT sum(count) as [count], [date] FROM table_name GROUP BY [date]) t1
ON t.[date] = t1.[date]
SET t.[count] = t1.[count]
Assuming that the table has a unique column named uid,
DELETE t1 FROM table_name t1, table_name t2
WHERE t1.uid > t2.uid AND t1.[date] = t2.[date]
Please refer this SO question to see more about deleting duplicate rows.
Related
Consider MySQL tables: table1, table2
table1:
+------+------+
| col1 | col2 |
+------+------+
| 1 | a |
| 2 | b |
| 3 | c |
+------+------+
table2:
+------+------+
| col1 | col2 |
+------+------+
| 1 | a |
| 2 | b |
+------+------+
What is the most efficient way to delete the rows in table1 based on the rows in table2 such that the desired output looks like this:
+------+------+
| col1 | col2 |
+------+------+
| 3 | c |
+------+------+
Please note that this is a minimalist example of a problem I am having with two very large tables:
Here is code to create table1 and table2:
DROP TABLE IF EXISTS table1;
CREATE TABLE table1 (
col1 BIGINT,
col2 TEXT
);
INSERT INTO table1 VALUES (1, 'a');
INSERT INTO table1 VALUES (2, 'b');
INSERT INTO table1 VALUES (3, 'c');
DROP TABLE IF EXISTS table2;
CREATE TABLE table2 (
col1 BIGINT,
col2 TEXT
);
INSERT INTO table2 VALUES (1, 'a');
INSERT INTO table2 VALUES (2, 'b');
MySQL = 5.7.12
Question:
From reading this site and others I notice that there are several ways to do this operation in MySQL. I am wondering which is the fastest way for large tables (30M+ rows)? Here are some ways I have discovered:
1. method using DELETE
DELETE t1
FROM table1 t1
INNER JOIN table2 t2
ON t1.col1=t2.col1;
2. method using DELETE FROM
DELETE FROM t1
USING table1 t1
INNER JOIN table2 t2
ON ( t1.col1 = t2.col1 );
3. method using DELETE FROM
DELETE FROM table1 WHERE col1 in (SELECT col1 FROM table2);
Is there a faster way to do this that I have not listed here?
I will suggest another method it is not as practical as the mentioned method , but maybe it will be much faster for larger tables.
It is mentioned on [MySQL documentation] (https://dev.mysql.com/doc/refman/8.0/en/delete.html)
InnoDB Tables
If you are deleting many rows from a large table, you may exceed the lock table size for an InnoDB table. To avoid this problem, or
simply to minimize the time that the table remains locked, the
following strategy (which does not use DELETE at all) might be
helpful:
Select the rows not to be deleted into an empty table that has the same structure as the original table:
INSERT INTO t_copy SELECT * FROM t WHERE ... ;
Use RENAME TABLE to atomically move the original table out of the way and rename the copy to the original name:
RENAME TABLE t TO t_old, t_copy TO t;
Drop the original table:
DROP TABLE t_old;
--Follow below steps:
--Rename the table:
RENAME TABLE table1 TO table1_old;
--Create new table with primary key and all necessary indexes:
CREATE TABLE table1 LIKE table1_old;
USE THIS FOR MyISAM TABLES:
SET UNIQUE_CHECKS=0;
LOCK TABLES table1_old WRITE, table2 WRITE;
ALTER TABLE table1 DISABLE KEYS;
INSERT INTO table1 (select * from table1_old t1 where col1 not in (select col1 from table2 ));
ALTER TABLE table1 ENABLE KEYS;
SET UNIQUE_CHECKS=1;
UNLOCK TABLES;
-- USE THIS FOR InnoDB TABLES:
SET AUTOCOMMIT = 0;
SET UNIQUE_CHECKS=0;
SET FOREIGN_KEY_CHECKS=0;
LOCK TABLES table1_old WRITE, table2 WRITE;
INSERT INTO table1 (select * from table1_old t1 where col1 not in (select col1 from table2 ));
SET FOREIGN_KEY_CHECKS=1;
SET UNIQUE_CHECKS=1;
COMMIT; SET AUTOCOMMIT = 1;
UNLOCK TABLES;
CREATE TABLE t_new LIKE t
INSERT INTO t_new
SELECT *
FROM t
LEFT JOIN exclude ON ...
WHERE exclude.id IS NULL;
RENAME TABLE t TO t_old,
t_new TO t;
DROP TABLE t_old;
DELETE (and UPDATE) choke on handling a huge number of rows; SELECT does not.
A possible optimization on this would be to drop all indexes except the PRIMARY KEY and re-add them after finishing.
(FOREIGN KEYs can be a big nuisance; do you have any?)
This question stumps me. I have a database with a table that has a primary key that consists of two fields. In the end I require that the primary key only be one field, but I need to delete the duplicate entries from the table.
In other words the table has:
PRIMARY KEY (`field1`, `field2`)
There are entries that have duplicate field1 and different field2. So I have entries like this:
field1 | field2
1 | 1
1 | 2
2 | 1
2 | 2
3 | 1
4 | 1
I want to delete 1 of each of those entries that have duplicates on field1.
How can I do this with MySQL / SQL?
I think this will work in your case,
DELETE t1 FROM table t1
INNER JOIN table t2
WHERE t1.id > t2.id
AND t1.field1 = t2.field1
In this query I am joining the same table and picking duplicate values of field1 with different id and removing those.
Hope this works!!
I dont know how the delete from table needs to be specified in the mysql syntax but essentially you are trying to remove the second entry for the field1 for each of its unique value. So in some way if you are able to retrieve those records and pass them as select statements under your delete from table clause it should work.
For instance, here is the query that would select 2nd row for each value of field1 if it is repeated
select field1, field2
from
(
select *, count(*) over (partition by field1) as ct
, rank() over (partition by field1 order by field2 desc) as rn
from temp
) where rn = 1 and ct = 2
In your case it would return below records
field1 field2
1 2
2 2
So then all you need to do is have a delete from table clause at the top of that select statement.
NOTE - I have tried a solution without a join and hence I maintain these 2 analytical functions.
For instance this works in something like BigQuery -
delete from TABLE where concat(field1, field2) in
(
select concat(field1, field2)
from
(
select *, count(*) over (partition by field1) as ct
, rank() over (partition by field1 order by field2 desc) as rn
from TABLE
) where rn = 1 and ct = 2
)
Hello – I have a DB table (MySQL ver 5.6.41-84.1-log) that has about 92,000 entries, with columns for:
id (incremental unique ID)
post_type (not important)
post_id (not important, but shows relation to another table)
user_id (not important)
vote (not important)
ip (IP Address, ie. 123.123.123.123)
voted (Datestamp in GMT, ie. 2018-12-03 04:50:05)
I recently ran a contest and we had a rule that no single IP could vote more than 60 times per day. So now I need to run a custom SQL formula that applies the following rule:
For each IP address, for each day, if there are > 60 rows, delete those additional rows.
Thank you for your help!
This is a complicated one, and I think it is hard to provide a 100% sure answer without actual table and data to play with.
However let me try to describe the logic, and build the query step by step so you can paly around with it and possibly fix lurking erros.
1) We start with selecting all ip adresses that posted more than 60 votes on a given day. For this we use a group by on the voting day and on the ip adress, combined with a having clause
select date(voted), ip_adress
from table
group by date(voted), ip_adress
having count(*) > 60
2) From then, we go back to the table and select the first 60 ids corresponding to each voting day / ip adress couple. id is an autoincremented field so we just sort using this field and the use the mysql limit instruction
select id, ip_adress, date(voted) as day_voted
from table
where ip_adress, date(voted) in (
select date(voted), ip_adress
from table
group by date(voted), ip_adress
having count(*) > 60
)
order by id
limit 60
3) Finally, we go back once again to the table and search for the all ids whose ip adress and day of vote belong to the above list, but whose id is greater than the max id of the list. This is achieved with a join and requires a group by clause.
select t1.id
from
table t1
join (
select id, ip_adress, date(voted) as day_voted
from table
where ip_adress, date(voted) in (
select date(voted), ip_adress
from table
group by date(voted), ip_adress
having count(*) > 60
)
order by id
limit 60
) t2
on t1.ip_adress = t2.ip_adress
and date(t1.voted) = t2.day_voted and t1.id > max(t2.id)
group by t1.id
That should return the list of all ids that we need to delete. Test if before you go further.
4) The very last step is to delete those ids. There are limitations in mysql that make a delete with subquery condition quite uneasy to achieve. See the following SO question for more information on the technical background. You can either use a temporary table to store the selected ids, or try to outsmart mysql by wrapping the subquery and aliasing it. Let us try with the second option :
delete t.* from table t where id in ( select id from (
select t1.id
from
table t1
join (
select id, ip_adress, date(voted) as day_voted
from table
where ip_adress, date(voted) in (
select date(voted), ip_adress
from table
group by date(voted), ip_adress
having count(*) > 60
)
order by id
limit 60
) t2
on t1.ip_adress = t2.ip_adress
and date(t1.voted) = t2.day_voted
and t1.id > max(t2.id)
group by t1.id
) x );
Hope this helps !
You could approach this by vastly simplifying your sample data and using row number simulation for mysql version prior to 8.0 or window function for versions 8.0 or above. I assume you are not on version 8 or above in the following example
drop table if exists t;
create table t(id int auto_increment primary key,ip varchar(2));
insert into t (ip) values
(1),(1),(3),(3),
(2),
(3),(3),(1),(2);
delete t1 from t t1 join
(
select id,rownumber from
(
select t.*,
if(ip <> #p,#r:=1,#r:=#r+1) rownumber,
#p:=ip p
from t
cross join (select #r:=0,#p:=0) r
order by ip,id
)s
where rownumber > 2
) a on a.id = t1.id;
Working in to out the sub query s allocates a row number per ip, sub query a then picks row numbers > 2 and the outer multi-table delete deletes from t joined to a to give
+----+------+
| id | ip |
+----+------+
| 1 | 1 |
| 2 | 1 |
| 3 | 3 |
| 4 | 3 |
| 5 | 2 |
| 9 | 2 |
+----+------+
6 rows in set (0.00 sec)
I had someone help me write the following query, which addressed my question.
SET SQL_SAFE_UPDATES = 0;
create table temp( SELECT id, ip, voted
FROM
(SELECT id, ip, voted,
#ip_rank := IF(#current_ip = ip, #ip_rank + 1, 1) AS ip_rank,
#current_ip := ip
FROM `table_name` where ip in (SELECT ip from `table_name` group by date(voted),ip having count(*) >60)
ORDER BY ip, voted desc
) ranked
WHERE ip_rank <= 2);
DELETE FROM `table_name`
WHERE id not in (select id from temp) and ip in (select ip from temp);
drop table temp;
I am trying to use the IN operator to get the count of certain fields in the table.
This is my query:
SELECT order_id, COUNT(*)
FROM remake_error_type
WHERE order_id IN (1, 2, 100)
GROUP BY order_id;
My current output:
| order_id | COUNT(*) |
+----------+----------+
| 1 | 8 |
| 2 | 8 |
My expected output:
| order_id | COUNT(*) |
+----------+----------+
| 1 | 8 |
| 2 | 8 |
| 100 | 0 |
You can write your query this way:
SELECT t.id, COUNT(remake_error_type.order_id)
FROM
(SELECT 1 AS id UNION ALL SELECT 2 UNION ALL SELECT 100) as t
LEFT JOIN remake_error_type
ON t.id = remake_error_type.order_id
GROUP BY
t.id
a LEFT JOIN will return all rows from the subquery on the left, and the COUNT(remake_error_type.order_id) will count all values where the join succeeds.
You can create a temporary table, insert as many order_ids as required, and perform the left join to remake_error_type. At a small number of orders the other answers are sufficient, but if you were doing this for a lot of orders, UNION ALL and sub-queries are inefficient, both to type it up and to execute on the server.
Additionally, this is a very dynamic approach, because you can control easily the values in your temp table by modifying the insert statement.
However, this will only work if the database user has sufficient privileges: at least select, create temporary and drop table.
DROP TABLE IF EXISTS myTempOrders;
CREATE TEMPORARY TABLE myTempOrders (order_id INTEGER, PRIMARY KEY(order_id));
INSERT INTO myTempOrders (order_id) VALUES (1), (2), (100);
SELECT temp.order_id, count(*)
FROM myTempOrders temp
LEFT JOIN remake_error_type ON temp.order_id = remake_error_type.order_id
GROUP BY 1
If the order_id values exist in some table, then it is possible to extract the desired result without creating a temporary table and inserting values into it.
To qualify, the table must
have an auto increment primary key with # rows greater than the maximum sought order_id value
have a starting increment value less than the minimum sought order_id value
have no missing values in the primary key (i.e. no records have been deleted)
if a qualified table exists, then you can run the following query, where you have to replace surrogate with the qualified table name and surrogate_id with the auto-incrementing primary key of the qualified table name
SELECT surrogate.surrogate_id, count(*)
FROM my_qualified_table surrogate
LEFT JOIN remake_error_type ON surrogate.surrogate_id = remake_error_type.order_id
WHERE surrogate.surrogate_id IN (1, 2, 100)
GROUP BY 1
You could use a union for this. No, this does not use the IN operator, but it is an alternative that will give you your expected results. One option is to hardcode the order_id and use conditional aggregation to get the SUM() of rows with that id:
SELECT 1 AS order_id, SUM(order_id = 1) AS numOrders FROM myTable
UNION ALL
SELECT 2 AS order_id, SUM(order_id = 2) AS numOrders FROM myTable
UNION ALL
SELECT 100 AS order_id, SUM(order_id = 100) AS numOrders FROM myTable;
Here is an SQL Fiddle example.
I have a table with rows like id, length, time and some of them are duplicates, where length and time is the same in some rows. I want to delete all copies of the first row submitted.
id | length | time
01 | 255232 | 1242
02 | 255232 | 1242 <- Delete that one
I have this to show all duplicates in table.
SELECT idgarmin_track, length , time
FROM `80dage_garmin_track`
WHERE length in
( SELECT length
FROM `80dage_garmin_track`
GROUP
BY length
HAVING count(*) > 1 )
ORDER BY idgarmin_track, length, time LIMIT 0,500
DELETE FROM `80dage_garmin_track` t1
WHERE EXISTS (SELECT 1 from `80dage_garmin_track` t2
WHERE t1.Length = t2.Length
AND t1.Time = t2.Time
AND t1.idgarmin_track > t2.idgarmin_track)
If you can take your table offline for a period, then the simplest way is to build a new table containing the data you want and then drop the original table:
create table `80dage_garmin_track_un` like `80dage_garmin_track`;
insert into `80dage_garmin_track_un`
select min(idgarmin_track), length, time
group by length, time;
rename table `80dage_garmin_track` to old, `80dage_garmin_track_un` to `80dage_garmin_track`;
drop table old;
i have the same problem Holsteinkaa, i just use it like this:
delete from table where id in ( select * from (
SELECT id FROM table t1
WHERE EXISTS (SELECT 1 from table t2
WHERE t1.field = t2.field
AND t1.id > t2.id
)
) as tmp )
i was trying to put this like a comment to Michael Pakhantsov answer but i cant :/ sorry