mysql: delete rows between repeating specific values - mysql

SQL rookie here. I have a broken punch in/out type table with millions of records fed by a legacy bad app that did not check for previous logins/logouts before merrily inserting another duplicate record. The app is fixed but I need to sanitize the table to retain the historical data so it can be fed into future reports.
In a nutshell, I'm trying to keep each minimum login row followed by the next minimum logout row and discard everything else between. The bad app allowed both duplicate logins AND logouts... grrrr.
Every "duplicate row" type question I've searched for here doesn't seem to apply to this type of grouping situation. From being a long time SO lurker I know you guys would like to see what I've already tried but have already tried tens of goofy query attempts that aren't coming close. Any guidance would be greatly appreciated.
Here's the table and what I'm trying to do and the fiddle with schema
+---------------------+-------+-------------+---------------+
| calldate | agent | etype | uniqueid |
+---------------------+-------+-------------+---------------+
| 2018-02-02 19:26:47 | 501 | agentlogin | 1517599607.71 |
| 2018-02-02 19:26:55 | 501 | agentlogin | 1517599615.72 |<-- delete
| 2018-02-02 19:27:32 | 501 | agentlogoff | 1517599652.73 |
| 2018-02-02 19:27:43 | 501 | agentlogin | 1517599663.74 |
| 2018-02-02 19:28:24 | 501 | agentlogoff | 1517599704.75 |
| 2018-02-02 19:29:02 | 501 | agentlogoff | 1517599742.76 |<-- delete
| 2018-02-02 19:29:39 | 501 | agentlogoff | 1517599778.77 |<-- delete
| 2018-02-02 19:34:54 | 501 | agentlogin | 1517600094.80 |
| 2018-02-02 19:35:23 | 501 | agentlogin | 1517600122.81 |<-- delete
| 2018-02-02 19:35:49 | 501 | agentlogin | 1517600149.82 |<-- delete
| 2018-02-02 19:36:04 | 501 | agentlogoff | 1517600164.83 |
| 2018-02-02 19:36:08 | 501 | agentlogoff | 1517600168.84 |<-- delete
+---------------------+-------+-------------+---------------+

I would create a copy of the table with an auto_increment column. This way you can compare two neighbor rows more easily and more efficiently.
Find in the new table the rows which have the same agent and etype as in the previous row and join the result with the original table using the unique column in a DELETE statement.
create table tmp (
`id` int unsigned auto_increment primary key,
`calldate` datetime,
`uniqueid` varchar(32),
`agent` varchar(80),
`etype` varchar(80)
) as
select null as id, calldate, uniqueid, agent, etype
from test
order by agent, calldate, uniqueid
;
delete t
from tmp t1
join tmp t2
on t2.id = t1.id + 1
and t2.agent = t1.agent
and t2.etype = t1.etype
join test t on t.uniqueid = t2.uniqueid;
drop table tmp;
Demo: http://sqlfiddle.com/#!9/3e96b/2
You should however first have an index on uniqueid.

Here you go:
select calldate,agent,etype,uniqueid
from test as t1
where not exists
(select *
from
test as t2
where t2.agent=t1.agent
and t2.etype=t1.etype
and t2.uniqueid<t1.uniqueid
and t2.uniqueid>ifnull((select max(uniqueid )
from test t3
where t3.agent=t1.agent
and t3.etype<>t1.etype
and t3.uniqueid<t1.uniqueid),0)
)
order by uniqueid;
http://sqlfiddle.com/#!9/149802/16

Related

Best Approach to correct column with different spellings in mysql

I have a table with column that has data with spelling errors.
Like:
apple, appl, aple
bana, banana, banna
cat, cot, cta
I would like to correct all error spelling to single correct ones. There are thousands of rows.
What would be the best approach to correct this issue where I wouldn't have to update each spelling errors manually?
I have added status iscorrect 'Y' for correct ones.
Here's a thought, using SOUNDEX. SOUNDEX is really a lousy function, and certainly no panacea, but it might reduce a data set comprising thousands of errors to a data set comprising hundreds of errors.
For the rest, we can look at things like Levenshtein distance, but ultimately, you're going to need a manual approach to some extent...
DROP TABLE IF EXISTS bad_data;
CREATE TABLE bad_data
(id SERIAL PRIMARY KEY
,string VARCHAR(12) NOT NULL
);
INSERT INTO bad_data (string) VALUES
('apple'),
('appl'),
('aple'),
('bana'),
('banana'),
('banna'),
('cat'),
('cot'),
('cta');
DROP TABLE IF EXISTS good_data;
CREATE TABLE good_data
(id SERIAL PRIMARY KEY
,string VARCHAR(12) NOT NULL UNIQUE
);
INSERT INTO good_data(string) VALUES
('apple'),
('banana'),
('cat');
SELECT *
FROM bad_data x
JOIN good_data y ON SOUNDEX(x.string) = SOUNDEX(y.string);
+----+--------+------+--------+
| id | string | id | string |
+----+--------+------+--------+
| 1 | apple | 1 | apple |
| 2 | appl | 1 | apple |
| 3 | aple | 1 | apple |
| 4 | bana | 2 | banana |
| 5 | banana | 2 | banana |
| 6 | banna | 2 | banana |
| 7 | cat | 3 | cat |
| 8 | cot | 3 | cat |
| 9 | cta | 3 | cat |
+----+--------+------+--------+

Update table data using foreign key

I have two tables structure like below
Table1
Serial | Src | Albumid(primarykey)
________|__________________|________
1 | /root/wewe.jpg | 20
2 | /root/wewe.jpg | 21
3 | /root/wewe.jpg | 21
4 | /root/wewe.jpg | 23
5 | /root/wewe.jpg | 18
Table2
Albumid | Albumname | AlbumCover //albumid is secondary key ref. to first table
________|__________________|________
20 | AAA | null
21 | bbb | null
31 | vcc | null
42 | ddd | null
18 | eee | null
I followed this POST two update my Albumcover in Table2 using Serial no. of first table..
create proc AddCover #Serial int
as
Begin
update Table1 set albumcover='somthing' where table1.serial = #Serial
end
Can i do like this using foregin key constraint??
You'll need to do the update on Table2. To tell it to have a condition based on values from table1, check this post for examples:
MySQL - UPDATE query based on SELECT Query

Populate values from one table

I am trying to populate an empty table(t) from another table(t2) based on a flag field being set. He is my attempt below and the table data.
UPDATE 2014PriceSheetIssues AS t
JOIN TransSalesAvebyType2013Combined AS t2
SET t.`Tran_Type`=t2.`Tran_Type` WHERE t.`rflag`='1';
When I run the script, I receive (0) zero records affected.??
+-----------+----------------+-------------------+-------+-------+
| Tran_Type | RetailAvePrice | WholesaleAvePrice | Rflag | Wflag |
+-----------+----------------+-------------------+-------+-------+
| 125C | 992 | 650 | 1 | NULL |
| 2004R | 1500 | NULL | 1 | NULL |
| 4EAT | 1480 | 1999 | 1 | 1 |
+-----------+----------------+-------------------+-------+-------+
I think you should just do the following
INSERT INTO 2014PriceSheetIssues
( `fldX`, `fldY` )
VALUES (
SELECT `fldX`, `fldY`
FROM TransSalesAvebyType2013Combined
WHERE 2014PriceSheetIssues.`rflag`='1'
)
The select query gets the values and the insert puts it in the (empty) other table.

SQL algorithm to as near to linear time as possible and tweaking of select statement

I am using MySQL version 5.5 on Ubuntu.
My database tables are setup as follows:
DDLs:
CREATE TABLE 'asx' (
'code' char(3) NOT NULL,
'high' decimal(9,3),
'low' decimal(9,3),
'close' decimal(9,3),
'histID' int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY ('histID'),
UNIQUE KEY 'code' ('code')
)
CREATE TABLE 'asxhist' (
'date' date NOT NULL,
'average' decimal(9,3),
'histID' int(11) NOT NULL,
PRIMARY KEY ('date','histID'),
KEY 'histID' ('histID'),
CONSTRAINT 'asxhist_ibfk_1' FOREIGN KEY ('histID') REFERENCES 'asx' ('histID')
ON UPDATE CASCADE
)
t1:
| code | high | low | close | histID (primary key)|
| asx | 10.000 | 9.500 | 9.800 | 1
| nab | 42.000 | 41.250 | 41.350 | 2
t2:
| date | average | histID (foreign key) |
| 2013-01-01| 10.000 | 1 |
| 2013-01-01| 39.000 | 2 |
| 2013-01-02| 9.000 | 1 |
| 2013-01-02| 38.000 | 2 |
| 2013-01-03| 9.500 | 1 |
| 2013-01-03| 39.500 | 2 |
| 2013-01-04| 11.000 | 1 |
| 2013-01-04| 38.500 | 2 |
I am attempting to complete a select query that produces this as a result:
| code | high | low | close | asxhist.average |
| asx | 10.000 | 9.500 | 9.800 | 11.000, 9.5000 |
| nab | 42.000 | 41.250 | 41.350 | 38.500,39.500 |
Where the most recent information in table 2 is returned with table 1 in a csv format.
I have managed to get this far:
SELECT code, high, low, close,
(SELECT GROUP_CONCAT(DISTINCT t2.average ORDER BY date DESC SEPARATOR ',') FROM t2
WHERE t2.histID = t1.histID)
FROM t1;
Unfortunately this returns all values associated with hID. I'm taking a look at xaprb.com's firstleastmax-row-per-group-in-sql solution but I have been banging my head all day and the slight wooziness seems to be dimming my ability to comprehend how I should use it to my benefit. How can I limit the results to the most 5 recent values and considering the tables will eventually be megabytes in size, try and remain in O(n2) or less? (Or can I?)
Temporary work around using SUBSTRING_INDEX and not a feasible solution for huge data
SELECT code, high, low, close,
(SELECT SUBSTRING_INDEX(GROUP_CONCAT(asxhist.average), ',', 3)
FROM asxhist
WHERE asxhist.histID = asx.histID
ORDER BY date DESC)
FROM asx;
From what I gather Limit option in GROUP_CONCAT is still under feature-request.
Also on stackoverflow hack MySQL GROUP_CONCAT

MySQL insert new row on value change

For a personal project I'm working on right now I want to make a line graph of game prices on Steam, Impulse, EA Origins, and several other sites over time. At the moment I've modified a script used by SteamCalculator.com to record the current price (sale price if applicable) for every game in every country code possible or each of these sites. I also have a column for the date in which the price was stored. My current tables look something like so:
THIS STRUCTURE IS NO LONGER VALID. SEE BELOW
+----------+------+------+------+------+------+------+------------+
| steam_id | us | at | au | de | no | uk | date |
+----------+------+------+------+------+------+------+------------+
| 112233 | 999 | 899 | 999 | NULL | 899 | 699 | 2011-8-21 |
| 123456 | 1999 | 999 | 1999 | 999 | 999 | 999 | 2011-8-20 |
| ... | ... | ... | ... | ... | ... | ... | ... |
+----------+------+------+------+------+------+------+------------+
At the moment each country is updated separately (there's a for loop going through the countries), although if it would simplify it then this could be modified to temporarily store new prices to an array then update an entire row at a time. I'll likely be doing this eventually, anyway, for performance reasons.
Now my issue is determining how to best update this table if one of the prices changes. For instance, let's suppose that on 8/22/2011 the game 112233 goes on sale in America for $4.99, Austria for 3.99€, and the other prices remain the same. I would need the table to look like so:
THIS STRUCTURE IS NO LONGER VALID. SEE BELOW
+----------+------+------+------+------+------+------+------------+
| steam_id | us | at | au | de | no | uk | date |
+----------+------+------+------+------+------+------+------------+
| 112233 | 999 | 899 | 999 | NULL | 899 | 699 | 2011-8-21 |
| 123456 | 1999 | 999 | 1999 | 999 | 999 | 999 | 2011-8-20 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 112233 | 499 | 399 | 999 | NULL | 899 | 699 | 2011-8-22 |
+----------+------+------+------+------+------+------+------------+
I don't want to create a new row EVERY time the price is checked, otherwise I'll end up having millions of rows of repeated prices day after day. I also don't want to create a new row per changed price like so:
THIS STRUCTURE IS NO LONGER VALID. SEE BELOW
+----------+------+------+------+------+------+------+------------+
| steam_id | us | at | au | de | no | uk | date |
+----------+------+------+------+------+------+------+------------+
| 112233 | 999 | 899 | 999 | NULL | 899 | 699 | 2011-8-21 |
| 123456 | 1999 | 999 | 1999 | 999 | 999 | 999 | 2011-8-20 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 112233 | 499 | 899 | 999 | NULL | 899 | 699 | 2011-8-22 |
| 112233 | 499 | 399 | 999 | NULL | 899 | 699 | 2011-8-22 |
+----------+------+------+------+------+------+------+------------+
I can prevent the first problem but not the second by making each (steam_id, <country>) a unique index then adding ON DUPLICATE KEY UPDATE to every database query. This will only add a row if the price is different, however it will add a new row for each country which changes. It also does not allow the same price for a single game for two different days (for instance, suppose game 112233 goes off sale later and returns to $9.99) so this is clearly an awful option.
I can prevent the second problem but not the first by making (steam_id, date) a unique index then adding ON DUPLICATE KEY UPDATE to every query. Every single day when the script is run the date has changed, so it will create a new row. This method ends up with hundreds of lines of the same prices from day to day.
How can I tell MySQL to create a new row if (and only if) any of the prices has changed since the latest date?
UPDATE -
At the recommendation of people in this thread I have changed the schema of my database to facilitate adding new country codes in the future and avoid the issue of needing to update entire rows at a time. The new schema looks something like:
+----------+------+---------+------------+
| steam_id | cc | price | date |
+----------+------+---------+------------+
| 112233 | us | 999 | 2011-8-21 |
| 123456 | uk | 699 | 2011-8-20 |
| ... | ... | ... | ... |
+----------+------+---------+------------+
On top of this new schema I have discovered that I can use the following SQL query to grab the price from the most recent update:
SELECT `price` FROM `steam_prices` WHERE `steam_id` = 112233 AND `cc`='us' ORDER BY `date` ASC LIMIT 1
At this point my question boils down to this:
Is it possible to (using only SQL rather than application logic) insert a row only if a condition is true? For instance:
INSERT INTO `steam_prices` (...) VALUES (...) IF price<>(SELECT `price` FROM `steam_prices` WHERE `steam_id` = 112233 AND `cc`='us' ORDER BY `date` ASC LIMIT 1)
From the MySQL manual I can not find any way to do this. I have only found that you can ignore or update if a unique index is the same. However if I made the price a unique index (allowing me to update the date if it was the same) then I would not be able to recognize when a game went on sale and then returned to its original price. For instance:
+----------+------+---------+------------+
| steam_id | cc | price | date |
+----------+------+---------+------------+
| 112233 | us | 999 | 2011-8-20 |
| 112233 | us | 499 | 2011-8-21 |
| 112233 | us | 999 | 2011-8-22 |
| ... | ... | ... | ... |
+----------+------+---------+------------+
Also, after just finding and reading MySQL Conditional INSERT, I created and tried the following query:
INSERT INTO `steam_prices`(
`steam_id`,
`cc`,
`update`,
`price`
)
SELECT '7870', 'us', NOW(), 999
FROM `steam_prices`
WHERE
`price`<>999
AND `update` IN (
SELECT `update`
FROM `steam_prices`
ORDER BY `update`
ASC LIMIT 1
)
The idea was to insert the row '7870', 'us', NOW(), 999 if (and only if) the price of the most recent update wasn't 999. When I ran this I got the following error:
1235 - This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'
Any ideas?
You will probably find this easier if you simply change your schema to something like:
steam_id integer
country varchar(2)
date date
price float
primary key (steam_id,country,date)
(with other appropriate indexes) and then only worrying about each country in turn.
In other words, your for loop has a unique ID/country combo so it can simply query the latest-date record for that combo and add a new row if it's different.
That will make your selections a little more complicated but I believe it's a better solution, especially if there's any chance at all that more countries may be added in future (it won't break the schema in that case).
First, I suggest you store your data in a form that is is less hard-coded per country:
+----------+--------------+------------+-------+
| steam_id | country_code | date | price |
+----------+--------------+------------+-------+
| 112233 | us | 2011-08-20 | 12.45 |
| 112233 | uk | 2011-08-20 | 12.46 |
| 112233 | de | 2011-08-20 | 12.47 |
| 112233 | at | 2011-08-20 | 12.48 |
| 112233 | us | 2011-08-21 | 12.49 |
| ...... | .. | .......... | ..... |
+----------+--------------+------------+-------+
From here, you place a primary key on the first three columns...
Now for your question about not creating extra rows... That is what a simple transaction + application logic is great at.
Start a transaction
Run a select to see if the record in question is there
If not, insert one
Was there a problem with that approach?
Hope this helps.
After experimentation, and with some help from MySQL Conditional INSERT and http://www.artfulsoftware.com/infotree/queries.php#101, I found a query that worked:
INSERT INTO `steam_prices`(
`steam_id`,
`cc`,
`price`,
`update`
)
SELECT 7870, 'us', 999, NOW()
FROM `steam_prices` AS p1
LEFT JOIN `steam_prices` AS p2 ON p1.`steam_id`=p2.`steam_id` AND p1.`update` < p2.`update`
WHERE
p2.`steam_id` IS NULL
AND p1.`steam_id`=7870
AND p1.`cc`='us'
AND (
p1.`price`<>999
)
The answer is to first return all rows where there is no earlier timestamp. This is done with a within-group aggregate. You join a table with itself only on rows where the timestamp is earlier. If it fails to join (the timestamp was not earlier) then you know that row contains the latest timestamp. These rows will have a NULL id in the joined table (failed to join).
After you have selected all rows with the latest timestamp, grab only those rows where the steam_id is the steam_id you're looking for and where the price is different from the new price that you're entering. If there are no rows with a different price for that game at this point then the price has not changed since the last update, so an empty set is returned. When an empty set is returned the SELECT statement fails and nothing is inserted. If the SELECT statement succeeds (a different price was found) then it returns the row 7870, 'us', 999, NOW() which is inserted into our table.
EDIT - I actually found a mistake with the above query a little while later and I have since revised it. The query above will insert a new row if the price has changed since the last update, but it will not insert a row if there are currently no prices in the database for that item.
To resolve this I had to take advantage of the DUAL table (which always contains one row), then use an OR in the where clause to test for a different price OR an empty set
INSERT INTO `steam_prices`(
`steam_id`,
`cc`,
`price`,
`update`
)
SELECT 12345, 'us', 999, NOW()
FROM DUAL
WHERE
NOT EXISTS (
SELECT `steam_id`
FROM `steam_prices`
WHERE `steam_id`=12345
)
OR
EXISTS (
SELECT p1.`steam_id`
FROM `steam_prices` AS p1
LEFT JOIN `steam_prices` AS p2 ON p1.`steam_id`=p2.`steam_id` AND p1.`update` < p2.`update`
WHERE
p2.`steam_id` IS NULL
AND p1.`steam_id`=12345
AND p1.`cc`='us'
AND (
p1.`price`<>999
)
)
It's very long, it's very ugly, and it's very complicated. But it works exactly as advertised. If there is no price in the database for a certain steam_id then it inserts a new row. If there is already a price then it checks the price with the most recent update and, if different, inserts a new row.