Clean duplicate rows on custom conditions - mysql

I have this problem managing notes. I started with the strategy to always INSERT new notes and SELECT the last one. Please don't laugh, I must have thought it was a good idea, but right now, the system is not even in all-out production and there's been 300k rows inserted in about a month. In two years, my system will fail. I need to merge duplicate lines. Here is the structure of my notes table:
CREATE TABLE IF NOT EXISTS `ps_notes` (
`CodeNTE` int(11) NOT NULL AUTO_INCREMENT,
`CodePRS` int(11) NOT NULL,
`CodeXYZ` int(11) NOT NULL,
`Type` char(3) NOT NULL,
`Focus` char(3) NOT NULL,
`Texte` tinytext NOT NULL,
`Date` datetime NOT NULL,
PRIMARY KEY (`CodeNTE`),
KEY `CodeXYZ` (`CodeXYZ`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=335068 ;
Notes can be related to a person CodePRS, are necessarily related to a Type, Focus and CodeXYZ. They have a Texte entry and sometime I want to know the Date.
CodeXYZ is a unique identifier for the entity to which the note is attached. This identifier can come from any table and therefore is not absolutely unique, hence comes the Type field. This field specifies from which table the parent row comes. The focus field distincts notes that refer to a same CodeXYZ and Type.
Here some sample lines:
+---------+------+-------+-------------+------------+
| CodeXYZ | Type | Focus | Texte | Date |
+---------+------+-------+-------------+------------+
| 30008 | ctr | adm | Whatever | 2013-05-09 |
| 30008 | ctr | adm | Whatever | 2013-06-10 |
| 30008 | ctr | adm | Lorem ipsum | 2013-06-11 |
| 30008 | ctr | clt | He's cool | 0000-00-00 |
| 2546 | ctr | sup | Another | 2013-02-11 |
| 2546 | ctr | sup | Another | 2013-02-11 |
| 2546 | ctr | sup | Another | 2013-02-19 |
+---------+------+-------+-------------+------------+
this is the output I'd like to have:
+---------+------+-------+-------------+-----------------------------------------+
| CodeXYZ | Type | Focus | Texte | Date |
+---------+------+-------+-------------+-----------------------------------------+
| 30008 | ctr | adm | Lorem ipsum | 2013-06-11 (I want the most recent one) |
| 30008 | ctr | clt | He's cool | 0000-00-00 |
| 2546 | ctr | sup | Another | 2013-02-11 |
| 2546 | ctr | sup | Another | 2013-02-19 |
+---------+------+-------+-------------+-----------------------------------------+
Conditions for merging
I want to merge lines that have the same CodeXYZ,Type and Focus when Focus is not 'sup'.
When Focus is 'sup' I want to merge the lines that have the same CodeXYZ,Type,Focus and Date
Always I want to keep the most recent one
So I ran this query to merge rows in a temporary table:
INSERT INTO notes_tmp (CodePRS,CodeXYZ,Type,Focus,Texte,Date)
SELECT CodePRS,CodeXYZ,Type,Focus,Texte,Date
FROM notes
GROUP BY CodeXYZ,Type,Focus
But that way, all lines will be merged even the last ones.
So I thought of this:
INSERT INTO notes_tmp (CodePRS,CodeXYZ,Type,Focus,Texte,Date)
SELECT CodePRS,CodeXYZ,Type,Focus,Texte,Date
FROM notes
WHERE Focus<>'sup'
GROUP BY CodeXYZ,Type,Focus
ORDER BY Date DESC
UNION
SELECT CodePRS,CodeXYZ,Type,Focus,Texte,Date
FROM notes
WHERE Focus='sup'
GROUP BY CodeXYZ,Type,Focus,Date
ORDER BY Date DESC
but UNION is not at the right place, I don't think I can use it in INSERT INTO ... SELECT sql syntax
Is there a way to manage copying those lines over in a single mysql call with multiple sub queries all ending up in the same table acording to separate conditions

you can use group_concat to merge text field and make other columns unique with group by. try this:
INSERT INTO notes_temp
SELECT CodeXYZ,Type, Focus,GROUP_CONCAT(Texte),Date
FROM notes WHERE Focus = 'sup'
GROUP BY CodeXYZ,Type, Focus,Date;
INSERT INTO notes_temp
SELECT CodeXYZ,Type, Focus,GROUP_CONCAT(Texte),MAX(Date)
FROM notes WHERE Focus <> 'sup'
GROUP BY CodeXYZ,Type, Focus;
check sqlfiddle

So with part of #Volkan answer, I could come up with this somehow strangely working sql to get the correct note out of my GROUP_CONCAT()
The case will get the last entry of the group concat. I used another separator (,,,) because commas do happen often in text. three in a row a little bit less.
INSERT INTO notes_temp
SELECT CodeXYZ,Type, Focus,Texte,Date
FROM notes WHERE Focus = 'sup'
GROUP BY CodeXYZ,Type, Focus,Date;
INSERT INTO notes_temp
SELECT
CodeXYZ,
Type,
Focus,
CASE
WHEN COUNT(Texte) > 1
THEN SUBSTR(GROUP_CONCAT(Texte SEPARATOR ",,,"),((LENGTH(GROUP_CONCAT(Texte SEPARATOR ",,,"))+2) - INSTR(REVERSE(GROUP_CONCAT(Texte SEPARATOR ",,,")),",,,")))
ELSE
Texte
END
AS Texte,
MAX(Date)
FROM notes WHERE Focus <> 'sup'
GROUP BY CodeXYZ,Type, Focus;

Related

MySQL - Select everything from one table, but only first matching value in second table

I'm feeling a little rusty with creating queries in MySQL. I thought I could solve this, but I'm having no luck and searching around doesn't result in anything similar...
Basically, I have two tables. I want to select everything from one table and the matching row from the second table. However, I only want to have the first result from the second table. I hope that makes sense.
The rows in the daily_entries table are unique. There will be one row for each day, but maybe not everyday. The second table notes contains many rows, each of which are associated with ONE row from daily_entries.
Below are examples of my tables;
Table One
mysql> desc daily_entries;
+----------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+--------------+------+-----+---------+----------------+
| eid | int(11) | NO | PRI | NULL | auto_increment |
| date | date | NO | | NULL | |
| location | varchar(100) | NO | | NULL | |
+----------+--------------+------+-----+---------+----------------+
Table Two
mysql> desc notes;
+---------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+---------+------+-----+---------+----------------+
| task_id | int(11) | NO | PRI | NULL | auto_increment |
| eid | int(11) | NO | MUL | NULL | |
| notes | text | YES | | NULL | |
+---------+---------+------+-----+---------+----------------+
What I need to do, is select all entries from notes, with only one result from daily_entries.
Below is an example of how I want it to look:
+----------------------------------------------+---------+------------+----------+-----+
| notes | task_id | date | location | eid |
+----------------------------------------------+---------+------------+----------+-----+
| Another note | 3 | 2014-01-02 | Home | 2 |
| Enter a note. | 1 | 2014-01-01 | Away | 1 |
| This is a test note. To see what happens. | 2 | | Away | 1 |
| Testing another note | 4 | | Away | 1 |
+----------------------------------------------+---------+------------+----------+-----+
4 rows in set (0.00 sec)
Below is the query that I currently have:
SELECT notes.notes, notes.task_id, daily_entries.date, daily_entries.location, daily_entries.eid
FROM daily_entries
LEFT JOIN notes ON daily_entries.eid=notes.eid
ORDER BY daily_entries.date DESC
Below is an example of how it looks with my query:
+----------------------------------------------+---------+------------+----------+-----+
| notes | task_id | date | location | eid |
+----------------------------------------------+---------+------------+----------+-----+
| Another note | 3 | 2014-01-02 | Home | 2 |
| Enter a note. | 1 | 2014-01-01 | Away | 1 |
| This is a test note. To see what happens. | 2 | 2014-01-01 | Away | 1 |
| Testing another note | 4 | 2014-01-01 | Away | 1 |
+----------------------------------------------+---------+------------+----------+-----+
4 rows in set (0.00 sec)
At first I thought I could simply GROUP BY daily_entries.date, however that returned only the first row of each matching set. Can this even be done? I would greatly appreciate any help someone can offer. Using Limit at the end of my query obviously limited it to the value that I specified, but applied it to everything which was to be expected.
Basically, there's nothing wrong with your query. I believe it is exactly what you need because it is returning the data you want. You can not look at as if it is duplicating your daily_entries you should be looking at it as if it is return all notes with its associated daily_entry.
Of course, you can achieve what you described in your question (there's an answer already that solve this issue) but think twice before you do it because such nested queries will only add a lot of noticeable performance overhead to your database server.
I'd recommend to keep your query as simple as possible with one single LEFT JOIN (which is all you need) and then let consuming applications manipulate the data and present it the way they need to.
Use mysql's non-standard group by functionality:
SELECT n.notes, n.task_id, de.date, de.location, de.eid
FROM notes n
LEFT JOIN (select * from
(select * from daily_entries ORDER BY date DESC) x
group by eid) de ON de.eid = n.eid
You need to do these queries with explicit filtering for the last row. This example uses a join to do this:
SELECT n.notes, n.task_id, de.date, de.location, de.eid
FROM daily_entries de LEFT JOIN
notes n
ON de.eid = n.eid LEFT JOIN
(select n.eid, min(task_id) as min_task_id
from notes n
group by n.eid
) nmin
on n.task_id = nmin.min_task_id
ORDER BY de.date DESC;

SQL algorithm to as near to linear time as possible and tweaking of select statement

I am using MySQL version 5.5 on Ubuntu.
My database tables are setup as follows:
DDLs:
CREATE TABLE 'asx' (
'code' char(3) NOT NULL,
'high' decimal(9,3),
'low' decimal(9,3),
'close' decimal(9,3),
'histID' int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY ('histID'),
UNIQUE KEY 'code' ('code')
)
CREATE TABLE 'asxhist' (
'date' date NOT NULL,
'average' decimal(9,3),
'histID' int(11) NOT NULL,
PRIMARY KEY ('date','histID'),
KEY 'histID' ('histID'),
CONSTRAINT 'asxhist_ibfk_1' FOREIGN KEY ('histID') REFERENCES 'asx' ('histID')
ON UPDATE CASCADE
)
t1:
| code | high | low | close | histID (primary key)|
| asx | 10.000 | 9.500 | 9.800 | 1
| nab | 42.000 | 41.250 | 41.350 | 2
t2:
| date | average | histID (foreign key) |
| 2013-01-01| 10.000 | 1 |
| 2013-01-01| 39.000 | 2 |
| 2013-01-02| 9.000 | 1 |
| 2013-01-02| 38.000 | 2 |
| 2013-01-03| 9.500 | 1 |
| 2013-01-03| 39.500 | 2 |
| 2013-01-04| 11.000 | 1 |
| 2013-01-04| 38.500 | 2 |
I am attempting to complete a select query that produces this as a result:
| code | high | low | close | asxhist.average |
| asx | 10.000 | 9.500 | 9.800 | 11.000, 9.5000 |
| nab | 42.000 | 41.250 | 41.350 | 38.500,39.500 |
Where the most recent information in table 2 is returned with table 1 in a csv format.
I have managed to get this far:
SELECT code, high, low, close,
(SELECT GROUP_CONCAT(DISTINCT t2.average ORDER BY date DESC SEPARATOR ',') FROM t2
WHERE t2.histID = t1.histID)
FROM t1;
Unfortunately this returns all values associated with hID. I'm taking a look at xaprb.com's firstleastmax-row-per-group-in-sql solution but I have been banging my head all day and the slight wooziness seems to be dimming my ability to comprehend how I should use it to my benefit. How can I limit the results to the most 5 recent values and considering the tables will eventually be megabytes in size, try and remain in O(n2) or less? (Or can I?)
Temporary work around using SUBSTRING_INDEX and not a feasible solution for huge data
SELECT code, high, low, close,
(SELECT SUBSTRING_INDEX(GROUP_CONCAT(asxhist.average), ',', 3)
FROM asxhist
WHERE asxhist.histID = asx.histID
ORDER BY date DESC)
FROM asx;
From what I gather Limit option in GROUP_CONCAT is still under feature-request.
Also on stackoverflow hack MySQL GROUP_CONCAT

Three Queries Faster than One -- What's Wrong with my Joins?

I've got a JPA ManyToMany relationship set up, which gives me three important tables: my Ticket table, my Join table, and my Inventory table. They're InnoDB tables on MySQL 5.1. The relevant bits are:
Ticket:
+--------+----------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+----------+------+-----+---------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| Status | longtext | YES | | NULL | |
+--------+----------+------+-----+---------+----------------+
JoinTable:
+-------------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------+------+-----+---------+-------+
| InventoryID | int(11) | NO | PRI | NULL | | Foreign Key - Inventory
| TicketID | int(11) | NO | PRI | NULL | | Foreign Key - Ticket
+-------------+---------+------+-----+---------+-------+
Inventory:
+--------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| TStampString | varchar(32) | NO | MUL | NULL | |
+--------------+--------------+------+-----+---------+----------------+
TStampStrings are of the form "yyyy.mm.dd HH:MM:SS Z" (for example, '2010.03.19 22:27:57 GMT'). Right now all of the Tickets created directly correspond to some specific hour TStampString, so that SELECT COUNT(*) FROM Ticket; is the same as SELECT COUNT(DISTINCT(SUBSTRING(TStampString, 1, 13))) FROM Inventory;
What I'd like to do is regroup certain Tickets based on the minute granularity of a TStampString: (SUBSTRING(TStampString, 1, 16)). So I'm profiling and testing the SELECT of an INSERT INTO ... SELECT statement:
EXPLAIN SELECT SUBSTRING(i.TStampString, 1, 16) FROM Ticket t JOIN JoinTable j
ON t.ID = j.TicketID JOIN Inventory i ON j.InventoryID = i.ID WHERE t.Status
= 'Regroup' GROUP BY SUBSTRING(i.TStampString, 1, 16);
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
|id| type |tbl| type | psbl_keys | key | len | ref | rows | Extra |
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
|1 | SMPL | t | ALL | PRI | NULL| NULL| NULL | 35569 | where |
| | | | | | | | | | +temporary|
| | | | | | | | | | +filesort |
|1 | SMPL | j | ref | PRI,FK1,FK2 | FK2 | 4 | t.ID | 378 | index |
|1 | SMPL | i | eq_ref | PRI | PRI | 4 | j.Invent | 1 | |
| | | | | | | | oryID | | |
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
What this implies to me is that for each row in Ticket, MySQL first does the joins then later decides that the row is invalid due to the WHERE clause. Certainly the runtime is abominable (I gave up after 30 minutes). Note that it goes no faster with t.Status = 'Regroup' moved to the first JOIN clause and no WHERE clause.
But what's interesting is that if I run this query manually in three steps, doing what I thought the optimizer would do, each step returns almost immediately:
--Step 1: Select relevant Tickets (results dumped to file)
SELECT ID FROM Ticket WHERE Status = 'Regroup';
--Step 2: Get relevant Inventory entries
SELECT InventoryID FROM JoinTable WHERE TicketID IN (step 1s file);
--Step 3: Select what I wanted all along
SELECT SUBSTRING(TStampString, 1, 16) FROM Inventory WHERE ID IN (step 2s file)
GROUP BY SUBSTRING(TStampString, 1, 16);
On my particular tables, the first query gives 154 results, the second creates 206,598 lines, and the third query returns 9198 rows. All of them combined take ~2 minutes to run, with the last query having the only significant runtime.
Dumping the intermediate results to a file is cumbersome, and more importantly I'd like to know how to write my original query such that it runs reasonably. So how do I structure this three-table-join such that it runs as fast as I know is possible?
UPDATE: I've added a prefix index on Status(16), which changes my EXPLAIN profile rows to 153, 378, and 1 respectively (since the first row has a key to use). The JOIN version of my query now takes ~6 minutes, which is tolerable but still considerably slower than the manual version. I'd still like to know why the join is performing sorely suboptimally, but it may be that one can't create independent subqueries in buggy MySQL 5.1. If enough time passes I'll accept Add Index as the solution to my problem, although it's not exactly the answer to my question.
In the end I did end up manually recreating every step of the join on disk. Tens of thousands of files each with a thousand queries was still significantly faster than anything I could get my version of MySQL to do. But since that process would be horribly specific and unhelpful for the layman, I'm accepting ypercube's answer of Add (Partial) Indexes.
What you can do to speed up the query:
Add an index on Status. Even if you don't change the type to VARCHAR, you can still add a partial index:
ALTER TABLE Ticket
ADD INDEX status_idx
Status(16) ;
I assume that the Primary key of the Join table is (InventoryID, TicketID). You can add another index on (TicketID, InventoryID) as well. This may not benefit this particular query but it will be helpful in other queries you'll have.
The answer on why this happens is that the optimizer does not always choose the best plan. You can try this variation of your query and see how the EXPLAIN plan differs and if there is any efficiency gain:
SELECT SUBSTRING(i.TStampString, 1, 16)
FROM
( SELECT (DISTINCT) j.InventoryID
FROM Ticket t
JOIN JoinTable j
ON t.ID = j.TicketID
WHERE t.Status = 'Regroup'
) AS tmp
JOIN Inventory i
ON tmp.InventoryID = i.ID
GROUP BY SUBSTRING(i.TStampString, 1, 16) ;
try giving the first substring-clause an alias and using it in the group-by.
SELECT SUBSTRING(i.TStampString, 1, 16) as blaa FROM Ticket t JOIN JoinTable j
ON t.ID = j.TicketID JOIN Inventory i ON j.InventoryID = i.ID WHERE t.Status
= 'Regroup' GROUP BY blaa;
also avoid the join altogether since you dont need it..
SELECT distinct(SUBSTRING(i.TStampString, 1,16)) from inventory i where i.ID in
( select id from JoinTable j where j.TicketID in
(select id from Ticket t where t.Status = 'Regroup'));
would that work?
btw. you do have an index on the Status field ?

Can I SELECT this in a single stament?

I am a total SQL noob; sorry.
I have a table
mysql> describe activity;
+--------------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+---------+------+-----+---------+-------+
| user | text | NO | | NULL | |
| time_stamp | int(11) | NO | | NULL | |
| activity | text | NO | | NULL | |
| item | int | NO | | NULL | |
+--------------+---------+------+-----+---------+-------+
Normally activity is a two-step process; 1) "check out" and 2 "use"
An item cnnot be checked out a second time, unless used.
Now I want to find any cases where an item was checked out but not used.
Being dumb, I would use two selects, one for check out &one for use, on the same item, then compare the timestamps.
Is there a SELECT statemnt that will help me selct the items which were checked out but not used?
Tricky with the possibility of multipel checkouts. Or should I just code
loop over activity, form oldest until newset
if find a checkout and there is no newer used time then i have a hit
You could get the last date of each checkout or use and then compare them per item:
SELECT MAX(IF(activity='check out', time_stamp, NULL)) AS last_co,
MAX(IF(activity='use', time_stamp, NULL)) AS last_use
FROM activity
GROUP BY item
HAVING NOT(last_use >= last_co);
The NOT(last_use >= last_co) is written that way because of how NULL compare behaviour works: last_use < last_co will not work if last_use is null.
Without proper indexing, this query will not perform very well though. Plus you might want to bound the query using a WHERE condition.

MySQL insert new row on value change

For a personal project I'm working on right now I want to make a line graph of game prices on Steam, Impulse, EA Origins, and several other sites over time. At the moment I've modified a script used by SteamCalculator.com to record the current price (sale price if applicable) for every game in every country code possible or each of these sites. I also have a column for the date in which the price was stored. My current tables look something like so:
THIS STRUCTURE IS NO LONGER VALID. SEE BELOW
+----------+------+------+------+------+------+------+------------+
| steam_id | us | at | au | de | no | uk | date |
+----------+------+------+------+------+------+------+------------+
| 112233 | 999 | 899 | 999 | NULL | 899 | 699 | 2011-8-21 |
| 123456 | 1999 | 999 | 1999 | 999 | 999 | 999 | 2011-8-20 |
| ... | ... | ... | ... | ... | ... | ... | ... |
+----------+------+------+------+------+------+------+------------+
At the moment each country is updated separately (there's a for loop going through the countries), although if it would simplify it then this could be modified to temporarily store new prices to an array then update an entire row at a time. I'll likely be doing this eventually, anyway, for performance reasons.
Now my issue is determining how to best update this table if one of the prices changes. For instance, let's suppose that on 8/22/2011 the game 112233 goes on sale in America for $4.99, Austria for 3.99€, and the other prices remain the same. I would need the table to look like so:
THIS STRUCTURE IS NO LONGER VALID. SEE BELOW
+----------+------+------+------+------+------+------+------------+
| steam_id | us | at | au | de | no | uk | date |
+----------+------+------+------+------+------+------+------------+
| 112233 | 999 | 899 | 999 | NULL | 899 | 699 | 2011-8-21 |
| 123456 | 1999 | 999 | 1999 | 999 | 999 | 999 | 2011-8-20 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 112233 | 499 | 399 | 999 | NULL | 899 | 699 | 2011-8-22 |
+----------+------+------+------+------+------+------+------------+
I don't want to create a new row EVERY time the price is checked, otherwise I'll end up having millions of rows of repeated prices day after day. I also don't want to create a new row per changed price like so:
THIS STRUCTURE IS NO LONGER VALID. SEE BELOW
+----------+------+------+------+------+------+------+------------+
| steam_id | us | at | au | de | no | uk | date |
+----------+------+------+------+------+------+------+------------+
| 112233 | 999 | 899 | 999 | NULL | 899 | 699 | 2011-8-21 |
| 123456 | 1999 | 999 | 1999 | 999 | 999 | 999 | 2011-8-20 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 112233 | 499 | 899 | 999 | NULL | 899 | 699 | 2011-8-22 |
| 112233 | 499 | 399 | 999 | NULL | 899 | 699 | 2011-8-22 |
+----------+------+------+------+------+------+------+------------+
I can prevent the first problem but not the second by making each (steam_id, <country>) a unique index then adding ON DUPLICATE KEY UPDATE to every database query. This will only add a row if the price is different, however it will add a new row for each country which changes. It also does not allow the same price for a single game for two different days (for instance, suppose game 112233 goes off sale later and returns to $9.99) so this is clearly an awful option.
I can prevent the second problem but not the first by making (steam_id, date) a unique index then adding ON DUPLICATE KEY UPDATE to every query. Every single day when the script is run the date has changed, so it will create a new row. This method ends up with hundreds of lines of the same prices from day to day.
How can I tell MySQL to create a new row if (and only if) any of the prices has changed since the latest date?
UPDATE -
At the recommendation of people in this thread I have changed the schema of my database to facilitate adding new country codes in the future and avoid the issue of needing to update entire rows at a time. The new schema looks something like:
+----------+------+---------+------------+
| steam_id | cc | price | date |
+----------+------+---------+------------+
| 112233 | us | 999 | 2011-8-21 |
| 123456 | uk | 699 | 2011-8-20 |
| ... | ... | ... | ... |
+----------+------+---------+------------+
On top of this new schema I have discovered that I can use the following SQL query to grab the price from the most recent update:
SELECT `price` FROM `steam_prices` WHERE `steam_id` = 112233 AND `cc`='us' ORDER BY `date` ASC LIMIT 1
At this point my question boils down to this:
Is it possible to (using only SQL rather than application logic) insert a row only if a condition is true? For instance:
INSERT INTO `steam_prices` (...) VALUES (...) IF price<>(SELECT `price` FROM `steam_prices` WHERE `steam_id` = 112233 AND `cc`='us' ORDER BY `date` ASC LIMIT 1)
From the MySQL manual I can not find any way to do this. I have only found that you can ignore or update if a unique index is the same. However if I made the price a unique index (allowing me to update the date if it was the same) then I would not be able to recognize when a game went on sale and then returned to its original price. For instance:
+----------+------+---------+------------+
| steam_id | cc | price | date |
+----------+------+---------+------------+
| 112233 | us | 999 | 2011-8-20 |
| 112233 | us | 499 | 2011-8-21 |
| 112233 | us | 999 | 2011-8-22 |
| ... | ... | ... | ... |
+----------+------+---------+------------+
Also, after just finding and reading MySQL Conditional INSERT, I created and tried the following query:
INSERT INTO `steam_prices`(
`steam_id`,
`cc`,
`update`,
`price`
)
SELECT '7870', 'us', NOW(), 999
FROM `steam_prices`
WHERE
`price`<>999
AND `update` IN (
SELECT `update`
FROM `steam_prices`
ORDER BY `update`
ASC LIMIT 1
)
The idea was to insert the row '7870', 'us', NOW(), 999 if (and only if) the price of the most recent update wasn't 999. When I ran this I got the following error:
1235 - This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'
Any ideas?
You will probably find this easier if you simply change your schema to something like:
steam_id integer
country varchar(2)
date date
price float
primary key (steam_id,country,date)
(with other appropriate indexes) and then only worrying about each country in turn.
In other words, your for loop has a unique ID/country combo so it can simply query the latest-date record for that combo and add a new row if it's different.
That will make your selections a little more complicated but I believe it's a better solution, especially if there's any chance at all that more countries may be added in future (it won't break the schema in that case).
First, I suggest you store your data in a form that is is less hard-coded per country:
+----------+--------------+------------+-------+
| steam_id | country_code | date | price |
+----------+--------------+------------+-------+
| 112233 | us | 2011-08-20 | 12.45 |
| 112233 | uk | 2011-08-20 | 12.46 |
| 112233 | de | 2011-08-20 | 12.47 |
| 112233 | at | 2011-08-20 | 12.48 |
| 112233 | us | 2011-08-21 | 12.49 |
| ...... | .. | .......... | ..... |
+----------+--------------+------------+-------+
From here, you place a primary key on the first three columns...
Now for your question about not creating extra rows... That is what a simple transaction + application logic is great at.
Start a transaction
Run a select to see if the record in question is there
If not, insert one
Was there a problem with that approach?
Hope this helps.
After experimentation, and with some help from MySQL Conditional INSERT and http://www.artfulsoftware.com/infotree/queries.php#101, I found a query that worked:
INSERT INTO `steam_prices`(
`steam_id`,
`cc`,
`price`,
`update`
)
SELECT 7870, 'us', 999, NOW()
FROM `steam_prices` AS p1
LEFT JOIN `steam_prices` AS p2 ON p1.`steam_id`=p2.`steam_id` AND p1.`update` < p2.`update`
WHERE
p2.`steam_id` IS NULL
AND p1.`steam_id`=7870
AND p1.`cc`='us'
AND (
p1.`price`<>999
)
The answer is to first return all rows where there is no earlier timestamp. This is done with a within-group aggregate. You join a table with itself only on rows where the timestamp is earlier. If it fails to join (the timestamp was not earlier) then you know that row contains the latest timestamp. These rows will have a NULL id in the joined table (failed to join).
After you have selected all rows with the latest timestamp, grab only those rows where the steam_id is the steam_id you're looking for and where the price is different from the new price that you're entering. If there are no rows with a different price for that game at this point then the price has not changed since the last update, so an empty set is returned. When an empty set is returned the SELECT statement fails and nothing is inserted. If the SELECT statement succeeds (a different price was found) then it returns the row 7870, 'us', 999, NOW() which is inserted into our table.
EDIT - I actually found a mistake with the above query a little while later and I have since revised it. The query above will insert a new row if the price has changed since the last update, but it will not insert a row if there are currently no prices in the database for that item.
To resolve this I had to take advantage of the DUAL table (which always contains one row), then use an OR in the where clause to test for a different price OR an empty set
INSERT INTO `steam_prices`(
`steam_id`,
`cc`,
`price`,
`update`
)
SELECT 12345, 'us', 999, NOW()
FROM DUAL
WHERE
NOT EXISTS (
SELECT `steam_id`
FROM `steam_prices`
WHERE `steam_id`=12345
)
OR
EXISTS (
SELECT p1.`steam_id`
FROM `steam_prices` AS p1
LEFT JOIN `steam_prices` AS p2 ON p1.`steam_id`=p2.`steam_id` AND p1.`update` < p2.`update`
WHERE
p2.`steam_id` IS NULL
AND p1.`steam_id`=12345
AND p1.`cc`='us'
AND (
p1.`price`<>999
)
)
It's very long, it's very ugly, and it's very complicated. But it works exactly as advertised. If there is no price in the database for a certain steam_id then it inserts a new row. If there is already a price then it checks the price with the most recent update and, if different, inserts a new row.