First, I'll explain what I need to do, then how I think I can achieve it. My current plan seems very inefficient in theory, so my question is whether there is a better way of accomplishing it.
I have 2 Tables - lets call them 'Products' and 'Products_Temp', both are identical. I need to download a large number of files (XML or XLS) which contain product details (stock, pricing etc) from suppliers. These are then parsed into the Products_Temp table. Right now, I plan to use CF Scheduled Tasks to handle the downloading, and Navicat to do the actual parsing - I'm happy enough this is adequate and efficient enough.
The next step is where I'm struggling - once the file has been downloaded and parsed, I need to look for any changes in the data. This will be compared against the Products table. If a change is found, then that row should be added or updated (if it should be removed, then I'll need to flag it rather than just delete it). Once all the data has been compared, the products_temp table should be emptied.
I'm aware of methods to compare tables and sync them accordingly, however the issue I have is the fact I'll be handling multiple files from different sources. I had considered using only the products table and append/update, but I'm unsure how I could manage the 'flag deleted' requirement.
Right now, the only way I know I can make it work is to loop through the products_temp table, do various cfquerys and delete the row once complete. However, that seems incredibly inefficient, and given the fact we're likely to be dealing with hundreds of thousands of rows, unlikely to be effective if we update everything daily.
Any pointers or advice on a better route would be appreciated!
Both responses have possibilities. Just to expand on your options a little ..
Option #1
IF mySQL supports some sort of hashing, on a per row basis, you could use a variation of comodoro's suggestion to avoid hard deletes.
Identify Changed
To identify changes, do an inner join on the primary key and check the hash values. If they are different, the product was changed and should be updated:
UPDATE Products p INNER JOIN Products_Temp tmp ON tmp.ProductID = p.ProductID
SET p.ProductName = tmp.ProductName
, p.Stock = tmp.Stock
, ...
, p.DateLastChanged = now()
, p.IsDiscontinued = 0
WHERE tmp.TheRowHash <> p.TheRowHash
Identify Deleted
Use a simple outer join to identify records that do not exist in the temp table, and flag them as "deleted"
UPDATE Products p LEFT JOIN Products_Temp tmp ON tmp.ProductID = p.ProductID
SET p.DateLastChanged = now()
, p.IsDiscontinued = 1
WHERE tmp.ProductID IS NULL
Identify New
Finally, use a similar outer join to insert any "new" products.
INSERT INTO Products ( ProductName, Stock, DateLastChanged, IsDiscontinued, .. )
SELECT tmp.ProductName, tmp.Stock, now() AS DateLastChanged, 0 AS IsDiscontinued, ...
FROM Products_Temp tmp LEFT JOIN Products p ON tmp.ProductID = p.ProductID
WHERE p.ProductID IS NULL
Option #2
If per row hashing is not feasible, an alternate approach is a variation of Sharondio's suggestion.
Add a "status" column to the temp table and flag all imported records as "new", "changed" or "unchanged" through a series of joins. (The default should be "changed").
Identify UN-Changed
First use an inner join, on all fields, to identify products that have NOT changed. (Note, if your table contains any nullable fields, remember to use something like coalesce Otherwise, the results may be skewed because null values are not equal to anything.
UPDATE Products_Temp tmp INNER JOIN Products p ON tmp.ProductID = p.ProductID
SET tmp.Status = 'Unchanged'
WHERE p.ProductName = tmp.ProductName
AND p.Stock = tmp.Stock
...
Identify New
Like before, use an outer join to identify "new" records.
UPDATE Products_Temp tmp LEFT JOIN Products p ON tmp.ProductID = p.ProductID
SET tmp.Status = 'New'
WHERE p.ProductID IS NULL
By process of elimination, all other records in the temp table are "changed". Once you have calculated the statuses, you can update the Products table:
/* update changed products */
UPDATE Products p INNER JOIN Products_Temp tmp ON tmp.ProductID = p.ProductID
SET p.ProductName = tmp.ProductName
, p.Stock = tmp.Stock
, ...
, p.DateLastChanged = now()
, p.IsDiscontinued = 0
WHERE tmp.status = 'Changed'
/* insert new products */
INSERT INTO Products ( ProductName, Stock, DateLastChanged, IsDiscontinued, .. )
SELECT tmp.ProductName, tmp.Stock, now() AS DateLastChanged, 0 AS IsDiscontinued, ...
FROM Products_Temp tmp
WHERE tmp.Status = 'New'
/* flag deleted records */
UPDATE Products p LEFT JOIN Products_Temp tmp ON tmp.ProductID = p.ProductID
SET p.DateLastChanged = now()
, p.IsDiscontinued = 1
WHERE tmp.ProductID IS NULL
For finding the changes, I'd look at joins based on the fields you want to match on. This can be slow, depending on the number of fields and whether or not they're indexed, but I'd still say it was faster than loops. Something along the lines of:
SELECT product_id
FROM Products
WHERE product_id NOT IN (
SELECT T.product_id
FROM Products_Temp T
INNER JOIN PRODUCTS P
ON (
P.field1 = T.field1
AND P.field2 = T.field2
...
)
)
For the missing products to find the non-matches:
SELECT P.product_id
FROM Products P
LEFT OUTER JOIN Products_Temp T
ON (P.field1 = T.field1
AND P.field2 = T.field2
...)
WHERE T.product_id IS NULL
I had to solve a similar problem once, maybe the solution is applicable in your case (I do not know Coldfusion much). Why not (for each source) just delete everything from table Products corresponding to that source and replacing it with Products_Temp from the same source? It assumes you can make a unique field for each source. The SQL code would look something like:
DELETE FROM Products WHERE source_id = x;
INSERT INTO Products (field1, field2, ..., source_id)
SELECT field1, field2, ..., x FROM Products_Temp;
Also if the source doesn't change much, you can consider making a hash after its downloading and skipping the update if it did not change to save some database access.
Related
I have 3 tables; products, variants, and stock {simplified}
PRODUCTS
id
name
discontinued (ENUM: 0 or 1)
etc
VARIANTS
id
product_id
colour
size
STOCK
id
variant_id
branch_id
When the user selects to discontinue the PRODUCT I set the discontinued flag to 1. That's fine, but I want to delete the PRODUCT and VARIANTS records altogether if there is no STOCK record of the product. Obviously I can do this using a SELECT query first in PHP but I would like to do it in one mySQL query. This is what I have so far, but it is returning an error from mySQL:
$query = "DELETE FROM prod_lines,
JOIN variants ON variants.lineid = prod_lines.id
WHERE prod_lines.id = '$lineid' AND
(SELECT COUNT(id) FROM stock WHERE stock.variantid = variants.id) = 0";
Can anybody out there help me come up with the right solution? Maybe there is a better way that doesn't even involve a subquery?
Thanks in advance
I assume that the product_line table in your query is the PRODUCTS table you describe in the beginning of your post.
DELETE prod_lines.* -- you must specify which table you are deleting from, because there are several tables in the FROM clause
FROM prod_lines
JOIN variants ON variants.lineid = prod_lines.id
LEFT JOIN stock ON stock.variantid = variants.id
WHERE prod_lines.id = #lineid -- your "$lineid" variable here
AND stock.id IS NULL; -- selects only items with no match in the "stock" table
http://sqlfiddle.com/#!2/40a21
Try something like this,the downside is though you have to have two php variables here and also an nested query that you may not be ok with
DELETE FROM (select COUNT(id) countstock FROM stock s,variants v WHERE s.variantid = v.id and v.productid='$lineid') prod
where prod.countstock>1 and prod.productid='$lineid'
I have a table in my database to store user data. I found a defect in the code that adds data to this table database where if a network timeout occurs, the code updated the next user's data with the previous user's data. I've addressed this defect but I need to clean the database. I've added a flag to indicate the rows that need to be ignored and my goal is to mark these flags accordingly for duplicates. In some cases, though, duplicate values may actually be legitimate so I am more interested in finding several user's with the same data (i.e, u> 2).
Here's an example (tablename = Data):
id---- user_id----data1----data2----data3----datetime-----------flag
1-----usr1--------3---------- 2---------2---------2012-02-16..-----0
2-----usr2--------3---------- 2---------2---------2012-02-16..-----0
3-----usr3--------3---------- 2---------2---------2012-02-16..-----0
In this case, I'd like to mark the 1 and 2 id flags as 1 (to indicate ignore). Since we know usr1 was the original datapoint (assuming the oldest dates are earlier in the list).
At this point there are so many entries in the table that I'm not sure the best way to identify the users that have duplicate entries.
I'm looking for a mysql command to identify the problem data first and then I'll be able to mark the entries. Could someone guide me in the right direction?
Well, first select duplicate data with their min user id:
CREATE TEMPORARY TABLE duplicates
SELECT MIN(user_id), data1,data2,data3
FROM data
GROUP BY data1,data2,data3
HAVING COUNT(*) > 1 -- at least two rows
AND COUNT(*) = COUNT(DISTINCT user_id) -- all user_ids must be different
AND TIMESTAMPDIFF( MINUTE, MIN(`datetime`), MAX(`datetime`)) <= 45;
(I'm not sure, if I used TIMESTAMPDIFF properly.)
Now we can update the flag in those rows where user_id is different:
UPDATE duplicate
INNER JOIN data ON data.data1 = duplicate.data1
AND data.data2 = duplicate.data2
AND data.data3 = duplicate.data3
AND data.user_id != duplicate.user_id
SET data.flag = 1;
UPDATE Data A
LEFT JOIN
(
SELECT user_id,data1,data2,data3,min(id) min_id
FROM Data GROUP BY user_id,data1,data2,data3
) B
ON A.id = B.min_id
SET A.flag = IF(ISNULL(B.min_id),1,0);
If there are duplicate times involved, maybe try this
UPDATE Data A
LEFT JOIN
(
SELECT user_id,data1,data2,data3,,`datetime`,min(id) min_id
FROM Data GROUP BY user_id,data1,data2,data3,`datetime`
) B
ON A.id = B.min_id
SET A.flag = IF(ISNULL(B.min_id),1,0);
I have two tables - one called customer_records and another called customer_actions.
customer_records has the following schema:
CustomerID (auto increment, primary key)
CustomerName
...etc...
customer_actions has the following schema:
ActionID (auto increment, primary key)
CustomerID (relates to customer_records)
ActionType
ActionTime (UNIX time stamp that the entry was made)
Note (TEXT type)
Every time a user carries out an action on a customer record, an entry is made in customer_actions, and the user is given the opportunity to enter a note. ActionType can be one of a few values (like 'designatory update' or 'added case info' - can only be one of a list of options).
What I want to be able to do is display a list of records from customer_records where the last ActionType was a certain value.
So far, I've searched the net/SO and come up with this monster:
SELECT * FROM (
SELECT * FROM (
SELECT * FROM `customer_actions` ORDER BY `EntryID` DESC
) list1 GROUP BY `CustomerID`
) list2 WHERE `ActionType`='whatever' LIMIT 0,30
Which is great - it lists each customer ID and their last action. But the query is extremely slow on occasions (note: there are nearly 20,000 records in customer_records). Can anyone offer any tips on how I can sort this monster of a query out or adjust my table to give faster results? I'm using MySQL. Any help is really appreciated, thanks.
Edit: To be clear, I need to see a list of customers who's last action was 'whatever'.
To filter customers by their last action, you could use a correlated sub-query...
SELECT
*
FROM
customer_records
INNER JOIN
customer_actions
ON customer_actions.CustomerID = customer_records.CustomerID
AND customer_actions.ActionDate = (
SELECT
MAX(ActionDate)
FROM
customer_actions AS lookup
WHERE
CustomerID = customer_records.CustomerID
)
WHERE
customer_actions.ActionType = 'Whatever'
You may find it more efficient to avoid the correlated sub-query as follows...
SELECT
*
FROM
customer_records
INNER JOIN
(SELECT CustomerID, MAX(ActionDate) AS ActionDate FROM customer_actions GROUP BY CustomerID) AS last_action
ON customer_records.CustomerID = last_action.CustomerID
INNER JOIN
customer_actions
ON customer_actions.CustomerID = last_action.CustomerID
AND customer_actions.ActionDate = last_action.ActionDate
WHERE
customer_actions.ActionType = 'Whatever'
I'm not sure if I understand the requirements but it looks to me like a JOIN would be enough for that.
SELECT cr.CustomerID, cr.CustomerName, ...
FROM customer_records cr
INNER JOIN customer_actions ca ON ca.CustomerID = cr.CustomerID
WHERE `ActionType` = 'whatever'
ORDER BY
ca.EntryID
Note that 20.000 records should not pose a performance problem
Please note that I've adapted Lieven's answer (I made a separate post as this was too long for a comment). Any credit for the solution itself goes to him, I'm just trying to show you some key points for improving performance.
If speed is a concern then the following should give you some suggestions for improving it:
select top 100 -- Change as required
cr.CustomerID ,
cr.CustomerName,
cr.MoreDetail1,
cr.Etc
from customer_records cr
inner join customer_actions ca
on ca.CustomerID = cr.CustomerID
where ca.ActionType = 'x'
order by cr.CustomerID
A few notes:
In some cases I find left outer joins to be faster then inner joins - It would be worth measuring performance for both for this query
Avoid returning * wherever possible
You don't have to reference 'cr.x' in the initial select but it's a good habit to get into for when you start working on large queries that can have multiple joins in them (this will make a lot of sense once you start doing this
When using joins always join on a primary key
Maybe I'm missing something but what's wrong with a simple join and a where clause?
Select ActionType, ActionTime, Note
FROM Customer_Records CR
INNER JOIN customer_Actions CA
ON CR.CustomerID = CA.CustomerID
Where ActionType = 'added case info'
I need to gather posts from two mysql tables that have different columns and provide a WHERE clause to each set of tables. I appreciate the help, thanks in advance.
This is what I have tried...
SELECT
blabbing.id,
blabbing.mem_id,
blabbing.the_blab,
blabbing.blab_date,
blabbing.blab_type,
blabbing.device,
blabbing.fromid,
team_blabbing.team_id
FROM
blabbing
LEFT OUTER JOIN
team_blabbing
ON team_blabbing.id = blabbing.id
WHERE
team_id IN ($team_array) ||
mem_id='$id' ||
fromid='$logOptions_id'
ORDER BY
blab_date DESC
LIMIT 20
I know that this is messy, but i'll admit, I am no mysql veteran. I'm a beginner at best... Any suggestions?
You could put the where-clauses in subqueries:
select
*
from
(select * from ... where ...) as alias1 -- this is a subquery
left outer join
(select * from ... where ...) as alias2 -- this is also a subquery
on
....
order by
....
Note that you can't use subqueries like this in a view definition.
You could also combine the where-clauses, as in your example. Use table aliases to distinguish between columns of different tables (it's a good idea to use aliases even when you don't have to, just because it makes things easier to read). Example:
select
*
from
<table> as alias1
left outer join
<othertable> as alias2
on
....
where
alias1.id = ... and alias2.id = ... -- aliases distinguish between ids!!
order by
....
Two suggestions for you since a relative newbie in SQL. Use "aliases" for your tables to help reduce SuperLongTableNameReferencesForColumns, and always qualify the column names in a query. It can help your life go easier, and anyone AFTER you to better know which columns come from what table, especially if same column name in different tables. Prevents ambiguity in the query. Your left join, I think, from the sample, may be ambigous, but confirm the join of B.ID to TB.ID? Typically a "Team_ID" would appear once in a teams table, and each blabbing entry could have the "Team_ID" that such posting was from, in addition to its OWN "ID" for the blabbing table's unique key indicator.
SELECT
B.id,
B.mem_id,
B.the_blab,
B.blab_date,
B.blab_type,
B.device,
B.fromid,
TB.team_id
FROM
blabbing B
LEFT JOIN team_blabbing TB
ON B.ID = TB.ID
WHERE
TB.Team_ID IN ( you can't do a direct $team_array here )
OR B.mem_id = SomeParameter
OR b.FromID = AnotherParameter
ORDER BY
B.blab_date DESC
LIMIT 20
Where you were trying the $team_array, you would have to build out the full list as expected, such as
TB.Team_ID IN ( 1, 4, 18, 23, 58 )
Also, not logical "||" or, but SQL "OR"
EDIT -- per your comment
This could be done in a variety of ways, such as dynamic SQL building and executing, calling multiple times, once for each ID and merging the results, or additionally, by doing a join to yet another temp table that gets cleaned out say... daily.
If you have another table such as "TeamJoins", and it has say... 3 columns: a date, a sessionid and team_id, you could daily purge anything from a day old of queries, and/or keep clearing each time a new query by the same session ID (as it appears coming from PHP). Have two indexes, one on the date (to simplify any daily purging), and second on (sessionID, team_id) for the join.
Then, loop through to do inserts into the "TempJoins" table with the simple elements identified.
THEN, instead of a hard-coded list IN, you could change that part to
...
FROM
blabbing B
LEFT JOIN team_blabbing TB
ON B.ID = TB.ID
LEFT JOIN TeamJoins TJ
on TB.Team_ID = TJ.Team_ID
WHERE
TB.Team_ID IN NOT NULL
OR B.mem_id ... rest of query
What I ended up doing is;
I added an extra column to my blabbing table called team_id and set it to null as well as another field in my team_blabbing table called mem_id
Then I changed the insert script to also insert a value to the mem_id in team_blabbing.
After doing this I did a simple UNION ALL in the query:
SELECT
*
FROM
blabbing
WHERE
mem_id='$id' OR
fromid='$logOptions_id'
UNION ALL
SELECT
*
FROM
team_blabbing
WHERE
team_id
IN
($team_array)
ORDER BY
blab_date DESC
LIMIT 20
I am open to any thought on what I did. Try not to be too harsh though:) Thanks again for all the info.
I have a MySQL table as follow:
id, user, culprit, reason, status, ts_register, ts_update
I was thinking of using the reason as an int field and store just the id of the reason that could be selected by the user and the reason itself could be increased by the admin.
What I meant by increased is that the admin could register new reason, for example currently we have:
Flood, Racism, Hacks, Other
But the admin could add a new reason for instance:
Refund
Now my problem is that I would like to allow my users to select multiple reasons, for example:
The report 01 have the reasons Flood and Hack.
How should I store the reason field so that I could select multiple reasons while maintaining a good table format?
Should I just go ahead and store it as a string and cast it as an INT when I am searching thru it or there are better forms to store it?
UPDATE Based on Jonathan's reply:
SELECT mt.*, group_concat(r.reason separator ', ') AS reason
FROM MainTable AS mt
JOIN MainReason AS mr ON mt.id = mr.maintable_ID
JOIN Reasons AS r ON mr.reason = r.reason_id
GROUP BY mt.id
The normalized solution is to have a second table containing one row for each reason:
CREATE TABLE MainReasons
(
MainTable_ID INTEGER NOT NULL REFERENCES MainTable(ID),
Reason INTEGER NOT NULL REFERENCES Reasons(ID),
PRIMARY KEY(MainTable_ID, Reason)
);
(Assuming your main table is called MainTable and you have a table defining valid reason codes called Reasons.)
From a comment:
[W]ould you be [so] kind [as] to show me an example of selecting something to retrieve a report's reason? I mean if I simple select it SELECT * FROM MainTABLE I would never get any reasons since MainTable doesnt know it right? Because it is only linked to the MainReasons and Reasons table so I would need to do something like SELECT * FROM MainTable LEFT JOIN MainReasons USING (MainTable_ID) or something alike but how would I go about getting all the reasons if multiples?
SELECT mt.*, r.reason
FROM MainTable AS mt
JOIN MainReason AS mr ON mt.id = mr.maintable_ID
JOIN Reasons AS r ON mr.reason = r.reason_id
This will return one row per reason - so it would return multiple rows for a single report (recorded in what I called MainTable). I omitted the reason ID number from the results - you can include it if you wish.
You can add criteria to the query, adding terms to a WHERE clause. If you want to see the reports where a specific reason is specified:
SELECT mt.*
FROM MainTable AS mt
JOIN MainReason AS mr ON mt.id = mr.maintable_ID
JOIN Reasons AS r ON mr.reason = r.reason_id
WHERE r.reason = 'Flood'
(You don't need the reason in the results - you know what it is.)
If you want to see the reports where 'Floods' and 'Hacks' were the reasons given, then you can write:
SELECT mt.*
FROM MainTable AS mt
JOIN (SELECT f.MainTable_ID
FROM (SELECT MainTable_ID
FROM MainReason AS mr1
JOIN Reasons AS r1 ON mr1.reason = r1.reason_ID
WHERE r1.reason = 'Floods'
) AS f ON f.MainTable_ID = mt.MainTable_ID
JOIN (SELECT f.MainTable_ID
FROM (SELECT MainTable_ID
FROM MainReason AS mr2
JOIN Reasons AS r2 ON mr2.reason = r2.reason_ID
WHERE r1.reason = 'Hacks'
) AS h ON h.MainTable_ID = mt.MainTable_ID
To do a one-to-many relationship, I would spin reason off into it's own table, like so:
id, parent_id, reason
parent_id would refer back into your current table's id.
You could store it as an INT, but it would be a continual pain to have to parse it every time you wanted to read the data. This way would just take one more join.