Repair database based on two tables - mysql

I came back here for another question related to my previous ones. A while ago I created a simple web products parser app which helped me to save some prices on different websites and do some comparison but after a while I found a relative big problem. I will explain everything below.
I have a lot of Mysql tables with the following format:
products with id, name, link
products-prices with id, id_prod, price, availability and date
As you can see, in the products-prices table there is a cell with id_prod which links to the id in the products table. When I parsed the link for every product I though they are unique but in reality something happened and for every product I have 3-4 links. For example, let's consider www.example.com/smth, instead of putting it parsed like that (without http/s and / at the final) in DB I put the whole link and for some reason now I have 4 different products (basically the same one) with http://www.example.com/smth, https://www.example.com/smth, http://www.example.com/smth/, https://www.example.com/smth/. Now I want to do a query to repair my database, basically to delete 1 to 3 entries and keep only one product from products and also change the id_prod from every entry in products-prices.
I don't want a direct answer, instead if you can route me to a tutorial/concept of what syntax I need to use I will be more than thankful. Have a good day!
Edit, real world example
https://images2.imgbox.com/f5/a5/0bdvqXcu_o.png
https://images2.imgbox.com/22/e8/BTbPLCzE_o.png
In the first picture, you can see that the only difference between those 3 products is the link, and in the link the only difference is that one of them is http the other ones are https and between those 2 https one has a slash at the final. In the second picture I have a lot (yea I know very inefficient) of entries which I want in this example to point to the product with id 2 from the first picture.

Try a simple grouping to ascertain the scale of the problem:
SELECT (COUNTPRODID) C, PRODID
FROM YOURTABLE
GROUP BY PRODID
HAVING COUNT(PRODID) >1
Once you have identified the scale of the issue, you could create a table to stage 1 of your records with a sequence based on the PRODID as below:
SELECT * INTO TmpTable
FROM
(SELECT
#row_number:=CASE
WHEN #PRODID = PRODID THEN #row_number + 1
ELSE 1
END AS SEQ,
#PRODID :=PRODID as PRODID
FROM
YOURTABLE
ORDER BY PRODID;) dups
WHERE dups.SEQ = 1
You could then delete all rows in you source
DELETE FROM YOURTABLE
WHERE PRODID IN (SELECT PRODID FROM TmpTable)
And then finally write the rows back from your temp table:
INSERT INTO YOURTABLE
SELECT field1, field2 etc. FROM TmpTable

Related

MYSQL checking if a record exists with the specified child-records

Okay, let's say I have a table called rooms:
It only has one column: ID
I also have another table called items_in_rooms with columns:
roomId, itemName, itemColor
Whenever a room-record is inserted a bunch of records is also inserted into items_in_rooms linked to the row-record, specifying what items are in that room.
The problem is that when a room-record along with its items, I need to first verify if a room with those exact items don't already exist.
How can this be done?
One way of course would be to first fetch all room-records along with all their items then look through them until it has been verified there isn't already an exact copy in the database and then do insertion if it's unique.
But this sounds a bit ineffective to me, especially as the tables grows very large so I was hoping there's a way to have MYSQL do the checking.
One way I came up with was to do something like this:
SELECT roomId FROM(
SELECT rooms.id roomId, GROUP_CONCAT(
CONCAT_WS(',',itemName,itemColor) ORDER BY itemName,itemColor SEPARATOR '/'
) roomContents
FROM items_in_rooms
JOIN rooms ON roomId=rooms.id
WHERE snapshotDate='$dateString'
GROUP BY roomId
) concatenatedRoomContents
WHERE roomContents='bed,white/carpet,red/chair,brown'
Essentially this will make MYSQL concatenate each room into a string, then compare them to the "input-string" in the WHERE-clause. Obviously the input-string would have to be ordered the same way as how MYSQL orders the rows before concatenating (itemName,itemColor).
While this worked for be it felt very dirty. Also, it initially caused some problems when I had added a decimal-field as MYSQL always includes every decimal-digit when stringifying so 1 for instance could be "1.000"
while PHP which I'm using by default stringifies it to "1". I solved this using number_format() making it include the right amount of decimal-digits.
Now I've noticed I've got some duplicates in the table again so there's some other gotcha I need to find, but I was just wondering if there's maybe a more clever way?
This is how it can be done. The following query returns the id of the room if such a room exists(it has exactly those items, no more, no less).
SELECT roomId FROM (
SELECT roomId,count(*) numMatchedItems
FROM items_in_rooms WHERE (itemName,itemColor)
IN (('bed','white'),('carpet','red'),('chair','brown'))
GROUP BY roomId
) matches
WHERE numMatchedItems=3
Thanks, CBroe.

MySQL Union (or similar) query

I have some booking data from a pair of views in MySQL. They match columns perfectly, and the main difference is a booking code that is placed in one of these rows.
The context is as follows: this is for calculating numbers for a sports camp. People are booked in, but can do extra activities.
View 1: All specialist bookings (say: a football class).
View 2: A general group.
Due to the old software, the booking process results in many people booking for the general group and then are upgraded to the old class. This is further complicated by some things elsewhere in the business.
To be clear - View 1 actually contains some (but are not exclusively all) people from within View 2. There's an intersection of the two groups. Obviously people can't be in two groups at once (there's only one of them!).
Finding all people who are in View 2 is of course easy... as is View 1. BUT, I need to produce a report which is basically:
"View 1" overwriting "View 2"... or put another way:
"View 1" [sort of] UNION "View 2"
However: I'm not sure the best way of doing this as there are added complications:
Each row is as approximately (with other stuff omitted) as follows:
User ID Timeslot Activity
1 A Football
1 A General
2 A General
3 A Football
As you can see, these rows all concern timeslot A:
- User 2 does general activities.
- User 3 does football.
- User 1 does football AND general.
AS these items are non unique, the above is a UNION (distinct), as there are no truly distinct rows.
The output I need is as follows:
User ID Timeslot Activity
1 A Football
2 A General
3 A Football
Here, Football has taken "precedence" over "general", and thus I get the picture of where people are at any time.
This UNION has a distinct clause on a number of fields, but ignores others.
So: does anyone know how to do what amounts to:
"add two tables together and overwrite one of them if it's the same timeslot"
Or something like a:
"selective distinct on UNION DISTINCT".
Cheers
Rick
Try this:
SELECT *
FROM
(SELECT *,
IF(Activity='General',1,0) AS order_column
FROM `Table1`
ORDER BY order_column) AS tmp
GROUP BY UserId
This will add an order_column to your original table that as value 1 if the Activity value is general; Doing this we can select this temporary table ordering by this column (ascending order) and all record with general activity comes after all others. After that we can simply select the result of this temporary table grouping by user id. The group by clouse without any aggregate function takes the first record that match.
EDIT:
If you don't to use group by without aggregate function this is an 'ugly' alternative:
SELECT UserId,
Timeslot,
SUBSTRING(MAX(CASE Activity WHEN "General" THEN "00General" WHEN "Football" THEN "01Football" ELSE Activity END) , 3)
FROM `Table1`
GROUP BY UserId,
Timeslot LIMIT 0 ,
30
Here we need to define each possible value for Activity.

SQL deduping help?

I'm sure there are a ton of ways to do this, but right now I'm struggling to find the way that will work properly given the data.
I basically have a table containing duplicates which have additional fields tied to them and source details that take priority over others. So basically I added a "priority" field to my table which I then updated based on source priority. I now need to select the distinct records to populate my "unique" records table (which I'll then apply unique key constraint to prevent this from happening again on the field required!)....
So I have basically, something like this:
Select phone, carrier, src, priority
from dbo.mytable
So basically I need to pull distinct on phone in order of priority (1,2,3,4, etc), and basically pull the rest of the other data along with it and still keep UNIQUE on phone.
I've tried a few things using sub-select from the same table with min(priority) value, but outcome still doesn't seem to make sense. Any help would be greatly appreciated. Thanks!
EDIT I need to dedupe from the same table, but I can populate a new table with the uniques if needed based on my select statement to pull the uniques. This is in MSSQL, but figured anyone with SQL knowledge could answer.
For example, let's say I have the following rows:
5556667777, ATT, source1, 1
5556667777, ATT, source2, 2
5556667777, ATT, source3, 3
I need to pull uniques based on priority 1 first..... the problem is, I need to remove any all other dupes from the table based on the priority order without ending up with the same phone number twice again. Make sense?
So you're saying the combination (phone, priority) is unique in the existing table, and you want to select the rows for which the priority is smallest?
SELECT mytable.phone, mytable.carrier, mytable.src
FROM mytable
INNER JOIN (
SELECT phone, MIN(priority) AS minpriority
FROM mytable
GROUP BY phone
) AS minphone
ON mytable.phone = minphone.phone
AND mytable.priority = minphone.minpriority

Need to delete random tuples from database in SQL

We're hiring some third party Test engineers and programmers to help us with some bugs on our website. They would be working on a beta installation of our web application. The thing is that we need to give them a copy of our database, we don't want to give the entire database, its a huge database of companies. So we would want to give them a watered down version of it that has less than a fraction of the actual data -- just enough for making a proper test.
We have data in the following Schema:
COMPANIES
ID|NAME|CATEGORY|COUNTRY_ID.....
We also have a set number of categories and countries.
The thing is that we don't want the deletion to be too random, basically out of the hundreds of thousands of entries we need to give them a version that has a few hundred entries but such that, you have at least 2-3 companies for each country and category.
I'm a bit perplexed as how to do a select query with the above restriction much less delete.
It's a MySQL database we would be using here. Can this be even done in SQL or do we need to make a script in php or so?
Following select statement will select companies with first 3 id in ascending order for each category, country_id combination:
select id, name, category, country_id
from companies c1
where id in (
select id
from companies c2
where c2.category=c1.category and c2.countr_id=c1.country_id
order by id
limit 3
);
Not sure my answer will fit your needs since I am doing some assumptions that may be wrong, but you could try the following approach:
select category, country_id, min(id) id1, max(id) id2
from companies
group by country_id, category
order by country_id, category
This query only gives you 2 company ids instead of 3 and they will be the first and last id that match category and country.
Please note also I wrote this out of my mind and have no MySQL engine to test it.
Hope that helps or at least gives you a hint on how to do it.

Showing all duplicates, side by side, in MySQL

I have a table like so:
Table eventlog
user | user_group | event_date | event_dur.
---- ---------- --------- ----------
xyz 1 2009-1-1 3.5
xyz 2 2009-1-1 4.5
abc 2 2009-1-2 5
abc 1 2009-1-2 5
Notice that in the above sample data, the only thing reliable is the date and the user. Through an over site that is 90% mine to blame, I have managed to allow users to duplicate their daily entries. In some instances the duplicates were intended to be updates to their duration, in others it was their attempt to change the user_group they were working with that day, and in other cases both.
Fortunately, I have a fairly strong idea (since this is an update to an older system) of which records are correct. (Basically, this all happened as an attempt to seamlessly merge the old DB with the new DB).
Unfortunately, I have to more or less do this by hand, or risk losing data that only exists on one side and not the other....
Long story short, I'm trying to figure out the right MySQL query to return all records that have more than one entry for a user on any given date. I have been struggling with GROUP BY and HAVING, but the best I can get is a list of one of the two duplicates, per duplicate, which would be great if I knew for sure it was the wrong one.
Here is the closest I've come:
SELECT *
FROM eventlog
GROUP BY event_date, user
HAVING COUNT(user) > 1
ORDER BY event_date, user
Any help with this would be extremely useful. If need be, I have the list of users/date for each set of duplicates, so I can go by hand and remove all 400 of them, but I'd much rather see them all at once.
Thanks!
Would this work?
SELECT event_date, user
FROM eventlog
GROUP BY event_date, user
HAVING COUNT(*) > 1
ORDER BY event_date, user
What's throwing me off is the COUNT(user) clause you have.
You can list all the field values of the duplicates with GROUP_CONCAT function, but you still get one row for each set.
I think this would work (untested)
SELECT *
FROM eventlog e1
WHERE 1 <
(
SELECT COUNT(*)
FROM eventlog e2
WHERE e1.event_date = e2.event_date
AND e1.user = e2.user
)
-- AND [maybe an additionnal constraint to find the bad duplicate]
ORDER BY event_date, user;
;