MySQL take duplicate data and combine unique data - mysql

With my MySQL database, I want to take data from my temporary table and insert it into my main table, while removing any duplicate data but also taking into consideration the data I already have. This seems to require an update and/or an insert depending on what exists in "data_table" so I really have no idea how to write it or if it is even possible. If this isn't possible, I'd like to know how to accomplish this while not considering what is already in "data_table" which I would think is possible. Thank you for your help!
Existing data_table before running query:
data_table
+-----id-----+-----age-----+-----gender-----+-----color-----+
=============+==============+=================+================+
1 5 m pink,red,purple
data_table_temp
+-----id-----+-----age-----+-----gender-----+-----color-----+
=============+==============+=================+================+
1 5 m red
2 5 m blue
3 5 m red
4 5 m orange
5 6 m red
6 6 m green
7 6 m blue
After query:
data_table
+-----id-----+-----age-----+-----gender-----+-----color-----+
=============+==============+=================+================+
1 5 m pink,red,purple,blue,orange
2 6 m red,green,blue

Here is an approach to this problem which turned out to be harder than I expected.
The idea is to concat the colors that don't match and put them together. There is a bit of a problem assigning ids. Getting the "2" for the second row is a problem, so this just assigned the id sequentially:
select #id := #id + 1 as id,
coalesce(dt.age, dtt.age) as age,
coalesce(dt.gender, dtt.gender) as age,
concat_ws(dt.color,
group_concat(case when find_in_set(dtt.color, dt.color) > 0
then dtt.color
end)
)
from data_table_temp dtt left outer join
data_table dt join
on dt.age = dtt.age and
dt.gender = dtt.gender cross join
(select #id := 0) var
group by coalesce(dt.age, dtt.age), coalesce(dt.gender, dtt.gender);

MySQL doesn't have any string functions to (easily) split a delimited string (like data_table.color).
However, if you have all of the data in data_table_temp's format (one color per row), you can generate the desired results like this:
SELECT DISTINCT age, GROUP_CONCAT(DISTINCT color)
FROM table WHERE [condition]
GROUP BY age;
Optionally adding in gender, as necessary.
Apologies for the half-answer

Related

SQL unwanted results in NOT query

This looks like it should be really easy question, but I've been looking for an answer for the past two days and can't find it. Please help!
I have two tables along the lines of
texts.text_id, texts.other_stuff...
pairs.pair_id, pairs.textA, pairs.textB
The second table defines pairs of entries from the first table.
What I need is the reverse of an ordinary LEFT JOIN query like:
SELECT texts.text_id
FROM texts
LEFT JOIN text_pairs
ON texts.text_id = text_pairs.textA
WHERE text_pairs.textB = 123
ORDER BY texts.text_id
How do I get exclusively the texts that are not paired with A given textB? I've tried
WHERE text_pairs.textB != 123 OR WHERE text_pairs.textB IS NULL
However, this returns all the pairs where textB is not 123. So, in a situation like
textA TextB
1 3
1 4
2 4
if I ask for textB != 3, the query returns 1 and 2. I need something that will just give me 1.
The comparison on the second table goes in the ON clause. Then you add a condition to see if there is no match:
SELECT t.text_id
FROM texts t LEFT JOIN
text_pairs tp
ON t.text_id = tp.textA AND tp.textB = 123
WHERE tp.textB IS NULL
ORDER BY t.text_id ;
This logic is often expressed using NOT EXISTS or NOT IN:
select t.*
from texts t
where not exists (select 1
from text_pairs tp
where t.text_id = tp.textA AND tp.textB = 123
);

need to query 2 MySQL tables with COUNT(*) condition

I have 2 tables (cycles and merged_cycles). "cycles" has 2 fields I need to target (userid and cycleid) and "merged_cycles" also has 2 targeted fields (cycleid1 and cycleid2). I need to know all cycles.userid that have more than one record in "cycles", so long as the corresponding cycles.cycleid for any matching record does not appear in any record in "merged_cycles" in either merged-cycles.cycleid1 OR merged_cycles.cycleid2. I currently have it working using 2 different queries, but i was curious if it could be done in one. Here's what i have tried so far:
SELECT cycles.cycleid, cycles.userid, cycles.COUNT(*),
merged_cycles.cycleid1, merged_cycles.cycleid2
FROM cycles,merged_cycles
WHERE merged_cycles.cycleid1 != cycles.cycleid && merged_cycles.cycleid2 != cycles.cycleid
GROUP BY cycles.userid
HAVING cycles.count(*) > 1
Thanks for any suggestions!
I think this does what you want:
SELECT c.cycleid
FROM cycles c
WHERE NOT EXISTS (SELECT 1
FROM merged_cycles mc
WHERE c.cycleid IN (mc.cycleid1, mc.cycleid2)
)
GROUP BY c.userid
HAVING count(*) > 1;

Select row if multiple present values are present in another table

I'm doing a search function on a movie database, I want to give the option to search a film with two genres (ie: crime id:6 and adventure id:7)
I basically want to get a row from title if it has genre_id 6 AND 7 present in the title_genre value. Obviously, this query below isn't working (I understand why it's not but I don't know how to fix it).
Any help please?
SELECT * FROM (`title`, `title_genre`)
WHERE `title`.`active` = 1
AND `title`.`media_id` = title_genre.media_id
AND title_genre.genre_id = 6 AND title_genre.genre_id = 7
You can use the exists to check the existence of other genre_id = 7 in title_genre and also using explicit join makes it much better as
select
t.*,
tg.*
from title t
join title_genre tg on tg.media_id = t.media_id
where
tg.genre_id = 6
and exists(
select 1 from title_genre tg1
where tg1.media_id = t.media_id
and tg1.genre_id = 7
)

how to search for a given sequence of rows within a table in SQL Server 2008

The problem:
We have a number of entries within a table but we are only interested in the ones that appear in a given sequence. For example we are looking for three specific "GFTitle" entries ('Pearson Grafton','Woolworths (P and O)','QRX - Brisbane'), however they have to appear in a particular order to be considered a valid route. (See image below)
RowNum GFTitle
------------------------------
1 Pearson Grafton
2 Woolworths (P and O)
3 QRX - Brisbane
4 Pearson Grafton
5 Woolworths (P and O)
6 Pearson Grafton
7 QRX - Brisbane
8 Pearson Grafton
9 Pearson Grafton
So rows (1,2,3) satisfy this rule but rows (4,5,6) don't even though the first two entries (4,5) do.
I am sure there is a way to do this via CTE's but some help would be great.
Cheers
This is very simple using even good old tools :-) Try this quick-and-dirty solution, assuming your table name is GFTitles and RowNumber values are sequential:
SELECT a.[RowNum]
,a.[GFTitle]
,b.[GFTitle]
,c.[GFTitle]
FROM [dbo].[GFTitles] as a
join [dbo].[GFTitles] as b on b.RowNumber = a.RowNumber + 1
join [dbo].[GFTitles] as c on c.RowNumber = a.RowNumber + 2
WHERE a.[GFTitle] = 'Pearson Grafton' and
b.[GFTitle] = 'Woolworths (P and O)' and
c.[GFTitle] = 'QRX - Brisbane'
Assuming RowNum has neither duplicates nor gaps, you could try the following method.
Assign row numbers to the sought sequence's items and join the row set to your table on GFTitle.
For every match, calculate the difference between your table's row number and that of the sequence. If there's a matching sequence in your table, the corresponding rows' RowNum differences will be identical.
Count the rows per difference and return only those where the count matches the number of sequence items.
Here's a query that implements the above logic:
WITH SoughtSequence AS (
SELECT *
FROM (
VALUES
(1, 'Pearson Grafton'),
(2, 'Woolworths (P and O)'),
(3, 'QRX - Brisbane')
) x (RowNum, GFTitle)
)
, joined AS (
SELECT
t.*,
SequenceLength = COUNT(*) OVER (PARTITION BY t.RowNum - ss.RowNum)
FROM atable t
INNER JOIN SoughtSequence ss
ON t.GFTitle = ss.GFTitle
)
SELECT
RowNum,
GFTitle
FROM joined
WHERE SequenceLength = (SELECT COUNT(*) FROM SoughtSequence)
;
You can try it at SQL Fiddle too.

Need Help streamlining a SQL query to avoid redundant math operations in the WHERE and SELECT

*Hey everyone, I am working on a query and am unsure how to make it process as quickly as possible and with as little redundancy as possible. I am really hoping someone there can help me come up with a good way of doing this.
Thanks in advance for the help!*
Okay, so here is what I have as best I can explain it. I have simplified the tables and math to just get across what I am trying to understand.
Basically I have a smallish table that never changes and will always only have 50k records like this:
Values_Table
ID Value1 Value2
1 2 7
2 2 7.2
3 3 7.5
4 33 10
….50000 44 17.2
And a couple tables that constantly change and are rather large, eg a potential of up to 5 million records:
Flags_Table
Index Flag1 Type
1 0 0
2 0 1
3 1 0
4 1 1
….5,000,000 1 1
Users_Table
Index Name ASSOCIATED_ID
1 John 1
2 John 1
3 Paul 3
4 Paul 3
….5,000,000 Richard 2
I need to tie all 3 tables together. The most results that are likely to ever be returned from the small table is somewhere in the neighborhood of 100 results. The large tables are joined on the index and these are then joined to the Values_Table ON Values_Table.ID = Users_Table.ASSOCIATED_ID …. That part is easy enough.
Where it gets tricky for me is that I need to return, as quickly as possible, a list limited to 10 results where value1 and value2 are mathematically operated on to return a new_ value where that new_value is less than 10 and the result is sorted by that new_value and any other where statements I need can be applied to the flags. I do need to be able to move along the limit. EG LIMIT 0,10 / 11,10 / 21,10 etc...
In a subsequent (or the same if possible) query I need to get the top 10 count of all types that matched that criteria before the limit was applied.
So for example I want to join all of these and return anything where Value1 + Value2 < 10 AND I also need the count.
So what I want is:
Index Name Flag1 New_Value
1 John 0 9
2 John 0 9
5000000 Richard 1 9.2
The second response would be:
ID (not index) Count
1 2
2 1
I tried this a few ways and ultimately came up with the following somewhat ugly query:
SELECT INDEX, NAME, Flag1, (Value1 * some_variable + Value2) as New_Value
FROM Values_Table
JOIN Users_Table ON ASSOCIATED_ID = ID
JOIN Flags_Table ON Flags_Table.Index = Users_Table.Index
WHERE (Value1 * some_variable + Value1) < 10
ORDER BY New_Value
LIMIT 0,10
And then for the count:
SELECT ID, COUNT(TYPE) as Count, (Value1 * some_variable + Value2) as New_Value
FROM Values_Table
JOIN Users_Table ON ASSOCIATED_ID = ID
JOIN Flags_Table ON Flags_Table.Index = Users_Table.Index
WHERE (Value1 * some_variable + Value1) < 10
GROUP BY TYPE
ORDER BY New_Value
LIMIT 0,10
Being able to filter on the different flags and such in my WHERE clause is important; that may sound stupid to comment on but I mention that because from what I could see a quicker method would have been to use the HAVING statement but I don't believe that will work in certain instance depending on what I want to use my WHERE clause to filter against.
And when filtering using the flags table :
SELECT INDEX, NAME, Flag1, (Value1 * some_variable + Value2) as New_Value
FROM Values_Table
JOIN Users_Table ON ASSOCIATED_ID = ID
JOIN Flags_Table ON Flags_Table.Index = Users_Table.Index
WHERE (Value1 * some_variable + Value1) < 10 AND Flag1 = 0
ORDER BY New_Value
LIMIT 0,10
...filtered count:
SELECT ID, COUNT(TYPE) as Count, (Value1 * some_variable + Value2) as New_Value
FROM Values_Table
JOIN Users_Table ON ASSOCIATED_ID = ID
JOIN Flags_Table ON Flags_Table.Index = Users_Table.Index
WHERE (Value1 * some_variable + Value1) < 10 AND Flag1 = 0
GROUP BY TYPE
ORDER BY New_Value
LIMIT 0,10
That works fine but has to run the math multiple times for each row, and I get the nagging feeling that it is also running the math multiple times on the same row in the Values_table table. My thought was that I should just get only the valid responses from the Values_table first and then join those to the other tables to cut down on the processing; with how SQL optimizes things though I wasn't sure if it might not already be doing that. I know I could use a HAVING clause to only run the math once if I did it that way but I am uncertain how I would then best join things.
My questions are:
Can I avoid running that math twice and still make the query work
(or I suppose if there is a good way
to make the first one work as well
that would be great)
What is the fastest way to do this
as this is something that will
be running very often.
It seems like this should be painfully simple but I am just missing something stupid.
I contemplated pulling into a temp table then joining that table to itself but that seems like I would trade math for iterations against the table and still end up slow.
Thank you all for your help in this and please let me know if I need to clarify anything here!
** To clarify on a question, I can't use a 3rd column with the values pre-calculated because in reality the math is much more complex then addition, I just simplified it for illustration's sake.
Do you have a benchmark query to compare against? Usually it doesn't work to try to outsmart the optimizer. If you have acceptable performance from a starting query, then you can see where extra work is being expended (indicated by disk reads, cache consumption, etc.) and focus on that.
Avoid the temptation to break it into pieces and solve those. That's an antipattern. That includes temp tables especially.
Redundant math is usually ok - what hurts is disk activity. I've never seen a query that needed CPU work reduction on pure calculations.
Gather your results and put them in a temp table
SELECT * into TempTable FROM (SELECT INDEX, NAME, Type, ID, Flag1, (Value1 + Value2) as New_Value
FROM Values_Table
JOIN Users_Table ON ASSOCIATED_ID = ID
JOIN Flags_Table ON Flags_Table.Index = Users_Table.Index
WHERE New_Value < 10)
ORDER BY New_Value
LIMIT 0,10
Return Result for First Query
SELECT INDEX, NAME, Flag1, New_Value
FROM TempTable
Return Results for count of Types
Select ID, Count(Type)
FROM TempTable
GROUP BY TYPE
Is there any chance that you can add a third column to the values_table with the pre-calculated value? Even if the result of your calculation is dependent on other variables, you could run the calculation for the whole table but only when those variables change.