Use string matching to de-dupe results of query

Use string matching to de-dupe results of query - mysql

I have a table with the format:
Id | Loc |
-------|-----|
789-A | 4 |
123 | 1 |
123-BZ | 1 |
123-CG | 2 |
456 | 2 |
456 | 3 |
789 | 4 |
I want to exclude certain rows from the result of query based on whether a duplicate exists. In this case, though, the definition of a duplicate row is pretty complex:
If any row returned by the query (let's refer to this hypothetical row as ThisRow) has a counterpart row also contained within the query results where Loc is identical to ThisRow.Loc AND Id is of the form <ThisRow.Id>-<an alphanumeric suffix> then ThisRow should be considered a duplicate and excluded from the query results.
For example, using the table above, SELECT * FROM table should return the results set below:
Id | Loc |
-------|-----|
789-A | 4 |
123-BZ | 1 |
123-CG | 2 |
456 | 2 |
456 | 3 |
I understand how to write the string matching conditional:
ThisRow.Id REGEXP '^PossibleDuplicateRow.Id-[A-Za-z0-9]*'
and the straight comparison of Loc:
ThisRow.Loc = PossibleDuplicateRow.Loc
What I don't understand is how to form these conditionals into a (self-referential?) query.

Here's one way:
SELECT *
FROM myTable t1
WHERE NOT EXISTS
(
SELECT 1
FROM myTable t2
WHERE t2.Loc = t1.Loc
AND t2.Id LIKE CONCAT(t1.Id, '-%')
)
SQL Fiddle example
Or, the same query using an anti-join (which should be a little faster):
SELECT *
FROM myTable t1
LEFT OUTER JOIN myTable t2
ON t2.Loc = t1.Loc
AND t2.Id LIKE CONCAT(t1.Id, '-%')
WHERE t2.Id IS NULL
SQL Fiddle example

In the example data you give, there are no examples of duplicate locs not being on duplicate rows. For example, you don't have a row "123-AZ, 1", where the prefix row "123, 1" would conflict with two rows.
If this is a real characteristic of the data, then you can eliminate dups without a self join, by using aggregation:
select max(id), loc
from t
group by (case when locate(id, '-') = 0 then id
else left(id, locate(id, '-') - 1)
end), loc
I offer this because an aggregation should be much faster than a non-equijoin.

Related

MySQL Conditionally Inner Join X # of columns

I have two tables, Table1 and Table2.
Based on a value inside a column on Table1, can I inner join a certain number of columns from Table2, JOIN on ID.
Table1:
id | col_number |
1 | 2
2 | 3
Table2:
id | col1 | col2 | col3
1 | BRK | GOOG | APPL
2 |AMZN | INTC | TSLA
Expected Outcome, If the query was run for ID1:
id | col_number | col1 | col2
1 | 2 | BRK | GOOG
I haven’t been able to find many examples of conditional inner joins easy enough for me to attempt to understand them. Those I have found are conditional on different tables, not columns.
Fiddle: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=4efaf735c1f28fa8a9e55d77ca30fa71

The select-list of an SQL query must be fixed before the query is parsed and prepared, and that happens before the query begins reading any rows of data. This means you can't make a query that returns a different number of columns depending on the data values in some of the rows it reads.
Also, any query result must have the same number of columns in every row, not a dynamic number of columns.
You could, however, make some of the expressions return NULL in some columns depending on a data value.
SELECT table1.id, table1.col_number,
CASE WHEN table1.col_number >= 1 THEN table2.col1 ELSE NULL END AS col1,
CASE WHEN table1.col_number >= 2 THEN table2.col2 ELSE NULL END AS col2,
CASE WHEN table1.col_number >= 3 THEN table2.col3 ELSE NULL END AS col3
FROM table1 JOIN table2 USING (id);

Count all records that does not exist to other table - SQL Query

I have two(2) tables and I'm trying to count all records from Table1 and Table1_delta were pagename from Table1_delta is not yet listed into Table1. Incase pagename from Table1_delta is listed to Table1, status must be 1 so that it will be included in count result.
Sample table structure:
Table1
+-----------+--------+
| pagename | status |
+-----------+--------+
| pagename1 | 2 |
| pagename2 | 1 |
+-----------+--------+
Table1_delta
+-----------+
| pagename |
+-----------+
| pagename1 |
| pagename2 |
| pagename3 |
| pagename4 |
+-----------+
The table sample should return "3".
pagename3 and pagename4 is not listed in Table1(that returns 2) and pagename2 from Table1 has an status = 1(that returns 1). In total there are 3 pagenames from Table1_delta that are not listed in Table1 and record from Table1 where status = 1. I'm wondering on how will be the query of this? I'm using MySQL v5.6.17. Thanks!

Here is an alternative solution using joins:
SELECT COUNT(*)
FROM Table1_delta t1 LEFT JOIN Table1 t2
ON t1.pagename = t2.pagename
WHERE t2.status IS NULL OR t2.status = 1
Here is what the temporary table from the above query looks like:
+-----------+--------+
| pagename | status |
+-----------+--------+
| pagename1 | 2 | # this row is NOT counted
| pagename2 | 1 | # +1 this row has status = 1 and is counted
| pagename3 | null | # +1 this row has status = null and is counted
| pagename4 | null | # +1 this row is also null and is counted
+-----------+--------+
Check out the link below for a running demo.
SQLFiddle

Try using joins
Select count(Table1_delta.pagename) from Table1_delta
INNER JOIN Table1 ON
Table1_delta.pagename != Table1 .pagename
AND Table1.status != 1

If I've understood correctly:
SELECT COUNT(*) FROM Table1_Delta
WHERE pagename NOT IN
(SELECT pagename FROM Table1 WHERE status = 1)
Update
As requested in the comments, here's what this query does:
First, the subquery: SELECT pagename FROM Table1 WHERE status = 1, retrieves the pagename field from those Table1 records where status is 1.
So in the example case, it'll return a single row, containing pagename2.
Then the main query counts all the records in Table1_Delta (SELECT COUNT(*) FROM Table1_Delta) whose Pagename does not contain (WHERE Pagename NOT IN (<subquery>)) those values returned from the subquery.
So this would match 3 entries (pagename1, pagename3, pagename4), and that's the count you get
Historically, using sub-queries is considered slower than using joins, but frankly, RDBMS's have come a long way optimizing queries, and for simple cases like this, it would be "probably" (I haven't measured) faster. It actually depends on the real case and DB... but the SQL code is much more self-explanatory than joins IMO. Your mileage may vary.

SELECTING SUM based on conditional if from another table

Database: mysql > ver 5.0
table 1: type_id (int), type
table 2: name_id, name, is_same_as = table2.name_id or NULL
table 3: id, table2.name_id, table1.type_id, value (float)
I want to sum values, and count values in table 3 where table2.name_id are same and also include the values of id where is_same_is=name_id. I want to select all data in table3 for all values in table2.
Apologize if my question is not very clear, and if it has already been answered but I am unable to find a relevant answer. Or dont exactly know what to look for.
[data]. table1
id | type
=========
1 | test1
2 | test2
[data].table2
name_id | name | is_same_as
==============================
1 | tb_1 | NULL
2 | tb_2 | 1
3 | tb_3 | NULL
4 | tb_4 | 1
[data].table3
id | name_id | type_id | value
======================================
1 | 1 | 1 | 1.5
2 | 2 | 1 | 0.5
3 | 2 | 2 | 1.0
output:
name_id| type_id|SUM(value)
=======================================================
1 | 1 |2.0 < because in table2, is_same_as = 1
2 | 2 |1.0

I think the following does what you want:
select coalesce(t2.is_same_as, t2.name_id) as name_id, t3.type_id, sum(value)
from table_3 t3 join
table_2 t2
on t3.name_id = t2.name_id
group by coalesce(t2.is_same_as, t2.name_id), t3.type_id
order by 1, 2
It joins the table on name_id. However, it then uses the is_same_as column, if present, or the name_id if not, for summarizing the data.

This might be what you are looking for: (I haven't tested it in MySQL, so there may be a typo)
with combined_names_tab (name_id, name_id_ref) as
(
select name_id, name_id from table2
union select t2a.name_id, t2b.name_id
from table2 t2a
join table2 t2b
on (t2a.name_id = t2b.is_same_as)
)
select cnt.name_id, t3.type_id, sum(t3.value) sum_val
from combined_names_tab cnt
join table3 t3
on ( cnt.name_id_ref = t3.name_id )
group by cnt.name_id, t3.type_id
having sum(t3.value) / count(t3.value) >= 3
Here's what the query does:
First, it creates 'combined_names_tab' which is a join of all the table2 rows that you want to GROUP BY using the "is_same_as" column to make that determination. I make sure to include the "parent" row by doing a UNION.
Second, once you have those rows above, it's a simply join to table3 with a GROUP BY and a SUM.
Note: table1 was unnecessary (I believe).
Let me know if this works!
john...

How to `SELECT` and manufacture missing rows from previous values?

I have the following (simplified) result from SELECT * FROM table ORDER BY tick,refid:
tick refid value
----------------
1 1 11
1 2 22
1 3 33
2 1 1111
2 3 3333
3 3 333333
Note the "missing" rows for refid 1 (tick 3) and refid 2 (ticks 2 and 3)
If possible, how can I make a query to add these missing rows using the most recent prior value for that refid? "Most recent" means the value for the row with the same refid as the missing row and largest tick such that the tick is less than the tick for the missing row. e.g.
tick refid value
----------------
1 1 11
1 2 22
1 3 33
2 1 1111
2 2 22
2 3 3333
3 1 1111
3 2 22
3 3 333333
Additional conditions:
All refids will have values at tick=1.
There may be many 'missing' ticks for a refid in sequence, (as above for refid 2).
There are many refids and it's not known which will have sparse data where.
There will be many ticks beyond 3, but all sequential. In the correct result, each refid will have a result for each tick.
Missing rows are not known in advance - this will be run on multiple databases, all with the same structure, and different "missing" rows.
I'm using MySQL and cannot change db just now. Feel free to post answer in another dialect, to help discussion, but I'll select an answer in MySQL dialect over others.
Yes, I know this can be done in the code, which I've implemented. I'm just curious if it can be done with SQL.

What value should be returned when a given tick-refid combination does not exist? In this solution, I simply returned the lowest value for that given refid.
Revision
I've updated the logic to determine what value to use in the case of a null. It should be noted that I'm assuming that ticks+refid is unique in the table.
Select Ticks.tick
, Refs.refid
, Case
When Table.value Is Null
Then (
Select T2.value
From Table As T2
Where T2.refid = Refs.refId
And T2.tick = (
Select Max(T1.tick)
From Table As T1
Where T1.tick < Ticks.tick
And T1.refid = T2.refid
)
)
Else Table.value
End As value
From (
Select Distinct refid
From Table
) As Refs
Cross Join (
Select Distinct tick
From Table
) As Ticks
Left Join Table
On Table.tick = Ticks.tick
And Table.refid = Refs.refid

If you know in advance what your 'tick' and 'refid' values are,
Make a helper table that contains all possible tick and refid values.
Then left join from the helper table on tick and refid to your data table.
If you don't know exactly what your 'tick' and 'refid' values are, you maybe could still use this method, but instead of a static helper table, it would have to be dynamically generated.

The following has too many sub-selects for my taste, but it generates the desired result in MySQL, as long as every tick and every refid occurs separately at least once in the table.
Start with a query that generates every pair of tick and refid. The following uses the table to generate the pairs, so if any tick never appears in the underlying table, it will also be missing from the generated pairs. The same holds true for refids, though the restriction that "All refids will have values at tick=1" should ensure the latter never happens.
SELECT tick, refid FROM
(SELECT refid FROM chadwick WHERE tick=1) AS r
JOIN
(SELECT DISTINCT tick FROM chadwick) AS t
Using this, generate every missing tick, refid pair, along with the largest tick that exists in the table by equijoining on refid and θ≥-joining on tick. Group by the generated tick, refid since only one row for each pair is desired. The key to filtering out existing tick, refid pairs is the HAVING clause. Strictly speaking, you can leave out the HAVING; the resulting query will return existing rows with their existing values.
SELECT tr.tick, tr.refid, MAX(c.tick) AS ctick
FROM
(SELECT tick, refid FROM
(SELECT refid FROM chadwick WHERE tick=1) AS r
JOIN
(SELECT DISTINCT tick FROM chadwick) AS t
) AS tr
JOIN chadwick AS c ON tr.tick >= c.tick AND tr.refid=c.refid
GROUP BY tr.tick, tr.refid
HAVING tr.tick > MAX(c.tick)
One final select from the above as a sub-select, joined to the original table to get the value for the given ctick, returns the new rows for the table.
INSERT INTO chadwick
SELECT missing.tick, missing.refid, c.value
FROM (SELECT tr.tick, tr.refid, MAX(c.tick) AS ctick
FROM
(SELECT tick, refid FROM
(SELECT refid FROM chadwick WHERE tick=1) AS r
JOIN
(SELECT DISTINCT tick FROM chadwick) AS t
) AS tr
JOIN chadwick AS c ON tr.tick >= c.tick AND tr.refid=c.refid
GROUP BY tr.tick, tr.refid
) AS missing
JOIN chadwick AS c ON missing.ctick = c.tick AND missing.refid=c.refid
;
Performance on the sample table, along with (tick, refid) and (refid, tick) indices:
+----+-------------+------------+-------+-------------------+----------+---------+----------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+-------------------+----------+---------+----------+------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 3 | |
| 1 | PRIMARY | c | ALL | tick_ref,ref_tick | NULL | NULL | NULL | 6 | Using where; Using join buffer |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 9 | Using temporary; Using filesort |
| 2 | DERIVED | c | ref | tick_ref,ref_tick | ref_tick | 5 | tr.refid | 1 | Using where; Using index |
| 3 | DERIVED | <derived4> | ALL | NULL | NULL | NULL | NULL | 3 | |
| 3 | DERIVED | <derived5> | ALL | NULL | NULL | NULL | NULL | 3 | Using join buffer |
| 5 | DERIVED | chadwick | index | NULL | tick_ref | 10 | NULL | 6 | Using index |
| 4 | DERIVED | chadwick | ref | tick_ref | tick_ref | 5 | | 2 | Using where; Using index |
+----+-------------+------------+-------+-------------------+----------+---------+----------+------+---------------------------------+
As I said, too many sub-selects. A temporary table may help matters.
To check for missing ticks:
SELECT clo.tick+1 AS missing_tick
FROM chadwick AS chi
RIGHT JOIN chadwick AS clo ON chi.tick = clo.tick+1
WHERE chi.tick IS NULL;
This will return at least one row with tick equal to 1 + the largest tick in the table. Thus, the largest value in this result can be ignored.

In order to have the list of pairs (tick, refid) to insert get a whole list:
SELECT a.tick, b.refid
FROM ( SELECT DISTINCT tick FROM t) a
CROSS JOIN ( SELECT DISTINCT refid FROM t) b
Now substract from that query the existing ones:
SELECT a.tick tick, b.refid refid
FROM ( SELECT DISTINCT tick FROM t) a
CROSS JOIN ( SELECT DISTINCT refid FROM t) b
MINUS
SELECT DISTINCT tick, refid FROM t
Now you can join with t to obtain the final query (note that I use inner join + left join to obtain previous result but you could adapt):
INSERT INTO t(tick, refid, value)
SELECT c.tick, c.refid, t1.value
FROM ( SELECT a.tick tick, b.refid refid
FROM ( SELECT DISTINCT tick FROM t) a
CROSS JOIN ( SELECT DISTINCT refid FROM t) b
MINUS
SELECT DISTINCT tick, refid FROM t
) c
INNER JOIN t t1 ON t1.refid = c.refid and t1.tick < c.tick
LEFT JOIN t t2 ON t2.refid = c.refid AND t1.tick < t2.tick AND t2.tick < c.tick
WHERE t2.tick IS NULL

How to select all from SQL query including subquery

I want to perform a mySQL query that returns all the results in a table but has an additional column that has values "selected" if they correspond to a separate query. For example
| name |
|------|
| a |
| b |
| c |
I want to be able to return the following when I do a query "select * from table where name = 'a';
| name | selected |
|------|----------|
| a | yes |
| b | |
| c | |
That is, I also want to know which ones where left out.

select *
, selected = case
when exists (
select *
from table2 t2
where t2.field = t1.field
and ...etc.
) then 1
else 0
end
from table t1

Where clause restricts your row set so you get only rows with name 'a'.
To get all of them either join the limited table back to itself or use IF without where:
SELECT *, IF( name='a', 'yes', '') AS selected
FROM table

The one that worked for me was:
select a.*
, name = case
when
a.name = 'a'
then 0
else 1
end
as selected
from points as a

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Use string matching to de-dupe results of query - mysql

Related

MySQL Conditionally Inner Join X # of columns

Count all records that does not exist to other table - SQL Query

SELECTING SUM based on conditional if from another table

How to `SELECT` and manufacture missing rows from previous values?

How to select all from SQL query including subquery

Categories

Resources