how to use a like with a join in sql? - mysql

I have 2 tables, say table A and table B and I want to perform a join, but the matching condition has to be where a column from A 'is like' a column from B meaning that anything can come before or after the column in B:
for example: if the column in A is 'foo'. Then the join would match if column in B is either: 'fooblah', 'somethingfooblah', or just 'foo'. I know how to use the wildcards in a standard like statement, but am confused when doing a join. Does this make sense? Thanks.

Using INSTR:
SELECT *
FROM TABLE a
JOIN TABLE b ON INSTR(b.column, a.column) > 0
Using LIKE:
SELECT *
FROM TABLE a
JOIN TABLE b ON b.column LIKE '%'+ a.column +'%'
Using LIKE, with CONCAT:
SELECT *
FROM TABLE a
JOIN TABLE b ON b.column LIKE CONCAT('%', a.column ,'%')
Mind that in all options, you'll probably want to drive the column values to uppercase BEFORE comparing to ensure you are getting matches without concern for case sensitivity:
SELECT *
FROM (SELECT UPPER(a.column) 'ua'
TABLE a) a
JOIN (SELECT UPPER(b.column) 'ub'
TABLE b) b ON INSTR(b.ub, a.ua) > 0
The most efficient will depend ultimately on the EXPLAIN plan output.
JOIN clauses are identical to writing WHERE clauses. The JOIN syntax is also referred to as ANSI JOINs because they were standardized. Non-ANSI JOINs look like:
SELECT *
FROM TABLE a,
TABLE b
WHERE INSTR(b.column, a.column) > 0
I'm not going to bother with a Non-ANSI LEFT JOIN example. The benefit of the ANSI JOIN syntax is that it separates what is joining tables together from what is actually happening in the WHERE clause.

In MySQL you could try:
SELECT * FROM A INNER JOIN B ON B.MYCOL LIKE CONCAT('%', A.MYCOL, '%');
Of course this would be a massively inefficient query because it would do a full table scan.
Update: Here's a proof
create table A (MYCOL varchar(255));
create table B (MYCOL varchar(255));
insert into A (MYCOL) values ('foo'), ('bar'), ('baz');
insert into B (MYCOL) values ('fooblah'), ('somethingfooblah'), ('foo');
insert into B (MYCOL) values ('barblah'), ('somethingbarblah'), ('bar');
SELECT * FROM A INNER JOIN B ON B.MYCOL LIKE CONCAT('%', A.MYCOL, '%');
+-------+------------------+
| MYCOL | MYCOL |
+-------+------------------+
| foo | fooblah |
| foo | somethingfooblah |
| foo | foo |
| bar | barblah |
| bar | somethingbarblah |
| bar | bar |
+-------+------------------+
6 rows in set (0.38 sec)

If this is something you'll need to do often...then you may want to denormalize the relationship between tables A and B.
For example, on insert to table B, you could write zero or more entries to a juncion table mapping B to A based on partial mapping. Similarly, changes to either table could update this association.
This all depends on how frequently tables A and B are modified. If they are fairly static, then taking a hit on INSERT is less painful then repeated hits on SELECT.

Using conditional criteria in a join is definitely different than the Where clause. The cardinality between the tables can create differences between Joins and Where clauses.
For example, using a Like condition in an Outer Join will keep all records in the first table listed in the join. Using the same condition in the Where clause will implicitly change the join to an Inner join. The record has to generally be present in both tables to accomplish the conditional comparison in the Where clause.
I generally use the style given in one of the prior answers.
tbl_A as ta
LEFT OUTER JOIN tbl_B AS tb
ON ta.[Desc] LIKE '%' + tb.[Desc] + '%'
This way I can control the join type.

When writing queries with our server LIKE or INSTR (or CHARINDEX in T-SQL) takes too long, so we use LEFT like in the following structure:
select *
from little
left join big
on left( big.key, len(little.key) ) = little.key
I understand that might only work with varying endings to the query, unlike other suggestions with '%' + b + '%', but is enough and much faster if you only need b+'%'.
Another way to optimize it for speed (but not memory) is to create a column in "little" that is "len(little.key)" as "lenkey" and user that instead in the query above.

Related

Is it possible to inverse the select statement in SQL?

When I want to select all columns expect foo and bar, what I normally do is just explicitly list all the other columns in select statement.
select a, b, c, d, ... from ...
But if table has dozen columns, this is tedious process for simple means. What I would like to do instead, is something like the following pseudo statement:
select * except(foo, bar) from ...
I would also like to know, if there is a function to filter out rows from the result consisting of multiple columns, if multiple rows has same content in all corresponding columns. In other words duplicate rows would be filtered out.
------------------------
A | B | C
------------------------ ====> ------------------------
A | B | C A | B | C
------------------------ ------------------------
You can query INFORMATION_SCHEMA db and get the list of columns (except two) for that table, e.g.:
SELECT REPLACE(GROUP_CONCAT(COLUMN_NAME), '<foo,bar>,', '')
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = '<your_table>' AND TABLE_SCHEMA = '<database>';
Once you get the list of columns, you can use that in your select query.
You can create view based on this table with all columns except these two columns and then use this view everytime with
select * from view
simple group by on all column will remove such duplicates. there are other options as well - distinct and row_number.
select * except(foo, bar) from
This is a frequently requested feature on SO. However, it has not made it to the SQL Standard and I don't know of any SQL products that support it. I guess when the product managers ask their developers, MVPs, usergroups, etc to measure enthusiasm for this prospective feature, they mostly hear, "SELECT * FROM is considered dangerous, we need to protect new users who don't know what they are doing, etc."
You may find it useful to use NATURAL JOIN rather than INNER JOIN etc which removes what would be duplicated columns from the resulting table expression e.g.
SELECT *
FROM Table1 t1
INNER JOIN Table2 t2
ON t1.foo = t2.foo
AND t1.bar = t2.bar;
will result in two columns named foo and two named bar (and possibly other repeated names), probably de-duplicated in some way e.g. by suffixing the range variable names t1 and t2 that INNER JOIN forced you into using.
Whereas:
SELECT *
FROM Table1 NATURAL JOIN Table2;
doesn't require the use of range variables (a good thing) because there will only be one column named foo and one named bar in the result.
And to remove duplicated rows as well as columns changed the implied SELECT ALL * into the explicit SELECT DISTINCT * e.g.
SELECT DISTINCT *
FROM Table1 NATURAL JOIN Table2;
Doing this may reduce your need for the SELECT ALL BUT { these columns } feature you desire.
Of course, if you do that you will be told, "NATURAL JOIN is considered dangerous, we need to protect you from yourself in case you don't know what you are doing, etc." :)

SQL: Remove records from table A that don't exist in table B based on two fields

this is probably something simple but I can't wrap my head around it. I've tried IN, NOT EXISTS, EXCEPT, etc... and still can't seem to get this right.
I have two tables.
Table A
-----------
BK
NUM
Table B
------------
BK
NUM
How do I write a query to remove all records from table A, that are not in table B based on the two fields. So if Table A has a record where BK = 1 and NUM = 2, then it should look in table B. If table B also has a record where BK = 1 and NUM = 2 then do nothing, but if not, delete that record from table A. Does that make sense?
Any help is much appreciated.
You can do so
delete from tablea
where (BK,NUM) not in
(select BK,NUM from tableb)
using exists
delete from tablea a
where not exists
(select 1 from tableb where BK=a.BK and NUM = a.NUM)
Another alternative is to use an anti-join pattern, a LEFT [OUTER] JOIN and then a predicate in the WHERE clause that filters out all matches.
It's easiest to write this as a SELECT first, test it, and then convert to a DELETE.
SELECT t.*
FROM tablea t
LEFT
JOIN tableb s
ON s.BK = t.BK
AND s.NUM = t.NUM
WHERE s.BK IS NULL
The LEFT JOIN returns all rows from t along with matching rows from s. The "trick" is the predicate in the WHERE clause... we know that s.BK will be non-NULL on all matching rows (because the value had to satisfy an equality comparison, in a predicate in the ON clause). So s.BK will be NULL only for rows in t that didn't have a matching row in s.
For MySQL, changing that into a DELETE statement is easy, just replace the SELECT keyword with DELETE. (We could write either DELETE t or DELETE t.*, either of those will work.
(This is an illustration of only one (of several) possible approaches.)

MySQL JOIN tables with WHERE clause

I need to gather posts from two mysql tables that have different columns and provide a WHERE clause to each set of tables. I appreciate the help, thanks in advance.
This is what I have tried...
SELECT
blabbing.id,
blabbing.mem_id,
blabbing.the_blab,
blabbing.blab_date,
blabbing.blab_type,
blabbing.device,
blabbing.fromid,
team_blabbing.team_id
FROM
blabbing
LEFT OUTER JOIN
team_blabbing
ON team_blabbing.id = blabbing.id
WHERE
team_id IN ($team_array) ||
mem_id='$id' ||
fromid='$logOptions_id'
ORDER BY
blab_date DESC
LIMIT 20
I know that this is messy, but i'll admit, I am no mysql veteran. I'm a beginner at best... Any suggestions?
You could put the where-clauses in subqueries:
select
*
from
(select * from ... where ...) as alias1 -- this is a subquery
left outer join
(select * from ... where ...) as alias2 -- this is also a subquery
on
....
order by
....
Note that you can't use subqueries like this in a view definition.
You could also combine the where-clauses, as in your example. Use table aliases to distinguish between columns of different tables (it's a good idea to use aliases even when you don't have to, just because it makes things easier to read). Example:
select
*
from
<table> as alias1
left outer join
<othertable> as alias2
on
....
where
alias1.id = ... and alias2.id = ... -- aliases distinguish between ids!!
order by
....
Two suggestions for you since a relative newbie in SQL. Use "aliases" for your tables to help reduce SuperLongTableNameReferencesForColumns, and always qualify the column names in a query. It can help your life go easier, and anyone AFTER you to better know which columns come from what table, especially if same column name in different tables. Prevents ambiguity in the query. Your left join, I think, from the sample, may be ambigous, but confirm the join of B.ID to TB.ID? Typically a "Team_ID" would appear once in a teams table, and each blabbing entry could have the "Team_ID" that such posting was from, in addition to its OWN "ID" for the blabbing table's unique key indicator.
SELECT
B.id,
B.mem_id,
B.the_blab,
B.blab_date,
B.blab_type,
B.device,
B.fromid,
TB.team_id
FROM
blabbing B
LEFT JOIN team_blabbing TB
ON B.ID = TB.ID
WHERE
TB.Team_ID IN ( you can't do a direct $team_array here )
OR B.mem_id = SomeParameter
OR b.FromID = AnotherParameter
ORDER BY
B.blab_date DESC
LIMIT 20
Where you were trying the $team_array, you would have to build out the full list as expected, such as
TB.Team_ID IN ( 1, 4, 18, 23, 58 )
Also, not logical "||" or, but SQL "OR"
EDIT -- per your comment
This could be done in a variety of ways, such as dynamic SQL building and executing, calling multiple times, once for each ID and merging the results, or additionally, by doing a join to yet another temp table that gets cleaned out say... daily.
If you have another table such as "TeamJoins", and it has say... 3 columns: a date, a sessionid and team_id, you could daily purge anything from a day old of queries, and/or keep clearing each time a new query by the same session ID (as it appears coming from PHP). Have two indexes, one on the date (to simplify any daily purging), and second on (sessionID, team_id) for the join.
Then, loop through to do inserts into the "TempJoins" table with the simple elements identified.
THEN, instead of a hard-coded list IN, you could change that part to
...
FROM
blabbing B
LEFT JOIN team_blabbing TB
ON B.ID = TB.ID
LEFT JOIN TeamJoins TJ
on TB.Team_ID = TJ.Team_ID
WHERE
TB.Team_ID IN NOT NULL
OR B.mem_id ... rest of query
What I ended up doing is;
I added an extra column to my blabbing table called team_id and set it to null as well as another field in my team_blabbing table called mem_id
Then I changed the insert script to also insert a value to the mem_id in team_blabbing.
After doing this I did a simple UNION ALL in the query:
SELECT
*
FROM
blabbing
WHERE
mem_id='$id' OR
fromid='$logOptions_id'
UNION ALL
SELECT
*
FROM
team_blabbing
WHERE
team_id
IN
($team_array)
ORDER BY
blab_date DESC
LIMIT 20
I am open to any thought on what I did. Try not to be too harsh though:) Thanks again for all the info.

MySQL Select based on count of substring in a column?

Let's say I have two tables (I'm trying to remove everything irrelevant to the question from the tables and make some sample ones, so bear with me :)
___________________ ________________________
|File | |Content |
|_________________| |______________________|
|ID Primary Key | 1 * |ID Primary Key |
|URL Varcher(255) |---------|FileID Foreign Key |
|_________________| | ref File(ID) |
|FileContent Text |
|______________________|
A File has a url. There may be many Content items corresponding to each File.
I need to create a query using these tables that I'm having some trouble with. I essentially want the query, in simple terms, to say:
"Select the file URL and the sum of the times substring "X" appears in all content entries associated with that file."
I'm pretty good with SQL selects, but I'm not so good with aggregate functions and it's letting me down. Any help is greatly appreciated :)
The query won't be efficient but might give you a hint:
SELECT url, cnt
FROM (
SELECT
f.id,
IFNULL(
SUM(
(LENGTH(c.text) - LENGTH(REPLACE(c.text, f.url, '')))/LENGTH(f.url)
),
0
) as cnt
FROM file c
JOIN content c ON f.id = c.fileid
GROUP BY f.id
) cnts JOIN file USING(id);
To append files that do not have a match in the content table you can UNION ALL the rest of use LEFT JOIN in the cnts subquery.
This solution attempts to use REGEXP to match the substring. REGEXP returns 1 if it matches, 0 if not, so SUM() them up for the total. REGEXP might seem like overkill, but would allow for more complicated matching than a simple substring.
SELECT
File.ID,
File.URL,
SUM(Content.FileContent REGEXP 'substring') AS numSubStrs
FROM File LEFT JOIN Content ON File.ID = Content.ID
GROUP BY File.ID, File.URL;
The easier method if a more complex match pattern won't ever be needed uses LIKE and COUNT(*) instead of SUM():
SELECT
File.ID,
File.URL,
COUNT(*) AS numSubStrs
FROM File LEFT JOIN Content ON File.ID = Content.ID
WHERE Content.FileContent LIKE '%substring%'
GROUP BY File.ID, File.URL;
Note the use of LEFT JOIN, which should produce 0 when there are not actually any entries in Content.

merging tables which consist of 17 million records

I have 3 tables in which 2 tables have 200 000 records and another table of 1 800 000 records. I do merge these 3 tables using 2 contraints that is OCN and TIMESTAMP(month,year). first two tables has columns for month and year as Monthx (which includes both month,date and year). and other table as seperate columns for each month and year. I gave the query as,
mysql--> insert into trail
select * from A,B,C
where A.OCN=B.OCN
and B.OCN=C.OCN
and C.OCN=A.OCN
and date_format(A.Monthx,'%b')=date_format(B.Monthx,'%b')
and date_format(A.Monthx,'%b')=C.IMonth
and date_format(B.Monthx,'%b')=C.month
and year(A.Monthx)=year(B.Monthx)
and year(B.Monthx)=C.Iyear
and year(A.Monthx)=C.Iyear
I gave this query 4days before its still running.could u tell me whether this query is correct or wrong and provide me a exact query..(i gave tat '%b' because my C table has a column which has months in the form JAN,MAR).
Please don't use implicit where joins, bury it in 1989, where it belongs. Use explicit joins instead
select * from a inner join b on (a.ocn = b.ocn and
date_format(A.Monthx,'%b')=date_format(B.Monthx,'%b') ....
This select part of the query (had to rewrite it because I refuse to deal with '89 syntax)
select * from A
inner join B on (
A.OCN=B.OCN
and date_format(A.Monthx,'%b')=date_format(B.Monthx,'%b')
and year(A.Monthx)=year(B.Monthx)
)
inner join C on (
C.OCN=A.OCN
and date_format(A.Monthx,'%b')=C.IMonth
and date_format(B.Monthx,'%b')=C.month
and year(B.Monthx)=C.Iyear
and year(A.Monthx)=C.Iyear
)
Has a lot of problems.
using a function on a field will kill any opportunity to use an index on that field.
you are doing a lot of duplicate test. if (A = B) and (B = C) then it logically follows that (A = C)
the translations of the date fields take a lot of time
I would suggest you rewrite your tables to use fields that don't need translating (using functions), but can be compared directly.
A field like yearmonth : char(6) e.g. 201006 can be indexed and compared much faster.
If the table A,B,C have a field called ym for short than your query can be:
INSERT INTO TRAIL
SELECT a.*, b.*, c.* FROM a
INNER JOIN b ON (
a.ocn = b.ocn
AND a.ym = b.ym
)
INNER JOIN c ON (
a.ocn = c.ocn
AND a.ym = c.ym
);
If you put indexes on ocn (primary index probably) and ym the query should run about a million rows a second (or more).
To test if your query is ok, import a small subset of records from A, B and C to a temporary database and test it their.
You have redundancies in your implicit JOIN because you are joining A.OCN with B.OCN, B.OCN with C.OCN and then C.OCN to A.OCN, on of those can be deleted. If A.OCN = B.OCN and B.CON = C.OCN, A.OCN = C.OCN is implied. Further, I guess you have redundancies in your date comparisons.