SQL operator IN returns only DISTINCT - mysql

I have the following query:
SELECT class, subclass ,weight
FROM classes
WHERE classes.term in ('this','paper','present','this','and','this','this')
The above query returns only distinct values. For example I have the following table:
+-----------------------------------+
|class | subclass | term | weight |
+-----------------------------------+
| a | b | this | 3 |
| c | d | paper | 2 |
| e | f | sth | 1 |
+-----------------------------------+
the result I will get is
+-----------------------------------+
|class | subclass | term | weight |
+-----------------------------------+
| a | b | this | 3 |
| c | d | paper | 1 |
+-----------------------------------+
what I actually wanted is the following
+-----------------------------------+
|class | subclass | term | weight |
+-----------------------------------+
| a | b | this | 3 |
| a | b | this | 3 |
| a | b | this | 3 |
| a | b | this | 3 |
| c | d | paper | 2 |
+-----------------------------------+
I there any other way to get all the results without IN "cutting" only distinct values?
The problem is that I cannot change that part: ('this','paper','present','this','and','this','this')
because it is not created by a query. It is a string of words I want to search.
Edit:
- In the original scenario the table contains more than 3000 different words and the actual string is generated by a function I do not have
rights to access and contains 300+ words with many duplicates.
- In the original scenario I want to add the weight of the word every
time it appears in the string
Edit2:
The result I expect is to sum the weights every time a term appears in string.
Expecting results like the following:
+-----------------------------------+
|class | subclass | term | weight |
+-----------------------------------+
| a | b | this | 12 |
| c | d | paper | 2 |
+-----------------------------------+
Is there any other solution?

Use a join:
select c.*
from (select 'this' as term union all
select 'paper' as term union all
select 'present' as term union all
select 'this' as term union all
select 'and' as term union all
select 'this' as term union all
select 'this' as term
) terms left join
classes c
on c.term = terms.term;
This will work in both MySQL and SQLite.

For reference, see this question on how to count the number of occurrences in a substring:
SELECT m.*, (LENGTH('this paper present this and this this') - LENGTH(REPLACE('this paper present this and this this', term, ''))) / LENGTH(term) AS count
FROM myTable;
Once you have the number of occurrences for each string, you can multiply that value by the weight to get the total, like this:
SELECT term, weight * (LENGTH('this paper present this and this this') - LENGTH(REPLACE('this paper present this and this this', term, ''))) / LENGTH(term) AS totalWeight
FROM myTable m;
Note that this solution does not take a separated list of words, but concatenates that list into one string.
Here is an SQL Fiddle example for you.
EDIT
If you want the sum of weights for all terms in the string, without regard to the terms themselves, you can just adjust the query to use the SUM() function, and don't use GROUP BY because you want to sum for the whole table:
SELECT SUM(weight * (LENGTH('this paper present this and this this') - LENGTH(REPLACE('this paper present this and this this', term, ''))) / LENGTH(term)) AS totalWeight
FROM myTable m;
EDIT 2
A little more explanation for the query based on lengths. You can break it up into multiple parts:
LENGTH('this paper present this and this this') returns the number of characters in the string you are searching
LENGTH(REPLACE(myString, term)) is the length of the string above, with your term removed. (So, for example of 'this', it's going to be total length 37, subtracting 16 (4 for each occurrence) which will give you 21.
By subtracting the second value from the first, you'll get the number of characters in the overall string that are as a result of your value (37 - 21 = 16).
Then, it divides it by the length of 'term' to get the number of occurrences. 16 characters, divided by 4 characters in each occurrence means the substring occured 4 times. (16 / 4 = 4). Try these steps again with 'paper' and you will see.
The above procedure is illustrated step by step in this SQL Fiddle.

Related

How to find data based on comma separated parameter in comma separated data in my SQL query

We have below data,
plant table
----------------------------
| name | classification |
| A | 1,4,7 |
| B | 2,3,7 |
| C | 3,4,9,8 |
| D | 1,5,6,9 |
Now from front end side, they will send multiple parameter like "4,9",
and the objective output should be like this
plant table
---------------------------
| name | classification |
| A | 1,4,7 |
| C | 3,4,9,8 |
| D | 1,5,6,9 |
Already tried with FIND_IN_SET code, but only able to fetch only with 1 parameter
select * from plant o where find_in_set('4',classification ) <> 0
Another solution is by doing multiple queries, for example if the parameter is "4,9" then we do loop the query two times with parameter 4 and 9, but actually that solution will consume so much resources since the data is around 10000+ rows and the parameter itself actually can be more than 5 params
If the table design is in bad practice then OK but we are unable to change it since the table is in third party
Any solution or any insight will be appreciated,
Thank you
Schema (MySQL v8.0)
CREATE TABLE broken_table (name CHAR(12) PRIMARY KEY,classification VARCHAR(12));
INSERT INTO broken_table VALUES
('A','1,4,7'),
('B','2,3,7'),
('C','3,4,9,8'),
('D','1,5,6,9');
Query #1
WITH RECURSIVE cte (n) AS
(
SELECT 1
UNION ALL
SELECT n + 1 FROM cte WHERE n < 5
)
SELECT DISTINCT x.name, x.classification FROM broken_table x JOIN cte
WHERE SUBSTRING_INDEX(SUBSTRING_INDEX(classification,',',n),',',-1) IN (4,9);
name
classification
A
1,4,7
C
3,4,9,8
D
1,5,6,9
View on DB Fiddle
EDIT:
or, for older versions...
SELECT DISTINCT x.name, x.classification FROM broken_table x JOIN
(
SELECT 1 n UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5
) cte
WHERE SUBSTRING_INDEX(SUBSTRING_INDEX(classification,',',n),',',-1) IN (4,9)
Let's just avoid the CSV altogether and fix your table design:
plant table
----------------------------
| name | classification |
| A | 1 |
| A | 4 |
| A | 7 |
| B | 2 |
| B | 3 |
| B | 7 |
| ... | ... |
Now with this design, you may use the following statement:
SELECT *
FROM plant
WHERE classification IN (?);
To the ? placeholder, you may bind your collection of values to match (e.g. (4,9)).
You want or so you can use regular expressions. If everything were one digit:
where classification regexp replace('4,9', ',', '|')
However, this would match 42 and 19, which I'm guessing you do not want. So, make this a little more complicated so you have comma delimiters:
where classification regexp concat('(,|^)', replace('4,9', ',', ',|,'), '(,|$)')

MySql: Select Distinct for words in different order

I have problem with creating query, which getting no duplicate values form my table. Unfortunately, Full Name column has Name and Surname in different order.
For example:
+----+----------------------+
| ID | Full Name |
+----+----------------------+
| 1 | Marshall Wilson |
| 2 | Wilson Marshall |
| 3 | Lori Hill |
| 4 | Hill Lori |
| 5 | Casey Dean Davidson |
| 6 | Davidson Casey Dean |
+----+----------------------+
I would like to get that result:
+----+-----------------------+
| ID | Full Name |
+----+-----------------------+
| 1 | Marshall Wilson |
| 3 | Lori Hill |
| 5 | Casey Dean Davidson |
+----+-----------------------+
My target is to create query, which getting in similar way, for example: select distinct for Name and Surname in the same order.
Any thoughts?
It requires lots of String operations, and usage of multiple Derived Tables. It may not be efficient.
We first tokenize the FullName into multiple words it is made out of. For that we use a number generator table gen. In this case, I have assumed that maximum number of substrings is 3. You can easily extend it further by adding more Selects, like, SELECT 4 UNION ALL .. and so on.
We use Substring_Index() with Replace() function to get a substring out, using a single space character (' ') as Delimiter. Trim() is used to remove any leading/trailing spaces left.
Now, the trick is to use this result-set as a Derived table, and do a Group_Concat() on the words such that they are sorted in a ascending order. This way even the duplicate names (but substrings in different order), will get similar words_sorted value. Eventually, we simply need to Group By on words_sorted to weed out the duplicates.
Query #1
SELECT
MIN(dt2.ID) AS ID,
MIN(dt2.FullName) AS FullName
FROM
(
SELECT
dt1.ID,
dt1.FullName,
GROUP_CONCAT(IF(word = '', NULL, word) ORDER BY word ASC) words_sorted
FROM
(
SELECT e.ID,
e.FullName,
TRIM(REPLACE(
SUBSTRING_INDEX(e.FullName, ' ', gen.idx),
SUBSTRING_INDEX(e.FullName, ' ', gen.idx-1),
'')) AS word
FROM employees AS e
CROSS JOIN (SELECT 1 AS idx UNION ALL
SELECT 2 UNION ALL
SELECT 3) AS gen -- You can add more numbers if more than 3 substrings
) AS dt1
GROUP BY dt1.ID, dt1.FullName
) AS dt2
GROUP BY dt2.words_sorted
ORDER BY ID;
| ID | FullName |
| --- | ------------------- |
| 1 | Marshall Wilson |
| 3 | Hill Lori |
| 5 | Casey Dean Davidson |
View on DB Fiddle

Generated columns using aggregate functions

The starting point
Suppose I have a table devTest that looks like this:
+----+------+
| id | j |
+----+------+
| 1 | 5 |
| 2 | 9 |
| 3 | 4 |
| 4 | 7 |
+----+------+
The goal
I want a column specifying the row's deviation from the mean in the j column (expressed in standard deviations). That is, the table would look like this:
+----+------+------------+
| id | j | jDev |
+----+------+------------+
| 1 | 5 | -0.5637345 |
| 2 | 9 | 1.2402159 |
| 3 | 4 | -1.0147221 |
| 4 | 7 | 0.3382407 |
+----+------+------------+
What I've tried
>alter table devTest add column jDev decimal as ((j - avg(j)) / std(j));
To which I receive an error indicating that aggregate functions can't be used in the definition of a generated column:
ERROR 1901 (HY000): Function or expression 'avg()' cannot be used in the
GENERATED ALWAYS AS clause of `jDev`
Making this kind of column must be pretty common, so I'd love to know the best solution!
In standard SQL you'd do:
select id, j, (j - avg(j) over ()) / std(j) over () as jdev
from devtest;
But MySQL doesn't support analytic window functions such as AVG OVER. So in MySQL, you must select the aggregation values separately:
select d.id, d.j, (d.j - agg.javg) / agg.jstd as jdev
from devtest d
cross join (select avg(j) as javg, std(j) as jstd from devtest) agg;
Then create a view as Benjamin Crouzier suggests in his answer:
CREATE VIEW v_devtest AS
select d.id, d.j, (d.j - agg.javg) / agg.jstd as jdev
from devtest d
cross join (select avg(j) as javg, std(j) as jstd from devtest) agg;
A computed column can only calculate its value from other values in the same record. So what you are trying to do cannot be done with a calculated column.
This error makes sense because any change in your table (say you add a j with value 0) would update your average, and this in turn would change all your generated columns. So this would be quite a bit of work for the query engine.
Another solution would be to define a view instead:
CREATE VIEW j_dev (id, j, j_dev) AS
SELECT id, j,
(j - avg(j)) / std(j) as j_dev
FROM devTest
(not sure about the create view syntax, correct me if I'm wrong)

A better way to search for tags in mysql table

Say I have a table and one of the columns is titled tags with data that is comma separated like this.
"tag1,tag2,new york,tag4"
As you can see, some of tags will have spaces.
Whats the best or most accurate way of querying the table for any tags that are equal to "new york"?
In the past I've used:
SELECT id WHERE find_in_set('new york',tags) <> 0
But find_in_set does not work when the value has a space.
I'm currently using this:
SELECT id WHERE concat(',',tags,',') LIKE concat(',%new york%,')
But I'm not sure if this is the best approach.
How would you do it?
When Item A can be associated with many of item B, and item B can be associated with many of item A. This is called Many to many relationship
Data with these relationship should be stored in separate table and join together only on query.
Examble
Table 1
| product_uid | price | amount |
| 1 | 12000 | 3000 |
| 2 | 30000 | 600 |
Table 2
| tag_uid | tag_value |
| 1 | tag_01 |
| 2 | tag_02 |
| 3 | tag_03 |
| 4 | tag_04 |
Then we use a join table to relate them
Table 3
| entry_uid | product_uid | tag_uid |
| 1 | 1 | 3 |
| 2 | 1 | 4 |
| 3 | 2 | 1 |
| 4 | 2 | 2 |
| 5 | 4 | 2 |
The query will be (If you want to select item one and the tag)
SELECT t1.*, t2.tag_value
FROM Table1 as t1,
JOIN Table3 as join_table ON t1.product_uid = join_table.product_uid
JOIN Table2 as t2 ON t2.tag_uid = join_table.tag_uid
WHERE t1.product_uid = 1
If I needed to ignore the leading spaces before and after the commas in tags.
For example, if tags had a value of:
'atlanta,boston , chicago, los angeles , new york '
and assuming spaces are the only character I want to ignore, and the tag I'm searching for doesn't have any leading or trailing spaces, then I'd likely use a regular expression. Something like this:
SELECT ...
FROM t
WHERE t.tags REGEXP CONCAT('^|, *', 'new york' ,' *,|$')
I recommend Bill Karwin's excellent book "SQL Antipatterns: Avoiding the Pitfalls of Database Programming"
https://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557
Chapter 2 Jaywalking covers the antipattern of comma separated lists.

need explanation for this MySQL query

I just came across this database query and wonder what exactly this query does..Please clarify ..
select * from tablename order by priority='High' DESC, priority='Medium' DESC, priority='Low" DESC;
Looks like it'll order the priority by High, Medium then Low.
Because if the order by clause was just priority DESC then it would do it alphabetical, which would give
Medium
Low
High
It basically lists all fields from the table "tablename" and ordered by priority High, Medium, Low.
So High appears first in the list, then Medium, and then finally Low
i.e.
* High
* High
* High
* Medium
* Medium
* Low
Where * is the rest of the fields in the table
Others have already explained what id does (High comes first, then Medium, then Low). I'll just add a few words about WHY that is so.
The reason is that the result of a comparison in MySQL is an integer - 1 if it's true, 0 if it's false. And you can sort by integers, so this construct works. I'm not sure this would fly on other RDBMS though.
Added: OK, a more detailed explanation. First of all, let's start with how ORDER BY works.
ORDER BY takes a comma-separated list of arguments which it evalutes for every row. Then it sorts by these arguments. So, for example, let's take the classical example:
SELECT * from MyTable ORDER BY a, b, c desc
What ORDER BY does in this case, is that it gets the full result set in memory somewhere, and for every row it evaluates the values of a, b and c. Then it sorts it all using some standard sorting algorithm (such as quicksort). When it needs to compare two rows to find out which one comes first, it first compares the values of a for both rows; if those are equal, it compares the values of b; and, if those are equal too, it finally compares the values of c. Pretty simple, right? It's what you would do too.
OK, now let's consider something trickier. Take this:
SELECT * from MyTable ORDER BY a+b, c-d
This is basically the same thing, except that before all the sorting, ORDER BY takes every row and calculates a+b and c-d and stores the results in invisible columns that it creates just for sorting. Then it just compares those values like in the previous case. In essence, ORDER BY creates a table like this:
+-------------------+-----+-----+-----+-----+-------+-------+
| Some columns here | A | B | C | D | A+B | C-D |
+-------------------+-----+-----+-----+-----+-------+-------+
| | 1 | 2 | 3 | 4 | 3 | -1 |
| | 8 | 7 | 6 | 5 | 15 | 1 |
| | ... | ... | ... | ... | ... | ... |
+-------------------+-----+-----+-----+-----+-------+-------+
And then sorts the whole thing by the last two columns, which it discards afterwards. You don't even see them it your result set.
OK, something even weirder:
SELECT * from MyTable ORDER BY CASE WHEN a=b THEN c ELSE D END
Again - before sorting is performed, ORDER BY will go through each row, calculate the value of the expression CASE WHEN a=b THEN c ELSE D END and store it in an invisible column. This expression will always evaluate to some value, or you get an exception. Then it just sorts by that column which contains simple values, not just a fancy formula.
+-------------------+-----+-----+-----+-----+-----------------------------------+
| Some columns here | A | B | C | D | CASE WHEN a=b THEN c ELSE D END |
+-------------------+-----+-----+-----+-----+-----------------------------------+
| | 1 | 2 | 3 | 4 | 4 |
| | 3 | 3 | 6 | 5 | 6 |
| | ... | ... | ... | ... | ... |
+-------------------+-----+-----+-----+-----+-----------------------------------+
Hopefully you are now comfortable with this part. If not, re-read it or ask for more examples.
Next thing is the boolean expressions. Or rather the boolean type, which for MySQL happens to be an integer. In other words SELECT 2>3 will return 0 and SELECT 2<3 will return 1. That's just it. The boolean type is an integer. And you can do integer stuff with it too. Like SELECT (2<3)+5 will return 6.
OK, now let's put all this together. Let's take your query:
select * from tablename order by priority='High' DESC, priority='Medium' DESC, priority='Low" DESC;
What happens is that ORDER BY sees a table like this:
+-------------------+----------+-----------------+-------------------+----------------+
| Some columns here | priority | priority='High' | priority='Medium' | priority='Low' |
+-------------------+----------+-----------------+-------------------+----------------+
| | Low | 0 | 0 | 1 |
| | High | 1 | 0 | 0 |
| | Medium | 0 | 1 | 0 |
| | Low | 0 | 0 | 1 |
| | High | 1 | 0 | 0 |
| | Low | 0 | 0 | 1 |
| | Medium | 0 | 1 | 0 |
| | High | 1 | 0 | 0 |
| | Medium | 0 | 1 | 0 |
| | Low | 0 | 0 | 1 |
+-------------------+----------+-----------------+-------------------+----------------+
And it then sorts by the last three invisble columns which are discarded later.
Does it make sense now?
(P.S. In reality, of course, there are no invisible columns and the whole thing is made much trickier to get good speed, using indexes if possible and other stuff. However it is much easier to understand the process like this. It's not wrong either.)