MySql Duplicated rows string comparation performance - mysql

I have a table with more then 2 million records,
I need to find duplication records in column with string type additionaly I have index for this field.
I have next query:
select m.* from member as m
where lower(m.username) in
(select lower(b.username) from member as b
where b.Username like 'a%'
group by b.username
having count(b.username) >= 2);
sub-query return only 4 records less then 0.2 seconds, but if I use them in where conditions section, this query working very long time and never return results....
I have tried to run next query, that theoretically the same logic:
select * from member as m where lower(Username) in (lower('a1'),
lower('a2'),lower('a3'),lower('a4'));
and it works fine and fast.
what is the issues ?
additionally I would like to run query with out where b.Username like 'a%' part?

In common case MySQL can not use index for IN subqueries
This is sad, but, actually, MySQL can not recognize "constant subqueries". What does it mean? It means that if you have a subquery that returns static list of values - and you use that in IN within another query, MySQL will not use index (by range).
Why it is so?
Actually, the most correct point is - because MySQL treats following queries:
.. WHERE `field` IN ('foo', 'bar', 'baz')
and
.. WHERE `field` IN (SELECT `col` FROM t)
-as different queries (I'm assuming that column col in table t in second query have same values, i.e. 'foo', 'bar', 'baz'). First query is equivalent for it's "expected" case, i.e. for range of values. But second query is equal for = ANY subquery - and so MySQL will not use index for that.
What to do
Actually, your case and cases similar to it - are cases when it's better to split your query into two parts. First part will be retrieve static list of values from your table. Second part will substitute result of your first part into IN clause and then you'll get index using.
Alternative - you can use JOIN syntax for table to itself. That may seems useful if you want to resolve an issue with one query (or if your list is too long)

Related

Make mysql query faster

I have this query which takes around 29 second to perform and need to make it faster. I have created index on aggregate_date column and still no real improvement. Each aggregate_date has almost 26k rows within the table.
One more thing the query will run starting from 1/1/2018 till yesterday date
select MAX(os.aggregate_date) as lastMonthDay,
os.totalYTD
from (
SELECT aggregate_date,
Sum(YTD) AS totalYTD
FROM tbl_aggregated_tables
WHERE subscription_type = 'Subcription Income'
GROUP BY aggregate_date
) as os
GROUP by MONTH(os.aggregate_date), YEAR(os.aggregate_date);
I used Explain Select ... and received the following
update
The most of query time is consumed by the inner Query, so as scaisEdge suggested bellow i have tested the query and the time is reduced to almost 8s.
the Inner Query will look like:
select agt.aggregate_date,SUM(YTD)
from tbl_aggregated_tables as agt
FORCE INDEX(idx_aggregatedate_subtype_YTD)
WHERE agt.subscription_type = 'Subcription Income'
GROUP by agt.aggregate_date
I have noticed that this comparison "WHERE agt.subscription_type = 'Subcription Income'" takes the most of time. So is there any way to change that and to be mentioned the column of subscription_type only have 2 values which is 'Subcription Income' and 'Subcription Unit'
The index on aggregate_date column is not useful for performance because in not involved in where condition
looking to your code an useful index should be on column subscription_type
you could try using a redundant index adding also the column involved in select clause (for try to obtain all the data in query from index avoiding access to table)so you index could be
create idx1 on tbl_aggregated_tables (subscription_type, aggregate_date, YTD )
the meaning of the last group by seems not coherent with the select clause

The claim that subqueries can only return single columns

I am going through Ben Fortas "Teach yourself SQL in 10 minutes" book and it has grey box warning: "Subquery SELECT statements can only retrieve a single column. Attempting to return multiple columns will return an error."
Is this, in fact, commonly true for an RDMS? (Note that if this answer is correct then it is not true for all databases).
And why in the world would it ever be true? It seems like such a weird language restriction. Queries are expensive to compute, and the work to retrieve 3 columns is not particularly computationally different than the work to retrieve 1 (unless your RDMS stores your tables grouped by columns instead of grouped by rows).
In the answer that you link to, I would classify that as an "inline view" or "inline query", rather than a subquery.
That raises the question of what exactly a subquery is, of course.
Here's an example where you can indeed only return a single column.
select (select name from table where id = main_query.id),
id
from table main_query
Here's an example where you can return multiple columns. This seems to be to be unequivocally a subquery, of the "correlated" type.
select id
from table main_query
where (col1, col2) in (select a,b
from c
where c.x = main_query.y);
Here's an example where it doesn't matter how many columns are returned, and in fact any values are ignored:
select id
from table main_query
where exists (select a,b
from c
where c.x = main_query.y);
I think on balance I'd say that it is not true, but it depends on what your definition of a subquery is.
Scalar subqueries can return only one column. These are subqueries that are used wherever a single value is expected. This can be in almost any clause. For instance, as this non-sensical query uses them:
select (select count(*) from information_schema.tables), table_name
from information_schema.columns
where table_name = (select table_name from information_schema.columns order by rand() limit 1);
However, many subqueries are not scalar subqueries. These include:
Subqueries in the FROM clause.
Subqueries for a EXISTS and NOT EXISTS.
Subqueries that operator on a tuple when used in a comparison such as = and IN.

Evaluating a correlated subquery in SQL

I'm having trouble getting my head around evaluating correlated subqueries. An example is using a correlated subquery in SELECT so that GROUP BY isn't needed:
Consider the relations:
Movies : Title, Director Length
Schedule : Theatre, Title
I have the following query
SELECT S.Theater, MAX(M.Length)
FROM Movies M JOIN Schedule S ON M.Title=S.Title
GROUP BY S.Theater
Which gets the longest film that every theatre is playing. This is the same query without using GROUP BY:
SELECT DISTINCT S.theater,
(SELECT MAX(M.Length)
FROM Movies M
WHERE M.Title=S.Title)
FROM Schedule S
but I don't understand how it quite works.
I'd appreciate if anybody could give me an example of how correlated subqueries are evaluated.
Thanks :)
Conceptually...
To understand this, first ignore the bit about correlated subquery.
Consider the order of operations for a statement like this:
SELECT t.foo FROM mytable t
MySQL prepares an empty resultset. Rows in the resultset will consist of one column, because there is one expression in the SELECT list. A row is retrieved from mytable. MySQL puts a row into the resultset, using the value from the foo column from the mytable row, assigning it to the foo column in the resultset. Fetch the next row, repeat that same process, until there are no more rows to fetch from the table.
Pretty easy stuff. But bear with me.
Consider this statement:
SELECT t.foo AS fi, 'bar' AS fo FROM mytable t
MySQL process that the same way. Prepare an empty resultset. Rows in the resultset are going to have two columns this time. The first column is given the name fi (because we assigned the name fi with an alias). The second column in rows of the resultset will be named fo, because (again) we assigned an alias.
Now we etch a row from mytable, and insert a row into the resultset. The value of the foo column goes into the column name fi, and the literal string 'bar' goes into the column named fo. Continue fetching rows and inserting rows into the resultset, until no more rows to fetch.
Not too hard.
Next, consider this statement, which looks a little more tricky:
SELECT t.foo AS fi, (SELECT 'bar') AS fo FROM mytable t
Same thing happens again. Empty resultset. Rows have two columns, name fi and fo.
Fetch a row from mytable, and insert a row into the resultset. The value of foo goes into column fi (just like before.) This is where it gets tricky... for the second column in the resultset, MySQL executes the query inside the parens. In this case it's a pretty simple query, we can test that pretty easily to see what it returns. Take the result from that query and assign that to the fo column, and insert the row into the resultset.
Still with me?
SELECT t.foo AS fi, (SELECT q.tip FROM bartab q LIMIT 1) AS fo FROM mytable
This is starting to look more complicated. But it's not really that much different. The same things happen again. Prepare the empty resultset. Rows will have two columns, one name fi, the other named fo. Fetch a row from mytable. Get the value from foo column, and assign it to the fi column in the result row. For the fo column, execute the query, and assign the result from the query to the fo column. Insert the result row into the resultset. Fetch another row from mytable, a repeat the process.
Here we should stop and notice something. MySQL is picky about that query in the SELECT list. Really really picky. MySQL has restrictions on that. The query must return exactly one column. And it cannot return more than one row.
In that last example, for the row being inserted into the resultset, MySQL is looking for a single value to assign to the fo column. When we think about it that way, it makes sense that the query can't return more than one column... what would MySQL do with the value from the second column? And it makes sense that we don't want to return more than one row... what would MySQL do with multiple rows?
MySQL will allow the query to return zero rows. When that happens, MySQL assigns a NULL to the fo column.
If you have an understanding of that, your 95% of the way there to understanding the correlated subquery.
Let's look at another example. Our single line of SQL is getting a little unweildy, so we'll just add some line breaks and spaces to make it easier for us to work with. The extra spaces and linebreaks don't change the meaning of our statement.
SELECT t.foo AS fi
, ( SELECT q.tip
FROM bartab q
WHERE q.col = t.foo
ORDER BY q.tip DESC
LIMIT 1
) AS fo
FROM mytable t
Okay, that looks a lot more complicated. But is it really? It's the same thing again. Prepare an empty resultset. Rows will have two columns, fi and fo. Fetch a row from mytable, and get a row ready to insert into the resultset. Copy the value from the foo column, assign it to the fi column. And for the fo column, execute the query, take the single value returned by the query to the fo column, and push the row into the resultset. Fetch the next row from mytable, and repeat.
To explain (finall!) the part about "correlated".
That query we are going to run to get the result for the fo column. That contains a reference to a column from the outer table. t.foo. In this example that appears in the WHERE clause; it doesn't have to, it could appear anywhere in the statement.
What MySQL does with that, when it runs that subquery, it passes in the value of the foo column, into the query. If the row we just fetched from mytable has a value of 42 in the foo column... that subquery is equivalent to
SELECT q.tip
FROM bartab q
WHERE q.col = 42
ORDER BY q.tip DESC
LIMIT 1
But since we're not passing in the literal value of 42, what we're passing in is values from the row in the outer query, the result returned by our subquery is "related" to the row we're processing in the outer query.
We could be a lot more complicated in our subquery, as long as we remember the rule about the subquery in the SELECT list... it has to return exactly one column, and at most one row. It returns at most one value.
Correlated subqueries can appear in parts of the statement other than the SELECT list, such as the WHERE clause. The same general concept applies. For each row processed by the outer query, the values of the column(s) from that row are passed in to the subquery. The result returned from the subquery is related to the row being processed in the outer query.
The discussion omits all the steps before the actual execution... parsing the statament into tokens, performing the syntax check (keywords and identifiers in the right place). Then performing the semantics check (does mytable exist, does the user have select privilege on it, does the column foo exist in mytable). Then determining the access plan. And in the execution, obtaining the required locks, and so on. All that happens with every statement we execute.)
And we're going to not discuss the kinds of horrendous performance issues we can create with correlated subqueries. Though the previous discussion should give a clue. Since the subquery is executed for every row we're putting into the resultset (if it's in the SELECT list of our outer query), or is being executed for every row that is accessed by the outer query... if the outer query is returning 40,000 rows, that means our correlated subquery is going to be executed 40,000 times. So we better well make sure that subquery executes fast. Even when it executes fast, we're still going to execute it 40,000 times.
From a conceptual standpoint, imagine that the database is going through each row of the result without the subquery:
SELECT DISTINCT S.Theater, S.Title
FROM Schedule S
And then, for each one of those, running the subquery for you:
SELECT MAX(M.Length)
FROM Movies M
WHERE M.Title = (whatever S.Title was)
And placing that in as the value. Really, it's not (conceptually) that different from using a function:
SELECT DISTINCT S.Theater, SUBSTRING(S.Title, 1, 5)
FROM Schedule S
It's just that this function performs a query against another table, instead.
I do say conceptually, though. The database may be optimizing the correlated query into something more like a join. Whatever it does internally matters for performance, but doesn't matter as much for understanding the concept.
But, it may not return the results you're expecting. Consider the following data (sorry sqlfiddle seems to be erroring atm):
CREATE TABLE Movies (
Title varchar(255),
Length int(10) unsigned,
PRIMARY KEY (Title)
);
CREATE TABLE Schedule (
Title varchar(255),
Theater varchar(255),
PRIMARY KEY (Theater, Title)
);
INSERT INTO Movies
VALUES ('Star Wars', 121);
INSERT INTO Movies
VALUES ('Minions', 91);
INSERT INTO Movies
VALUES ('Up', 96);
INSERT INTO Schedule
VALUES ('Star Wars', 'Cinema 8');
INSERT INTO Schedule
VALUES ('Minions', 'Cinema 8');
INSERT INTO Schedule
VALUES ('Up', 'Cinema 8');
INSERT INTO Schedule
VALUES ('Star Wars', 'Cinema 6');
And then this query:
SELECT DISTINCT
S.Theater,
(
SELECT MAX(M.Length)
FROM Movies M
WHERE M.Title = S.Title
) AS MaxLength
FROM Schedule S;
You'll get this result:
+----------+-----------+
| Theater | MaxLength |
+----------+-----------+
| Cinema 6 | 121 |
| Cinema 8 | 91 |
| Cinema 8 | 121 |
| Cinema 8 | 96 |
+----------+-----------+
As you can see, it's not a replacement for GROUP BY (and you can still use GROUP BY), it's just running the subquery for each row. DISTINCT will only remove duplicates from the result. It's not giving the "greatest length" per theater anymore, it's just giving each unique movie length associated with the theater name.
PS: You might likely use an ID column of some sort to identify movies, rather than using the Title in the join. This way, if by chance the name of the movie has to be amended, it only needs to change in one place, not all over Schedule too. Plus, it's faster to join on an ID number than a string.

Where clause with one column and multiple criteria returning one row instead of13

I have a simple query with a few rows and multiple criteria in the where clause but it is only returning one row instead of 13. No joins and the syntax was triple checked and appears to be free of errors.
Query:
select column1, column2, column3
from mydb
where onecolumn in (number1, number2....number13)
Results:
returns one row of data associated with a random number in the where clause
spent a big part of the day trying to figure this one out and am now out of ideas. Please help...
Absent a more detailed test case, and the actual SQL statement that is actually running, this question cannot be answered. Here are some "ideas"...
Our first guess is that the rows you think are going to satisfy the predicates aren't actually satisfying all of the conditions.
Our second guess is that you've got an aggregate expression (COUNT(), MAX(), SUM()) in the SELECT list that's causing an implicit GROUP BY. This is a common "gotcha"... the non-standard MySQL extension to GROUP BY which allows non-aggregates to appear in the SELECT list, which are not also included as expressions in the GROUP BY clause. This same gotcha appears when the GROUP BY clause is omitted entirely, and an aggregate is included in the SELECT list.
But the question doesn't make any mention of an aggregate expression in the SELECT list.
Our third guess is another issue that beginners frequently overlook: the order of precedence of operations, especially AND and OR. For example, consider the expressions:
a AND b OR c
a AND ( b OR c )
( a AND b ) OR c
consider those while we sing-along, Sesame Street style,...: "One of these things is not like the others, one of these things just doesn't belong..."
A fourth guess... if it wasn't for the row being returned having a value of onecolumn as a random number in the IN list... if it was instead the first number in the IN list, we'd be very suspicious that the IN list actually contains a single string value that looks like a list a values, but is actually not.
The two expression in the SELECT list look very similar, but they are very different:
SELECT t.n IN (2,3,5,7) AS n_in_list
, t.n IN ('2,3,5,7') AS n_in_string
FROM ( SELECT 2 AS n
UNION ALL SELECT 3
UNION ALL SELECT 5
) t
The first expression is comparing n to each value in a list of four values.
The second expression is equivalent to t.n IN (2).
This is a frequent trip up when neophytes are dynamically creating SQL text, thinking that they can pass in a string value and that MySQL will see the commas within the string as part of the SQL statement.
(But this doesn't explain how a some the random one in the list.)
Those are all just guesses. Those are some of the most frequent trip ups we see, but we're just guessing. It could be something else entirely. In it's current form, there is no definitive "answer" to the question.

Union query to combine results of 3 tables

I am relatively new to coding so please have patience.
I am trying to combine data from 3 tables. I have managed to get some data back but it isn't what i need. Please see my example below.
select oid, rrnhs, idnam, idfnam, dte1, ta
as 'access type' from person
left join
(select fk_oid, min(dte), dte1, ta
from
((Select fk_oid,min(accessdate) as dte, accessdate1 as dte1, accesstype as ta
from vascularpdaccess
where isnull(accesstype)=false group by fk_oid)
union
(Select fk_oid, min(hpdate) as dte, hpdate as dte1, HPACCE as ta
from hdtreatment
where isnull(hptype)=false group by fk_oid)) as bla
group by fk_oid) as access
on person.oid=access.fk_oid
where person.rrnhs in (1000010000, 2000020000, 3000030000)
My understanding with a union is that the columns have to be of the same data type but i have two problems. The first is that accesstype and hpacce combine in to a the same column as expected, but i dont want to actually see the hpacce data (dont know if this is even possible).
Secondly, the idea of the query is to pull back a patients 'accesstype' date at the first date of hpdate.
I dont know if this even makes sens to you guys but hoping someone can help..y'all are usually pretty nifty!
Thanks in advance!
Mikey
All queries need to have the same number of columns in the SELECT statement. It looks like you first query has the max number of columns, so you will need to "pad" the other to have the same number of columns. You can use NULL as col to create the column with all null values.
To answer the question (I think) you were asking... for a UNION or UNION ALL set operation, you are correct: the number of columns and the datatypes of the columns returned must match.
But it is possible to return a literal as an expression in the SELECT list. For example, if you don't want to return the value of HPACCE column, you can replace that with a literal or a NULL. (If that column is character datatype (we can't tell from the information provided in the question), you could use (for example) a literal empty string '' AS ta in place of HPACCE AS ta.
SELECT fk_oid
, MIN(HPDATE) AS dte
, hpdate AS dte1
, NULL AS ta
-- -------------------- ^^^^
FROM hdtreatment
Some other notes:
The predicate ISNULL(foo)=FALSE can be more simply expressed as foo IS NOT NULL.
The UNION set operator will remove duplicate rows. If that's not necessary, you could use a UNION ALL set operator.
The subsequent GROUP BY fk_oid operation on the inline view bla is going to collapse rows; but it's indeterminate which row the values from dte1 and ta will be from. (i.e. there is no guarantee those values will be from the row that had the "minimum" value of dte.) Other databases will throw an exception/error with this statement, along the lines of "non-aggregate in SELECT list not in GROUP BY". But this is allowed (without error or warning) by a MySQL specific extension to GROUP BY behavior. (We can get MySQL to behave like other databases and throw an error of we specify a value for sql_mode that includes ONLY_FULL_GROUP_BY (?).)
The predicate on the outer query doesn't get pushed down into the inline view bla. The view bla is going to materialized for every fk_oid, and that could be a performance issue on large sets.
Also, qualifying all column references would make the statement easier to read. And, that will also insulate the statement from throwing an "ambiguous column" error in the future, when a column named (e.g.) ta or dte1 is added to the person table.