Slow stored MySQL function gets progressively slower with repeated runs - mysql

I need to filter a list by whether the person has an appointment. This runs in 0.09 seconds.
select personid from persons p
where EXISTS (SELECT 1 FROM appointments a
WHERE a.personid = p.personid);
Since I use this in more than one query and it actually contains another condition, it seemed convenient to put the filter into a function, so I have
CREATE FUNCTION `has_appt`(pid INT) RETURNS tinyint(1)
BEGIN
RETURN
EXISTS (SELECT 1 FROM appointments WHERE personid = pid);
END
Then I can use
select personid from persons where has_appt(personid)
However, two unexpected things happen. First, the statement using the has_appt() function now takes 2.5 seconds to run. I know there is overhead to a function call, but this seems extreme. Second, if I run the statement repeatedly, it takes about 5 seconds longer each time, so by the 4th time, it is taking over 20 seconds. This happens regardless of how long I wait between tries, but storing the function again resets the time to 2.5 seconds. What can account for the progressive slowness? What state can be affected by simply running it multiple times?
I know the solution is to forget the function and just embed this into my queries, but I want to understand the principle so I can avoid making the same mistake again. Thanks in advance for you help.
I'm using MySQL 8 and Workbench.

Your original query can be replaced by, and sped up by,
SELECT personid FROM appointments;
But the query seems dumb -- why would you want a list of all it ids of people with appointments, but no info about them? Perhaps you over-simplified the query?
If a person might have multiple appointments, then this would be needed, and might not be as fast:
SELECT DISTINCT personid FROM appointments;
As for why the function is so slow... Optimization does not see what is inside the function. So select personid from persons where has_appt(personid) walks through the entire persons table, calling the function repeatedly.

Related

MySQL how can I speed up UDF in my query

I have one table that has userID, department, er. I've created one simple query to gather all this information.
SELECT table.userID, table.department, table.er FROM table;
Now, I want to group all er's that belong to the same department and perform this calculation
select sum(table.er)/3 as department_er from table group by table.department;
Then add this result as a new column in my first query. To do this I've created a UDF that looks like this
BEGIN
DECLARE department_er FLOAT;
set department_er = (select sum(er) from table where table.department = dpt);
RETURN department_er;
END
Then I used that UDF in this query
SELECT table.userID, table.department, (select dptER(table.department)/3) as department_er FROM table
I've indexed my tables and more complex queries were dropped from 4+ minutes to less than 1 second. This seems to be pretty simple but is going on 10 minutes to run. Is there a better way to do this or a way to optimize my UDF?
Forgive my n00b-ness :)
Try a query without a dependent aggregated subquery in SELECT clause:
select table.userID,
table.department as dpt,
x.department_er
from table
join (
select department,
(sum(table.er)/3) As department_er
from table
group by department
) x
ON x.department = table.department
This UDF function cannot be optimized. Maybe it seems to work in simple queries, but generally it can hurt your database performance.
Imagine that we have a query like this one:
SELECT ....., UDF( some parameters )
FROM table
....
MySql must call this funcion for each record that is retrieved from the table in this query
If the table contains 1000 records - the function is fired 1000 times.
And the query within the function is also fired 1000 times.
If 10.000 records - then the function is called 10.000 times.
Even if you optimize this function in such a way, that the UDF will be 2 times faster, the above query will still fire the function 1000 times.
If 500 users have the same department - it still is called 500 times for each user and calculates the same value for each of them. 499 redundant calls, because only 1 call is required to calculate this value.
The only way to optimize such queries is to take the "inner" query out of the UDF function and combine it with the main query using joins etc.

mysql query using where clause with 24 million rows

SELECT DISTINCT `Stock`.`ProductNumber`,`Stock`.`Description`,`TComponent_Status`.`component`, `TComponent_Status`.`certificate`,`TComponent_Status`.`status`,`TComponent_Status`.`date_created`
FROM Stock , TBOM , TComponent_Status
WHERE `TBOM`.`Component` = `TComponent_Status`.`component`
AND `Stock`.`ProductNumber` = `TBOM`.`Product`
Basically table TBOM HAS :
24,588,820 rows
The query is ridiculously slow, i'm not too sure what i can do to make it better. I have indexed all the other tables in the query but TBOM has a few duplicates in the columns so i can't even run that command. I'm a little baffled.
To start, index the following fields:
TBOM.Component
TBOM.Product
TComponent_Status.component
Stock.ProductNumber
Not all of the above indexes may be necessary (e.g., the last two), but it is a good start.
Also, remove the DISTINCT if you don't absolutely need it.
The only thing I can really think of is having an index on your Stock table on
(ProductNumber, Description)
This can help in two ways. Since you are only using those two fields in the query, the engine wont be required to go to the full data row of each stock record since both parts are in the index, it can use that. Additionally, you are doing DISTINCT, so having the index available to help optimize the DISTINCTness, should also help.
Now, the other issue for time. Since you are doing a distinct from stock to product to product status, you are asking for all 24 million TBOM items (assume bill of materials), and each BOM component could have multiple status created, you are getting every BOM for EVERY component changed.
If what you are really looking for is something like the most recent change of any component item, you might want to do it in reverse... Something like...
SELECT DISTINCT
Stock.ProductNumber,
Stock.Description,
JustThese.component,
JustThese.certificate,
JustThese.`status`,
JustThese.date_created
FROM
( select DISTINCT
TCS.Component,
TCS.Certificate,
TCS.`staus`,
TCS.date_created
from
TComponent_Status TCS
where
TCS.date_created >= 'some date you want to limit based upon' ) as JustThese
JOIN TBOM
on JustThese.Component = TBOM.Component
JOIN Stock
on TBOM.Product = Stock.Product
If this is a case, I would ensure an index on the component status table, something like
( date_created, component, certificate, status, date_created ) as the index. This way, the WHERE clause would be optimized, and distinct would be too since pieces already part of the index.
But, how you currently have it, if you have 10 TBOM entries for a single "component", and that component has 100 changes, you now have 10 * 100 or 1,000 entries in your result set. Take this and span 24 million, and its definitely not going to look good.

VB.NET MySqlDataAdapter.Fill takes almost 6-7 seconds

I run a simple SELECT (noted below) in a stored procedure of a table that's around 1,500 rows.
CREATE PROCEDURE `LoadCollectionItemProperty`(IN sId int(10))
BEGIN
SELECT *
FROM itemproperty
WHERE itemid IN
(SELECT itemid
FROM collectionitem
WHERE collectionid = sId AND removed ='0000-00-00 00:00:00');
END
This operation takes around 7 seconds. I inserted Breakpoints and used F11 to determine that upon MySqlAdapter.Fill is where the lag starts. Both my computer and the server hosting the MySQL database are NOT challenged spec wise. I'm guessing it's the query itself.
collectionitem holds the 2 foreign keys linking an itemproperty to a collection. we feed the sproc sId(PK of collection) so that the subquery returns all the itemids from a specific collection and then we use the itemid(PK) in itemproperty.
Is there any way to speed up the process?
UPDATE
My issue was entirely due to improper indexing. Once I learned which columns to index, everything is extremely smooth! Thank you for your help.
You can try this, but it may not help much if your tables are missing indexes.
BEGIN
SELECT *
FROM itemproperty i
WHERE exists
(SELECT 1
FROM collectionitem c
WHERE collectionid = sId AND i.itemid = c.itemid AND removed ='0000-00-00 00:00:00');
END
Well given it's the query, (you should prove that by just running it at teh propmpt on the server)
Cut the query out of the sp and prefix it with Explain to see the query execution plan to confrm but some things stand out straight off.
SELECT *
FROM itemproperty
inner join collectionitem on collectionitem.itemid = itemproperty.itemid and removed ='0000-00-00 00:00:00'
to get rid of the subquery.
Is removed a date time, is it indexed?

SQL ORDER BY performance

I have a table with more than 1 million records. The problem is that the query takes too much times, like 5 minutes. The "ORDER BY" is my problem, but i need the expression in the query order by to get most popular videos. And because of the expression i can't create an index on it.
How can i resolve this problem?
Thx.
SELECT DISTINCT
`v`.`id`,`v`.`url`, `v`.`title`, `v`.`hits`, `v`.`created`, ROUND((r.likes*100)/(r.likes+r.dislikes),0) AS `vote`
FROM
`videos` AS `v`
INNER JOIN
`votes` AS `r` ON v.id = r.id_video
ORDER BY
(v.hits+((r.likes-r.dislikes)*(r.likes-r.dislikes))/2*v.hits)/DATEDIFF(NOW(),v.created) DESC
Does the most popular have to be calculated everytime? I doubt if the answer is yes. Some operations will take a long time to run no matter how efficient your query is.
Also bear in mind you have 1 million now, you might have 10 million in the next few months. So the query might work now but not in a month, the solution needs to be scalable.
I would make a job to run every couple of hours to calculate and store this information on a different table. This might not be the answer you are looking for but I just had to say it.
What I have done in the past is to create a voting system based on Integers.
Nothing will outperform integers.
The voting system table has 2 Columns:
ProductID
VoteCount (INT)
The votecount stores all the votes that are submitted.
Like = +1
Unlike = -1
Create an Index in the vote table based on ID.
You have to alternatives to improve this:
1) create a new column with the needed value pre-calculated
1) create a second table that holds the videos primary key and the result of the calculation.
This could be a calculated column (in the firts case) or modify your app or add triggers that allow you to keep it in sync (you'd need to manually load it the firs time, and later let your program keep it updated)
If you use the second option your key could be composed of the finalRating plus the primary key of the videos table. This way your searches would be hugely improved.
Have you try moving you arithmetic of the order by into your select, and then order by the virtual column such as:
SELECT (col1+col2) AS a
FROM TABLE
ORDER BY a
Arithmetic on sort is expensive.

MySQL Query Optimization Tips

I have a game application in which users users answer questions and rating is based on the time elapsed on answering this questions.
I am trying to build a query that returns a the rating for top 20 players. the game has some stages and I need to retrieve the players which played all stages (assume the number of stages are 5)
This is what have I wrote:
SELECT `usersname` , `time`
FROM `users`
WHERE `users`.`id`
IN (
SELECT `steps`.`user_id`
FROM `steps`
GROUP BY `steps`.`user_id`
HAVING COUNT( `steps`.`id` ) = 5
)
ORDER BY `time` ASC
LIMIT 20
In the inner Select I am selecting all user_id-s who have played 5 stages (steps). The query works correctly but It's horribly slow. It takes about minute and a half to execute. can you provide some tips on optimizing it. Inner Select returns about 2000 rows.
Feel free to ask me if you need additional information.
Try with JOIN, instead of IN (SELECT ...):
SELECT usersname , `time`
FROM users
JOIN
( SELECT steps.user_id
FROM steps
GROUP BY steps.user_id
HAVING COUNT(*) = 5
) grp
ON grp.user_id = users.id
ORDER BY `time` ASC
LIMIT 20
Assuming that you have an index on users.time, which is the first obvious optimization, replacing HAVING with WHERE in the inner query may be worth a try.
The query optimizer might do this already if you are lucky, but you cannot rely on it, and strictly to the specification, HAVING runs after fetching every record whereas WHERE prunes them before.
If that does not help, simply having a counter that increments for every stage completed in the users table might speed up things, eleminating the sub-query. This will make completing a stage minimally slower (but this won't happen a million times per second!), but will be very fast to query only the users who have completed all 5 stages (especially if you have an index on that field).
Also, using memcached or some similar caching technology may be worthwile for something like a highscore, which is typically of the kind of "not necessarily 100% accurate to the second, changing slowly, queried billions of times" data.
If memcached is not an option, even writing the result to a temp file and re-using that for 1-2 seconds (or even longer) would be an option. Nobody will notice. Even if you cache highscores for as long as 1-2 minutes, still nobody will take offense because that is just "how long it takes".
I think you should use where instead of having. Also, in my opinion you should do this in a stored function. In my opinion the best way is to use where instead of having, also, run the inner query, store the results and run the outer query based on the results of your inner query.
This use case may benefit from de-normalization. There is no need to search through all 2000 user records to determine if a user is better than 20 records.
Create a Top_20_Users table.
After the 5th stage check if user's time is less than any in the Top_20_Users table. If yes, update the slowest/worst record.
Things you can do with this.
Since the Top_20_Users table will be so small, add a field for stage and include Top 20 times for each stage as well as for all five stages completed.
Let the Top_20_Users table grow. A history of all top 20 users ever, their times and the date when that time was good enough to be a top 20. Show trends as users learn the game and the top 20 times get better and better.