I have a database of 100,000 names in cemeteries. The cemeteries number around 6000....i wish to return the number of names in each cemetery..
If i do an individual query, it takes a millisecond
SELECT COUNT(*) FROM tblnames
WHERE tblcemetery_ID = 2
My actual query goes on and on and I end up killing it so I dont kill our database. Can someone point me at a more efficient method?
select tblcemetery.id,
(SELECT COUNT(*) FROM tblnames
WHERE tblcemetery_ID = tblcemetery.id) AS casualtyCount
from tblcemetery
ORDER BY
fldcemetery
You can rephrase your query to use a join instead of a correlated subquery:
SELECT
t1.id,
COUNT(t2.tblcemetery_ID) AS casualtyCount
FROM tblcemetery t1
LEFT JOIN tblnames t2
ON t1.id = t2.tblcemetery_ID
GROUP BY
t1.id
ORDER BY
t1.id
I have heard that in certain databases, such as Oracle, the optimizer is smart enough to figure out what I wrote above, and would refactor your query under the hood. But the MySQL optimizer might not be smart enough to do this.
One nice side effect of this refactor is that we now see an opportunity to improve performance even more, by adding indices to the join columns. I am assuming that id is the primary key of tblcemetery, in which case it is already indexed. But you could add an index to tblcemetery_ID in the tblnames table for a possible performance boost:
CREATE INDEX cmtry_idx ON tblnames (tblcemetery_ID)
It could even be done without a JOIN by using an EXISTS clause like this
SELECT id, COUNT(*) AS casualtyCount
FROM tblcemetery
WHERE EXISTS (SELECT 1 FROM tblnames WHERE tblcemetery_ID=id)
GROUP BY id
ORDER BY id
Or you could look up group by here f.e. and do something like
SELECT tblcemetery_ID, sum(1) from tblnames group by tblscemetery_id
You essentially sum up 1 for each name entry that belongs to this cemetery as you are not interessted in the names at all, no need to join to the cemetary detail table
Not sure if sum(1) or count(*) is better, both should work.
Youll only get cemetaries that have ppl inside though
Related
I have an issue on creating tables by using select keyword (it runs so slow). The query is to take only the details of the animal with the latest entry date. that query will be used to inner join another query.
SELECT *
FROM amusementPart a
INNER JOIN (
SELECT DISTINCT name, type, cageID, dateOfEntry
FROM bigRegistrations
GROUP BY cageID
) r ON a.type = r.cageID
But because of slow performance, someone suggested me steps to improve the performance. 1) use temporary table, 2)store the result and use it and join it the the other statement.
use myzoo
CREATE TABLE animalRegistrations AS
SELECT DISTINCT name, type, cageID, MAX(dateOfEntry) as entryDate
FROM bigRegistrations
GROUP BY cageID
unfortunately, It is still slow. If I only use the select statement, the result will be shown in 1-2 seconds. But if I add the create table, the query will take ages (approx 25 minutes)
Any good approach to improve the query time?
edit: the size of big registration table is around 3.5 million rows
Can you please try the query in the way below to achieve The query is to take only the details of the animal with the latest entry date. that query will be used to inner join another query, the query you are using is not fetching records as per your requirement and it will faster:
SELECT a.*, b.name, b.type, b.cageID, b.dateOfEntry
FROM amusementPart a
INNER JOIN bigRegistrations b ON a.type = b.cageID
INNER JOIN (SELECT c.cageID, max(c.dateOfEntry) dateofEntry
FROM bigRegistrations c
GROUP BY c.cageID) t ON t.cageID = b.cageID AND t.dateofEntry = b.dateofEntry
Suggested indexing on cageID and dateofEntry
This is a multipart question.
Use Temporary Table
Don't use Distinct - group all columns to make distinct (dont forget to check for index)
Check the SQL Execution plans
Here you are not creating a temporary table. Try the following...
CREATE TEMPORARY TABLE IF NOT EXISTS animalRegistrations AS
SELECT name, type, cageID, MAX(dateOfEntry) as entryDate
FROM bigRegistrations
GROUP BY cageID
Have you tried doing an explain to see how the plan is different from one execution to the next?
Also, I have found that there can be locking issues in some DB when doing insert(select) and table creation using select. I ran this in MySQL, and it solved some deadlock issues I was having.
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
The reason the query runs so slow is probably because it is creating the temp table based on all 3.5 million rows, when really you only need a subset of those, i.e. the bigRegistrations that match your join to amusementPart. The first single select statement is faster b/c SQL is smart enough to know it only needs to calculate the bigRegistrations where a.type = r.cageID.
I'd suggest that you don't need a temp table, your first query is quite simple. Rather, you may just need an index. You can determine this manually by studying the estimated execution plan, or running your query in the database tuning advisor. My guess is you need to create an index similar to below. Notice I index by cageId first since that is what you join to amusementParks, so that would help SQL narrow the results down the quickest. But I'm guessing a bit - view the query plan or tuning advisor to be sure.
CREATE NONCLUSTERED INDEX IX_bigRegistrations ON bigRegistrations
(cageId, name, type, dateOfEntry)
Also, if you want the animal with the latest entry date, I think you want this query instead of the one you're using. I'm assuming the PK is all 4 columns.
SELECT name, type, cageID, dateOfEntry
FROM bigRegistrations BR
WHERE BR.dateOfEntry =
(SELECT MAX(BR1.dateOfEntry)
FROM bigRegistrations BR1
WHERE BR1.name = BR.name
AND BR1.type = BR.type
AND BR1.cageID = BR.cageID)
I am running the below query to retrive the unique latest result based on a date field within a same table. But this query takes too much time when the table is growing. Any suggestion to improve this is welcome.
select
t2.*
from
(
select
(
select
id
from
ctc_pre_assets ti
where
ti.ctcassettag = t1.ctcassettag
order by
ti.createddate desc limit 1
) lid
from
(
select
distinct ctcassettag
from
ctc_pre_assets
) t1
) ro,
ctc_pre_assets t2
where
t2.id = ro.lid
order by
id
Our able may contain same row multiple times, but each row with different time stamp. My object is based on a single column for example assettag I want to retrieve single row for each assettag with latest timestamp.
It's simpler, and probably faster, to find the newest date for each ctcassettag and then join back to find the whole row that matches.
This does assume that no ctcassettag has multiple rows with the same createddate, in which case you can get back more than one row per ctcassettag.
SELECT
ctc_pre_assets.*
FROM
ctc_pre_assets
INNER JOIN
(
SELECT
ctcassettag,
MAX(createddate) AS createddate
FROM
ctc_pre_assets
GROUP BY
ctcassettag
)
newest
ON newest.ctcassettag = ctc_pre_assets.ctcassettag
AND newest.createddate = ctc_pre_assets.createddate
ORDER BY
ctc_pre_assets.id
EDIT: To deal with multiple rows with the same date.
You haven't actually said how to pick which row you want in the event that multiple rows are for the same ctcassettag on the same createddate. So, this solution just chooses the row with the lowest id from amongst those duplicates.
SELECT
ctc_pre_assets.*
FROM
ctc_pre_assets
WHERE
ctc_pre_assets.id
=
(
SELECT
lookup.id
FROM
ctc_pre_assets lookup
WHERE
lookup.ctcassettag = ctc_pre_assets.ctcassettag
ORDER BY
lookup.createddate DESC,
lookup.id ASC
LIMIT
1
)
This does still use a correlated sub-query, which is slower than a simple nested-sub-query (such as my first answer), but it does deal with the "duplicates".
You can change the rules on which row to pick by changing the ORDER BY in the correlated sub-query.
It's also very similar to your own query, but with one less join.
Nested queries are always known to take longer time than a conventional query since. Can you append 'explain' at the start of the query and put your results here? That will help us analyse the exact query/table which is taking longer to response.
Check if the table has indexes. Unindented tables are not advisable(until unless obviously required to be unindented) and are alarmingly slow in executing queries.
On the contrary, I think the best case is to avoid writing nested queries altogether. Bette, run each of the queries separately and then use the results(in array or list format) in the second query.
First some questions that you should at least ask yourself, but maybe also give us an answer to improve the accuracy of our responses:
Is your data normalized? If yes, maybe you should make an exception to avoid this brutal subquery problem
Are you using indexes? If yes, which ones, and are you using them to the fullest?
Some suggestions to improve the readability and maybe performance of the query:
- Use joins
- Use group by
- Use aggregators
Example (untested, so might not work, but should give an impression):
SELECT t2.*
FROM (
SELECT id
FROM ctc_pre_assets
GROUP BY ctcassettag
HAVING createddate = max(createddate)
ORDER BY ctcassettag DESC
) ro
INNER JOIN ctc_pre_assets t2 ON t2.id = ro.lid
ORDER BY id
Using normalization is great, but there are a few caveats where normalization causes more harm than good. This seems like a situation like this, but without your tables infront of me, I can't tell for sure.
Using distinct the way you are doing, I can't help but get the feeling you might not get all relevant results - maybe someone else can confirm or deny this?
It's not that subqueries are all bad, but they tend to create massive scaleability issues if written incorrectly. Make sure you use them the right way (google it?)
Indexes can potentially save you for a bunch of time - if you actually use them. It's not enough to set them up, you have to create queries that actually uses your indexes. Google this as well.
I have big DB. It's about 1 mln strings. I need to do something like this:
select * from t1 WHERE id1 NOT IN (SELECT id2 FROM t2)
But it works very slow. I know that I can do it using "JOIN" syntax, but I can't understand how.
Try this way:
select *
from t1
left join t2 on t1.id1 = t2.id
where t2.id is null
First of all you should optimize your indexes in both tables, and after that you should use join
There are different ways a dbms can deal with this task:
It can select id2 from t2 and then select all t1 where id1 is not in that set. You suggest this using the IN clause.
It can select record by record from t1 and look for each record if it finds a match in t2. You would suggest this using the EXISTS clause.
You can outer join the table then throw away all matches and stay with the non-matching entries. This may look like a bad way, especially when there are many matches, because you would get big intermediate data and then throw most of it away. However, depending on how the dbms works, it can be rather fast, for example when it applies hash join techniques.
It all depends on table sizes, number of matches, indexes, etc. and on what the dbms makes of your query. There are dbms that are able to completely re-write your query to find the best execution plan.
Having said all this, you can just try different things:
the IN clause with (SELECT DISTINCT id2 FROM t2). DISTINCT can reduce the intermediate result significantly and really speed up your query. (But maybe your dbms does that anyhow to get a good execution plan.)
use an EXISTS clause and see if that is faster
the outer join suggested by Parado
Apologies if this has been asked before but is there any way, at all, I can optimize this query to run faster. At the minute it takes about 2 seconds which while isn't a huge amount it is the slowest query on my site, all other queries take less that 0.5 secs.
Here is my query:
SELECT SQL_CALC_FOUND_ROWS MAX(images.id) AS maxID, celebrity.* FROM images
JOIN celebrity ON images.celeb_id = celebrity.id
GROUP BY images.celeb_id
ORDER BY maxID DESC
LIMIT 0,20
Here is an explain:
1 SIMPLE celebrity ALL PRIMARY NULL NULL NULL 536 Using temporary; Using filesort
1 SIMPLE images ref celeb_id celeb_id 4 celeborama_ignite.celebrity.id 191
I'm at a loss at how to improve the performance in this query further. I'm not super familiar with MySQL, but I do know that it is slow because I am sorting on the data created by MAX() and that has no index. I can't not sort on that as it gives me the results needed, but is there something else I can do to prevent it from slowing down the query?
Thanks.
If you really need fast solution - then don't perform such queries in runtime.
Just create additional field last_image_id in celebrity table and update it on event of uploading of new image (by trigger or your application logic, doesn't matter)
I would get the latest image this way:
SElECT c.*, i.id AS image_id
FROM celebrity c
JOIN images i ON i.celeb_id = c.id
LEFT OUTER JOIN images i2 ON i2.celeb_id = c.id AND i2.id > i.id
WHERE i2.id IS NULL
ORDER BY image_id DESC
LIMIT 0,20;
In other words, try to find a row i2 for the same celebrity with a higher id than i.id. If the outer join fails to find that match, then i.id must be the max image id for the given celebrity.
SQL_CALC_FOUND_ROWS can cause queries to run extremely slowly. I've found some cases where just removing the SQL_CALC_FOUND_ROWS made the query run 200x faster (but it could also make only a small difference in other cases, it depends on the table, so you should test both ways).
If you need the equivalent of SQL_CALC_FOUND_ROWS, just run a separate query:
SELECT COUNT(*) FROM celebrity;
I think you need a compound index on (celeb_id, id) in table images (supposing it's a MyISAM table), so the GROUP BY celeb_id and MAX(id) can use this index.
But with big tables, you'll probably have to follow #zerkms' advice and add a new column in table celebrity
MYSQL doesn't perform so good with joins. i would recommend to dividing your query in two. that is in first query select the Celeb and then select image. Simply avoid joins.
Check out this link - http://phpadvent.org/2011/a-stitch-in-time-saves-nine-by-paul-jones
SELECT STRAIGHT_JOIN *
FROM (
SELECT MAX(id) as maxID, celeb_id as id
FROM images
GROUP BY celeb_id
ORDER by maxID DESC
LIMIT 0, 20) as ids
JOIN celebrity USING (id);
the query does not allow row number precalculation, but an additional:
SELECT COUNT(DISTINCT celeb_id)
FROM images;
or even (if each celebrity has an image):
SELECT COUNT(*) FROM celebrity;
will not cost much, because can easily be cached by the query cache (if it not switched off).
The following query gets the info that I need. However, I noticed that as the tables grow, my code gets slower and slower. I'm guessing it is this query. Can this written a different way to make it more efficient? I've heard a lot about using joins instead of subqueries, however, I don't "get" how to do it.
SELECT * FROM
(SELECT MAX(T.id) AS MAXid
FROM transactions AS T
GROUP BY T.position
ORDER BY T.position) AS result1,
(SELECT T.id AS id, T.symbol, T.t_type, T.degree, T.position, T.shares, T.price, T.completed, T.t_date,
DATEDIFF(CURRENT_DATE, T.t_date) AS days_past,
IFNULL(SUM(S.shares), 0) AS subtrans_shares,
T.shares - IFNULL(SUM(S.shares),0) AS due_shares,
(SELECT IFNULL(SUM(IF(SO.t_type = 'sell', -SO.shares, SO.shares )), 0)
FROM subtransactions AS SO WHERE SO.symbol = T.symbol) AS owned_shares
FROM transactions AS T
LEFT OUTER JOIN subtransactions AS S
ON T.id = S.transid
GROUP BY T.id
ORDER BY T.position) AS result2
WHERE MAXid = id
Your code:
(SELECT MAX(T.id) AS MAXid
FROM transactions AS T [<--- here ]
GROUP BY T.position
ORDER BY T.position) AS result1,
(SELECT T.id AS id, T.symbol, T.t_type, T.degree, T.position, T.shares, T.price, T.completed, T.t_date,
DATEDIFF(CURRENT_DATE, T.t_date) AS days_past,
IFNULL(SUM(S.shares), 0) AS subtrans_shares,
T.shares - IFNULL(SUM(S.shares),0) AS due_shares,
(SELECT IFNULL(SUM(IF(SO.t_type = 'sell', -SO.shares, SO.shares )), 0)
FROM subtransactions AS SO WHERE SO.symbol = T.symbol) AS owned_shares
FROM transactions AS T [<--- here ]
Notice the [<---- here ] marks I added to your code.
The first T is not in any way related to the second T. They have the same correlation alias, they refer to the same table, but they're entirely independent selects and results.
So what you're doing in the first, uncorrelated, subquery is getting the max id for all positions in transactions.
And then you're joining all transaction.position.max(id)s to result2 (which result2 happens to be a join of all transaction.positions to subtransactions). (And the internal order by is pointless and costly, too, but that's not the main problem.)
You're joining every transaction.position.max(id) to every (whatever result 2 selects).
On Edit, after getting home: Ok, you're not Cartesianing, the "where MAXid = id" does join result1 to result2. But you're still rolling up all rows of transaction in both queries.
So you're getting a Cartesian join -- every result1 joined to every result2, unconditionally (nothing tells the database, for example, that they ought to be joined by (max) id or by position).
So if you have ten unique position.max(id)s in transaction, you're getting 100 rows. 1000 unique positions, a million rows. Etc.
When you want to write a complicated query like this, it's a lot easier if you compose it out of simpler views. in particular, you can test each view on its own, to make sure you're getting reasonable results, and then just join the views.
I would split the query into smaller chunks, probably using a stored proc. For example get the max ids from transaction and put this in a table variable. Then join this with subtransactions. This will make it easier for you and the compiler to work out what is going on.
Also without knowing what indexes are on your table it is hard to offer more advice
Put a benchmark function in the code. Then time each section of the code to determine where the slow down is happening. Often times the slow down happens in a different query than you first guess. Determine the correct query that needs to be optimized before posting to stackoverflow.