speed up SELECT DISTINCT using keys - mysql

If I use the SELECT DISTINCT query on a table with 100 rows where 98 entries of the table are identical and the other 2 are identical, would it still go through all 100 rows just to return the 2 distinct results?
Is there a way to use indexing/keys etc so that instead of going through all 100 rows, it would instead go through 2 rows?
####EDIT#####
so I added this index:
KEY `column` (`column`(1)),
but then when I do
EXPLAIN SELECT DISTINCT column FROM tablename
it's still saying that it's going through all rows rather than just distinct ones

Creating an index on the column or set of columns being queried with DISTINCT will speed up the query. Rather than looking through every row it will use the two entries in the index. With only 100 rows though, the difference may not even be detectable.

I am working on almost similar thing. I am trying to get the distinct values from a table with 400Mill rows.
I even have the key on that attribute. It is still doing the full scan. the only difference is is that it is full index scan rather than a disc scan.
i have only 10 distinct values but i didnt resulted even after 5minutes and i killed it.

Related

Optimize LIMIT the number of rows to be SELECT in SQL

Consider a table Test having 1000 rows
Test Table
id name desc
1 Adi test1
2 Sam test2
3 Kal test3
.
.
1000 Jil test1000
If i need to fetch, say suppose 100 rows(i.e. a small subset) only, then I am using LIMIT clause in my query
SELECT * FROM test LIMIT 100;
This query first fetches 1000 rows and then returns 100 out of it.
Can this be optimised, such that the DB engine queries only 100 rows and returns them
(instead of fetching all 1000 rows first and then returning 100)
Reason for above supposition is that the order of processing will be
FROM
WHERE
SELECT
ORDER BY
LIMIT
You can combine LIMIT ROW COUNT with an ORDER BY, This causes MySQL to stop sorting as soon as it has found the first ROW COUNT rows of the sorted result.
Hope this helps, If you need any clarification just drop a comment.
The query you wrote will fetch only 100 rows, not 1000. But, if you change that query in any way, my statement may be wrong.
GROUP BY and ORDER BY are likely to incur a sort, which is arguably even slower than a full table scan. And that sort must be done before seeing the LIMIT.
Well, not always...
SELECT ... FROM t ORDER BY x LIMIT 100;
together with INDEX(x) -- This may use the index and fetch only 100 rows from the index. BUT... then it has to reach into the data 100 times to find the other columns that you ask for. UNLESS you only ask for x.
Etc, etc.
And here's another wrinkle. A lot of questions on this forum are "Why isn't MySQL using my index?" Back to your query. If there are "only" 1000 rows in your table, my example with the ORDER BY x won't use the index because it is faster to simply read through the table, tossing 90% of the rows. On the other hand, if there were 9999 rows, then it would use the index. (The transition is somewhere around 20%, but it that is imprecise.)
Confused? Fine. Let's discuss one query at a time. I can [probably] discuss the what and why of each one you throw at me. Be sure to include SHOW CREATE TABLE, the full query, and EXPLAIN SELECT... That way, I can explain what EXPLAIN tells you (or does not).
Did you know that having both a GROUP BY and ORDER BY may cause the use of two sorts? EXPLAIN won't point that out. And sometimes there is a simple trick to get rid of one of the sorts.
There are a lot of tricks up MySQL's sleeve.

mysql query speed at table which has 1.5million rows

It takes around 5 seconds to get result of query from a table consisting 1.5 million row. Query is "select * from table where code=x"
Is there a setting to increase speed ? Or should I jump to another database apart from MySQL ?
You could index the code column. Note that the trade off is that inserting new rows or updating the code column on existing rows will be slowed down a bit since the index also needs to be updated. In any event, you should benchmark the improvement to make sure it's worth it.
WHERE code=x -- needs INDEX(code)
SELECT * when many of the columns are bulky: Large columns are stored "off-record". Hence they take longer to fetch. So, explicitly list the columns you really need, hoping to leave out some of the bulky columns.
When a GROUP BY or LIMIT is involved, it is sometimes best to do
SELECT lots of columns
FROM ( SELECT id FROM t WHERE ... group-by or limit ) AS x
JOIN t AS y USING(id)
etc.
That is, start by finding just the ids as simply as possible, then JOIN back to the original table and other table(s). (This is not the case you presented, but I worry that you over-simplified it.)

I always have a "WHERE date" in all my SQL queries. Speed up?

I have a large table with hundreds of thousands of rows. However only about 50,000 rows are actually "active" and part of my queries, because I only select the rows that have been updated last 14 days with WHERE crdate > "2014-08-10". So to speed up the queries to the table I'm thinking what of the following options (or maybe you have another suggestion?) that is the best one:
I can delete all old entries and insert them into a "history" table with a cronjob running every day/week. However this will still make the history table slow if I want to do queries to that one.
I can make an index on my "crdate" column. However my dates are in the format of "2014-08-10 06:32:59" so I guess because it is storing so many different values, that index will be quite large(?) and potentially slow(?).
Do you guys have any other suggestion of how I can speed up queries to this table? Is it an bad idea to set an index on a date-column that have so many different values?
1st rule of databases. Always have indexes on columns you are filtering on.
So yes, put an index on crdate.
You can also go with a history table in parallel but make sure you put the index on the crdate column in the history table too. Having the history table, will allow you to have a smaller index in the main table.
I wanted to add to this for future googler's. if you are querying a datatime a more distinct query will result in a more efficient query for example
SELECT * FROM MyTable WHERE MyDateTime = '01/01/2015 00:00:00'
Will be faster than:
SELECT * FROM MyTable WHERE MyDateTime = '01/01/2015'
I tested this repeatedly on an indexed view(by datetime) of 5 million rows the more distinct query gave me a 1 second quicker response

select count(*) taking considerably longer than select * for same "where" clause?

I am finding a select count(*) is taking considerably longer than select * for the queries with the same where clause.
The table in question has about 2.2 million records (call it detailtable). It has a foreign key field linking to another table (maintable).
This query takes about 10-15 seconds:
select count(*) from detailtable where maintableid = 999
But this takes a second or less:
select * from detailtable where maintableid = 999
UPDATE - It was asked to specify the number of records involved. It is 150.
UPDATE 2 Here is information when the EXPLAIN keyword is used.
For the SELECT COUNT(*), The EXTRA column reports:
Using where; Using index
KEY and POSSIBLE KEYS both have the foreign key constraint as their value.
For the SELECT * query, everything is the same except EXTRA just says:
Using Where
UPDATE 3 Tried OPTIMIZE TABLE and it still does not make a difference.
For sure
select count(*)
should be faster than
select *
count(*), count(field), count(primary key), count(any) are all the same.
Your explain clearly stateas that the optimizer somehow uses the index for count(*) and not for the other making the foreign key the main reason for the delay.
Eliminate the foreign key.
Try
select count(PRIKEYFIELD) from detailtable where maintableid = 999
count(*) will get all data from the table, then count the rows meaning it has more work to do.
Using the primary key field means it's using its index, and should run faster.
Thread Necro!
Crazy idea... In some cases, depending on the query planner and the table size, etc, etc., it is possible for using an index to actually be slower than not using one. So if you get your count without using an index, in some cases, it could actually be faster.
Try this:
SELECT count(*)
FROM detailtable
USING INDEX ()
WHERE maintableid = 999
SELECT count(*)
with that syntax alone is no problem, you can do that to any table.
The main issue on your scenario is the proper use of INDEX and applying [WHERE] clause on your search.
Try to reconfigure your index if you have the chance.
If the table is too big, yes it may take time. Try to check MyISAM locking article.
As the table is 2.2 million records, count can take time. As technically, MySQL should find the records and then count them. This is an extra operation that becomes significant with millions of records. The only way to make it faster is to cache the result in another table and update it behind the scenes.
Or simply Try
SELECT count(1) FROM table_name WHERE _condition;
SELECT count('x') FROM table_name WHERE _condition;

mysql Subquery with JOIN bad performance

My problem is this:
select * from
(
select * from barcodesA
UNION ALL
select * from barcodesB
)
as barcodesTOTAL, boxes
where barcodesTotal.code=boxes.code;
Table barcodesA has 4000 entries
Table barcodesB has 4000 entries
Table boxes has like 180.000 entries
It takes 30 seconds to proccess the query.
Another problematic query:
select * from
viewBarcodesTotal, boxes
where barcodesTotal.code=boxes.code;
viewBarcodesTotal contains the UNION ALL from both barcodes tables. It also takes forever.
Meanwhile,
select * from barcodesA , boxes where barcodesA.code=boxes.code
UNION ALL
select * from barcodesB , boxes where barcodesB.code=boxes.code
This one takes <1second.
The question is obviously WHY?, is my code bugged? is mysql bugged?
I have to migrate from access to mysql, and i would have to rewrite all my code if the first option in bugged.
Add an index on boxes.code if you don't already have one. Joining 8000 records (4K+4K) to the 180,000 will benefit from an index on the 180K side of the equation.
Also, be explicit and specify the fields you need back in your SELECT statements. Using * in a production-use query is bad form as it encourages not having to think about what fields (and how big they might be), not to mention the fact that you have 2 different tables in your example, barcodesa and barcodesb with potentially different data types and column orders that you're UNIONing....
The REASON for the performance difference...
The first query says... First, do a complete union of EVERY record in A UNIONed with EVERY record in B, THEN Join it to boxes on the code. The union does not have an index to be optimized against.
By explicitly applying your SECOND query instance, each table individually IS optimized on the join (apparently there IS an index per performance of second, but I would ensure both tables have index on "code" column).