SQL performance on one column - mysql

I have a table with 510,085 rows, which is now pushing me to seek higher performance. One of the fields in this table is called 'photoStatus'.
In 'photoStatus', 510,045 rows contain the word 'Active' and the remaining 40 contain the word 'Suspended'.
Which of these two queries would be faster to search for 'Active' photos or doesn't it matter?
WHERE photoStatus = 'Active'
Or
WHERE photoStatus <> 'Suspended'
Obviously this is part of a massive query, it's not just one WHERE condition.
Database is MySQL (MyISAM)

Why not convert the column to a boolean, or a numeric value, which would be much faster than a string compare, then you could just do:
....
WHERE isActive;

If you have an index on that column WHERE photoStatus = 'Active' will be faster since the server can just scan the range in the index matching Active.

Second will be "a little" faster because it will not require to compare whole string just first character comparison is enough to include result according to database comparison algo

Be sure to use an index on that field. And EXPLAIN the query to see how efficient your query is.
Other than that your query would filter out just 40, so the rest of the query has to be efficient.

Its going to have to index the table and rows either way.
Personally I would always match. Use equals.
WHERE photoStatus = 'Active'
I would always use int or boolean, better than matching a string..

A normal index won't help in this scenario since the percentage of actually returned rows is to large.
So the database will have to look at each row. There might be some differenece, depending on how fast an equal vs !equal comparison is, but that should be neglectable.
So I expect the result to be pretty much of same speed.

You have posted to few details to find a shortcut for your query.
As it appears you need a full scan. In this case you can try to read the table in parallel.
Don't know what DBMS you use, but in Oracle you can use a hint select /*+parallel(yourtable 8)*/ from yourtable
What you try to do with this data? What types of queryes are slow? can you give an example?There can be many tricks and you can do many mistakes. And not all queryes should work fast. If they are for UI, must respond in a time < 1s. But if it is for admin task, may take 1 minute :)

WHERE photoStatus = 'Active' is better if you have index on that column based on small testing similar to your example.
I added query execution in sql server. Short one belongs to equal comparison. It is reporting better performance. If you dont have index, query cost is similar.

Firstly, .5M rows is not a large table - by ANY means.
A column like "Active" / "Inactive", is likely to be pretty useless to be an index by itself, because it doesn't have enough selectivity to make an index scan beneficial (in fact, if it's 50% of the rows in the table, a table-scan would probably be better).
I suspect that in fact, "Active" has nothing to do with your problem - after all, you're not trying to return .5M rows to the client are you?
A query which returns .5M rows is not going to be fast, because just returning the rows takes a (relatively) long time.
Anyway my answer: It makes no difference, you need to check the other parts of your query. Post a question with the full query, table structure and explain output.

Related

Can I use index in MySQL in this way? [duplicate]

If I have a query like:
Select EmployeeId
From Employee
Where EmployeeTypeId IN (1,2,3)
and I have an index on the EmployeeTypeId field, does SQL server still use that index?
Yeah, that's right. If your Employee table has 10,000 records, and only 5 records have EmployeeTypeId in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the EmployeeTypeId in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIds, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.
SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.
SELECT EmployeeId FROM Employee WITH (Index(Index_EmployeeTypeId )) WHERE EmployeeTypeId IN (1,2,3)
Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.
Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.
Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.
If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?
It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.
And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...
So there's the potential for an "IN" clause to run a table scan, but the optimizer will
try and work out the best way to deal with it?
Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.
The other posters are correct that an index will be used over a table scan if:
The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.
The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).
As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.
#Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.
With NHibernate, you can write a clause like this:
int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))
NHibernate then generates a query which looks like
select * from employee where employeeid in (1, 5, 23463, 32523)
So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.
Select EmployeeId From Employee USE(INDEX(EmployeeTypeId))
This query will search using the index you have created. It works for me. Please do a try..

Should I avoid ORDER BY in queries for large tables?

In our application, we have a page that displays user a set of data, a part of it actually. It also allows user to order it by a custom field. So in the end it all comes down to query like this:
SELECT name, info, description FROM mytable
WHERE active = 1 -- Some filtering by indexed column
ORDER BY name LIMIT 0,50; -- Just a part of it
And this worked just fine, as long as the size of table is relatively small (used only locally in our department). But now we have to scale this application. And let's assume, the table has about a million of records (we expect that to happen soon). What will happen with ordering? Do I understand correctly, that in order to do this query, MySQL will have to sort a million records each time and give a part of it? This seems like a very resource-heavy operation.
My idea is simply to turn off that feature and don't let users select their custom ordering (maybe just filtering), so that the order would be a natural one (by id in descending order, I believe the indexing can handle that).
Or is there a way to make this query work much faster with ordering?
UPDATE:
Here is what I read from the official MySQL developer page.
In some cases, MySQL cannot use indexes to resolve the ORDER BY,
although it still uses indexes to find the rows that match the WHERE
clause. These cases include the following:
....
The key used to
fetch the rows is not the same as the one used in the ORDER BY:
SELECT * FROM t1 WHERE key2=constant ORDER BY key1;
So yes, it does seem like mysql will have a problem with such a query? So, what do I do - don't use an order part at all?
The 'problem' here seems to be that you have 2 requirements (in the example)
active = 1
order by name LIMIT 0, 50
The former you can easily solve by adding an index on the active field
The latter you can improve by adding an index on name
Since you do both in the same query, you'll need to combine this into an index that lets you resolve the active value quickly and then from there on fetches the first 50 names.
As such, I'd guess that something like this will help you out:
CREATE INDEX idx_test ON myTable (active, name)
(in theory, as always, try before you buy!)
Keep in mind though that there is no such a thing as a free lunch; you'll need to consider that adding an index also comes with downsides:
the index will make your INSERT/UPDATE/DELETE statements (slightly) slower, usually the effect is negligible but only testing will show
the index will require extra space in de database, think of it as an additional (hidden) special table sitting next to your actual data. The index will only hold the fields required + the PK of the originating table, which usually is a lot less data then the entire table, but for 'millions of rows' it can add up.
if your query selects one or more fields that are not part of the index, then the system will have to fetch the matching PK fields from the index first and then go look for the other fields in the actual table by means of the PK. This probably is still (a lot) faster than when not having the index, but keep this in mind when doing something like SELECT * FROM ... : do you really need all the fields?
In the example you use active and name but from the text I get that these might be 'dynamic' in which case you'd have to foresee all kinds of combinations. From a practical point this might not be feasible as each index will come with the downsides of above and each time you add an index you'll add supra to that list again (cumulative).
PS: I use PK for simplicity but in MSSQL it's actually the fields of the clustered index, which USUALLY is the same thing. I'm guessing MySQL works similarly.
Explain your query, and check, whether it goes for filesort,
If Order By doesnt get any index or if MYSQL optimizer prefers to avoid the existing index(es) for sorting, it goes with filesort.
Now, If you're getting filesort, then you should preferably either avoid ORDER BY or you should create appropriate index(es).
if the data is small enough, it does operations in Memory else it goes on the disk.
so you may try and change the variable < sort_buffer_size > as well.
there are always tradeoffs, one way to improve the preformance of order query is to set the buffersize and then the run the order by query which improvises the performance of the query
set sort_buffer_size=100000;
<>
If this size is further increased then the performance will start decreasing

MYSQL IN vs <> performance

I have a table where I have a status field which can have values like 1,2,3,4,5. I need to select all the rows from the table with status != 1. I have the following 2 options:
NOTE that the table has INDEX over status field.
SELECT ... FROM my_tbl WHERE status <> 1;
or
SELECT ... FROM my_tbl WHERE status IN(2,3,4,5);
Which of the above is a better choice? (my_tbl is expected to grow very big).
You can run your own tests to find out, because it will vary depending on the underlying tables.
More than that, please don't worry about "fastest" without having first done some sort of measurement that it matters.
Rather than worrying about fastest, think about which way is clearest.
In databases especially, think about which way is going to protect you from data errors.
It doesn't matter how fast your program is if it's buggy or gives incorrect answers.
How many rows have the value "1"? If less than ~20%, you will get a table scan regardless of how you formulate the WHERE (IN, <>, BETWEEN). That's assuming you have INDEX(status).
But indexing ENUMs, flags, and other things with poor cardinality is rarely useful.
An IN clause with 50K items causes memory problems (or at least used to), but not performance problems. They are sorted, and a binary search is used.
Rule of Thumb: The cost of evaluation of expressions (IN, <>, functions, etc) is mostly irrelevant in performance. The main cost is fetching the rows, especially if they need to be fetched from disk.
An INDEX may assist in minimizing the number of rows fetched.
you can use BENCHMARK() to test it yourself.
http://sqlfiddle.com/#!2/d41d8/29606/2
the first one if faster which makes sense since it only has to compare 1 number instead of 4 numbers.

Is it OK to index all the fields in this mysql query?

I have this mysql query and I am not sure what are the implications of indexing all the fields in the query . I mean is it OK to index all the fields in the CASE statement, Join Statement and Where Statement? Are there any performance implications of indexing fields?
SELECT roots.id as root_id, root_words.*,
CASE
WHEN root_words.title LIKE '%text%' THEN 1
WHEN root_words.unsigned_title LIKE '%normalised_text%' THEN 2
WHEN unsigned_source LIKE '%normalised_text%' THEN 3
WHEN roots.root LIKE '%text%' THEN 4
END as priorities
FROM roots INNER JOIN root_words ON roots.id=root_words.root_id
WHERE (root_words.unsigned_title LIKE '%normalised_text%') OR (root_words.title LIKE '%text%')
OR (unsigned_source LIKE '%normalised_text."%') OR (roots.root LIKE '%text%') ORDER by priorities
Also, How can I further improve the speed of the query above?
Thanks!
You index columns in tables, not queries.
None of the search criteria you've specified will be able to make use of indexes (since the search terms begin with a wild card).
You should make sure that the id column is indexed, to speed the JOIN. (Presumably, it's already indexed as a PRIMARY KEY in one table and a FOREIGN KEY in the other).
To speed up this query you will need to use full text search. Adding indexes will not speed up this particular query and will cost you time on INSERTs, UPDATEs, and DELETEs.
Caveat: Indexes speed up retrieval time but cause inserts and updates to run slower.
To answer the implications of indexing every field, there is a performance hit when using indexes whenever the data that is indexed is modified, either through inserts, updates, or deletes. This is because SQL needs to maintain the index. It's a balance between how often the data is read versus how often it is modified.
In this specific query, the only index that could possibly help would be in your JOIN clause, on the fields roots.id and root_words.root_id.
None of the checks in your WHERE clause could be indexed, because of the leading '%'. This causes SQL to scan every row in these tables for a matching value.
If you are able to remove the leading '%', you would then benefit from indexes on these fields... if not, you should look into implementing full-text search; but be warned, this isn't trivial.
Indexing won't help when used in conjunction with LIKE '%something%'.
It's like looking for words in a dictionary that have ae in them somewhere. The dictionary (or Index in this case) is organised based on the first letter of the word, then the second letter, etc. It has no mechanism to put all the words with ae in them close together. You still end up reading the whole dictionary from beginning to end.
Indexing the fields used in the CASE clause will likely not help you. Indexing helps by making it easy to find records in a table. The CASE clause is about processing the records you have found, not finding them in the first place.
Optimisers can also struggle with optimising multiple unrelated OR conditions such as yours. The optimiser is trying to narrow down the amount of effort to complete your query, but that's hard to do when unrelated conditions could make a record acceptable.
All in all your query would benefit from indexes on roots(root_id) and/or roots(id), but not much else.
If you were to index additional fields though, the two main costs are:
- Increased write time (insert, update or delete) due to additional indexes to write to
- Increased space taken up on the disk

mysql where condition

I'm interested in where condition; if I write:
Select * from table_name
where insert_date > '2010-01-03'
and text like '%friend%';
is it different from:
Select * from table_name
where text like '%friend%'
and insert_date > '2010-01-03';
I mean if the table is very big, has a lot of rows and if mysql takes records compliant with condition " where insert_date > '2010-01-03' " first and then searches in these records for a word "friend" it can be much faster than from first search for "friend" rows and than look into the date field.
Is it important to write where condition smartly, or mysql analyze the condition and rewrites where condition in the best way?
thanks
No, the two where clauses should be equivalent. The optimizer should pick the same index whichever you use.
The order of columns in an index does matter though.
If you think the optimizer is using the wrong index, you could give it a hint. More often than not though, there's a good reason for using the index it has chosen to use, so unless you know exactly what you are doing, giving the optimizer hints will often make things worse not better.
I don't know about MySQL in particular, but typically this kind of optimization is left to the database engine, as which order is faster depends on indexes, cardinality of data, and quantity of data among other things.
I think it's true, that both of where clause ar similar in database abstraction
By definition, a logical conjunction (the AND operator) is commutative. This means that WHERE A AND B is equal to WHERE B AND A.
It makes no difference in which order you write your conditions.
However, what makes a difference is what indexes you have in place on your table. The query analyzer takes these into account. It is also smart enough to find the part of the condition that is easiest to check and apply that one first.