I am wondering how MySQL finds the rows in a table when searching like so:
select * from table where field = 'text';
Does it use a particular search algorithm? Is it practically the fastest way to look up information in a table? Or would building a search macro using another algorithm (like Boyer-Moore) work faster?
If there is an index on field, then databases often use a b-tree for indexed searches. If there is no index, then the entire table is scanned. This describes some of the techniques used in MySql
http://dev.mysql.com/doc/refman/5.5/en/index-btree-hash.html
Many hours of work has gone into optimizing MySql. Take advantage of that work already done, and resist trying to re-doing it
For that query it can do nothing other than searching every entry of that table and comparing its field column against that string.
Boyer-Moore isn't needed because it's exact equality that's requested and not asking whether the field contains that string.
If you are interested in how it found those records try executing using the EXPLAIN keyword:
EXPLAIN select * from table where field = 'text';
I would recommend looking at this article to get a better understanding what is happening in the background.
I would be very surprised if you would be able to write something on your own that is faster. You could look at creating indexes on the table in question to speed up selects.
Related
If I have a query like:
Select EmployeeId
From Employee
Where EmployeeTypeId IN (1,2,3)
and I have an index on the EmployeeTypeId field, does SQL server still use that index?
Yeah, that's right. If your Employee table has 10,000 records, and only 5 records have EmployeeTypeId in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the EmployeeTypeId in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIds, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.
SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.
SELECT EmployeeId FROM Employee WITH (Index(Index_EmployeeTypeId )) WHERE EmployeeTypeId IN (1,2,3)
Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.
Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.
Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.
If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?
It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.
And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...
So there's the potential for an "IN" clause to run a table scan, but the optimizer will
try and work out the best way to deal with it?
Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.
The other posters are correct that an index will be used over a table scan if:
The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.
The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).
As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.
#Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.
With NHibernate, you can write a clause like this:
int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))
NHibernate then generates a query which looks like
select * from employee where employeeid in (1, 5, 23463, 32523)
So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.
Select EmployeeId From Employee USE(INDEX(EmployeeTypeId))
This query will search using the index you have created. It works for me. Please do a try..
The webpage in question is https://www.christart.com/poetry/
I have a MySQL table with little over 7,000 records of poems entries. I'm getting requests from my users to be able to run queries against they body of the poems. But they are saved in a 'text' column.
I know how to write the SQL statement. That's easy enough. My concern is the load on the database. I always index columns that are queried or join on. But can't index a 'text' column.
There must be a way. How should I approach this?
You could use a full text index:
CREATE FULLTEXT INDEX poem_contents ON poems(body);
And then search using match:
SELECT *
FROM poems
WHERE MATCH(body) AGAINST ('some phrase' IN BOOLEAN MODE)
There's no reason that you can't index a text field. That being said, there's probably very little value in indexing a text field that's containing entire poems.
If your database only has 7,000 rows, you probably won't see a massive performance hit unless you scale much larger than it currently is. For a larger scale, a better solution would probably be to extract keywords from the body and search on those.
I think you must explore Apache Lucene or similar kind of project which provide full text search. Alternatively you can check mongodb instead of mysql. It got number of index types. There are also Solr/ElasticSearch which at back uses Lucene.
Poem body, I assume, it will be stored in varchar type. I dont know indexing possible on varchar or not & dont think it wise to indexing entire poem body. Something like Lucene/Solr provides better option.
Please note, I am not related to any of the product mentioned above.
I have this doubt:
Suppose I have a one big table with a relationship to to a smaller table of users.
The idea is to search in that really big table for dates bigger than a given date and order by a score (big int, for example), and obtain related user info at the same time.
The result of this query can change every 10 minutes or so.
So, there is no text search, but I have a really big table. Should I use sphinx (or other search engine) or should I just use some MySQL indexes?
If I use sphinx, it's sure that I can obtain really fast results; but maybe having the index refreshed, even with delta indexing, doesn't make a big difference with MySQL indexing. At the same time, the changes in the table are not necessary new inserts, but updates; and I have read that real time indexing and delta index can give problems.
Maybe it would be better to use MySQL indexes, and help with some kind of caching to avoid unnecessary queries .
Just use MySQL, you definitely don't need Sphinx for what you are doing.
Okay, mysql indexing. Is indexing nothing more than having a unique ID for each row that will be used in the WHERE clause?
When indexing a table does the process add any information to the table? For instance, another column or value somewhere.
Does indexing happen on the fly when retrieving values or are values placed into the table much like an insert or update function?
Any more information to clearly explain mysql indexing would be appreciated. And please dont just place a link to the mysql documentation, it is confusing and it is always better to get a personal response from a professional.
Lastly, why is indexing different from telling mysql to look for values between two values. For Example: WHERE create_time >= 'AweekAgo'
I'm asking because one of my tables is 220,000+ rows and it takes more than a minute to return values with a very simple mysql select statement and I'm hoping indexing will speed this up.
Thanks in advanced.
You were down voted because you didn't make effort to read or search for what you are asking for. A simple search in google could have shown you the benefits and drawbacks of Database Index. Here is a related question on StackOverflow. I am sure there are numerous questions like that.
To simplify the jargons, it would be easier to locate books in a library if you arrange the in shelves numbered according to their area of specialization. You can easily tell somebody to go to a specific location and pick the book - that is what index does
Another example: imagine an alphabetically ordered admission list. If your name start with Z, you will just skip A to Y and get to Z - faster? If otherwise, you will have to search and search and may not even find it if you didn't look carefully
A database index is a data structure that improves the speed of operations in a table. Indexes can be created using one or more columns, providing the basis for both rapid random lookups and efficient ordering of access to records.
You can create an index like this way :
CREATE INDEX index_name
ON table_name ( column1, column2,...);
You might be working on a more complex database, so it's good to remember a few simple rules.
Indexes slow down inserts and updates, so you want to use them carefully on columns that are FREQUENTLY updated.
Indexes speed up where clauses and order by.
For further detail, you can read :
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
http://www.tutorialspoint.com/mysql/mysql-indexes.htm
There are a lot of indexing, for example a hash, a trie, a spatial index. It depends on the value. Most likely it's a hash and a binary search tree. Nothing really fancy because most likely the fancy thing is expensive.
I am currently using MySql and have a few tables which i need to perform boolean search on. Given the fact my tables are Innodb type, I found out one of the better ways to do this is to use Sphinx or Lucene. I have a doubt in using these, my queries are of the following format,
Select count(*) as cnt, DATE_FORMAT(CONVERT_TZ(wrdTrk.createdOnGMTDate,'+00:00',:zone),'%Y-%m-%d') as dat from t_twitter_tracking wrdTrk where wrdTrk.word like (:word) and wrdTrk.createdOnGMTDate between :stDate and :endDate group by dat;
the queries have a date field which needs to be converted to the timezone of the logged in user and then the field used to do a group by.
Now if i migrate to Sphinx/lucene will I be able to get a result similar to the query above. Am a beginner in Sphinx, which of these two should i use or is there anything better.
Actually groupby and search ' wrdTrk.word like (:word)' is a major part of my query and I need to move to boolean search to enhance user experience. My database has approximately 23652826 rows and the db is Innodb based and MySql full text search doesnt work.
Regards
Roh
Yes. Sphinx can do this. I don't know what LIKE (:word) does, but you can do a query like #word "exactword" in sphinx search.
only you need to index the data properly and will got the result
Since you only need the counts, I believe it would be better for you to keep using MySQL.
If you have a performance problem, I suggest you use explain() and possibly better indexing to improve your queries.
Only if full-text search is a major part of your use-case you should move to using Sphinx/Solr.
Read Full Text Search Engine versus DBMS for a more comprehensive answer.
save your count in a meta table, keep it updated. or use myisam, which maintains its own count. mongodb also maintains its own count. cache the count in memcache. counting each time you need to know the count is a silly use of resources.