When to use MySQL FULL TEXT SEARCH - mysql

I am working with a table having 1500 records, I query for comments columns using 'LIKE '%word%', That is working good for now. but as comments are increasing day by day. I am thinking to use Full text search or some tool like http://sphinxsearch.com.
What is the limit of records in the table when full text search or search engine (spinxsearch) do the job effectively also I think, size of field also matters, I am working with comments so its fine but if its with long long articles then I may realize the power of full text search or sponxsearch.
Thanks.

At the moment when dataset (the whole table in your case) cannot fit in the memory, the indexing (hence the full-text index / sphinx/lucene) make a big difference.
Also, on a highly-concurrent, mixed write/read load, it makes a big difference, because in your query, the whole table have to be scanned, and this involves locking, scanning of redo logs, etc. (depending on transaction isolation level)

'LIKE' use regex to find the right row, this can become very very slow very fast, switch to sphinx, its very good.

Related

Use sphinx vs MySQL on no text search query

I have this doubt:
Suppose I have a one big table with a relationship to to a smaller table of users.
The idea is to search in that really big table for dates bigger than a given date and order by a score (big int, for example), and obtain related user info at the same time.
The result of this query can change every 10 minutes or so.
So, there is no text search, but I have a really big table. Should I use sphinx (or other search engine) or should I just use some MySQL indexes?
If I use sphinx, it's sure that I can obtain really fast results; but maybe having the index refreshed, even with delta indexing, doesn't make a big difference with MySQL indexing. At the same time, the changes in the table are not necessary new inserts, but updates; and I have read that real time indexing and delta index can give problems.
Maybe it would be better to use MySQL indexes, and help with some kind of caching to avoid unnecessary queries .
Just use MySQL, you definitely don't need Sphinx for what you are doing.

User search use MySQL LIKE or integrate a search engine (such as solr, sphinx, lucene etc.)?

In my mysql db I have a user table consisting of 37,000 (or thereabouts) users.
When a user search for another user on the site, I perform a simple like wildcard (i.e. LIKE '{name}%}) to return the users found.
Would it be more efficient and quicker to use a search engine such a solr to do my 'LIKE' searches? furthermore? I believe in solr I can use wildcard queries (http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/)
To be honest, it's not that slow at the moment using a LIKE query however as the number of users grows it'll become slower. Any tips or advice is greatly appreciated.
We had a similar situation about a month ago, our database is roughly around 33k~ and due to the fact our engine was InnoDB we could not utilize the MySQL full-text search feature (that and it being quite blunt).
We decided to implement sphinxsearch (http://www.sphinxsearch.com) and we're really impressed with the results (me becoming quite a 'fanboy' of it).
If we do a large index search with many columns (loads of left joins) of all our rows we actually halved the query response time against the MySQL 'LIKE' counterpart.
Although we havn't used it for long - If you're going to build for future scailablity i'd recommend sphinx.
you can speed up if the searchword must have minimum 3 chars to start the search and index your search column with a index size of 3 chars.
It's actually already built-in to MySQL: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
we're using solr for this purpose, since you can search in 1-2 ms even with milions of documents indexed. we're mirroring our mysql instance with Data Import Handler and then we search on Solr.
as neville pointed out, full text searches are built-in in mysql, but solr performances are way better, since it's born as a full text search engine

How to optimize full text searches?

I know there are a lot of question already on this subject, but I needed more specific information. So here goes:
Ideally what should be the maximum length of characters upon which a full text search can be performed using minimal resources (CPU, memory)?
When should I decide between using the LIKE %$str% and full-text search?
Is it important to have both versions LIKE %$str% and full-text search implemented and use the optimal one dynamically?
As far as I know it depends on the number of words, not characters. The fewer, the faster mysql will be. But don't let that get in your way.
Never use LIKE if you can use a full-text search. Except maybe for queries that you would manually run once in a while and you don't want to slow down the INSERTs on that table.
You know the speed of select vs speed of insert tradeoff in indexes, right?
Always use FT (full-text) search in queries that you don't run manually. LIKE is slow and becomes really slower when the number of rows increases. This is because the mysql engine has to look into EVERY row to answer your query. And FT keeps an index and knows exactly where to look.

Are indexes good or bad for a large database?

I read on MySQL Performance Blog that when tables are large, it is better to scan full tables, instead of using indexes.
I have a table with tens of millions of rows. When conducting queries, if I use no indexes, then queries are 24 times slower than with indexes. I know lot of things may cause this (e.g., are rows stored sequentially), but can you please give me some hints what might be happening? Or how I should start examining this issue? I want to understand when use of indexes is preferred and when it's not
Thanks
The article says that when dealing with very large data sets, where the amount of rows you need to work with are approaching the number of rows that is in the table, using an index might hurt performance.
In this case, going through the index will indeed hurt performance, as long as you need more data than is present in the index.
To go through the index, the database engine first has to read large parts of the index table (it is a type of table), then for each row (or set of rows) from this result, go to the real table and start cherrypicking pages to read.
If, on the other hand, you only need to retrieve columns that area already part of the index table, then the database engine only has to read from that, and not continue on to the full table for more data.
If you end up reading most or close to most of the actual table in question, all the work required to deal with the index might be more overhead than just doing a full table-scan to begin with.
Now, this is all the article is saying. For most work dealing with a database, using indexes is the exact right thing to do.
For instance, if you need to extract a small set of rows, going through an index instead of a full table scan will be many order of magnitudes faster.
In any case, if you're in doubt, you should do some performance profiling to find out how your application behaves under different types of loads, and then start tweaking, don't take a single article as a silver bullet for anything.
For instance, one way to speed up the example queries that does a count on the pad column in the article, would be to create a single index that covered both val and pad, in this way, the count would simply be a index-scan, and not a index-scan + table-lookup, and would run faster than the full table-scan.
Your best option is to know your data, and to experiment, and to know how the tools you use work, so indeed, learn more about indexes, but in the end, it is you who decides what is best for your program.
As always, it depends. I've so far never ran into a scenario as described in that blog posts. Using indexes on my queries for large (50+ million rows) has been on the order of 100 to 10000 times faster than doing a full table scan on these big tables.
There's probably no silver bullet here, you have to test for your particular data and your particular queries.
It is good practice to put the index on each column which you used in a WHERE clause.

Big tables and analysis in MySql

For my startup, I track everything myself rather than rely on google analytics. This is nice because I can actually have ips and user ids and everything.
This worked well until my tracking table rose about 2 million rows. The table is called acts, and records:
ip
url
note
account_id
...where available.
Now, trying to do something like this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.url LIKE '%some_marketing_page%';
Basically never finishes. I switched to this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.note = 'some_marketing_page';
But it is still very slow, despite having an index on note.
I am obviously not pro at mysql. My question is:
How do companies with lots of data track things like funnel conversion rates? Is it possible to do in mysql and I am just missing some knowledge? If not, what books / blogs can I read about how sites do this?
While getting towards 'respectable', 2 Millions rows is still a relatively small size for a table. (And therefore a faster performance is typically possible)
As you found out, the front-ended wildcard are particularly inefficient and we'll have to find a solution for this if that use case is common for your application.
It could just be that you do not have the right set of indexes. Before I proceed, however, I wish to stress that while indexes will typically improve the DBMS performance with SELECT statements of all kinds, it systematically has a negative effect on the performance of "CUD" operations (i.e. with the SQL CREATE/INSERT, UPDATE, DELETE verbs, i.e. the queries which write to the database rather than just read to it). In some cases the negative impact of indexes on "write" queries can be very significant.
My reason for particularly stressing the ambivalent nature of indexes is that it appears that your application does a fair amount of data collection as a normal part of its operation, and you will need to watch for possible degradation as the INSERTs queries get to be slowed down. A possible alternative is to perform the data collection into a relatively small table/database, with no or very few indexes, and to regularly import the data from this input database to a database where the actual data mining takes place. (After they are imported, the rows may be deleted from the "input database", keeping it small and fast for its INSERT function.)
Another concern/question is about the width of a row in the cast table (the number of columns and the sum of the widths of these columns). Bad performance could be tied to the fact that rows are too wide, resulting in too few rows in the leaf nodes of the table, and hence a deeper-than-needed tree structure.
Back to the indexes...
in view of the few queries in the question, it appears that you could benefit from an ip + note index (an index made at least with these two keys in this order). A full analysis of the index situation, and frankly a possible review of the database schema cannot be done here (not enough info for one...) but the general process for doing so is to make the list of the most common use case and to see which database indexes could help with these cases. One can gather insight into how particular queries are handled, initially or after index(es) are added, with mySQL command EXPLAIN.
Normalization OR demormalization (or indeed a combination of both!), is often a viable idea for improving performance during mining operations as well.
Why the JOIN? If we can assume that no IP makes it into acts without an associated record in users then you don't need the join:
SELECT COUNT(distinct ip) FROM acts
WHERE acts.url LIKE '%some_marketing_page%';
If you really do need the JOIN it might pay to first select the distinct IPs from acts, then JOIN those results to users (you'll have to look at the execution plan and experiment to see if this is faster).
Secondly, that LIKE with a leading wild card is going to cause a full table scan of acts and also necessitate some expensive text searching. You have three choices to improve this:
Decompose the url into component parts before you store it so that the search matches a column value exactly.
Require the search term to appear at the beginning of the of the url field, not in the middle.
Investigate a full text search engine that will index the url field in such a way that even an internal LIKE search can be performed against indexes.
Finally, in the case of searching on acts.notes, if an index on notes doesn't provide sufficient search improvement, I'd consider calculating and storing an integer hash on notes and searching for that.
Try running 'EXPLAIN PLAN' on your query and look to see if there are any table scans.
Should this be a LEFT JOIN?
Maybe this site can help.