Indexing text - MySQL vs. MS SQL - mysql

image you have application like this : 1 DB table, few int fields, few small varchar fields, and about 10 TEXT fields (contents variable - some data about 50 chars long, most about 100-200, some about 1000, very few more than 1000). Row count is in x0 000 - x00 000.
Now, i need effective way to query like this (meta-language):
SELECT (1 if textfield1 LIKE %param1% ELSE 0) as r1,(1 if textfield2 LIKE %param2% ELSE 0) as r2, ... etc, for most of the text fields in 1 query typically (it is dynamic - may be 2 of them included, may be all of them).
Now the question - what is better for me, MySQL or MSSQL (probably express while possible,upgrade to full if really needed) ?
I know that MySQL have nice text indexes, which you have set on custom number of first characters, so i can balance it for the typical scenario (like this : http://fernandoipar.com/2009/08/12/indexing-text-columns-in-mysql/)
MSSQL has only full text indexing, which i have no experience with. Note that i do NOT need features like words proximity or similar words (run = ran; some stemming would be nice, but because data are multilingual it is impossible anyway). I need just common LIKE %word% system, thats all. And i also have to be able to find short substrings (2 chars).
Virtually the goal is to run as many as possible of these queries per hour/day (there wont be enough results, never ever, because they should be refreshed as often as possible), so think of this kind of efficiency as requirement :)
Thanx!
UPDATE: well aparently there is no way to use index for optimizing LIKE %foo% queries. So the new question is : is there any other way to speed up this type of queries ? (please omit things like "buy more ram or SSD" :)

LIKE '%foo%' expression cannot be optimized in any RDBMS.
You need fulltext indexes in mysql or in sql server
I need just common LIKE %word% system
Then choose any DBMS you want, because all they will suck on such clause ;-)

Today many applications use an external index and search engine.
Have a look at http://lucene.apache.org/

Related

Is wildcard LIKE more performant than a multiple boolean search in MySQL?

I have been building an API for for a website and the objects I am searching for have a LOT of true/false fields. Instead of creating a huge db structure to manage the options I thought about serializing them in a string similar to '001001000010101001' where 1 is true and 0 is false (I am talking about 100 different options). The other reason I am doing this is to have a clean database so that all of those fields get grouped in a single field (I already have serializer/deserializer).
Now in the search function, since not all of the options get searched at the same time, I was thinking about using a LIKE statement with wildcards.
For example I would do something like this:
WHERE options LIKE '1_1__1__1___1%' (The final wildcard is to reduce the number of _ wildcards so that only the beginning of the pattern gets matched. I would stop at the last checked options to check and % wildcard the rest).
Would this (on average because sometimes there might be 2 or 3 parameters selected and many times there might be all of them) be more performant than a multiple series of AND xxx AND XXX AND ....?
Or is there a way more efficient (and clean to maintain) way of doing this that I am completely missing?
Let me discuss some of the items you bring up.
I understand the 0/1 approach. I find it interesting that you chose to go with strings instead of numbers. For up to 64 true/false value, a BIGINT (8 bytes) or something smaller would work.
Did you also want to test for false for some flags? Your string mechanism makes that rather obvious. (Your example should include a 0.)
Searching with LIKE will be efficient only if there is no leading wildcard (_ or %). So, I would expect most of your queries to involve a full table scan. Your serializer, plus the b prefix would work for setting the strings.
The integer approach that I am countering with would involve & and other boolean operations to test. This would probably be noticeably faster. This would necessitate a full table scan.
If there are other columns being tested in the WHERE clause, let's see them. They will be important to performance and indexing.
Using numbers instead of strings would be slightly faster because of smaller size and faster testing. But not a lot.
You can get some of the benefits of numeric by using the SET datatype. (Again limited to 64 independent flags per column.) The advantage is that you talk to the database via strings, not cryptic bit positions.
If this is a real estate app, consider a few columns like: kitchen, bedrooms, unusual_items (second dishwasher, jacuzzi), etc. No matter how it is implemented (string, integer, SET), this suggestion won't impact performance much.
Another performance note, again with a real estate model: Since #bedrooms is almost always a criteria, make it a column by itself. This may allow for some use of it in a composite index.

MySQL - Querying against a non-indexable field

The webpage in question is https://www.christart.com/poetry/
I have a MySQL table with little over 7,000 records of poems entries. I'm getting requests from my users to be able to run queries against they body of the poems. But they are saved in a 'text' column.
I know how to write the SQL statement. That's easy enough. My concern is the load on the database. I always index columns that are queried or join on. But can't index a 'text' column.
There must be a way. How should I approach this?
You could use a full text index:
CREATE FULLTEXT INDEX poem_contents ON poems(body);
And then search using match:
SELECT *
FROM poems
WHERE MATCH(body) AGAINST ('some phrase' IN BOOLEAN MODE)
There's no reason that you can't index a text field. That being said, there's probably very little value in indexing a text field that's containing entire poems.
If your database only has 7,000 rows, you probably won't see a massive performance hit unless you scale much larger than it currently is. For a larger scale, a better solution would probably be to extract keywords from the body and search on those.
I think you must explore Apache Lucene or similar kind of project which provide full text search. Alternatively you can check mongodb instead of mysql. It got number of index types. There are also Solr/ElasticSearch which at back uses Lucene.
Poem body, I assume, it will be stored in varchar type. I dont know indexing possible on varchar or not & dont think it wise to indexing entire poem body. Something like Lucene/Solr provides better option.
Please note, I am not related to any of the product mentioned above.

Sphinx vs. MySql - Search through list of friends (efficiency/speed)

I'm porting my application searches over to Sphinx from MySQL and am having a hard time figuring this one out, or if it even needs to be ported at all (I really want to know if it's worth using sphinx for this specific case for efficiency/speed):
users
uid uname
1 alex
2 barry
3 david
friends
uid | fid
1 2
2 1
1 3
3 1
Details are:
- InnoDB
- users: index on uid, index on uname
- friends: combined index on uid,fid
Normally, to search all of alex's friends with mysql:
$uid = 1
$searchstr = "%$friendSearch%";
$query = "SELECT f.fid, u.uname FROM friends f
JOIN users u ON f.fid=u.uid
WHERE f.uid=:uid AND u.uname LIKE :friendSearch";
$friends = $dbh->prepare($query);
$friends->bindParam(':uid', $uid, PDO::PARAM_INT);
$friends->bindParam(':friendSearch', $searchstr, PDO::PARAM_STR);
$friends->execute();
Is it any more efficient to find alex's friends with sphinx vs mysql or would that be an overkill? If sphinx would be faster for this as the list hits thousands of people,
what would the indexing query look like? How would I delete a friendship that no longer exists with sphinx as well, can I have a detailed example in this case? Should I change this query to use Sphinx?
Ok this is how I see this working.
I have the exact same problem with MongoDB. MongoDB "offers" searching capabilities but just like MySQL you should never use them unless you wanna be choked with IO, CPU and memory problems and be forced to use a lot more servers to cope with your index than you normally would.
The whole idea if using Sphinx (or another search tech) is to lower cost per server by having a performant index searcher.
Sphinx however is not a storage engine. It is not as simple to query exact relationships across tables, they have remmedied this a little with SphinxQL but due to the nature of the full text index it still doesn't do an integral join like you would get in MySQL.
Instead I would store the relationships within MySQL but have an index of "users" within Sphinx.
In my website I personally have 2 indexes:
main (houses users,videos,channels and playlists)
help (help system search)
These are delta updated once every minute. Since realtime indexes are still bit experimental at times and I personally have seen problems with high insertion/deletion rates I keep to delta updates. So I would use a delta index to update the main searchable objects of my site since this is less resource intensive and more performant than realtime indexes (from my own tests).
Do note inorder to process deletions and what not your Sphinx collection through delta you will need a killlist and certain filters for your delta index. Here is an example from my index:
source main_delta : main
{
sql_query_pre = SET NAMES utf8
sql_query_pre =
sql_query = \
SELECT id, deleted, _id, uid, listing, title, description, category, tags, author_name, duration, rating, views, type, adult, videos, UNIX_TIMESTAMP(date_uploaded) AS date_uploaded \
FROM documents \
WHERE id>( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 ) OR update_time >( SELECT last_index_time FROM sph_counter WHERE counter_id=1 )
sql_query_killlist = SELECT id FROM documents WHERE update_time>=( SELECT last_index_time FROM sph_counter WHERE counter_id=1 ) OR deleted = 1
}
This processes deletions and additions once every minute which is pretty much realtime for a real web app.
So now we know how to store our indexes. I need to talk about the relationships. Sphinx (even though it has SphinxQL) won't do integral joins across data so I would personally recommend doing the relationship outside of Sphinx, not only that but as I said this relationship table will get high load so this is something that could impact the Sphinx index.
I would do a query to pick out all ids and using that set of ids use the "filter" method on the sphinx API to filter the main index down to specific document ids. Once this is done you can search in Sphinx as normal. This is the most performant method I have found to date of dealing with this.
The key thing to remember at all times is that Sphinx is a search tech while MySQL is a storage tech. Keep that in mind and you should be ok.
Edit
As #N.B said (which I overlooked in my answer) Sphinx does have SphinxSE. Although primative and still in sort of testing stage of its development (same as realtime indexes) it does provide an actual MyISAM/InnoDB type storage to Sphinx. This is awesome. However there are caveats (as with anything):
The language is primative
The joins are primative
However it can/could do the job your looking for so be sure to look into it.
so I'm going to go ahead and kinda outline what -I- feel the best use cases for sphinx are and you can kinda decide if it's more or less in line for what you're looking to do.
If all you're looking to do is a string search one one field; then with MySQL you can do wild card searches without much trouble and honstly with an index on it unless you're expecting millions of rows you are going to be fine.
Now take facebook, that is not only indexing names, but pages ect or even any advanced search fields. Sphinx can take in x columns from MySQL, PostGRES, MongoDB, (insert your db you want here) and create a searchable full-text index across all of those.
Example:
You have 5 fields (house number, street, city, state, zipcode) and you want to do a full text search across all of those. Now with MySQL you could do searches on every single one, however with sphinx you can glob them all together then sphinx does some awesome statistical findings based on the string you've passed in and the matches which are resulting from it.
This Link: PHP Sphinx Searching does a great job at walking you through what it would look like and how things work together.
So you aren't really replacing a database; you're just adding a special daemon to it (sphinx) which allows you to create specialized indexes and run your full text searches against it.
No index can help you with this query, since you're looking for the string as an infix, not a prefix (you're looking for '%friendname%', not 'friendname%'.
Moreover, the LIKE solution will get you into corners: suppose you were looking for a friend called Ann. The LIKE expression will also match Marianne, Danny etc. There's no "complete word" notion in a LIKE expression.
A real solution is to use a text index. A FULLTEXT index is only available on MyISAM, and MySQL 5.6 (not GA at this time) will introduce FULLTEXT on InnoDB.
Otherwise you can indeed use Sphinx to search the text.
With just hundreds or thousands, you will probably not see a big difference, unless you're really going to do many searches per second. With larger numbers, you will eventually realize that a full table scan is inferior to Sphinx search.
I'm using Sphinx a lot, on dozens and sometimes hundreds of millions large texts, and can testify it works like a charm.
The problem with Sphinx is, of course, that it's an external tool. With Sphinx you have to tell it to read data from your database. You can do so (using crontab for example) every 5 minutes, every hour, etc. So if rows are DELETEd, they will only be removed from sphinx the next time it reads the data from table. If you can live with that - that's the simplest solution.
If you can't, there are real time indexes in sphinx, so you may directly instruct it to remove certain rows. I am unable to explain everything in this port, so here are a couple links for you:
Index updates
Real time indexes
As final conclusion, you have three options:
Risk it and use a full table scan, assuming you won't have high load.
Wait for MySQL 5.6 and use FULLTEXT with InnoDB.
Use sphinx
At this point in time, I would certainly use option #3: use sphinx.
Take a look at the solution I propose here:
https://stackoverflow.com/a/22531268/543814
Your friend names are probably short, and your query looks simple enough. You can probably afford to store all suffixes, perhaps in a separate table, pointing back to the original table to get the full name.
This would give you fast infix search at the cost of a little bit more storage space.
Furthermore, to avoid finding 'Marianne' when searching for 'Ann', consider:
Using case-sensitive search. (Fragile; may break if your users enter their names or their search queries with incorrect capitalization.)
After the query, filtering your search results further, requiring word boundaries around the search term (e.g. regex \bAnn\b).

Sphinx / Solr for keyword / frequency queries

We need to be able to perform two types of queries efficiently against a table containing several million records:
1) Return the "x" most recent records which contain keyword "y".
2) Return the "x" most frequent keywords for a group of records.
We have been thinking about using some external search server such as Sphinx or Solr, but we are not sure if any of those will be able to support both types of queries.
So, which is the most efficient way to be able to perform those types of queries?
Solr can definitely do both of those things, assuming you've set up your schema.xml file properly. Your queries might look something like this:
1 - http://localhost:8983/solr/solr-index/select?q=y&rows=x&sort=date+desc
2 - http://localhost:8983/solr/solr-index/select?q=*:*&rows=0&facet=true&facet.field=description
In fact your main problem with Solr might be getting the data into the index. But even indexing and optimization are fast.
Sphinx can do 1) without even breaking a sweat. No problem them.
2) Is more tricky. Its not supported out of the box. But it can be done. Need to do a fair amount of extra work. Basically you need to tokenize the text yourself, and store ids as Multi-Value attribute. Can then run group by query on this mva column.
If the above sounds in anyway scary, you probably best using another solution - from the last reply sounds like Solr will do it.

Using MySQL to search through large data sets?

Now I'm a really advanced PHP developer and heavily knowledged on small-scale MySQL sets, however I'm now building a large infrastructure for a startup I've recently joined and their servers push around 1 million rows of data every day using their massive server power and previous architecture.
I need to know what is the best way to search through large data sets (it currently resides at 84.9 million) rows with a database size of 394.4 gigabytes. It is hosted using Amazon RDS so it does not have any downtime or slowness, it's just that I want to know what's the best way to access large data sets internally.
For example, if I wanted to search through a database of 84 million rows it takes me 6 minutes. Now, if I made a direct request to a specific id or title it would serve it instantly. So how would I search through a large data set.
Just to remind you, it's fast to find information through database by passing in one variable but when searching it performs VERY slow.
MySQL query example:
SELECT u.*, COUNT(*) AS user_count, f.* FROM users u LEFT JOIN friends f ON u.user_id=(f.friend_from||f.friend_to) WHERE u.user_name LIKE ('%james%smith%') GROUP BY u.signed_up LIMIT 0, 100
That query under 84m rows is sigificantly slow. Specifically 47.41 seconds to perform this query standalone, any ideas guys?
All I need is that challenge sorted and I'll be able to get the drift. Also, I know MySQL isn't very good for large data sets and something like Oracle or MSSQL however I've been told to rebuild it on MySQL rather than other database solutions at this moment.
LIKE is VERY slow for a variety of reasons:
Unless your LIKE expression starts with a constant, no index will be used.
E.g. LIKE ('james%smith%') is good, LIKE ('%james%smith%') is bad for indexing. Your example will NOT use any indexes on "user_name" field.
String matching is complex (algorythmically) business compared to regular operators.
To resolve:
Make sure your LIKE expression starts with a constant, not a wildcard, if you have an index on that field you might be able to use.
Consider making an index table (in the literature/library context of the word "index", not a database index context) if you search for whole words. Or a substring lookup table if searching for random often repeating substrings.
E.g. if all user names are of the form "FN LN" or "LN, FN" - split them up and store first names and/or last names in a dictionary table, joining to that table (and doing straight equality) in your query.
LIKE ('%james%smith%')
Avoid these things like the plague. They are impossible for a general DBMS to optimise.
The right way is to calculate things like this (first and last names) at the time where the data is inserted or updated so that the cost is amortised across all reads. This can be done by adding two new columns (indexed) and using insert/update triggers.
Or, if you want all words in the column, have the trigger break the data into words then have an application-level index table to find relevant records, something like:
main_table:
id integer primary key
blah blah blah
text varchar(60)
appl_index:
id index
word varchar(20)
primary key (id,word)
index (word)
Then you can query appl_index to find those ids that have both james and smith in them, far faster than the abominable like '%...'. You could also break the actual words out to a separate table and use word IDs but that's a matter of taste - it's effect on performance would be questionable.
You may well have a similar problems with f.friend_from||f.friend_to but I've not seen that syntax before (if, as it seems to be, the context is u.user_id can be one or the other).
Basically, if you want your databases to scale, don't do anything that even looks like a per-row function in your selects. Take that from someone who works with mainframe databases where 84 million rows is about the size of our config tables :-)
And, as with all optimisation questions, measure, don't guess!