Using MySQL to search through large data sets?

Using MySQL to search through large data sets? - mysql

Now I'm a really advanced PHP developer and heavily knowledged on small-scale MySQL sets, however I'm now building a large infrastructure for a startup I've recently joined and their servers push around 1 million rows of data every day using their massive server power and previous architecture.
I need to know what is the best way to search through large data sets (it currently resides at 84.9 million) rows with a database size of 394.4 gigabytes. It is hosted using Amazon RDS so it does not have any downtime or slowness, it's just that I want to know what's the best way to access large data sets internally.
For example, if I wanted to search through a database of 84 million rows it takes me 6 minutes. Now, if I made a direct request to a specific id or title it would serve it instantly. So how would I search through a large data set.
Just to remind you, it's fast to find information through database by passing in one variable but when searching it performs VERY slow.
MySQL query example:
SELECT u.*, COUNT(*) AS user_count, f.* FROM users u LEFT JOIN friends f ON u.user_id=(f.friend_from||f.friend_to) WHERE u.user_name LIKE ('%james%smith%') GROUP BY u.signed_up LIMIT 0, 100
That query under 84m rows is sigificantly slow. Specifically 47.41 seconds to perform this query standalone, any ideas guys?
All I need is that challenge sorted and I'll be able to get the drift. Also, I know MySQL isn't very good for large data sets and something like Oracle or MSSQL however I've been told to rebuild it on MySQL rather than other database solutions at this moment.

LIKE is VERY slow for a variety of reasons:
Unless your LIKE expression starts with a constant, no index will be used.
E.g. LIKE ('james%smith%') is good, LIKE ('%james%smith%') is bad for indexing. Your example will NOT use any indexes on "user_name" field.
String matching is complex (algorythmically) business compared to regular operators.
To resolve:
Make sure your LIKE expression starts with a constant, not a wildcard, if you have an index on that field you might be able to use.
Consider making an index table (in the literature/library context of the word "index", not a database index context) if you search for whole words. Or a substring lookup table if searching for random often repeating substrings.
E.g. if all user names are of the form "FN LN" or "LN, FN" - split them up and store first names and/or last names in a dictionary table, joining to that table (and doing straight equality) in your query.

LIKE ('%james%smith%')
Avoid these things like the plague. They are impossible for a general DBMS to optimise.
The right way is to calculate things like this (first and last names) at the time where the data is inserted or updated so that the cost is amortised across all reads. This can be done by adding two new columns (indexed) and using insert/update triggers.
Or, if you want all words in the column, have the trigger break the data into words then have an application-level index table to find relevant records, something like:
main_table:
id integer primary key
blah blah blah
text varchar(60)
appl_index:
id index
word varchar(20)
primary key (id,word)
index (word)
Then you can query appl_index to find those ids that have both james and smith in them, far faster than the abominable like '%...'. You could also break the actual words out to a separate table and use word IDs but that's a matter of taste - it's effect on performance would be questionable.
You may well have a similar problems with f.friend_from||f.friend_to but I've not seen that syntax before (if, as it seems to be, the context is u.user_id can be one or the other).
Basically, if you want your databases to scale, don't do anything that even looks like a per-row function in your selects. Take that from someone who works with mainframe databases where 84 million rows is about the size of our config tables :-)
And, as with all optimisation questions, measure, don't guess!

Related

SQL Index on Strings Helpful?

So I have used MySQL a lot in small projects, for school; however, I'm not taking over a enterprise-ish scale project, and now speed matters, not just getting the right information back. I have Googled around a lot trying to learn how indexes might make my website faster, and I am hoping to further understand how they work, not just when to use them.
So, I find myself doing a lot of SELECT DISTINCTS in order to get all the distinct values, so i can populate my dropdowns. I have heard that this would be faster if this column was indexed; however, I don't completely understand why. If the values in this columns were ints, I would totally understand; basically a data structure like a BST would be created, and search times could be Log(n); however, if my column is strings, how can it put a string in a BST? This doesn't seem possible, since there is no metric to compare a string against another string (like there are with numbers). It seems like an index would just create a list of all the possible values for that column, but it seems as if the search would still require the database to go through every single row, making this search linear, just like if the database just scanned a regular tables.
My second question is what does the database do once it finds the right value in the index data structure. For example, let's say I'm doing a where age = 42. So, the database goes through the data structure until it finds 42, but how does it map that lookup to the whole row? Does the index have some sort of row number associated with it?
Lastly, if I am doing these frequent SELECT DISTINCT statements, is adding an index going to help? I feel like this must be a common task for websites, as many sites have dropdowns where you can filter results, I'm just trying to figure out if I'm approaching it the right way.
Thanks in advance.

You logic is good, however, your assumption that there is no metric to compare string to other strings is incorrect. Strings can simply be compared in alphabetical order, giving them a perfectly usable comparison metric that can be used to build the index.
It takes a tiny bit longer to compare strings then it does ints, however, having an index still speeds things up, regardless of the comparison cost.
I would like to mention however that if you are using SELECT DISTINCT as much as you say, there are probably problems with your database schema.
You should learn about normalizing your database. I recommend starting with this link: http://databases.about.com/od/specificproducts/a/normalization.htm
Normalization will provide you with querying mechanism that can vastly outweigh benefits received from indexing.

if your strings are something small like categories, then an index will help. If you have large chunks of random text, then you will likely want a full text index. If you are having to use select distinct a lot, your database may not be properly normalized for what you are doing. You could also put the distinct values in a separate table (that only has the distinct values), but this only helps if the content does not change a lot. Indexing strategies are particular to your application's access patterns, the data itself, and how the tables are normalized (or not).
HTH

Sphinx vs. MySql - Search through list of friends (efficiency/speed)

I'm porting my application searches over to Sphinx from MySQL and am having a hard time figuring this one out, or if it even needs to be ported at all (I really want to know if it's worth using sphinx for this specific case for efficiency/speed):
users
uid uname
1 alex
2 barry
3 david
friends
uid | fid
1 2
2 1
1 3
3 1
Details are:
- InnoDB
- users: index on uid, index on uname
- friends: combined index on uid,fid
Normally, to search all of alex's friends with mysql:
$uid = 1
$searchstr = "%$friendSearch%";
$query = "SELECT f.fid, u.uname FROM friends f
JOIN users u ON f.fid=u.uid
WHERE f.uid=:uid AND u.uname LIKE :friendSearch";
$friends = $dbh->prepare($query);
$friends->bindParam(':uid', $uid, PDO::PARAM_INT);
$friends->bindParam(':friendSearch', $searchstr, PDO::PARAM_STR);
$friends->execute();
Is it any more efficient to find alex's friends with sphinx vs mysql or would that be an overkill? If sphinx would be faster for this as the list hits thousands of people,
what would the indexing query look like? How would I delete a friendship that no longer exists with sphinx as well, can I have a detailed example in this case? Should I change this query to use Sphinx?

Ok this is how I see this working.
I have the exact same problem with MongoDB. MongoDB "offers" searching capabilities but just like MySQL you should never use them unless you wanna be choked with IO, CPU and memory problems and be forced to use a lot more servers to cope with your index than you normally would.
The whole idea if using Sphinx (or another search tech) is to lower cost per server by having a performant index searcher.
Sphinx however is not a storage engine. It is not as simple to query exact relationships across tables, they have remmedied this a little with SphinxQL but due to the nature of the full text index it still doesn't do an integral join like you would get in MySQL.
Instead I would store the relationships within MySQL but have an index of "users" within Sphinx.
In my website I personally have 2 indexes:
main (houses users,videos,channels and playlists)
help (help system search)
These are delta updated once every minute. Since realtime indexes are still bit experimental at times and I personally have seen problems with high insertion/deletion rates I keep to delta updates. So I would use a delta index to update the main searchable objects of my site since this is less resource intensive and more performant than realtime indexes (from my own tests).
Do note inorder to process deletions and what not your Sphinx collection through delta you will need a killlist and certain filters for your delta index. Here is an example from my index:
source main_delta : main
{
sql_query_pre = SET NAMES utf8
sql_query_pre =
sql_query = \
SELECT id, deleted, _id, uid, listing, title, description, category, tags, author_name, duration, rating, views, type, adult, videos, UNIX_TIMESTAMP(date_uploaded) AS date_uploaded \
FROM documents \
WHERE id>( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 ) OR update_time >( SELECT last_index_time FROM sph_counter WHERE counter_id=1 )
sql_query_killlist = SELECT id FROM documents WHERE update_time>=( SELECT last_index_time FROM sph_counter WHERE counter_id=1 ) OR deleted = 1
}
This processes deletions and additions once every minute which is pretty much realtime for a real web app.
So now we know how to store our indexes. I need to talk about the relationships. Sphinx (even though it has SphinxQL) won't do integral joins across data so I would personally recommend doing the relationship outside of Sphinx, not only that but as I said this relationship table will get high load so this is something that could impact the Sphinx index.
I would do a query to pick out all ids and using that set of ids use the "filter" method on the sphinx API to filter the main index down to specific document ids. Once this is done you can search in Sphinx as normal. This is the most performant method I have found to date of dealing with this.
The key thing to remember at all times is that Sphinx is a search tech while MySQL is a storage tech. Keep that in mind and you should be ok.
Edit
As #N.B said (which I overlooked in my answer) Sphinx does have SphinxSE. Although primative and still in sort of testing stage of its development (same as realtime indexes) it does provide an actual MyISAM/InnoDB type storage to Sphinx. This is awesome. However there are caveats (as with anything):
The language is primative
The joins are primative
However it can/could do the job your looking for so be sure to look into it.

so I'm going to go ahead and kinda outline what -I- feel the best use cases for sphinx are and you can kinda decide if it's more or less in line for what you're looking to do.
If all you're looking to do is a string search one one field; then with MySQL you can do wild card searches without much trouble and honstly with an index on it unless you're expecting millions of rows you are going to be fine.
Now take facebook, that is not only indexing names, but pages ect or even any advanced search fields. Sphinx can take in x columns from MySQL, PostGRES, MongoDB, (insert your db you want here) and create a searchable full-text index across all of those.
Example:
You have 5 fields (house number, street, city, state, zipcode) and you want to do a full text search across all of those. Now with MySQL you could do searches on every single one, however with sphinx you can glob them all together then sphinx does some awesome statistical findings based on the string you've passed in and the matches which are resulting from it.
This Link: PHP Sphinx Searching does a great job at walking you through what it would look like and how things work together.
So you aren't really replacing a database; you're just adding a special daemon to it (sphinx) which allows you to create specialized indexes and run your full text searches against it.

No index can help you with this query, since you're looking for the string as an infix, not a prefix (you're looking for '%friendname%', not 'friendname%'.
Moreover, the LIKE solution will get you into corners: suppose you were looking for a friend called Ann. The LIKE expression will also match Marianne, Danny etc. There's no "complete word" notion in a LIKE expression.
A real solution is to use a text index. A FULLTEXT index is only available on MyISAM, and MySQL 5.6 (not GA at this time) will introduce FULLTEXT on InnoDB.
Otherwise you can indeed use Sphinx to search the text.
With just hundreds or thousands, you will probably not see a big difference, unless you're really going to do many searches per second. With larger numbers, you will eventually realize that a full table scan is inferior to Sphinx search.
I'm using Sphinx a lot, on dozens and sometimes hundreds of millions large texts, and can testify it works like a charm.
The problem with Sphinx is, of course, that it's an external tool. With Sphinx you have to tell it to read data from your database. You can do so (using crontab for example) every 5 minutes, every hour, etc. So if rows are DELETEd, they will only be removed from sphinx the next time it reads the data from table. If you can live with that - that's the simplest solution.
If you can't, there are real time indexes in sphinx, so you may directly instruct it to remove certain rows. I am unable to explain everything in this port, so here are a couple links for you:
Index updates
Real time indexes
As final conclusion, you have three options:
Risk it and use a full table scan, assuming you won't have high load.
Wait for MySQL 5.6 and use FULLTEXT with InnoDB.
Use sphinx
At this point in time, I would certainly use option #3: use sphinx.

Take a look at the solution I propose here:
https://stackoverflow.com/a/22531268/543814
Your friend names are probably short, and your query looks simple enough. You can probably afford to store all suffixes, perhaps in a separate table, pointing back to the original table to get the full name.
This would give you fast infix search at the cost of a little bit more storage space.
Furthermore, to avoid finding 'Marianne' when searching for 'Ann', consider:
Using case-sensitive search. (Fragile; may break if your users enter their names or their search queries with incorrect capitalization.)
After the query, filtering your search results further, requiring word boundaries around the search term (e.g. regex \bAnn\b).

Fast mysql query to randomly select N usernames

In my jsp application I have a search box that lets user to search for user names in the database. I send an ajax call on each keystroke and fetch 5 random names starting with the entered string.
I am using the below query:
select userid,name,pic from tbl_mst_users where name like 'queryStr%' order by rand() limit 5
But this is very slow as I have more than 2000 records in my table.
Is there any better approach which takes less time and let me achieve the same..? I need random values.

How slow is "very slow", in seconds?
The reason why your query could be slow is most likely that you didn't place an index on name. 2000 rows should be a piece of cake for MySQL to handle.
The other possible reason is that you have many columns in the SELECT clause. I assume in this case the MySQL engine first copies all this data to a temp table before sorting this large result set.
I advise the following, so that you work only with indexes, for as long as possible:
SELECT userid, name, pic
FROM tbl_mst_users
JOIN (
-- here, MySQL works on indexes only
SELECT userid
FROM tbl_mst_users
WHERE name LIKE 'queryStr%'
ORDER BY RAND() LIMIT 5
) AS sub USING(userid); -- join other columns only after picking the rows in the sub-query.
This method is a bit better, but still does not scale well. However, it should be sufficient for small tables (2000 rows is, indeed, small).
The link provided by #user1461434 is quite interesting. It describes a solution with almost constant performance. Only drawback is that it returns only one random row at a time.

does table has indexing on name?
if not apply it
2.MediaWiki uses an interesting trick (for Wikipedia's Special:Random feature): the table with the articles has an extra column with a random number (generated when the article is created). To get a random article, generate a random number and get the article with the next larger or smaller (don't recall which) value in the random number column. With an index, this can be very fast. (And MediaWiki is written in PHP and developed for MySQL.)
This approach can cause a problem if the resulting numbers are badly distributed; IIRC, this has been fixed on MediaWiki, so if you decide to do it this way you should take a look at the code to see how it's currently done (probably they periodically regenerate the random number column).
3.http://jan.kneschke.de/projects/mysql/order-by-rand/

Big tables and analysis in MySql

For my startup, I track everything myself rather than rely on google analytics. This is nice because I can actually have ips and user ids and everything.
This worked well until my tracking table rose about 2 million rows. The table is called acts, and records:
ip
url
note
account_id
...where available.
Now, trying to do something like this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.url LIKE '%some_marketing_page%';
Basically never finishes. I switched to this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.note = 'some_marketing_page';
But it is still very slow, despite having an index on note.
I am obviously not pro at mysql. My question is:
How do companies with lots of data track things like funnel conversion rates? Is it possible to do in mysql and I am just missing some knowledge? If not, what books / blogs can I read about how sites do this?

While getting towards 'respectable', 2 Millions rows is still a relatively small size for a table. (And therefore a faster performance is typically possible)
As you found out, the front-ended wildcard are particularly inefficient and we'll have to find a solution for this if that use case is common for your application.
It could just be that you do not have the right set of indexes. Before I proceed, however, I wish to stress that while indexes will typically improve the DBMS performance with SELECT statements of all kinds, it systematically has a negative effect on the performance of "CUD" operations (i.e. with the SQL CREATE/INSERT, UPDATE, DELETE verbs, i.e. the queries which write to the database rather than just read to it). In some cases the negative impact of indexes on "write" queries can be very significant.
My reason for particularly stressing the ambivalent nature of indexes is that it appears that your application does a fair amount of data collection as a normal part of its operation, and you will need to watch for possible degradation as the INSERTs queries get to be slowed down. A possible alternative is to perform the data collection into a relatively small table/database, with no or very few indexes, and to regularly import the data from this input database to a database where the actual data mining takes place. (After they are imported, the rows may be deleted from the "input database", keeping it small and fast for its INSERT function.)
Another concern/question is about the width of a row in the cast table (the number of columns and the sum of the widths of these columns). Bad performance could be tied to the fact that rows are too wide, resulting in too few rows in the leaf nodes of the table, and hence a deeper-than-needed tree structure.
Back to the indexes...
in view of the few queries in the question, it appears that you could benefit from an ip + note index (an index made at least with these two keys in this order). A full analysis of the index situation, and frankly a possible review of the database schema cannot be done here (not enough info for one...) but the general process for doing so is to make the list of the most common use case and to see which database indexes could help with these cases. One can gather insight into how particular queries are handled, initially or after index(es) are added, with mySQL command EXPLAIN.
Normalization OR demormalization (or indeed a combination of both!), is often a viable idea for improving performance during mining operations as well.

Why the JOIN? If we can assume that no IP makes it into acts without an associated record in users then you don't need the join:
SELECT COUNT(distinct ip) FROM acts
WHERE acts.url LIKE '%some_marketing_page%';
If you really do need the JOIN it might pay to first select the distinct IPs from acts, then JOIN those results to users (you'll have to look at the execution plan and experiment to see if this is faster).
Secondly, that LIKE with a leading wild card is going to cause a full table scan of acts and also necessitate some expensive text searching. You have three choices to improve this:
Decompose the url into component parts before you store it so that the search matches a column value exactly.
Require the search term to appear at the beginning of the of the url field, not in the middle.
Investigate a full text search engine that will index the url field in such a way that even an internal LIKE search can be performed against indexes.
Finally, in the case of searching on acts.notes, if an index on notes doesn't provide sufficient search improvement, I'd consider calculating and storing an integer hash on notes and searching for that.

Try running 'EXPLAIN PLAN' on your query and look to see if there are any table scans.
Should this be a LEFT JOIN?
Maybe this site can help.

How does a hash table work? Is it faster than "SELECT * from .."

Let's say, I have :
Key | Indexes | Key-values
----+---------+------------
001 | 100001 | Alex
002 | 100002 | Micheal
003 | 100003 | Daniel
Lets say, we want to search 001, how to do the fast searching process using hash table?
Isn't it the same as we use the "SELECT * from .. " in mysql? I read alot, they say, the "SELECT *" searching from beginning to end, but hash table is not? Why and how?
By using hash table, are we reducing the records we are searching? How?
Can anyone demonstrate how to insert and retrieve hash table process in mysql query code? e.g.,
SELECT * from table1 where hash_value="bla" ...
Another scenario:
If the indexes are like S0001, S0002, T0001, T0002, etc. In mysql i could use:
SELECT * from table WHERE value = S*
isn't it the same and faster?

A simple hash table works by keeping the items on several lists, instead of just one. It uses a very fast and repeatable (i.e. non-random) method to choose which list to keep each item on. So when it is time to find the item again, it repeats that method to discover which list to look in, and then does a normal (slow) linear search in that list.
By dividing the items up into 17 lists, the search becomes 17 times faster, which is a good improvement.
Although of course this is only true if the lists are roughly the same length, so it is important to choose a good method of distributing the items between the lists.
In your example table, the first column is the key, the thing we need to find the item. And lets suppose we will maintain 17 lists. To insert something, we perform an operation on the key called hashing. This just turns the key into a number. It doesn't return a random number, because it must always return the same number for the same key. But at the same time, the numbers must be "spread out" widely.
Then we take the resulting number and use modulus to shrink it down to the size of our list:
Hash(key) % 17
This all happens extremely fast. Our lists are in an array, so:
_lists[Hash(key % 17)].Add(record);
And then later, to find the item using that key:
Record found = _lists[Hash(key % 17)].Find(key);
Note that each list can just be any container type, or a linked list class that you write by hand. When we execute a Find in that list, it works the slow way (examine the key of each record).

Do not worry about what MySQL is doing internally to locate records quickly. The job of a database is to do that sort of thing for you. Just run a SELECT [columns] FROM table WHERE [condition]; query and let the database generate a query plan for you. Note that you don't want to use SELECT *, since if you ever add a column to the table that will break all your old queries that relied on there being a certain number of columns in a certain order.
If you really want to know what's going on under the hood (it's good to know, but do not implement it yourself: that is the purpose of a database!), you need to know what indexes are and how they work. If a table has no index on the columns involved in the WHERE clause, then, as you say, the database will have to search through every row in the table to find the ones matching your condition. But if there is an index, the database will search the index to find the exact location of the rows you want, and jump directly to them. Indexes are usually implemented as B+-trees, a type of search tree that uses very few comparisons to locate a specific element. Searching a B-tree for a specific key is very fast. MySQL is also capable of using hash indexes, but these tend to be slower for database uses. Hash indexes usually only perform well on long keys (character strings especially), since they reduce the size of the key to a fixed hash size. For data types like integers and real numbers, which have a well-defined ordering and fixed length, the easy searchability of a B-tree usually provides better performance.
You might like to look at the chapters in the MySQL manual and PostgreSQL manual on indexing.

http://en.wikipedia.org/wiki/Hash_table
Hash tables may be used as in-memory data structures. Hash tables may also be adopted for use with persistent data structures; database indices sometimes use disk-based data structures based on hash tables, although balanced trees are more popular.

I guess you could use a hash function to get the ID you want to select from. Like
SELECT * FROM table WHERE value = hash_fn(whatever_input_you_build_your_hash_value_from)
Then you don't need to know the id of the row you want to select and can do an exact query. Since you know that the row will always have the same id because of the input you build the hash value form and you can always recreate this id through the hash function.
However this isn't always true depending on the size of the table and the maximum number of hashvalues (you often have "X mod hash-table-size" somewhere in your hash). To take care of this you should have a deterministic strategy you use each time you get two values with the same id. You should check Wikipedia for more info on this strategy, its called collision handling and should be mentioned in the same article as hash-tables.
MySQL probably uses hashtables somewhere because of the O(1) feature norheim.se (up) mentioned.

Hash tables are great for locating entries at O(1) cost where the key (that is used for hashing) is already known. They are in widespread use both in collection libraries and in database engines. You should be able to find plenty of information about them on the internet. Why don't you start with Wikipedia or just do a Google search?
I don't know the details of mysql. If there is a structure in there called "hash table", that would probably be a kind of table that uses hashing for locating the keys. I'm sure someone else will tell you about that. =)
EDIT: (in response to comment)
Ok. I'll try to make a grossly simplified explanation: A hash table is a table where the entries are located based on a function of the key. For instance, say that you want to store info about a set of persons. If you store it in a plain unsorted array, you would need to iterate over the elements in sequence in order to find the entry you are looking for. On average, this will need N/2 comparisons.
If, instead, you put all entries at indexes based on the first character of the persons first name. (A=0, B=1, C=2 etc), you will immediately be able to find the correct entry as long as you know the first name. This is the basic idea. You probably realize that some special handling (rehashing, or allowing lists of entries) is required in order to support multiple entries having the same first letter. If you have a well-dimensioned hash table, you should be able to get straight to the item you are searching for. This means approx one comparison, with the disclaimer of the special handling I just mentioned.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008