MySQL search Optimization for Search Fields - mysql

I have an MySQL database with 500,000 rows.
I have a list of 500,000 combination Strings such as First_Name and Last_Name.
I am trying to search the 500,000 rows with a similar query
select count(*) FROM data WHERE first='wadaw' AND last='wdvv';
The problem is that it takes too much time, I am using multiple threads and it doesn't seem to be very efficient considering the communication overhead between MySQL and the running time of the queries. I thought to start improve by changing the settings of my database to better fit my data and optimize MySQL database for it.
From my experience with search algorithms, an unsorted list would take n*log(n) with the most widely used methods and N with radix sort etc. which makes it n^2 or n^2log(n) for my case, which is not that good if you have 1,000,000 fields.
But with Binary search it would take Log(n) and thus n*log(n) for my case.
I am looking for a way to make the best out of my database.
Any suggestions?

Try using an index for both fields you are using. In your example:
create index idx_data_name_last on data (first, last);
That will use just one index and so the time will be log(n) and not n*log(n).

Related

How to store data for fast access in a MySQL table?

Suppose I have this fields:
first name, last name, city, state, zip, lat, lng
I'm trying to find a way to store the data so I can do queries like:
search by name or name & city
search by radius
Suppose I have millions of records, not sure if the best option is to put them all in one single table.
I could split them by state, but then I'll have to do a bunch of joins when searching.
Any ideas?
The more joins you have the slower things will get. If the data needs to be structured in a relational way for long term maintainability and extensibility than do so.
That MySQL table does not have to be where your fast search queries against.
You can MapReduce with Hadoop + Hive for high speed querying of lots of data, or, probably even easier to implement, you can build a Solr index that you query against.
In general, you will not achieve a very fast searchable large data set in MySQL or any relational database, MySQL is a great relational database store and will be fast for a while. But, depending on your searches and number of records, you will need the data to be grouped and aggregated differently and that is where something like a Solr or a NoSQL persistence/view of the data can get you the speeds you need.
A single table, with
INDEX(last_name, first_name, city)
INDEX(lat)
The first index handles your first pair of searches. The last one partially helps the search by radius; it might be good enough for only a million rows.
For a lot more rows, "nearest" requires a much more complex approach .
Some of the other approaches will simply brute-force search the entire list. That is not good for "nearest".

Use sphinx vs MySQL on no text search query

I have this doubt:
Suppose I have a one big table with a relationship to to a smaller table of users.
The idea is to search in that really big table for dates bigger than a given date and order by a score (big int, for example), and obtain related user info at the same time.
The result of this query can change every 10 minutes or so.
So, there is no text search, but I have a really big table. Should I use sphinx (or other search engine) or should I just use some MySQL indexes?
If I use sphinx, it's sure that I can obtain really fast results; but maybe having the index refreshed, even with delta indexing, doesn't make a big difference with MySQL indexing. At the same time, the changes in the table are not necessary new inserts, but updates; and I have read that real time indexing and delta index can give problems.
Maybe it would be better to use MySQL indexes, and help with some kind of caching to avoid unnecessary queries .
Just use MySQL, you definitely don't need Sphinx for what you are doing.

sorting 1 billion rows by one varchar column in MYSQL quickly

I have 1 billion rows stored in MYSQL, I need to output them alphabetically by the a varchar column, what's the most efficient way of go about it. using other linux utilites like sort awk are allowed.
MySQL can deal with a billion rows. Efficiency depends on 3 main factors: Buffers, Indexes and Joins.
Some suggestions:
Try to fit data set you’re working with in memory
Processing in memory is so much faster and you have whole bunch of problems solved just doing so. Use multiple servers to host portions of data set. Store portion of data you’re going to work with in temporary table etc.
Prefer full table scans to index accesses
For large data sets full table scans are often faster than range scans and other types of index lookups. Even if you look at 1% or rows or less full table scan may be faster.
Avoid joins to large tables
Joining of large data sets using nested loops is very expensive. Try to avoid it. Joins to smaller tables is OK but you might want to preload them to memory before join so there is no random IO needed to populate the caches.
Be aware of MySQL limitations which requires you to be extra careful working with large data sets. In MySQL, a query runs as a single thread (with exeption of MySQL Cluster) and MySQL issues IO requests one by one for query execution, which means if single query execution time is your concern many hard drives and large number of CPUs will not help.
Sometimes it is good idea to manually split query into several, run in parallel and aggregate result sets.
You did not give much info on your setup or your dataset, but this should give you a couple of clues on what to watch out for. In my opinion having the (properly tuned) database sort this for you would be faster than doing it programmatically unless you have very specific needs not mentioned in your post.
Have you just tried indexing the column and dumping them out? I'd try that first to see if the performance was inadequate before going exotic.
It depends on how you define efficient. CPU/Memory/IO/Time/Coding Effort. What is important in this case?
"select * from big_table order by the_varchar_column" That is probably the most efficient use of developer resources. Adding an index might make it run a lot faster.

Using MySQL to search through large data sets?

Now I'm a really advanced PHP developer and heavily knowledged on small-scale MySQL sets, however I'm now building a large infrastructure for a startup I've recently joined and their servers push around 1 million rows of data every day using their massive server power and previous architecture.
I need to know what is the best way to search through large data sets (it currently resides at 84.9 million) rows with a database size of 394.4 gigabytes. It is hosted using Amazon RDS so it does not have any downtime or slowness, it's just that I want to know what's the best way to access large data sets internally.
For example, if I wanted to search through a database of 84 million rows it takes me 6 minutes. Now, if I made a direct request to a specific id or title it would serve it instantly. So how would I search through a large data set.
Just to remind you, it's fast to find information through database by passing in one variable but when searching it performs VERY slow.
MySQL query example:
SELECT u.*, COUNT(*) AS user_count, f.* FROM users u LEFT JOIN friends f ON u.user_id=(f.friend_from||f.friend_to) WHERE u.user_name LIKE ('%james%smith%') GROUP BY u.signed_up LIMIT 0, 100
That query under 84m rows is sigificantly slow. Specifically 47.41 seconds to perform this query standalone, any ideas guys?
All I need is that challenge sorted and I'll be able to get the drift. Also, I know MySQL isn't very good for large data sets and something like Oracle or MSSQL however I've been told to rebuild it on MySQL rather than other database solutions at this moment.
LIKE is VERY slow for a variety of reasons:
Unless your LIKE expression starts with a constant, no index will be used.
E.g. LIKE ('james%smith%') is good, LIKE ('%james%smith%') is bad for indexing. Your example will NOT use any indexes on "user_name" field.
String matching is complex (algorythmically) business compared to regular operators.
To resolve:
Make sure your LIKE expression starts with a constant, not a wildcard, if you have an index on that field you might be able to use.
Consider making an index table (in the literature/library context of the word "index", not a database index context) if you search for whole words. Or a substring lookup table if searching for random often repeating substrings.
E.g. if all user names are of the form "FN LN" or "LN, FN" - split them up and store first names and/or last names in a dictionary table, joining to that table (and doing straight equality) in your query.
LIKE ('%james%smith%')
Avoid these things like the plague. They are impossible for a general DBMS to optimise.
The right way is to calculate things like this (first and last names) at the time where the data is inserted or updated so that the cost is amortised across all reads. This can be done by adding two new columns (indexed) and using insert/update triggers.
Or, if you want all words in the column, have the trigger break the data into words then have an application-level index table to find relevant records, something like:
main_table:
id integer primary key
blah blah blah
text varchar(60)
appl_index:
id index
word varchar(20)
primary key (id,word)
index (word)
Then you can query appl_index to find those ids that have both james and smith in them, far faster than the abominable like '%...'. You could also break the actual words out to a separate table and use word IDs but that's a matter of taste - it's effect on performance would be questionable.
You may well have a similar problems with f.friend_from||f.friend_to but I've not seen that syntax before (if, as it seems to be, the context is u.user_id can be one or the other).
Basically, if you want your databases to scale, don't do anything that even looks like a per-row function in your selects. Take that from someone who works with mainframe databases where 84 million rows is about the size of our config tables :-)
And, as with all optimisation questions, measure, don't guess!

What are some optimization techniques for MySQL table with 300+ million records?

I am looking at storing some JMX data from JVMs on many servers for about 90 days. This data would be statistics like heap size and thread count. This will mean that one of the tables will have around 388 million records.
From this data I am building some graphs so you can compare the stats retrieved from the Mbeans. This means I will be grabbing some data at an interval using timestamps.
So the real question is, Is there anyway to optimize the table or query so you can perform these queries in a reasonable amount of time?
Thanks,
Josh
There are several things you can do:
Build your indexes to match the queries you are running. Run EXPLAIN to see the types of queries that are run and make sure that they all use an index where possible.
Partition your table. Paritioning is a technique for splitting a large table into several smaller ones by a specific (aggregate) key. MySQL supports this internally from ver. 5.1.
If necessary, build summary tables that cache the costlier parts of your queries. Then run your queries against the summary tables. Similarly, temporary in-memory tables can be used to store a simplified view of your table as a pre-processing stage.
3 suggestions:
index
index
index
p.s. for timestamps you may run into performance issues -- depending on how MySQL handles DATETIME and TIMESTAMP internally, it may be better to store timestamps as integers. (# secs since 1970 or whatever)
Well, for a start, I would suggest you use "offline" processing to produce 'graph ready' data (for most of the common cases) rather than trying to query the raw data on demand.
If you are using MYSQL 5.1 you can use the new features.
but be warned they contain lot of bugs.
first you should use indexes.
if this is not enough you can try to split the tables by using partitioning.
if this also wont work, you can also try load balancing.
A few suggestions.
You're probably going to run aggregate queries on this stuff, so after (or while) you load the data into your tables, you should pre-aggregate the data, for instance pre-compute totals by hour, or by user, or by week, whatever, you get the idea, and store that in cache tables that you use for your reporting graphs. If you can shrink your dataset by an order of magnitude, then, good for you !
This means I will be grabbing some data at an interval using timestamps.
So this means you only use data from the last X days ?
Deleting old data from tables can be horribly slow if you got a few tens of millions of rows to delete, partitioning is great for that (just drop that old partition). It also groups all records from the same time period close together on disk so it's a lot more cache-efficient.
Now if you use MySQL, I strongly suggest using MyISAM tables. You don't get crash-proofness or transactions and locking is dumb, but the size of the table is much smaller than InnoDB, which means it can fit in RAM, which means much quicker access.
Since big aggregates can involve lots of rather sequential disk IO, a fast IO system like RAID10 (or SSD) is a plus.
Is there anyway to optimize the table or query so you can perform these queries
in a reasonable amount of time?
That depends on the table and the queries ; can't give any advice without knowing more.
If you need complicated reporting queries with big aggregates and joins, remember that MySQL does not support any fancy JOINs, or hash-aggregates, or anything else useful really, basically the only thing it can do is nested-loop indexscan which is good on a cached table, and absolutely atrocious on other cases if some random access is involved.
I suggest you test with Postgres. For big aggregates the smarter optimizer does work well.
Example :
CREATE TABLE t (id INTEGER PRIMARY KEY AUTO_INCREMENT, category INT NOT NULL, counter INT NOT NULL) ENGINE=MyISAM;
INSERT INTO t (category, counter) SELECT n%10, n&255 FROM serie;
(serie contains 16M lines with n = 1 .. 16000000)
MySQL Postgres
58 s 100s INSERT
75s 51s CREATE INDEX on (category,id) (useless)
9.3s 5s SELECT category, sum(counter) FROM t GROUP BY category;
1.7s 0.5s SELECT category, sum(counter) FROM t WHERE id>15000000 GROUP BY category;
On a simple query like this pg is about 2-3x faster (the difference would be much larger if complex joins were involved).
EXPLAIN Your SELECT Queries
LIMIT 1 When Getting a Unique Row
SELECT * FROM user WHERE state = 'Alabama' // wrong
SELECT 1 FROM user WHERE state = 'Alabama' LIMIT 1
Index the Search Fields
Indexes are not just for the primary keys or the unique keys. If there are any columns in your table that you will search by, you should almost always index them.
Index and Use Same Column Types for Joins
If your application contains many JOIN queries, you need to make sure that the columns you join by are indexed on both tables. This affects how MySQL internally optimizes the join operation.
Do Not ORDER BY RAND()
If you really need random rows out of your results, there are much better ways of doing it. Granted it takes additional code, but you will prevent a bottleneck that gets exponentially worse as your data grows. The problem is, MySQL will have to perform RAND() operation (which takes processing power) for every single row in the table before sorting it and giving you just 1 row.
Use ENUM over VARCHAR
ENUM type columns are very fast and compact. Internally they are stored like TINYINT, yet they can contain and display string values.
Use NOT NULL If You Can
Unless you have a very specific reason to use a NULL value, you should always set your columns as NOT NULL.
"NULL columns require additional space in the row to record whether their values are NULL. For MyISAM tables, each NULL column takes one bit extra, rounded up to the nearest byte."
Store IP Addresses as UNSIGNED INT
In your queries you can use the INET_ATON() to convert and IP to an integer, and INET_NTOA() for vice versa. There are also similar functions in PHP called ip2long() and long2ip().