mysql optimization for a large database - mysql

i'm creating a ecommerce web applicaiton using PHP and MYSQL(MYISAM). i want to know how to speed up my queries
I have a products table with over a million records with following columns: id (int, primary) catid(int) usrid (int) title (int) description (int) status (enum) date(datetime)
recently i split this one table into multiple tables based on the product categories(catid). thinking that it might reduce the load on the server.
Now i need to fetch results from these tables combined with following sets of conditions 1. results matching a usrid and status. (to fetch a users products) 2 results matching status and title or description (eg: for product search)
now currently i have to use UNION to fetch results from these all tables combined which is slowing down the permormance also i can't apply the LIMIT to the combined result set also. I thought of creating an index on all these columns to speed up the searching but this might slow down the INSERTS and UPDATES. also i'm begingin to think that splitting the table was not a good idea in the first place.
i would like to know the best approach to optimize the data retrieval in such a situation. I'm open to new database schema proposals as well.

To start: load test and turn on the MySQL slow query log.
Some other suggestions:
If staying with separate tables per category use UNION ALL instead of UNION. Reason being UNION implies distinctness, which makes the database engine do extra work to dedupe the rows unnecessarily.
Indices do add a write penalty, but what you describe probably has a read-write ratio of at least 10 to 1 and probably more like 1000 to 1 or higher. So index. For the two queries you describe, I would probably create three indices (you'll need to study explain plans to determine what column order is better).
usrid and status
status and title
status and description (is this an indexable field?)
Another note on indices, creating a covering index, that is one that has all your columns, can also be a useful solution if one of your frequent access patterns is retrieval by primary key.

Have you considered using memcached? It caches the resultset from database queries on the server and returns them if they are requested by multiple users. If it doesn't find a cache resultset, only then will it query the database. It should alleviate the load on the database significantly.
http://memcached.org/

Related

Optimised way to store large key value kind of data

I am working on a database that has a table user having columns user_id and user_service_id. My application needs to fetch all the users whose user_service_id is a particular value. Normally I would add an index to the user_service_id column and run a query like this :
select user_id from user where user_service_id = 2;
Since the cardinality of the column user_service_id is very less than around 3-4 and the table has around 10M entries, the query will end up scanning almost the entire table.
I was wondering what is the recommendation for such usecases. Also, would it make more sense to move the data to another nosql datastore as this doesn't seem to be an efficient usecase for MySQL or any SQL datastore? Tried to search this but couldn't find any recommendations here. Can someone please help or provide the necessary references?
Thanks in advance.
That query needs this index, which is both "composite" and "covering":
INDEX(user_service_id, user_id) -- in this order
But what will you do with the millions of rows that you get? Sounds like it will choke the client, whether it comes fast or slow.
See my Index Cookbook
"very dynamic" -- Not a problem.
"cache" -- the dynamic nature defeats caching.
"cardinality" -- not important, except to point out that there will be millions of rows.
"millions of rows" -- that takes time to deliver to the client. The number of rows delivered is the biggest factor in cost.
"select entire table, then filter in client" -- That will be even slower! (See "millions of rows".)

Whether or not SQL query (SELECT) continues or stops reading data from table when find the value

Greeting,
My question; Whether or no sql query (SELECT) continues or stops reading data (records) from table when find the value that I was looking for?
referance: "In order to return data for this query, mysql must start at the beginning of the disk data file, read in enough of the record to know where the category field data starts (because long_text is variable length), read this value, see if it satisfies the where condition (and so decide whether to add to the return record set), then figure out where the next record set is, then repeat."
link for referance: http://www.verynoisy.com/sql-indexing-dummies/#how_the_database_finds_records_normally
In general you don't know and you don't care, but you have to adapt when queries take too long to execute. When you do something like
select a,b,c from mytable where a=3 and b=5
then the database engine has a couple of options to optimize. When all these options fail, then it will do a "full table scan" - which means, it will have to examine the entire table to see which rows are eligible. When you have indices on e.g. column a then the database engine can optimize the search because it can pre-select rows where a has value 3. So, in general, make sure that you have indices for the columns that are most searched. (Perversely, some database engines get confused when you have too many indices and will fall back to a full table scan because they've lost their way...)
As to whether or not the scanning stops: In general, the database engine has to examine all data in the table (hopefully aided by indices) and won't stop after having found just one hit. If you want just the first hit, use a limit 1 clause to make sure that your result set has only one outcome. But then again, if you have a sort by clause, the database engine cannot stop after the first hit, there might be next ones that should get priority given the sorting.
Summarizing, how the db engine does its scan depends on how smart it is, what indices are available etc.. If your select queries take too long then consider re-organizing your indices, writing your select statements differently, or rebuilding the table.
The RDBMS reading data from disk is something you cannot know, you should not care and you must not rely on.
The issue is too broad to get a precise answer. The engine reads data from storage in blocks, a block can contain records that are not needed by the query at hand. If all the columns needed by the query is available in an index, the RDBMS won't even read the data file, it will only use the index. The data it needs could already be cached in memory (because it was read during the execution of a previous query). The underlying OS and the storage media also keep their own caches.
On a busy system, all these factors could lead to very different storage access patterns while running the same query several times on a couple of minutes apart.
Yes it scans the entire file. Unless you put something like
select * from user where id=100 limit 1
This of course will still search entire rows if id 100 is the last record.
If id is a primary key it will automatically be indexed and searching would be optimized
I'm sorry... I thought the table.
I will change question and I will explain it in the following image;
I understand that in CASE 1 all columns must be read with each iteration.
My question is: If it's the same in the CASE 2 or columns that are not selected in the query are excluded from reading in each iteration.
Also, are the both queries are the some in performance perspective?
Clarify:
CASE: 1 In first CASE select print all data
CASE: 2 In second CASE select print columns first_name and last_name
Whether in CASE 2 mysql server (SQL query) reads only columns first_name, last_name or read the entire table to get that data(rows)=(first_name, last_name)?
An interest of me how the server reads table row in CASE 1 and CASE 2?

How 'and' and 'or' work in SQL

Imagine I have a database for a large website which has a table called 'users' that has a large number of records. When I execute a query such as SELECT * FROM users WHERE username='John' my understanding is that (ignoring caching etc.) the database would navigate the index and find the user(s) named John. Imagine this query returns 1 million results and I am only interested in users called John who are 25 years old, so I perform another query: SELECT * FROM users WHERE username='John' AND age=25
How does this work? does it loop through all the users named John and find only those who's age matches 25, or is there a better way of doing it? I assume this is database and storage engine specific so we can assume I am using MySQL with InnoDB.
The answer is -- you're not supposed to ask this question. In a declarative language like SQL you describe the result desired and the processing engine determines the optimal way to produce the result. It may take different paths to get to the result depending on seemingly minor differences in the request, or the method used may change from version to version of the product, or even based on some factor completely unrelated to the product (available memory or disk space, for instance).
That said, the following is true of most SQL databases in most cases:
The database will use only one index in evaluating a WHERE clause.
If more than one index could be used to evaluate the WHERE clause the database will use statistics about the cardinality (distribution of values) in each index to select the "best" one.
If there is an index built from more than one column, and the head column(s) of that index are present in the filter conditions of the WHERE clause, that index can possibly be used to filter by multiple columns in a single index.
So, in your example, most databases would use indexes on either age or name to do the first-level filtering, then scan the resulting records to do the second level of filtering. The only exception would be if you had a compound index on (name, age) or (age, name) in which case only an index scan would be needed to find the records.
Assuming you have indexes on both columns, it generally examines the statistics of the data itself to choose an option that reduces the cardinality of the result set as quickly as possible.
For example, if 20% of people are aged 25 but only 3% are called John, it will get the Johns first then strip out those who are not aged 25.
If you have a composite key made up of both columns, then that should be even faster, since there's no "stripping" involved at all.
Bottom line, it comes down to the DB engine understanding the makeup of the data and choosing the best execution plan based on that. That's why it's often good to re-calculate statistics periodically, as the data may change.
If you have a query like this:
SELECT *
FROM users
WHERE username = 'John' AND age = 25;
Then the optimal index is users(username, age) or users(age, username). With this index, the matching records can be found just by looking them up in the index.
As for what happens if you only have an index on username. It would typically look up the rows with "John" in the username column. It would then fetch the records from the data pages and continue the filtering based on the data on the pages.

MySQL query slow because of separate indexes?

Here is my situation. I have a MySQL MyISAM table containing about 4 million records with a total of 13,3 GB of data. The table contains messages received from an external system. Two of the columns in the table keep track of a timestamp and a boolean whether the message is handled or not.
When using this query:
SELECT MIN(timestampCB) FROM webshop_cb_onx_message
The result shows up almost instantly.
However, I need to find the earliest timestamp of unhandled messages, like this:
SELECT MIN(timestampCB ) FROM webshop_cb_onx_message WHERE handled = 0
The results of this query show up after about 3 minutes, which is way too slow for the script I'm writing.
Both columns are individually indexed, not together. However, adding an index to the table would take incredibly long considering the amount of data that is in there already.
Does my problem originate from the fact that both columns are separatly indexed, and if so, does anyone have a solution to my issue other than adding another index?
It is commonly recommended that if the selectivity of an index over 20% then a full table scan is preferable over an index access. This would mean it is likely that your index on handled won't actually result in using the index but a full table scan given the selectivity.
A composite index of handled, timestampCB may actually improve the performance given its a composite index, even if the selectivity isn't great MySQL would most likely still use it - even if it didn't you could force it's use.

What are some optimization techniques for MySQL table with 300+ million records?

I am looking at storing some JMX data from JVMs on many servers for about 90 days. This data would be statistics like heap size and thread count. This will mean that one of the tables will have around 388 million records.
From this data I am building some graphs so you can compare the stats retrieved from the Mbeans. This means I will be grabbing some data at an interval using timestamps.
So the real question is, Is there anyway to optimize the table or query so you can perform these queries in a reasonable amount of time?
Thanks,
Josh
There are several things you can do:
Build your indexes to match the queries you are running. Run EXPLAIN to see the types of queries that are run and make sure that they all use an index where possible.
Partition your table. Paritioning is a technique for splitting a large table into several smaller ones by a specific (aggregate) key. MySQL supports this internally from ver. 5.1.
If necessary, build summary tables that cache the costlier parts of your queries. Then run your queries against the summary tables. Similarly, temporary in-memory tables can be used to store a simplified view of your table as a pre-processing stage.
3 suggestions:
index
index
index
p.s. for timestamps you may run into performance issues -- depending on how MySQL handles DATETIME and TIMESTAMP internally, it may be better to store timestamps as integers. (# secs since 1970 or whatever)
Well, for a start, I would suggest you use "offline" processing to produce 'graph ready' data (for most of the common cases) rather than trying to query the raw data on demand.
If you are using MYSQL 5.1 you can use the new features.
but be warned they contain lot of bugs.
first you should use indexes.
if this is not enough you can try to split the tables by using partitioning.
if this also wont work, you can also try load balancing.
A few suggestions.
You're probably going to run aggregate queries on this stuff, so after (or while) you load the data into your tables, you should pre-aggregate the data, for instance pre-compute totals by hour, or by user, or by week, whatever, you get the idea, and store that in cache tables that you use for your reporting graphs. If you can shrink your dataset by an order of magnitude, then, good for you !
This means I will be grabbing some data at an interval using timestamps.
So this means you only use data from the last X days ?
Deleting old data from tables can be horribly slow if you got a few tens of millions of rows to delete, partitioning is great for that (just drop that old partition). It also groups all records from the same time period close together on disk so it's a lot more cache-efficient.
Now if you use MySQL, I strongly suggest using MyISAM tables. You don't get crash-proofness or transactions and locking is dumb, but the size of the table is much smaller than InnoDB, which means it can fit in RAM, which means much quicker access.
Since big aggregates can involve lots of rather sequential disk IO, a fast IO system like RAID10 (or SSD) is a plus.
Is there anyway to optimize the table or query so you can perform these queries
in a reasonable amount of time?
That depends on the table and the queries ; can't give any advice without knowing more.
If you need complicated reporting queries with big aggregates and joins, remember that MySQL does not support any fancy JOINs, or hash-aggregates, or anything else useful really, basically the only thing it can do is nested-loop indexscan which is good on a cached table, and absolutely atrocious on other cases if some random access is involved.
I suggest you test with Postgres. For big aggregates the smarter optimizer does work well.
Example :
CREATE TABLE t (id INTEGER PRIMARY KEY AUTO_INCREMENT, category INT NOT NULL, counter INT NOT NULL) ENGINE=MyISAM;
INSERT INTO t (category, counter) SELECT n%10, n&255 FROM serie;
(serie contains 16M lines with n = 1 .. 16000000)
MySQL Postgres
58 s 100s INSERT
75s 51s CREATE INDEX on (category,id) (useless)
9.3s 5s SELECT category, sum(counter) FROM t GROUP BY category;
1.7s 0.5s SELECT category, sum(counter) FROM t WHERE id>15000000 GROUP BY category;
On a simple query like this pg is about 2-3x faster (the difference would be much larger if complex joins were involved).
EXPLAIN Your SELECT Queries
LIMIT 1 When Getting a Unique Row
SELECT * FROM user WHERE state = 'Alabama' // wrong
SELECT 1 FROM user WHERE state = 'Alabama' LIMIT 1
Index the Search Fields
Indexes are not just for the primary keys or the unique keys. If there are any columns in your table that you will search by, you should almost always index them.
Index and Use Same Column Types for Joins
If your application contains many JOIN queries, you need to make sure that the columns you join by are indexed on both tables. This affects how MySQL internally optimizes the join operation.
Do Not ORDER BY RAND()
If you really need random rows out of your results, there are much better ways of doing it. Granted it takes additional code, but you will prevent a bottleneck that gets exponentially worse as your data grows. The problem is, MySQL will have to perform RAND() operation (which takes processing power) for every single row in the table before sorting it and giving you just 1 row.
Use ENUM over VARCHAR
ENUM type columns are very fast and compact. Internally they are stored like TINYINT, yet they can contain and display string values.
Use NOT NULL If You Can
Unless you have a very specific reason to use a NULL value, you should always set your columns as NOT NULL.
"NULL columns require additional space in the row to record whether their values are NULL. For MyISAM tables, each NULL column takes one bit extra, rounded up to the nearest byte."
Store IP Addresses as UNSIGNED INT
In your queries you can use the INET_ATON() to convert and IP to an integer, and INET_NTOA() for vice versa. There are also similar functions in PHP called ip2long() and long2ip().