I have large sets of data. Over 40GB that I loaded in MySQL table. I am trying to perform simple queries like select * from tablename but it takes gazillion minutes to run and eventually times out. If I set a limit, the execution is fairly fast ex: select * from tablename limit 1000.
The table has over 200 million records.
Tried creating indexes on some columns and that failed too after 3 hours of execution.
Any tips on working with these types of datasets?
First thing you need to do is completely ignore all answers and comments advising some other, awesome, mumbo jumbo technology. It's absolute bullshit. Those things can't work in a different way because they're all constrained with the same problem - hardware.
Now, let's get back to MySQL. The problem with LIMIT is that MySQL takes the whole data set, then takes LIMIT amount of rows starting from OFFSET. That means if you do SELECT * FROM my_table LIMIT 1000 - it will take all 200 million rows, buffer them, then it will start counting from 0 to 999 and discard the rest.
Yes, it takes time. Yes, it appears as dumb. However, MySQL doesn't know what "start" or "end" mean, so it can't know what limit and offset are until you tell it so.
To improve your search, you can use something like this (assuming you have numeric primary key):
SELECT * FROM tablename WHERE id < 10000 LIMIT 1000;
In this case, instead of with 200 million rows, MySQL will work with all rows whose PK is below 10 000. Much easier, much quicker, also readable. Numbers can be tweaked at any point and if you perform a pagination of some sort in a scripting language, you can always transfer the last numeric id that was present so MySQL can start from that id onwards in its search.
Also, you should be using InnoDB engine, and tweak it using innodb_buffer_pool_size which is the magic sauce that makes MySQL fly.
For large databases, one should consider using an alternative solutions such as Apache Spark. MySQL reads the data from disk which is a slow operation. Nothing can work as fast as a technology that is based on MapReduce. Take a look to this answer. It is true that with large databases, queries get very challenging.
Anyway assuming you want to stick with MySQL, first of all if you are using MyISAM, make sure to convert your database storage to InnoDB. This is especially important if you have lots of read/write operations.
It is also important to partition, that reduce the table into more manageable smaller tables. It will also enhance the indexes performance.
Do not be too generous with adding indexes. Define indexes wisely. If an index does not need to be UNIQUE do not define it as one. If an index does not need to include multiple fields do not include multiple fields.
Most importantly start monitor your MySQL instance. Use SHOW ENGINE INNODB STATUS to investigate the performance of your MySQL instance.
Related
I am working on a database and its a pretty big one with 1.3 billion rows and around 35 columns. Here is what i get after checking the status of the table:
Name:Table Name
Engine:InnoDB
Version:10
Row_format:Compact
Rows:12853961
Avg_row_length:572
Data_length:7353663488
Max_data_length:0
Index_length:5877268480
Data_free:0
Auto_increment:12933138
Create_time:41271.0312615741
Update_time:NULL
Check_time:NULL
Collation:utf8_general_ci
Checksum:NULL
Create_options:
Comment:InnoDB free: 11489280 kB
The Problem I am facing that even a single select query takes too much time to process for example a query Select * from Table_Name limit 0,50000 takes around 2.48 minutes
Is that expected?
I have to make a report in which I have to use the whole historical data, that is whole 1.3 bil rows. I could do this batch by batch but then I would have to run queries which are taking too much time many times again and again.
When the simple query is taking so much time I am not able to do any other complex query which needs joins and case statements.
A common practice is, if you have huge amount of data, you ...
should not SELECT * : You should only select the columns you want
should Limit your fetch range to a smaller number: I bet you won't handle 50000 records at the same time. Try to fetch it batch by batch.
A common problem many database administrators face. The solution: Caching.
Break the Queries into more simpler and small queries. Use Memcached or other caching techniques and tools Memcached saves key vaue pairs, check for a data in memcache..if available, use it. If not fetch it from database and then use and cach. Next tie the data will be available from cahe.
You will have to develop own logic and change some queries. Memcached is available here:
http://memcached.org/
Many tutorials are available on the Web
enable in your my.conf the slow queries up to N seconds, then execute some queries and watch this log, this gives you some clues and maybe you could add some indexes to this table.
or do some queries with EXPLAIN. http://hackmysql.com/case1
A quick note that is usually an easy win ...
If you have any columns that are large text blobs, try selecting everything except for those fields. I've seen varchar(max) fields absolutely kill query efficiency.
You have a very wide average row size and 35 columns. You could try vertically partitioning the table, that is, split the table up into smaller tables that are related to each other 1:1 with a subset of columns from the table. InnoDB stores rows in pages and is not efficient for very wide rows.
If the data is append-only consider looking at ICE.
You might also look at TokuDB because it supports good compression.
You can consider using partitioning and Shard-Query (http://code.google.com/p/shard-query) to access data in parallel. You can also split data over more than one server for parallelism using Shard-Query.
Try adding WHERE clause: WHERE 1=1
If it doesn't give any effect then you should change your engine type to MyISAM.
We got this data model. Knowing the limited tree depth, our current tables are 1:1 to the model, with foreign keys to the parent node. Channel to Station, Measurement to Channel and Station. 90% of the queries is:
select value from measurements where
fk_station=X and fk_channel=Y and timestamp>=A and timestamp<=B
order by timestamp asc
The rest 10% is similar on the other timestamped tables, only simpler due to missing fk_channel.
Problem we are facing: there is hundreds of millions of unique [station,channel,timestamp] rows in Measurement table and growing. The timestamp index was alredy so huge and the ordering clause so slow that we had to start splitting it per Station Id; so we have tables Measurement_<Station Id> and the Station foreign key is left out. It helped significantly, but still some tables got tens of millions rows. In load peaks we got around 80000 queries/minute and queries on these bigger tables are observably lazier. We still run from one MySQL/ISAM instance without any fancy optimization hacks. About 150GB on the filesystem.
is there any significantly different/better way to store such data model?
with the current structure, is it normal that we got this kind of performance hiccups with this size/load? The machine is today's average hw, no embedded atom neither 8+ core beast
was the splitting of Measurement table the right thing to do? We are no SQL gurus, but the query and the required index seemed so obvious that we didn't even consider "optimizing" it. Splitting helped a lot, but something else might too
is there any other way of speeding up the index? It's kinda stupid that we must do the same index walking over and over, getting subsets of the same result. We won't ever use any other indexing, not even change to desc. It's very specialized appliance. Would be nice if the index is somehow "native order" :-)
would it help to distribute/shard the splitted Measurement tables? As i said, some tables are still huge and the problem feels to be about the index size which distribution won't help, so perhaps just lowering the query load...
Simple rules to think about in relational dbs like mysql:
Fetching too much data is never fast. Aggregating it can be. - your sample query is not aggregating anything. Makes me wonder if you crunch and aggregate those value in you application. Hint: Aggregate using column store engine eg. infinidb, it supports parallelism in query execution too, innodb doesn't.
Sorting huge amount of data is never fast - ask yourself, if the query returns 100K records, how much does your crunching job/frontend grid etc consumes ? Can a web user consume 100K data on screen. Not really, then LIMIT it. Moreover sort by auto increment ID instead of timestamp. Relational db engines are not good for sorting huge chucks of data, you will hit ceiling soon.
Is there a possibility that splitting up Measurement data over more than one table can reduce the size? If 90% of the queries are over the last 24 hours of timestamps, then you might want to finetune that data, and store the rest in a separate table or even database. I believe the Measurement should have a FK only to Channel, which has only its ID as PK, and an FK to Station.
I currently have a summary table to keep track of my users' post counts, and I run SELECTs on that table to sort them by counts, like WHERE count > 10, for example. Now I know having an index on columns used in WHERE clauses speeds things up, but since these fields will also be updated quite often, would indexing provide better or worse performance?
If you have a query like
SELECT count(*) as rowcount
FROM table1
GROUP BY name
Then you cannot put an index on count, you need to put an index on the group by field instead.
If you have a field named count
Then putting an index in this query may speed up the query, it may also make no difference at all:
SELECT id, `count`
FROM table1
WHERE `count` > 10
Whether an index on count will speed up the query really depends on what percentage of the rows satisfy the where clause. If it's more than 30%, MySQL (or any SQL for that matter) will refuse to use an index.
It will just stubbornly insist on doing a full table scan. (i.e. read all rows)
This is because using an index requires reading 2 files (1 index file and then the real table file with the actual data).
If you select a large percentage of rows, reading the extra index file is not worth it and just reading all the rows in order will be faster.
If only a few rows pass the sets, using an index will speed up this query a lot
Know your data
Using explain select will tell you what indexes MySQL has available and which one it picked and (kind of/sort of in a complicated kind of way) why.
See: http://dev.mysql.com/doc/refman/5.0/en/explain.html
Indexes in general provide better read performance at the cost of slightly worse insert, update and delete performance. Usually the tradeoff is worth it depending on the width of the index and the number of indexes that already exist on the table. In your case, I would bet that the overall performance (reading and writing) will still be substantially better with the index than without but you would need to run tests to know for sure.
It will improve read performance and worsen write performance. If the tables are MyISAM and you have a lot of people posting in a short amount of time you could run into issues where MySQL is waiting for locks, eventually causing a crash.
There's no way of really knowing that without trying it. A lot depends on the ratio of reads to writes, storage engine, disk throughput, various MySQL tuning parameters, etc. You'd have to setup a simulation that resembles production and run before and after.
I think its unlikely that the write performance will be a serious issue after adding the index.
But note that the index won't be used anyway if it is not selective enough - if more than for example 10% of your users have count > 10 the fastest query plan might be to not use the index and just scan the entire table.
I have 1 billion rows stored in MYSQL, I need to output them alphabetically by the a varchar column, what's the most efficient way of go about it. using other linux utilites like sort awk are allowed.
MySQL can deal with a billion rows. Efficiency depends on 3 main factors: Buffers, Indexes and Joins.
Some suggestions:
Try to fit data set you’re working with in memory
Processing in memory is so much faster and you have whole bunch of problems solved just doing so. Use multiple servers to host portions of data set. Store portion of data you’re going to work with in temporary table etc.
Prefer full table scans to index accesses
For large data sets full table scans are often faster than range scans and other types of index lookups. Even if you look at 1% or rows or less full table scan may be faster.
Avoid joins to large tables
Joining of large data sets using nested loops is very expensive. Try to avoid it. Joins to smaller tables is OK but you might want to preload them to memory before join so there is no random IO needed to populate the caches.
Be aware of MySQL limitations which requires you to be extra careful working with large data sets. In MySQL, a query runs as a single thread (with exeption of MySQL Cluster) and MySQL issues IO requests one by one for query execution, which means if single query execution time is your concern many hard drives and large number of CPUs will not help.
Sometimes it is good idea to manually split query into several, run in parallel and aggregate result sets.
You did not give much info on your setup or your dataset, but this should give you a couple of clues on what to watch out for. In my opinion having the (properly tuned) database sort this for you would be faster than doing it programmatically unless you have very specific needs not mentioned in your post.
Have you just tried indexing the column and dumping them out? I'd try that first to see if the performance was inadequate before going exotic.
It depends on how you define efficient. CPU/Memory/IO/Time/Coding Effort. What is important in this case?
"select * from big_table order by the_varchar_column" That is probably the most efficient use of developer resources. Adding an index might make it run a lot faster.
I am looking at storing some JMX data from JVMs on many servers for about 90 days. This data would be statistics like heap size and thread count. This will mean that one of the tables will have around 388 million records.
From this data I am building some graphs so you can compare the stats retrieved from the Mbeans. This means I will be grabbing some data at an interval using timestamps.
So the real question is, Is there anyway to optimize the table or query so you can perform these queries in a reasonable amount of time?
Thanks,
Josh
There are several things you can do:
Build your indexes to match the queries you are running. Run EXPLAIN to see the types of queries that are run and make sure that they all use an index where possible.
Partition your table. Paritioning is a technique for splitting a large table into several smaller ones by a specific (aggregate) key. MySQL supports this internally from ver. 5.1.
If necessary, build summary tables that cache the costlier parts of your queries. Then run your queries against the summary tables. Similarly, temporary in-memory tables can be used to store a simplified view of your table as a pre-processing stage.
3 suggestions:
index
index
index
p.s. for timestamps you may run into performance issues -- depending on how MySQL handles DATETIME and TIMESTAMP internally, it may be better to store timestamps as integers. (# secs since 1970 or whatever)
Well, for a start, I would suggest you use "offline" processing to produce 'graph ready' data (for most of the common cases) rather than trying to query the raw data on demand.
If you are using MYSQL 5.1 you can use the new features.
but be warned they contain lot of bugs.
first you should use indexes.
if this is not enough you can try to split the tables by using partitioning.
if this also wont work, you can also try load balancing.
A few suggestions.
You're probably going to run aggregate queries on this stuff, so after (or while) you load the data into your tables, you should pre-aggregate the data, for instance pre-compute totals by hour, or by user, or by week, whatever, you get the idea, and store that in cache tables that you use for your reporting graphs. If you can shrink your dataset by an order of magnitude, then, good for you !
This means I will be grabbing some data at an interval using timestamps.
So this means you only use data from the last X days ?
Deleting old data from tables can be horribly slow if you got a few tens of millions of rows to delete, partitioning is great for that (just drop that old partition). It also groups all records from the same time period close together on disk so it's a lot more cache-efficient.
Now if you use MySQL, I strongly suggest using MyISAM tables. You don't get crash-proofness or transactions and locking is dumb, but the size of the table is much smaller than InnoDB, which means it can fit in RAM, which means much quicker access.
Since big aggregates can involve lots of rather sequential disk IO, a fast IO system like RAID10 (or SSD) is a plus.
Is there anyway to optimize the table or query so you can perform these queries
in a reasonable amount of time?
That depends on the table and the queries ; can't give any advice without knowing more.
If you need complicated reporting queries with big aggregates and joins, remember that MySQL does not support any fancy JOINs, or hash-aggregates, or anything else useful really, basically the only thing it can do is nested-loop indexscan which is good on a cached table, and absolutely atrocious on other cases if some random access is involved.
I suggest you test with Postgres. For big aggregates the smarter optimizer does work well.
Example :
CREATE TABLE t (id INTEGER PRIMARY KEY AUTO_INCREMENT, category INT NOT NULL, counter INT NOT NULL) ENGINE=MyISAM;
INSERT INTO t (category, counter) SELECT n%10, n&255 FROM serie;
(serie contains 16M lines with n = 1 .. 16000000)
MySQL Postgres
58 s 100s INSERT
75s 51s CREATE INDEX on (category,id) (useless)
9.3s 5s SELECT category, sum(counter) FROM t GROUP BY category;
1.7s 0.5s SELECT category, sum(counter) FROM t WHERE id>15000000 GROUP BY category;
On a simple query like this pg is about 2-3x faster (the difference would be much larger if complex joins were involved).
EXPLAIN Your SELECT Queries
LIMIT 1 When Getting a Unique Row
SELECT * FROM user WHERE state = 'Alabama' // wrong
SELECT 1 FROM user WHERE state = 'Alabama' LIMIT 1
Index the Search Fields
Indexes are not just for the primary keys or the unique keys. If there are any columns in your table that you will search by, you should almost always index them.
Index and Use Same Column Types for Joins
If your application contains many JOIN queries, you need to make sure that the columns you join by are indexed on both tables. This affects how MySQL internally optimizes the join operation.
Do Not ORDER BY RAND()
If you really need random rows out of your results, there are much better ways of doing it. Granted it takes additional code, but you will prevent a bottleneck that gets exponentially worse as your data grows. The problem is, MySQL will have to perform RAND() operation (which takes processing power) for every single row in the table before sorting it and giving you just 1 row.
Use ENUM over VARCHAR
ENUM type columns are very fast and compact. Internally they are stored like TINYINT, yet they can contain and display string values.
Use NOT NULL If You Can
Unless you have a very specific reason to use a NULL value, you should always set your columns as NOT NULL.
"NULL columns require additional space in the row to record whether their values are NULL. For MyISAM tables, each NULL column takes one bit extra, rounded up to the nearest byte."
Store IP Addresses as UNSIGNED INT
In your queries you can use the INET_ATON() to convert and IP to an integer, and INET_NTOA() for vice versa. There are also similar functions in PHP called ip2long() and long2ip().