Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a big big table the size of table is in GB's around 130 GB. Every day data is dumped in the table.
I'd like to optimize the table... Can anyone suggest me how I should go about it?
Any input will be a great help.
It depends how you are trying to optimize it.
For querying speed, appropriate indexes including multi-column indexes would be a very good place to start. Do explains on all your queries to see what is taking up so much time. Optimize the code that's reading the data to store it instead of requerying.
If old data is less important or you're getting too much data to handle, you can rotate tables by year, month, week, or day. That way the data writing is always to a pretty minimal table. The older tables are all dated (ie tablefoo_2011_04) so that you have a backlog.
If you are trying to optimize size in the same table, make sure you are using appropriate types. If you get variable length strings, use a varchar instead of statically sized data. Don't use strings for status indicators, use an enum or int with a secondary lookup table.
The server should have a lot of ram so that it's not going to disk all the time.
You can also look at using a caching layer such as memcached.
More information about what the actual problem is, your situation, and what you are trying to optimize for would be helpful.
If your table is a sort of logging table, there can be several strategy for optimizing.
(1) Store essential data only.
If there are not necessary - nullable - columns in it and they does not be used for aggregation or analytics, store them into other table. Keep the main table smaller.
Ex) Don't store raw HTTP_USER_AGENT string. Preprocessing the agent string and store smaller data what you exactly want to review.
(2) Make the table as fixed format.
Use CHAR then VARCHAR for almost-fixed-length strings. This will be helpful for sped up SELECT queries.
Ex) ip VARCHAR(15) => ip CHAR(15)
(3) Summarize old data and dump them into other table periodically.
If you don't have to review the whole data everyday, divide it into periodically table (year/month/day) and store summarize data for old ones.
Ex) Table_2011_11 / Table_2011_11_28
(4) Don't use too many indexes for big table.
Too many indexes cause heavy load for inserting queries.
(5) Use ARCHIVE engine.
MySQL supports ARCHIVE ENGINE. This engine supports zlib for data compression.
http://dev.mysql.com/doc/refman/5.0/en/archive-storage-engine.html
It fits for logging generally(AFAIK), lack of ORDER BY, REPLACE, DELETE and UPDATE are not a big problem for logging.
You should show us what your SHOW CREATE TABLE tablename outputs so we can see the columns, indexes and so on.
From the glimpse of everything, it seems MySQL's partitioning is what you need to implement in order to increase performance further.
A few possible strategies.
If the dataset is so large, it may be of use to store certain information redundantly: keeping cache tables if certain records are accessed much more frequently than others, denormalize information (either to limit the number of joins or creating tables with less columns so you have a lean table to keep in memory at all times), or keeping summaries for the fast lookup of totals.
The summaries-table(s) can be kept in synch by either periodically generating them or by the use of triggers, or even combining both by having a cache table for the latest day on which you can calculate actual totals, and summaries for the historical data... will give you full precision while not requiring to read the full index. Test to see what delivers best performance in your situation.
Splitting your table by periods is certainly an option. It's like partitioning, but Mayflower Blog advises to do it yourself as the MySQL implementation seems to have certain limitations.
Additionally to this: if the data in those historical tables is never changed and you want to reduce space, you could use myisampack. Indexes are supported (you have to rebuild) and performance gain is reported, but I suspect you would gain speed on reading individual rows but face decreasing performance on large reads (as lots of rows need unpacking).
And last: you could think about what you need from the historical data. Does it need the exact same information you have for more recent entries, or are there things that just aren't important anymore? I could imagine if you have an access log, for example, that it stores all sorts of information like ip, referal url, requested url, user agent... Perhaps in 5 years time the user agent isn't interesting at all to know, it's fine to combine all requests from one ip for one page + css + javascript + images into one entry (perhaps have a different many-to-one table for the precise files), and the referal urls only need number of occurances and can be decoupled from exact time or ip.
Don't forget to consider the speed of the medium on which the data is stored. I think you can use raid disks to speed up access or maybe store the table in RAM but at 130GB that might be a challenge! Then consider the processor too. I realise this isn't a direct answer to your question but it may help achieve your aims.
You can still try to do partitioning using tablespaces or "table-per-period" structure as #Evan advised.
If your fulltext searching is failing may be should go to Sphinx/Lucene/Solr. External search engines can definitely help you to get faster.
If we are talking about table structure than you should use the smallest datatype if it possible.
If optimize table is too slow and it's true for the really big tables you can backup this table and restore it. Off course in this case you will need to get some downtime.
As bottom line:
if your issue concerning fulltext searching than before applying any table changes try to use external search engines.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
These days I'm reading about different ways to manage a huge dataset in the MySQL database.
To be honest, at the moment, I'm confused. I read some concepts about the mentioned issue but I don't know how they are related to each other?
Please take a look at these:
Partitioning - Which is a plugin
Clustering - Named NDB I guess
Sharding - Which is a concept I think and nothing implementable
The scenario is storing/maintaining/searching a huge set of data (assume a table with 5 billion rows) in MySQL. So we have to take apart the dataset, but how?
I've a few questions:
How much overlap is between those three items above?
In partitioning, all parts will be stored on the same machine (server)? Or they can be kept in different machines?
How to detect is the data stored in which partition? (in order to look up the data accordingly)
I know partitioning is for "tables", is clustering for "databases"?
By sharding, we replicate the data in different servers or we would have different data in the different servers? Also, is it happen in the "table" layer or the "database" layer?
How different parts (clusters/partitions) will see each other when it is needed? Like when we need to have a join clause on the whole table. Assuming the data is parted in different partitions/machines.
To use clustering, do I need to install a different edition (version) of MySQL? Isn't it supported by the normal edition?
Anyway, I've read about them over 3 days, and the main concept is still ambiguous for me.
a quick comparison:
description
nr of servers
redundant?
a goal
paritioning
1
No
time series
clustering
>= 3
Yes
recovery
sharding
>1
No
write scaling
Sharding is divvying up the data across multiple servers.
How much overlap is between those three items above?
A: Very little. Each divvies the data up in different ways for different goals.
In partitioning, all parts will be stored on the same machine (server)? Or they can be kept in different machines?
A: In partitioning, all parts will be stored on the same instance on the same machine (server).
How to detect is the data stored in which partition?
A: When practical, provide a WHERE clause that pinpoints which partition(s) are needed. (See "partition pruning")
I know partitioning is for "tables", is clustering for "databases"?
A: I think you could describe it that way. Clustering (also) has the advantage of having a second copy on a different piece of hardware.
By sharding, we replicate the data in different servers or we would have different data in the different servers? Also, is it happen in the "table" layer or the "database" layer?
A: No. Typically the largest table is split up in some arbitrary way -- some rows are put on each shard. Then clients must know how that split-up was done to know which server to talk to. (There is no canned code for this vital task.) Smaller tables are either copied onto all shards or put onto other machine(s).
How different parts (clusters/partitions) will see each other when it is needed? Like when we need to have a join clause on the whole table. Assuming the data is parted in different partitions/machines.
A: A JOIN works on only one server. (MariaDB has "FEDERATEDX", but that is a costly workaround.) For Partitioning, the query sees the many partitions as one big table, so JOIN is not a problem. For Clustering, everything is on each server, to no problem. For Sharding is fine within the constraint that you have only part of the big table.
BTW: read this: How to handle a question that asks many things
At the moment i do have a mysql database, and the data iam collecting is 5 Terrabyte a year. I will save my data all the time, i dont think i want to delete something very early.
I ask myself if i should use a distributed database because my data will grow every year. And after 5 years i will have 25 Terrabyte without index. (just calculated the raw data i save every day)
i have 5 tables and the most queries are joins over multiple tables.
And i need to access mostly 1-2 columns over many rows at a specific timestamp.
Would a distributed database be a prefered database than only a single mysql database?
Paritioning will be difficult, because all my tables are really high connected.
I know it depends on the queries and on the database table design and i can also have a distributed mysql database.
i just want to know when i should think about a distributed database.
Would this be a use case? or could mysql handle this large dataset?
EDIT:
in average i will have 1500 clients writing data per second, they affect all tables.
i just need the old dataset for analytics. Like machine learning and
pattern matching.
also a client should be able to see the historical data
Your question is about "distributed", but I see more serious questions that need answering first.
"Highly indexed 5TB" will slow to a crawl. An index is a BTree. To add a new row to an index means locating the block in that tree where the item belongs, then read-modify-write that block. But...
If the index is AUTO_INCREMENT or TIMESTAMP (or similar things), then the blocks being modified are 'always' at the 'end' of the BTree. So virtually all of the reads and writes are cacheable. That is, updating such an index is very low overhead.
If the index is 'random', such as UUID, GUID, md5, etc, then the block to update is rarely found in cache. That is, updating this one index for this one row is likely to cost a pair of IOPs. Even with SSDs, you are likely to not keep up. (Assuming you don't have several TB of RAM.)
If the index is somewhere between sequential and random (say, some kind of "name"), then there might be thousands of "hot spots" in the BTree, and these might be cacheable.
Bottom line: If you cannot avoid random indexes, your project is doomed.
Next issue... The queries. If you need to scan 5TB for a SELECT, that will take time. If this is a Data Warehouse type of application and you need to, say, summarize last month's data, then building and maintaining Summary Tables will be very important. Furthermore, this can obviate the need for some of the indexes on the 'Fact' table, thereby possibly eliminating my concern about indexes.
"See the historical data" -- See individual rows? Or just see summary info? (Again, if it is like DW, one rarely needs to see old datapoints.) If summarization will suffice, then most of the 25TB can be avoided.
Do you have a machine with 25TB online? If not, that may force you to have multiple machines. But then you will have the complexity of running queries across them.
5TB is estimated from INT = 4 bytes, etc? If using InnoDB, you need to multiple by 2 to 3 to get the actual footprint. Furthermore, if you need to modify a table in the future, such action probably needs to copy the table over, so that doubles the disk space needed. Your 25TB becomes more like 100TB of storage.
PARTITIONing has very few valid use cases, so I don't want to discuss that until knowing more.
"Sharding" (splitting across machines) is possibly what you mean by "distributed". With multiple tables, you need to think hard about how to split up the data so that JOINs will continue to work.
The 5TB is huge -- Do everything you can to shrink it -- Use smaller datatypes, normalize, etc. But don't "over-normalize", you could end up with terrible performance. (We need to see the queries!)
There are many directions to take a multi-TB db. We really need more info about your tables and queries before we can be more specific.
It's really impossible to provide a specific answer to such a wide question.
In general, I recommend only worrying about performance once you can prove that you have a problem; if you're worried, it's much better to set up a test rig, populate it with representative data, and see what happens.
"Can MySQL handle 5 - 25 TB of data?" Yes. No. Depends. If - as you say - you have no indexes, your queries may slow down a long time before you get to 5TB. If it's 5TB / year of highly indexable data it might be fine.
The most common solution to this question is to keep a "transactional" database for all the "regular" work, and a datawarehouse for reporting, using a regular Extract/Transform/Load job to move the data across, and archive it. The data warehouse typically has a schema optimized for querying, usually entirely unlike the original schema.
If you want to keep everything logically consistent, you might use sharding and clustering - a sort-a-kind-a out of the box feature of MySQL.
I would not, however, roll my own "distributed database" solution. It's much harder than you might think.
Question concerning MySQL and optimization over large table. The MySQL server is running on a limited capacity server and we need to optimize it as much as possible.
We are sampling data at a rate of one measurement per second and we use that to draw graphs on a web application.
Currently all those data are inside a single table and we end up with hundreds of millions of data points.
We have several data source which all have two ids: One for the position and one for the source itself. We use both ids together to have a unique id and we don't use MySQL id to reduce the size of the data. We use the posix plus both id together as the table primary key and we use them to query the DB. Those ID are not generated by SQL.
Usually we plot graph using about 400 points in time segments and several source.
Question:
What would be the best optimization for such design ?
First question: Is it better to keep all the data inside a single table or split them into several table ? This has the disadvantage to complicate the code as we would have dynamic tables.
If it's better to keep them in a single table, is it a correct approach to use a primary key based on ids and posix ?
Is there some specific mysql optimization for such purpose ?
Thanks
If I understood well, the best optimization for this situation would be having a distributed database. More specifically, I would apply the horizontal partitioning method to this table you mention.
Roughly saying, this is a method to have your table divided into fragments according to some specific criteria, so that your queries don't have to process huge amount of data all at once. You can use this to process only relevant data for some specific query, or even to process all data using parallelism.
Allow me to not explain any further since I'm not sure if that is exactly what you want and need, and also because you could possibly do better reading about this matter at your own pace. Hope this helps by giving you a starting point, though.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have many text files, their total size is about 300GB ~ 400GB. They are all in this format
key1 value_a
key1 value_b
key1 value_c
key2 value_d
key3 value_e
....
each line is composed by a key and a value. I want to create a database which can let me query all value of a key. For example, when I query key1, value_a, value_b and value_c are returned.
First of all, inserting all these files into the database is a big problem. I try to insert a few GBs size chunk to MySQL MyISAM table with LOAD DATA INFILE syntax. But it appears MySQL can't utilize multicores for inserting data. It's as slow as hell. So, I think MySQL is not a good choice here for so many records.
Also, I need to update or recreate the database periodically, weekly, or even daily if possible, therefore, insertion speed is important for me.
It's not possible for a single node to do the computing and insertion efficiently, to be efficient, I think it's better to perform the insertion in different nodes parallely.
For example,
node1 -> compute and store 0-99999.txt
node2 -> compute and store 10000-199999.txt
node3 -> compute and store 20000-299999.txt
....
So, here comes the first criteria.
Criteria 1. Fast insertion speed in distributed batch manner.
Then, as you can see in the text file example, it's better to provide multiple same key to different values. Just like key1 maps to value_a/value_b/value_c in the example.
Criteria 2. Multiple keys are allowed
Then, I will need to query keys in the database. No relational or complex join query is required, all I need is simple key/value querying. The important part is that multiple key to same value
Criteria 3. Simple and fast key value querying.
I know there are HBase/Cassandra/MongoDB/Redis.... and so on, but I'm not familiar with all of them, not sure which one fits my needs. So, the question is - what database to use? If none of them fits my needs, I even plan to build my own, but it takes efforts :/
Thanks.
There are probably a lot of systems that would fit your needs. Your requirements make things pleasantly easy in a couple ways:
Because you don't need any cross-key operations, you could use multiple databases, dividing keys between them via hash or range sharding. This is an easy way to solve the lack of parallelism that you observed with MySQL and probably would observe with a lot of other database systems.
Because you never do any online updates, you can just build an immutable database in bulk and then query it for the rest of the day/week. I'd expect you'd get a lot better performance this way.
I'd be inclined to build a set of hash-sharded LevelDB tables. That is, I wouldn't use an actual leveldb::DB which supports a more complex data structure (a stack of tables and a log) so that you can do online updates; instead, I'd directly use leveldb::Table and leveldb::TableBuilder objects (no log, only one table for a given key). This is a very efficient format for querying. And if your input files are already sorted like in your example, the table building will be extremely efficient as well. You can achieve whatever parallelism you desire by increasing the number of shards - if you're using a 16-core, 16-disk machine to build the database, then use at least 16 shards, all generated in parallel. If you're using 16 16-core, 16-disk machines, at least 256 shards. If you have a lot fewer disks than cores as many people do these days, try both, but you may find fewer shards are better to avoid seeks. If you're careful, I think you can basically max out the disk throughput while building tables, and that's saying a lot as I'd expect the tables to be noticeably smaller than your input files due to the key prefix compression (and optionally Snappy block compression). You'll mostly avoid seeks because aside from a relatively small index that you can typically buffer in RAM, the keys in the leveldb tables are stored in the same order as you read them from the input files, assuming again that your input files are already sorted. If they're not, you may want enough shards that you can sort a shard in RAM then write it out, perhaps processing shards more sequentially.
I would suggest you using SSDB(https://github.com/ideawu/ssdb), a leveldb server that suitable for storing collections of data.
You can store the data in maps:
ssdb->hset(key1, value1)
ssdb->hset(key1, value2)
...
list = ssdb->hscan(key1, 1000);
// now list = [value1, value2, ...]
SSDB is fast(half the speed of Redis, 30000 insertions per second), it is a network wrapper of leveldb, one-line installation and startup. Its clients include PHP, C++, Python, Java, Lua, ...
The traditional answer would be to use Oracle if you have the big bucks, or PostgreSQL if you don't. However, I'd suggest you also look at solutions like mongoDb which I found to be blazing fast and will also accomodate a scenario where your schema is not fixed and can change across your data.
Since you are already familiar with MySQL, I suggest trying all MySQL options before moving to a new system.
Many bigdata systems are tuned for very specific problems but don't fare well in areas that are taken for granted from a RDBMS. Also, most applications need regular RDBMS features alongside bigdata features. So moving to a new system may create new problems.
Also consider the software ecosystem, community support and knowledge base available around the system of your choice.
Coming back to the solution, how many rows would be there in the database? This is an important metric. I am assuming more than 100 million.
Try Partitioning. It can help a lot. The fact that your select criteria is simple and you don't require joins only make things better.
Postgres has a nice way of handling partitions. It requires more code to get up and running but gives an amazing control. Unlike MySQL, Postgres does not have a hard limit on number of partitions. Partitions in Postgres are regular tables. This gives you much more control over indexing, searching, backup, restore, parallel data access etc.
Take a look at HBase. You can store multiple values against a key, by using columns. Unlike RDBMS, you don't need to have fixed set of columns in each row, but can have arbitrary number of columns for a row. Since you query data by a key (row-key in HBase parlance), you can retrieve all the values for a given key by reading values of all the columns in that row.
HBase also concept of retention period, so you can decide which columns live for how long. Hence, the data can get cleaned up on its own, as per need basis. There are some interesting techniques people have employed to utilize the retention periods.
HBase is quite scalable, and supports very fast reads and writes.
InfoBright maybe is a good choice.
I'm working on a project which is similar in nature to website visitor analysis.
It will be used by 100s of websites with average of 10,000s to 100,000s page views a day each so the data amount will be very large.
Should I use a single table with websiteid or a separate table for each website?
Making changes to a live service with 100s of websites with separate tables for each seems like a big problem. On the other hand performance and scalability are probably going to be a problem with such large data. Any suggestions, comments or advice is most welcome.
How about one table partitioned by website FK?
I would say use the design that most makes sense given your data - in this case one large table.
The records will all be the same type, with same columns, so from a database normalization standpoint they make sense to have them in the same table. An index makes selecting particular rows easy, especially when whole queries can be satisfied by data in a single index (which can often be the case).
Note that visitor analysis will necessarily involve a lot of operations where there is no easy way to optimise other than to operate on a large number of rows at once - for instance: counts, sums, and averages. It is typical for resource intensive statistics like this to be pre-calculated and stored, rather than fetched live. It's something you would want to think about.
If the data is uniform, go with one table. If you ever need to SELECT across all websites
having multiple tables is a pain. However if you write enough scripting you can do it with multiple tables.
You could use MySQL's MERGE storage engine to do SELECTs across the tables (but don't expect good performance, and watch out for the Windows hard limit on the number of open files - in Linux you may haveto use ulimit to raise the limit. There's no way to do it in Windows).
I have broken a huge table into many (hundreds) of tables and used MERGE to SELECT. I did this so the I could perform off-line creation and optimization of each of the small tables. (Eg OPTIMIZE or ALTER TABLE...ORDER BY). However the performance of SELECT with MERGE caused me to write my own custom storage engine. (Described http://blog.coldlogic.com/categories/coldstore/'>here)
Use the single data structure. Once you start encountering performance problems there are many solutions like you can partition your tables by website id also known as horizontal partitioning or you can also use replication. This all depends upon the the ratio of reads vs writes.
But for start keep things simple and use one table with proper indexing. You can also determine if you need transactions or not. You can also take advantage of various different mysql storage engines like MyIsam or NDB (in memory clustering) to boost up the performance. Also caching plays a very good role in offloading the load from the database. The data that is mostly read only and can be computed easily is usually put in the cache and the cache serves the request instead of going to the database and only the necessary queries go to the database.
Use one table unless you have performance problems with MySQL.
Nobody here cannot answer performance questions, you should just do performance tests yourself to understand, whether having one big table is sufficient.