I think I'd better ask this question instead of guessing around without any experiment.
We are planning to add a new column as code
The code needs to have the following features:
It has to be unique.
Better to be a string, it's much easier for us to migrate data
Has to be random with enough space to avoid collision.
I am planning to just use UUID.
create table code(
id char(36),
unique index index1 (id)
) type=innodb;
Our operation behavior:
insert new code (at most 20K every day)
get row by code (very heavily, we may need to get every row in the database in limited time like 10 minutes).
Now I am worry about performance a little bit. We already have 400K row in our database. In the future it could grow to 10M or 30M.
Do you have any suggestion or see any problem?
BTW: I am not able to use auto incremented int because it's not randomized.
Go ahead. You won't get any problems neither with mysql nor with UUID.
UUIDs generated randomly have enough states that there will be for certain no collision. (Indeed it's still a chance but its 1 over 10^31 in case of 30M entries.)
On the other Hand: Why bother with uuid (which divides your random in 5 groups with no sense at all) when you can as easily use SecureRandom to just generate 16 byte values and use them?
According to this answer: How to store uuid as number? it is faster to store them in binary in mysq.
You want to read this, too: UUID performance in MySQL?
When using mysql you should think about concepts for backup. Mysql will handle big databases quiet easily, but exporting/importing 30M rows can take some time.
For mysql row limits see this question: How many rows in a database are TOO MANY?
I would suggest not to use MySQL for these cases (at least not the main storage that you actively search from).
There are number of different other technologies such as:
Apache Lucene
Apache Solr
Elastic Search
and more others...
These tools are build for fast searches on big data sets. It the above look like overkill you might simply use one of the NoSQL dbs, it will give you much more performance in your case.
There are huge number of articles comparing performance and limitation between all of these.
Related
At the moment i do have a mysql database, and the data iam collecting is 5 Terrabyte a year. I will save my data all the time, i dont think i want to delete something very early.
I ask myself if i should use a distributed database because my data will grow every year. And after 5 years i will have 25 Terrabyte without index. (just calculated the raw data i save every day)
i have 5 tables and the most queries are joins over multiple tables.
And i need to access mostly 1-2 columns over many rows at a specific timestamp.
Would a distributed database be a prefered database than only a single mysql database?
Paritioning will be difficult, because all my tables are really high connected.
I know it depends on the queries and on the database table design and i can also have a distributed mysql database.
i just want to know when i should think about a distributed database.
Would this be a use case? or could mysql handle this large dataset?
EDIT:
in average i will have 1500 clients writing data per second, they affect all tables.
i just need the old dataset for analytics. Like machine learning and
pattern matching.
also a client should be able to see the historical data
Your question is about "distributed", but I see more serious questions that need answering first.
"Highly indexed 5TB" will slow to a crawl. An index is a BTree. To add a new row to an index means locating the block in that tree where the item belongs, then read-modify-write that block. But...
If the index is AUTO_INCREMENT or TIMESTAMP (or similar things), then the blocks being modified are 'always' at the 'end' of the BTree. So virtually all of the reads and writes are cacheable. That is, updating such an index is very low overhead.
If the index is 'random', such as UUID, GUID, md5, etc, then the block to update is rarely found in cache. That is, updating this one index for this one row is likely to cost a pair of IOPs. Even with SSDs, you are likely to not keep up. (Assuming you don't have several TB of RAM.)
If the index is somewhere between sequential and random (say, some kind of "name"), then there might be thousands of "hot spots" in the BTree, and these might be cacheable.
Bottom line: If you cannot avoid random indexes, your project is doomed.
Next issue... The queries. If you need to scan 5TB for a SELECT, that will take time. If this is a Data Warehouse type of application and you need to, say, summarize last month's data, then building and maintaining Summary Tables will be very important. Furthermore, this can obviate the need for some of the indexes on the 'Fact' table, thereby possibly eliminating my concern about indexes.
"See the historical data" -- See individual rows? Or just see summary info? (Again, if it is like DW, one rarely needs to see old datapoints.) If summarization will suffice, then most of the 25TB can be avoided.
Do you have a machine with 25TB online? If not, that may force you to have multiple machines. But then you will have the complexity of running queries across them.
5TB is estimated from INT = 4 bytes, etc? If using InnoDB, you need to multiple by 2 to 3 to get the actual footprint. Furthermore, if you need to modify a table in the future, such action probably needs to copy the table over, so that doubles the disk space needed. Your 25TB becomes more like 100TB of storage.
PARTITIONing has very few valid use cases, so I don't want to discuss that until knowing more.
"Sharding" (splitting across machines) is possibly what you mean by "distributed". With multiple tables, you need to think hard about how to split up the data so that JOINs will continue to work.
The 5TB is huge -- Do everything you can to shrink it -- Use smaller datatypes, normalize, etc. But don't "over-normalize", you could end up with terrible performance. (We need to see the queries!)
There are many directions to take a multi-TB db. We really need more info about your tables and queries before we can be more specific.
It's really impossible to provide a specific answer to such a wide question.
In general, I recommend only worrying about performance once you can prove that you have a problem; if you're worried, it's much better to set up a test rig, populate it with representative data, and see what happens.
"Can MySQL handle 5 - 25 TB of data?" Yes. No. Depends. If - as you say - you have no indexes, your queries may slow down a long time before you get to 5TB. If it's 5TB / year of highly indexable data it might be fine.
The most common solution to this question is to keep a "transactional" database for all the "regular" work, and a datawarehouse for reporting, using a regular Extract/Transform/Load job to move the data across, and archive it. The data warehouse typically has a schema optimized for querying, usually entirely unlike the original schema.
If you want to keep everything logically consistent, you might use sharding and clustering - a sort-a-kind-a out of the box feature of MySQL.
I would not, however, roll my own "distributed database" solution. It's much harder than you might think.
I am creating a database to store data from a monitoring system that I have created. The system takes a bunch of data points(~4000) a couple times every minute and stores them in my database. I need to be able to down sample based on the time stamp. Right now I am planning on using one table with three columns:
results:
1. point_id
2. timestamp
3. value
so the query I'd be like to do would be:
SELECT point_id,
MAX(value) AS value
FROM results
WHERE timestamp BETWEEN date1 AND date2
GROUP BY point_id;
The problem I am running into is this seems super inefficient with respect to memory. Using this structure each time stamp would have to be recorded 4000 times, which seems a bit excessive to me. The only solutions I thought of that reduce the memory footprint of my database requires me to either use separate tables (which to my understanding is super bad practice) or storing the data in CSV files which would require me to write my own code to search through the data (which to my understanding requires me not to be a bum... and probably search substantially slower). Is there a database structure that I could implement that doesn't require me to store so much duplicate data?
A database on with your data structure is going to be less efficient than custom code. Guess what. That is not unusual.
First, though, I think you should wait until this is actually a performance problem. A timestamp with no fractional seconds requires 4 bytes (see here). So, a record would have, say 4+4+8=16 bytes (assuming a double floating point representation for value). By removing the timestamp you would get 12 bytes -- savings of 25%. I'm not saying that is unimportant. I am saying that other considerations -- such as getting the code to work -- might be more important.
Based on your data, the difference is between 184 Mbytes/day and 138 Mbytes/day, or 67 Gbytes/year and 50 Gbytes. You know, you are going to have to deal with biggish data issues regardless of how you store the timestamp.
Keeping the timestamp in the data will allow you other optimizations, notably the use of partitions to store each day in a separate file. This should be a big benefit for your queries, assuming the where conditions are partition-compatible. (Learn about partitioning here.) You may also need indexes, although partitions should be sufficient for your particular query example.
The point of SQL is not that it is the most optimal way to solve any given problem. Instead, it offers a reasonable solution to a very wide range of problems, and it offers many different capabilities that would be difficult to implement individually. So, the time to a reasonable solution is much, much less than developing bespoke code.
Using this structure each time stamp would have to be recorded 4000 times, which seems a bit excessive to me.
Not really. Date values are not that big and storing the same value for each row is perfectly reasonable.
...use separate tables (which to my understanding is super bad practice)
Who told you that!!! Normalising data (splitting into separate, linked data structures) is actually a good practise - so long as you don't overdo it - and SQL is designed to perform well with relational tables. It would perfectly fine to create a "time" table and link to the data in the other table. It would use a little more memory, but that really shouldn't concern you unless you are working in a very limited memory environment.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a big big table the size of table is in GB's around 130 GB. Every day data is dumped in the table.
I'd like to optimize the table... Can anyone suggest me how I should go about it?
Any input will be a great help.
It depends how you are trying to optimize it.
For querying speed, appropriate indexes including multi-column indexes would be a very good place to start. Do explains on all your queries to see what is taking up so much time. Optimize the code that's reading the data to store it instead of requerying.
If old data is less important or you're getting too much data to handle, you can rotate tables by year, month, week, or day. That way the data writing is always to a pretty minimal table. The older tables are all dated (ie tablefoo_2011_04) so that you have a backlog.
If you are trying to optimize size in the same table, make sure you are using appropriate types. If you get variable length strings, use a varchar instead of statically sized data. Don't use strings for status indicators, use an enum or int with a secondary lookup table.
The server should have a lot of ram so that it's not going to disk all the time.
You can also look at using a caching layer such as memcached.
More information about what the actual problem is, your situation, and what you are trying to optimize for would be helpful.
If your table is a sort of logging table, there can be several strategy for optimizing.
(1) Store essential data only.
If there are not necessary - nullable - columns in it and they does not be used for aggregation or analytics, store them into other table. Keep the main table smaller.
Ex) Don't store raw HTTP_USER_AGENT string. Preprocessing the agent string and store smaller data what you exactly want to review.
(2) Make the table as fixed format.
Use CHAR then VARCHAR for almost-fixed-length strings. This will be helpful for sped up SELECT queries.
Ex) ip VARCHAR(15) => ip CHAR(15)
(3) Summarize old data and dump them into other table periodically.
If you don't have to review the whole data everyday, divide it into periodically table (year/month/day) and store summarize data for old ones.
Ex) Table_2011_11 / Table_2011_11_28
(4) Don't use too many indexes for big table.
Too many indexes cause heavy load for inserting queries.
(5) Use ARCHIVE engine.
MySQL supports ARCHIVE ENGINE. This engine supports zlib for data compression.
http://dev.mysql.com/doc/refman/5.0/en/archive-storage-engine.html
It fits for logging generally(AFAIK), lack of ORDER BY, REPLACE, DELETE and UPDATE are not a big problem for logging.
You should show us what your SHOW CREATE TABLE tablename outputs so we can see the columns, indexes and so on.
From the glimpse of everything, it seems MySQL's partitioning is what you need to implement in order to increase performance further.
A few possible strategies.
If the dataset is so large, it may be of use to store certain information redundantly: keeping cache tables if certain records are accessed much more frequently than others, denormalize information (either to limit the number of joins or creating tables with less columns so you have a lean table to keep in memory at all times), or keeping summaries for the fast lookup of totals.
The summaries-table(s) can be kept in synch by either periodically generating them or by the use of triggers, or even combining both by having a cache table for the latest day on which you can calculate actual totals, and summaries for the historical data... will give you full precision while not requiring to read the full index. Test to see what delivers best performance in your situation.
Splitting your table by periods is certainly an option. It's like partitioning, but Mayflower Blog advises to do it yourself as the MySQL implementation seems to have certain limitations.
Additionally to this: if the data in those historical tables is never changed and you want to reduce space, you could use myisampack. Indexes are supported (you have to rebuild) and performance gain is reported, but I suspect you would gain speed on reading individual rows but face decreasing performance on large reads (as lots of rows need unpacking).
And last: you could think about what you need from the historical data. Does it need the exact same information you have for more recent entries, or are there things that just aren't important anymore? I could imagine if you have an access log, for example, that it stores all sorts of information like ip, referal url, requested url, user agent... Perhaps in 5 years time the user agent isn't interesting at all to know, it's fine to combine all requests from one ip for one page + css + javascript + images into one entry (perhaps have a different many-to-one table for the precise files), and the referal urls only need number of occurances and can be decoupled from exact time or ip.
Don't forget to consider the speed of the medium on which the data is stored. I think you can use raid disks to speed up access or maybe store the table in RAM but at 130GB that might be a challenge! Then consider the processor too. I realise this isn't a direct answer to your question but it may help achieve your aims.
You can still try to do partitioning using tablespaces or "table-per-period" structure as #Evan advised.
If your fulltext searching is failing may be should go to Sphinx/Lucene/Solr. External search engines can definitely help you to get faster.
If we are talking about table structure than you should use the smallest datatype if it possible.
If optimize table is too slow and it's true for the really big tables you can backup this table and restore it. Off course in this case you will need to get some downtime.
As bottom line:
if your issue concerning fulltext searching than before applying any table changes try to use external search engines.
I'm interested to build a huge database (100s millions of records) using MySQL, to contain stock data in 1-min intervals. The database will contain data for 5000 stocks say for 10 years.
Two concerns:
(1) In the past, I had a problem of "slow insertions" -- meaning, at the beginning the rate of insertions was good, but as the table was filling up with millions of records, the insertion became slow (too slow!). At that time I used Windows and now I use Linux -- should it make a difference?
(2) I am aware of indexing techniques that will help queries (data retrievals) be faster. The thing is, is there a way to speed-up insertions? I know one can turn off indexing while inserting, but then 'building' the indexes post insertion (for 10s of millions of records!) also takes tons of time. any advice on that?
Any other Do's / Don'ts? Thanks in advance for any help.
It depends on what type of index you need and how you generate data. If you are satisfied with single index on time, just stick to that and when you generate data, keep on inserting in ascending order (with respect to the insert time for which you have the index). That way, the reordering required is minimal during insertion. Also, consider partitioning to optimize your queries. It can give you dramatic improvements in performance. Using auto-increment column can help for fast indexing, but then you won't have the index on time if auto-increment column is the only index. Make sure you use innodb storage engine for good performance. If you properly tune your database engine on Linux and keep the design simple, it will smoothly scale without much issues. I think the huge data requirements you talk about is not as difficult to build as it might seem first. However, if you are planning to run aggregate queries (with joins of tables), then that is more challenging.
You could always keep your data in a table with no indexes and then use Lucene (or similar) to index the data. This will keep inserts fast and allow you to query Lucene for fast data retrieval.
Consider using an SSD drive (or array) to store your data, especially if you can't afford to create a box with gigs of memory. Everything about it should be faster.
Daily 20-25 million rows that will be removed at midnight for next days data. Can mySQL handle 25 million indexed rows? What would be another good solution?
You give very little information on the context but sometimes not using a database and instead a binary/plain text file is just fine and can -- depending on your requirements -- be much more efficient and maintainable. e.g if it's sensor data storing it in a binary file with each record at a known offset could be a good solution. You saying that the data would be deleted every 24h seems to indicate that you might not need some the properties of a relational database solution such as ACID, replication, integrated backup and so on, so perhaps a flat file approach is just fine?
Our MySQL database has over 300 million rows indexed and we only ever experience problems with complex joins running a little slow - most can be optimized though.
Handling the rows was no problem - the key to our performance was good indexes.
Considering you are dropping the information at midnight, i would also look at MySQL partitioning which would allow you to drop that part of the table whilst allowing the next day to continue inserting if need be.
The issue is not the number of rows itself -- it's what you do with the database. Are you doing only inserts during the day followed by some batch report? Or, are you doing thousands of queries per second on the data? Inserts/Updates/Deletes? If you slam enough load at any database platform, you can max it out with a single table and a single row (taking it to the most extreme). I used MySQL 4.1 w/ MyISAM (hardly the most modern of anything) on a site with a 40M row user table. It did < 5ms queries, I think. We were rendering pages in less than 200ms. However, we had lots and lots of caching set up, so the number of queries wasn't too high. And, we were doing simple statements like SELECT * FROM USER WHERE USER_NAME = 'SMITH'
Can you comment more on your use case?
If you are using Windows, you could do worse than use SqlExpress 2008, which should easily handle that load, depending on how many indexes you are creating on it. So long as you keep < 4GB total db size, it shouldn't be a problem.
From my experience, mySQL tends to not scale well at all. If you must have a free solution for this I would highly recommend postgreSQL.
Also (this may or may not be an issue for you), but keep in mind that if you're dealing with that much data, the maximum size of a mySQL database is 4 terabytes, if I remember correctly.
I don't think there is a practical limit on the max number of rows in mySQL, so if you MUST use mySQL, I think it would work for what you want to do, but personally for a production system I wouldn't recommend it.
As a general solution I'd recommend PostgreSQL too, but depending on your specific needs, other solutions might be better/faster. For example, if you do not need to query your data while it is being written, TokyoCabinet (the table based API / TDB) might be faster and more lightweight/robust.
I haven't looked into them in mysql, but this sounds like a perfect application for table partitions
use only as an index database and store it in the form of file approach would be more effective because you will remove within 24 hours and the process will be faster also not burden your server