How to structure an extremely large table - mysql

This is more a conceptual question. It's inspired from using some extremely large table where even a simple query takes a long time (properly indexed). I was wondering is there is a better structure then just letting the table grow, continually.
By large I mean 10,000,000+ records that grows every day by something like 10,000/day. A table like that would hit 10,000,000 additional records every 2.7 years. Lets say that more recent records are accesses the most but the older ones need to remain available.
I have two conceptual ideas to speed it up.
1) Maintain a master table that holds all the data, indexed by date in reverse order. Create a separate view for each year that holds only the data for that year. Then when querying, and lets say the query is expected to pull only a few records from a three year span, I could use a union to combine the three views and select from those.
2) The other option would be to create a separate table for every year. Then, again using a union to combine them when querying.
Does anyone else have any other ideas or concepts? I know this is a problem Facebook has faced, so how do you think they handled it? I doubt they have a single table (status_updates) that contains 100,000,000,000 records.

The main RDBMS providers all have similar concepts in terms of partitioned tables and partitioned views (as well as combinations of the two)
There is one immediate benefit, in that the data is now split across multiple conceptual tables, so any query that includes the partition key within the query can automatically ignore any partition that the key would not be in.
From a RDBMS management perspective, having the data divided into seperate partitions allows operations to be performed at a partition level, backup / restore / indexing etc. This helps reduce downtimes as well as allow for far faster archiving by just removing an entire partition at a time.
There are also non relational storage mechanisms such as nosql, map reduce etc, but ultimately how it is used, loaded and data is archived become a driving factor in the decision of the structure to use.
10 million rows is not that large in the scale of large systems, partitioned systems can and will hold billions of rows.

Your second idea looks like partitioning.
I don't know how well it works, but there is support for partition in MySQL -- see, in its manual : Chapter 17. Partitioning

There is good scalability approach for this tables. Union is right way, but there is better way.
If your database engine supports "semantical partitioning", then you can split one table into partitions. Each partition will cover some subrange (say 1 partition per year). It will not affect anything in SQL syntax, except DDL. And engine will transparently run hidden union logic and partitioned index scans with all parallel hardware it has (CPU, I/O, storage).
For example Sybase allows up to 255 partitions, as it is limit of union. But you will never need keyword "union" in queries.

Often the best plan is to have one table and then use database partioning.
Or you can archive data and create a view for the archived and combined data and keep only the active data in the table most functions are referencing. You will have to have a good archiving stategy though (which is automated) or you can lose data or not get things done efficiently in moving them. This is typically more difficult to maintain.

What you're talking about is horizontal partitioning or sharding.

Related

MYSQL - compare between 1 table have n partition and n table have same struct

I am a student and I have a question when I research about mysql partition.
Example I have a table "Label" with 10 partitions by hash(TaskId)
resourceId (PK)
TaskId (PK)
...
And I have 10 table with name table is "label": + taskId:
tables:
task1(resourceId,...)
task2(resourceId,...)
...
Could you please tell me about advantages and disadvantages between them?
Thanks
Welcome to Stack Overflow. I wish you had offered a third alternative in your question: "just one table with no partitions." That is by far, in almost all cases in the real world, the best way to handle your data. It only requires maintaining and querying one copy of each index, for example. If your data approaches billions of rows in size, it's time to consider stuff like partitions.
But never mind that. Your question was to compare ten tables against one table with ten partitions. Your ten-table approach is often called sharding your data.
First, here's what the two have in common: they both are represented by ten different tables on your storage device (ssd or disk). A query for a row of data that might be anywhere in the ten involves searching all ten, using whatever indexes or other techniques are available. Each of these ten tables consumes resources on your server: open file descriptors, RAM caches, etc.
Here are some differences:
When INSERTing a row into a partitioned table, MySQL figures out which partition to use. When you are using shards, your application must figure out which table to use and write the INSERT query for that particular table.
When querying a partitioned table for a few rows, MySQL automatically figures out from your query's WHERE conditions which partitions it must search. When you search your sharded data, on the other hand, your application much figure out which table or tables to search.
In the case you presented --partitioning by hash on the primary key -- the only way to get MySQL to search just one partition is to search for particular values of the PK. In your case this would be WHERE resourceId = foo AND TaskId = bar. If you search based on some other criterion -- WHERE customerId = something -- MySQL must search all the partitions. That takes time. In the sharding case, your application can use its own logic to figure out which tables to search.
If your system grows very large, you'll be able to move each shard to its own MySQL server running on its own hardware. Then, of course, your application will need to choose the correct server as well as the correct shard table for each access. This won't work with partitions.
With a partitioned table with an autoincrementing id value on each row inserted, each of your rows will have its own unique id no matter which partition it is in. In the sharding case, each table has its own sequence of autoincrementing ids. Rows from different tables will have duplicate ids.
The Data Definition Language (DDL: CREATE TABLE and the like) for partitioning is slightly simpler than for sharding. It's easier and less repetitive to write the DDL add a column or an index to a partitioned table than it is to a bunch of shard tables. With the volume of data that justifies sharding or partitioning, you will need to add and modify indexes to match the needs of your application in future.
Those are some practical differences. Pro tip don't partition and don't shard your data unless you have really good reasons to do so.
Keep in mind that server hardware, disk hardware, and the MySQL software are under active development. If it takes several years for your data to grow very large, new hardware and new software releases may improve fast enough in the meantime that you don't have to worry too much about partitioning / sharding.

Choosing the right MySQL structure for a very large time-based dataset

I have been using MySQL for the past few months and I have a good handle on smaller database structures. Now, however, I need to decide on how to create a database that can store a large set of time oriented data in either multiple tables, or a single table.
Using a single table, I have tried partitioning it into yearly segments, however, the load times and insert times are still quite long. Especially for searching. The data consists of roughly 8000 reporting stations with about 300-500 reports per day (several per hour). The reports go back all the way to 1980, so easily over 120 million data points and growing.
I am not sure what may provide the best results for searching such a vast amount of data, or if it would be better to separate the data into several tables. Each report has only a couple columns of information (time, temperature and wind).
I am sure this question has been asked many times, but any help would be appreciated.
Thank you!
120M rows is big enough to conisder PARTITIONing. And that it good for time-based data if you need to delete "old" data. This because DROP PARTITION is a lot faster and less invasive than DELETE.
I discuss this at length here.
Loading into a partitioned table should be only slightly slower (or faster in rare cases) than for a non-partitioned table.
Searching problems -- sounds like you did not index the table properly. Some tips:
(Usually) Put the "partition key" last in any index, if it is needed at all.
Use PARTITION BY RANGE(TO_DAYS(...)) only.
40 years? 40 partitions is reasonable.
Do not partition by station, but probably use that column at the start of some indexes.
Please show me the CREATE TABLE so I can be more specific in my tips.
If you won't be deleting 'old' rows, then partitioning is probably a waste. Let's see some of the queries.
On the other hand, if you often use a date range and several stations, then you have the "2D index problem". Partition by year; start the PRIMARY KEY with station
Do not use multiple tables. This is a common Question on this forum, and the answer is always the same.
Quite possibly you need some sort of "summary table". It might include high, low, average temp, etc for each week. For, say, a multi-year temperature graph, this is clearly 7 times as fast. More here.
Inserting only 37 rows/second should not be a problem, even on a slow HDD. If they come in batches, then batch the INSERTs via multiple rows per INSERT statement or via LOAD DATA.

Mysql what if too much data in a table

Data is increasing in one table everyday, it might lower the performance . I was thinking if I can create a trigger which move table A into A1 and create a new table A every a period of time, so that insert or update could be faster in table A. Is this the right way to save performance ? If not, what should I do ?
(for example, insert or update 1000 rows per second in table A, how is the performance after 3 years ?)
We are designing softwares for a factory. There are product lines which pcb boards are made on. We need to insert almost 60 pcb records per second for years. (1000 rows seem to be exaggerated)
First, you are talking about several terabytes for a single table. Is your disk that big? Yes, MySQL can handle that big a table.
Will it slow down? It depends on
The indexes. If you have 'random' indexes, the INSERTs will slow down to about 1 insert per disk hit. On a spinning HDD, that is only about 100 per second. SSD might be able to handle 1000/sec. Please provide SHOW CREATE TABLE.
Does the table have an AUTO_INCREMENT? If so, it needs to be BIGINT, not INT. But, if possible, get rid of it all together (to save space). Again, let's see the SHOW.
"Point" queries (load one row via an index) are mostly unaffected by the size of the table. They will be about twice as slow in a trillion-row table as in a million-row table. A point query will take milliseconds or tens of milliseconds; no big deal.
A table scan will take hours or days; hopefully you are not doing that.
A billion-row scan of part of the table will take days or weeks unless you are using the PRIMARY KEY or have a "covering" index. Let's see the queries and the SHOW.
The best technique is not to store the data. Summarize it as it arrives, save the summaries, then toss the raw data. (OK, you might store the raw in a csv file just in case you need to build a new summary table or fix a bug in an existing one.)
Having a few summary tables instead of the raw data would shrink the data to under 1TB and allow the relevant queries to run 10 times as fast. (OK, point queries would be only slightly faster.)
PARTITIONing (or otherwise splitting up the table)? It depends. Let's see the queries and the SHOW. In many situations, PARTITIONing does not speed up anything.
Will you be deleting or modifying existing rows? I hope not. That adds more dimensions of problems. If, on the other hand, you need to purge 'old' data, then that is an excellent use for PARTITIONing. For 3 years' worth of data, I would PARTITION BY RANGE(TO_DAYS(..)) and have monthly partitions. Then a monthly DROP PARTITION would be very fast.
Very Huge data may decrease the performance of server, So there is a way to handle this :
1) you have to create another table to store archive data ( old data ) using Archive storage mechanism . ( https://dev.mysql.com/doc/refman/8.0/en/archive-storage-engine.html )
2) create MySQL job/scheduler to move older records to archive table. schedule in timeslot
when server is maximum idle.
3) after moving older records to archive table, re-index the original table.
this will serve the purpose of performance.
It is unlikely that 1000 row tables perform sufficiently poorly that doing a table copy every once in a while is an overall net gain. And anyway, what would the new table have that the old one did not which would improve performance?
The key to having tables perform efficiently is intelligent table design and management of indexes. That is how zillion row tables are effective in geospatial work, library catalogs, astronomy, and how internet search engines find useful data, etc.
Each index defined does cause more mysql impact especially at row insert time. Assuming there are more reads than inserts, this is an advantage because most queries are rapidly completed thanks to a suitable index.
Indexes are best defined with a thorough understanding of the queries made against the table—both in quality and quantity. And, if there is any tendency for the nature of the queries to trend over months or years, then the indexes would need additions, modifications, or—yes—even deletions.
It seems to me there is something inherently wrong with the way you are using MySQL to begin with.
A database system is supposed to manage data that is required by your application in order for it to work. If you think flushing the table every so often is something acceptable, then that doesn't seem to be the case.
Perhaps you are better off just using log files. Split them by date, delete old ones if and when you decide they are no longer relevant or need the disk space. It's even safer to do that way from a recovery perspective.
If you need a better suggestion, then improve your question to include exactly what you are trying to accomplish so we can help you with it.

Django Best way to store price history of millions of products?

I am running a web scraping spider that scrapes nearly 1 million products on a daily basis.
I am considering 2 approaches:
1) store all products prices history in one table
product_id, date, price
but this would yield a multi million records in this table.
2) store data in multiple tables & make separate table for each product.
Table1: product_id, current_price
Table_product_id: date, price
Table_product_id: date, price
Table_product_id: date, price
But I will have nearly 1 million tables!
From the theoretical point of view, you should use the same schema to store instances of the same entity (e.g., your Product type). According to that, solution 1 should be preferred.
In the real world, high data cardinalities could be an issue. MongoDB, for example, use sharding for managing very large datasets. PostgreSQL allows partitioning. From the PostgreSQL's doc:
Partitioning refers to splitting what is logically one large table
into smaller physical pieces. Partitioning can provide several
benefits:
Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of
the table are in a single partition or a small number of
partitions. The partitioning substitutes for leading columns of
indexes, reducing index size and making it more likely that the
heavily-used parts of the indexes fit in memory.
When queries or updates access a large percentage of a single partition, performance can be improved by taking advantage of
sequential scan of that partition instead of using an index and
random access reads scattered across the whole table.
Bulk loads and deletes can be accomplished by adding or removing partitions, if that requirement is planned into the partitioning
design. ALTER TABLE NO INHERIT and DROP TABLE are both far faster
than a bulk operation. These commands also entirely avoid the VACUUM
overhead caused by a bulk DELETE.
Seldom-used data can be migrated to cheaper and slower storage media.
The benefits will normally be worthwhile only when a table would
otherwise be very large. The exact point at which a table will
benefit from partitioning depends on the application, although a rule
of thumb is that the size of the table should exceed the physical
memory of the database server.
As they mentioned, it depends on you specific use case. The last sentence could be the criterion to make your choice.

Is this MySql table a good candidate for partitioning?

I have a table with ~1.9 million rows and growing consistently. I run some fairly complicated queries against this data. The active data is generally clustered toward the end of the table -- that is, only the most recent n% of the records tend to be accessed on a regular basis, although the rest of the data needs to be available in the same table for the less usual cases that people look back at the older records.
For those with partitioning experience in MySQL, does this table seem like it would be a good candidate for partitioning? Or is it just too small to get much gain?
Thanks,
Jared
p.s. I looked for a question on stackoverflow to answer this question, but didn't find anything that quite fit.
Check out this article...He shows significant gains on a table with only 3 columns and 800K records. As long as your partitioning on a column that produces either an integer or NULL you should see some great performance improvements. I loved the speed gains from date based partitioning that I have seen with significantly fewer records but more columns.
Improving Database Performance with Partitioning
Logically, yes, if you typically run queries that need only the most recent 2% of the table, this would be a great candidate for partitioning.
The biggest barrier to using MySQL partitioning is that the column you use for the partitioning key must be part of the primary key and any other unique keys. This practically makes some tables not possible to partition.
If this blocks you from partitioning the table, the fallback plan is to partition "manually." That is, make two real tables with identical structure. Every week (or whatever schedule you want), run a batch job to migrate the older data to the second table. You can always make a VIEW which is a UNION of the two tables, in case you need to run occasional table-scans.
Table size should be greater than 5 GB.
You should go for RANGE PARTITIONING...(Monthly or yearly)