Django Best way to store price history of millions of products? - mysql

I am running a web scraping spider that scrapes nearly 1 million products on a daily basis.
I am considering 2 approaches:
1) store all products prices history in one table
product_id, date, price
but this would yield a multi million records in this table.
2) store data in multiple tables & make separate table for each product.
Table1: product_id, current_price
Table_product_id: date, price
Table_product_id: date, price
Table_product_id: date, price
But I will have nearly 1 million tables!

From the theoretical point of view, you should use the same schema to store instances of the same entity (e.g., your Product type). According to that, solution 1 should be preferred.
In the real world, high data cardinalities could be an issue. MongoDB, for example, use sharding for managing very large datasets. PostgreSQL allows partitioning. From the PostgreSQL's doc:
Partitioning refers to splitting what is logically one large table
into smaller physical pieces. Partitioning can provide several
benefits:
Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of
the table are in a single partition or a small number of
partitions. The partitioning substitutes for leading columns of
indexes, reducing index size and making it more likely that the
heavily-used parts of the indexes fit in memory.
When queries or updates access a large percentage of a single partition, performance can be improved by taking advantage of
sequential scan of that partition instead of using an index and
random access reads scattered across the whole table.
Bulk loads and deletes can be accomplished by adding or removing partitions, if that requirement is planned into the partitioning
design. ALTER TABLE NO INHERIT and DROP TABLE are both far faster
than a bulk operation. These commands also entirely avoid the VACUUM
overhead caused by a bulk DELETE.
Seldom-used data can be migrated to cheaper and slower storage media.
The benefits will normally be worthwhile only when a table would
otherwise be very large. The exact point at which a table will
benefit from partitioning depends on the application, although a rule
of thumb is that the size of the table should exceed the physical
memory of the database server.
As they mentioned, it depends on you specific use case. The last sentence could be the criterion to make your choice.

Related

MYSQL - compare between 1 table have n partition and n table have same struct

I am a student and I have a question when I research about mysql partition.
Example I have a table "Label" with 10 partitions by hash(TaskId)
resourceId (PK)
TaskId (PK)
...
And I have 10 table with name table is "label": + taskId:
tables:
task1(resourceId,...)
task2(resourceId,...)
...
Could you please tell me about advantages and disadvantages between them?
Thanks
Welcome to Stack Overflow. I wish you had offered a third alternative in your question: "just one table with no partitions." That is by far, in almost all cases in the real world, the best way to handle your data. It only requires maintaining and querying one copy of each index, for example. If your data approaches billions of rows in size, it's time to consider stuff like partitions.
But never mind that. Your question was to compare ten tables against one table with ten partitions. Your ten-table approach is often called sharding your data.
First, here's what the two have in common: they both are represented by ten different tables on your storage device (ssd or disk). A query for a row of data that might be anywhere in the ten involves searching all ten, using whatever indexes or other techniques are available. Each of these ten tables consumes resources on your server: open file descriptors, RAM caches, etc.
Here are some differences:
When INSERTing a row into a partitioned table, MySQL figures out which partition to use. When you are using shards, your application must figure out which table to use and write the INSERT query for that particular table.
When querying a partitioned table for a few rows, MySQL automatically figures out from your query's WHERE conditions which partitions it must search. When you search your sharded data, on the other hand, your application much figure out which table or tables to search.
In the case you presented --partitioning by hash on the primary key -- the only way to get MySQL to search just one partition is to search for particular values of the PK. In your case this would be WHERE resourceId = foo AND TaskId = bar. If you search based on some other criterion -- WHERE customerId = something -- MySQL must search all the partitions. That takes time. In the sharding case, your application can use its own logic to figure out which tables to search.
If your system grows very large, you'll be able to move each shard to its own MySQL server running on its own hardware. Then, of course, your application will need to choose the correct server as well as the correct shard table for each access. This won't work with partitions.
With a partitioned table with an autoincrementing id value on each row inserted, each of your rows will have its own unique id no matter which partition it is in. In the sharding case, each table has its own sequence of autoincrementing ids. Rows from different tables will have duplicate ids.
The Data Definition Language (DDL: CREATE TABLE and the like) for partitioning is slightly simpler than for sharding. It's easier and less repetitive to write the DDL add a column or an index to a partitioned table than it is to a bunch of shard tables. With the volume of data that justifies sharding or partitioning, you will need to add and modify indexes to match the needs of your application in future.
Those are some practical differences. Pro tip don't partition and don't shard your data unless you have really good reasons to do so.
Keep in mind that server hardware, disk hardware, and the MySQL software are under active development. If it takes several years for your data to grow very large, new hardware and new software releases may improve fast enough in the meantime that you don't have to worry too much about partitioning / sharding.

Mysql what if too much data in a table

Data is increasing in one table everyday, it might lower the performance . I was thinking if I can create a trigger which move table A into A1 and create a new table A every a period of time, so that insert or update could be faster in table A. Is this the right way to save performance ? If not, what should I do ?
(for example, insert or update 1000 rows per second in table A, how is the performance after 3 years ?)
We are designing softwares for a factory. There are product lines which pcb boards are made on. We need to insert almost 60 pcb records per second for years. (1000 rows seem to be exaggerated)
First, you are talking about several terabytes for a single table. Is your disk that big? Yes, MySQL can handle that big a table.
Will it slow down? It depends on
The indexes. If you have 'random' indexes, the INSERTs will slow down to about 1 insert per disk hit. On a spinning HDD, that is only about 100 per second. SSD might be able to handle 1000/sec. Please provide SHOW CREATE TABLE.
Does the table have an AUTO_INCREMENT? If so, it needs to be BIGINT, not INT. But, if possible, get rid of it all together (to save space). Again, let's see the SHOW.
"Point" queries (load one row via an index) are mostly unaffected by the size of the table. They will be about twice as slow in a trillion-row table as in a million-row table. A point query will take milliseconds or tens of milliseconds; no big deal.
A table scan will take hours or days; hopefully you are not doing that.
A billion-row scan of part of the table will take days or weeks unless you are using the PRIMARY KEY or have a "covering" index. Let's see the queries and the SHOW.
The best technique is not to store the data. Summarize it as it arrives, save the summaries, then toss the raw data. (OK, you might store the raw in a csv file just in case you need to build a new summary table or fix a bug in an existing one.)
Having a few summary tables instead of the raw data would shrink the data to under 1TB and allow the relevant queries to run 10 times as fast. (OK, point queries would be only slightly faster.)
PARTITIONing (or otherwise splitting up the table)? It depends. Let's see the queries and the SHOW. In many situations, PARTITIONing does not speed up anything.
Will you be deleting or modifying existing rows? I hope not. That adds more dimensions of problems. If, on the other hand, you need to purge 'old' data, then that is an excellent use for PARTITIONing. For 3 years' worth of data, I would PARTITION BY RANGE(TO_DAYS(..)) and have monthly partitions. Then a monthly DROP PARTITION would be very fast.
Very Huge data may decrease the performance of server, So there is a way to handle this :
1) you have to create another table to store archive data ( old data ) using Archive storage mechanism . ( https://dev.mysql.com/doc/refman/8.0/en/archive-storage-engine.html )
2) create MySQL job/scheduler to move older records to archive table. schedule in timeslot
when server is maximum idle.
3) after moving older records to archive table, re-index the original table.
this will serve the purpose of performance.
It is unlikely that 1000 row tables perform sufficiently poorly that doing a table copy every once in a while is an overall net gain. And anyway, what would the new table have that the old one did not which would improve performance?
The key to having tables perform efficiently is intelligent table design and management of indexes. That is how zillion row tables are effective in geospatial work, library catalogs, astronomy, and how internet search engines find useful data, etc.
Each index defined does cause more mysql impact especially at row insert time. Assuming there are more reads than inserts, this is an advantage because most queries are rapidly completed thanks to a suitable index.
Indexes are best defined with a thorough understanding of the queries made against the table—both in quality and quantity. And, if there is any tendency for the nature of the queries to trend over months or years, then the indexes would need additions, modifications, or—yes—even deletions.
It seems to me there is something inherently wrong with the way you are using MySQL to begin with.
A database system is supposed to manage data that is required by your application in order for it to work. If you think flushing the table every so often is something acceptable, then that doesn't seem to be the case.
Perhaps you are better off just using log files. Split them by date, delete old ones if and when you decide they are no longer relevant or need the disk space. It's even safer to do that way from a recovery perspective.
If you need a better suggestion, then improve your question to include exactly what you are trying to accomplish so we can help you with it.

MySQL Performance of one vs. many tables

I know that MySQL usually handles tables with many rows well. However, I currently face a setting where one table will be read and written by multiple users (around 10) at the same time and it is quite possible that the table will contain 10 billion rows.
My setting is a MySQL database with an InnoDB storage engine.
I have heart of some projects where tables of that size would become less efficient and slower, also concerning indexes.
I do not like the idea of having multiple tables with exactly the same structure just to split rows. Main question: However, would this not solve the issue of having reduced performance due to such a large bunch of rows?
Additional question: What else could I do to work with such a large table? The number of rows itself is not diminishable.
I have heard of some projects where tables of that size would become less efficient and slower, also concerning indexes.
This is not typical. So long as your tables are appropriately indexed for the way you're using them, performance should remain reasonable even for extremely large tables.
(There is a very slight drop in index performance as the depth of a BTREE index increases, but this effect is practically negligible. Also, it can be mitigated by using smaller keys in your indexes, as this minimizes the depth of the tree.)
In some situations, a more appropriate solution may be partitioning your table. This internally divides your data into multiple tables, but exposes them as a single table which can be queried normally. However, partitioning places some specific requirements on how your table is indexed, and does not inherently improve query performance. It's mainly useful to allow large quantities of older data to be deleted from a table at once, by dropping older partitions from a table that's partitioned by date.

Should frequently accessed tables containing large blobs with one-to-one relationships be normalized and columns split into two tables?

I have a frequently accessed table containing 3 columns of blobs, and 4 columns of extra data that is not used in the query, but just sent as result to PHP. There are 6 small columns (big int, small int, tiny int, medium int, medium int, medium int) that are used in the queries in the WHERE/ORDER BY/GROUP BY.
The server has very low memory, around 1GBs, and so the cache is not enough to improve the performance one on the large table. I've indexed the last 6 small columns, but it doesn't seem to be helping.
Would it be a good solution to split this large table into two?
One table containing the last 6 columns, and the other containing the blobs and extra data, and link it to the previous table with a foreign key that has a one to one relationship?
I'll then run the queries on the small table, and join the little number of rows remaining after filtering to the table with the blobs and extra data to return them to PHP.
Please note, I've already done this, and I managed to decrease the query time from 1.2-1.4 seconds to 0.1-0.2 seconds. However I'm not sure if the solution I've tried is considered good practice, or is even advisable at all?
What you have implemented is sometimes called "vertical partitioning". If you take it to the extreme, then it is the basis for columnar databases, such as Vertica.
As you have observed, such partitioning can dramatically increase query performance. One reason is that less data needs to be read for processing a row of data.
The downside is for updates, inserts, and deletes. With all the data in a single row, these operations are basically atomic -- that is, the operation only affects one row in a data page. (This is not strictly true with blobs, because these are split among multiple pages.)
When you split the data among multiple tables, then you need to coordinate these operations among the tables, so you don't end up with "partial" rows of data.
For a database being used with bulk inserts and lots of querying, this is not a particularly important consideration. Your splitting of separate columns of the data into separate tables is a reasonable approach for improving performance.

How to structure an extremely large table

This is more a conceptual question. It's inspired from using some extremely large table where even a simple query takes a long time (properly indexed). I was wondering is there is a better structure then just letting the table grow, continually.
By large I mean 10,000,000+ records that grows every day by something like 10,000/day. A table like that would hit 10,000,000 additional records every 2.7 years. Lets say that more recent records are accesses the most but the older ones need to remain available.
I have two conceptual ideas to speed it up.
1) Maintain a master table that holds all the data, indexed by date in reverse order. Create a separate view for each year that holds only the data for that year. Then when querying, and lets say the query is expected to pull only a few records from a three year span, I could use a union to combine the three views and select from those.
2) The other option would be to create a separate table for every year. Then, again using a union to combine them when querying.
Does anyone else have any other ideas or concepts? I know this is a problem Facebook has faced, so how do you think they handled it? I doubt they have a single table (status_updates) that contains 100,000,000,000 records.
The main RDBMS providers all have similar concepts in terms of partitioned tables and partitioned views (as well as combinations of the two)
There is one immediate benefit, in that the data is now split across multiple conceptual tables, so any query that includes the partition key within the query can automatically ignore any partition that the key would not be in.
From a RDBMS management perspective, having the data divided into seperate partitions allows operations to be performed at a partition level, backup / restore / indexing etc. This helps reduce downtimes as well as allow for far faster archiving by just removing an entire partition at a time.
There are also non relational storage mechanisms such as nosql, map reduce etc, but ultimately how it is used, loaded and data is archived become a driving factor in the decision of the structure to use.
10 million rows is not that large in the scale of large systems, partitioned systems can and will hold billions of rows.
Your second idea looks like partitioning.
I don't know how well it works, but there is support for partition in MySQL -- see, in its manual : Chapter 17. Partitioning
There is good scalability approach for this tables. Union is right way, but there is better way.
If your database engine supports "semantical partitioning", then you can split one table into partitions. Each partition will cover some subrange (say 1 partition per year). It will not affect anything in SQL syntax, except DDL. And engine will transparently run hidden union logic and partitioned index scans with all parallel hardware it has (CPU, I/O, storage).
For example Sybase allows up to 255 partitions, as it is limit of union. But you will never need keyword "union" in queries.
Often the best plan is to have one table and then use database partioning.
Or you can archive data and create a view for the archived and combined data and keep only the active data in the table most functions are referencing. You will have to have a good archiving stategy though (which is automated) or you can lose data or not get things done efficiently in moving them. This is typically more difficult to maintain.
What you're talking about is horizontal partitioning or sharding.