Should frequently accessed tables containing large blobs with one-to-one relationships be normalized and columns split into two tables? - mysql

I have a frequently accessed table containing 3 columns of blobs, and 4 columns of extra data that is not used in the query, but just sent as result to PHP. There are 6 small columns (big int, small int, tiny int, medium int, medium int, medium int) that are used in the queries in the WHERE/ORDER BY/GROUP BY.
The server has very low memory, around 1GBs, and so the cache is not enough to improve the performance one on the large table. I've indexed the last 6 small columns, but it doesn't seem to be helping.
Would it be a good solution to split this large table into two?
One table containing the last 6 columns, and the other containing the blobs and extra data, and link it to the previous table with a foreign key that has a one to one relationship?
I'll then run the queries on the small table, and join the little number of rows remaining after filtering to the table with the blobs and extra data to return them to PHP.
Please note, I've already done this, and I managed to decrease the query time from 1.2-1.4 seconds to 0.1-0.2 seconds. However I'm not sure if the solution I've tried is considered good practice, or is even advisable at all?

What you have implemented is sometimes called "vertical partitioning". If you take it to the extreme, then it is the basis for columnar databases, such as Vertica.
As you have observed, such partitioning can dramatically increase query performance. One reason is that less data needs to be read for processing a row of data.
The downside is for updates, inserts, and deletes. With all the data in a single row, these operations are basically atomic -- that is, the operation only affects one row in a data page. (This is not strictly true with blobs, because these are split among multiple pages.)
When you split the data among multiple tables, then you need to coordinate these operations among the tables, so you don't end up with "partial" rows of data.
For a database being used with bulk inserts and lots of querying, this is not a particularly important consideration. Your splitting of separate columns of the data into separate tables is a reasonable approach for improving performance.

Related

Best database design to have efficient analysis on it with some millions records

I have a basic question about database designing.
I have a lot of files which I have to read and insert them in database. Each file has some thousand lines and each line has about 30 fields (by these types: small int, int, big int, varchar, json). Of course I use multi threads along with bulk inserting in order to increase insert speed (finally I have 30-40 millions records).
After inserting I want to have some sophisticated analysis and the performance is important to me.
Now I get each line fields and I'm ready to insert so I have 3 approaches:
1- One big table:
In this case I can create a big table with 30 columns and stores all of the files fields in that. So there is a table with huge size which I want to have a lot of analysis on it.
2- A fairly large table (A) and some little tables (B)s
In this case I can create some little tables which consist of the columns that have fairly identical records if we separate them from the other columns. So these little tables just has some hundred or thousand records instead of 30 millions records. So in fairly large table (A), I emit the columns which I put them in another table and I use a foreign key instead of them. Finally I has a table (A) with 20 columns and 30 millions records and some tables (B) with 2-3 columns and 100-50000 records for each of them. So in order to analysis the table A, I have to use some joins ,for example in select and ...
3- just a fairly large table
In this case I can create a fairly large table like table A in above case (with 20 columns) and instead of using foreign keys, I use a mapping between source columns and destination columns (this is something like foreign keys but has a little difference). For example I have 3 columns c1,c2,c3 that in case 2, I put them in another table B and use foreign key to access them, but now I assign a specific number to each distinctive records consist of c1,c2,c3 at inserting time and store the relation between the record and its assigned value in the program codes. So this table is completely like the table A in case number 2 but there is no need to use join in select or ...
While the inserting time is important, the analysis time that I will have is more important to me, so I want to know your opinion about which of these case is better and also I will glad to see the other solutions.
From a design perspective 30 to 40 million is not that bad a number. Performance is fully dependent on how you would design your DB to be.
If you are using SQL Server then you could consider putting the large table on a separate database file group. I have worked on one case in a similar fashion where we had around 1.8 Billion record in a single table.
For the analysis if you are not going to look into the entire data in one shot. You could consider a vertical partitioning of the data. You could use a partition schema based on your need. Some sample could be to split the data as yearly partitions and this will help if your analysis will be limited to a years worth of data(just an example).
The major thing would be de-normalization /normalization based on your need and of course non clustered/clustered indexing of the data. Again this will depend on what sort of analysis queries you would be using.
A single thread can INSERT one row at a time and finish 40M rows in a day or two. With LOAD DATA, you can do it in perhaps an hour or less.
But is loading the real question? For doing grouping, summing, etc, the question is about SELECT. For "analytics", the question is not one of table structure. Have a single table for the raw data, plus one or more "Summary tables" to make the selects really fast for your typical queries.
Until you give more details about the data, I cannot give more details about a custom solution.
Partitioning (vertical or horizontal) is unlikely to help much in MySQL. (Again, details needed.)
Normalization shrinks the data, which leads to faster processing. But, it sounds like the dataset is so small that it will all fit in RAM?? (I assume your #2 is 'normalization'?)
Beware of over-normalization.

Django Best way to store price history of millions of products?

I am running a web scraping spider that scrapes nearly 1 million products on a daily basis.
I am considering 2 approaches:
1) store all products prices history in one table
product_id, date, price
but this would yield a multi million records in this table.
2) store data in multiple tables & make separate table for each product.
Table1: product_id, current_price
Table_product_id: date, price
Table_product_id: date, price
Table_product_id: date, price
But I will have nearly 1 million tables!
From the theoretical point of view, you should use the same schema to store instances of the same entity (e.g., your Product type). According to that, solution 1 should be preferred.
In the real world, high data cardinalities could be an issue. MongoDB, for example, use sharding for managing very large datasets. PostgreSQL allows partitioning. From the PostgreSQL's doc:
Partitioning refers to splitting what is logically one large table
into smaller physical pieces. Partitioning can provide several
benefits:
Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of
the table are in a single partition or a small number of
partitions. The partitioning substitutes for leading columns of
indexes, reducing index size and making it more likely that the
heavily-used parts of the indexes fit in memory.
When queries or updates access a large percentage of a single partition, performance can be improved by taking advantage of
sequential scan of that partition instead of using an index and
random access reads scattered across the whole table.
Bulk loads and deletes can be accomplished by adding or removing partitions, if that requirement is planned into the partitioning
design. ALTER TABLE NO INHERIT and DROP TABLE are both far faster
than a bulk operation. These commands also entirely avoid the VACUUM
overhead caused by a bulk DELETE.
Seldom-used data can be migrated to cheaper and slower storage media.
The benefits will normally be worthwhile only when a table would
otherwise be very large. The exact point at which a table will
benefit from partitioning depends on the application, although a rule
of thumb is that the size of the table should exceed the physical
memory of the database server.
As they mentioned, it depends on you specific use case. The last sentence could be the criterion to make your choice.

MySQL Performance of one vs. many tables

I know that MySQL usually handles tables with many rows well. However, I currently face a setting where one table will be read and written by multiple users (around 10) at the same time and it is quite possible that the table will contain 10 billion rows.
My setting is a MySQL database with an InnoDB storage engine.
I have heart of some projects where tables of that size would become less efficient and slower, also concerning indexes.
I do not like the idea of having multiple tables with exactly the same structure just to split rows. Main question: However, would this not solve the issue of having reduced performance due to such a large bunch of rows?
Additional question: What else could I do to work with such a large table? The number of rows itself is not diminishable.
I have heard of some projects where tables of that size would become less efficient and slower, also concerning indexes.
This is not typical. So long as your tables are appropriately indexed for the way you're using them, performance should remain reasonable even for extremely large tables.
(There is a very slight drop in index performance as the depth of a BTREE index increases, but this effect is practically negligible. Also, it can be mitigated by using smaller keys in your indexes, as this minimizes the depth of the tree.)
In some situations, a more appropriate solution may be partitioning your table. This internally divides your data into multiple tables, but exposes them as a single table which can be queried normally. However, partitioning places some specific requirements on how your table is indexed, and does not inherently improve query performance. It's mainly useful to allow large quantities of older data to be deleted from a table at once, by dropping older partitions from a table that's partitioned by date.

Database Optimisation through denormalization and smaller rows

Does tables with many columns take more time than the tables with less columns during SELECT or UPDATE query? (row count is same and I will update/select same number of columns in both cases)
example: I have a database to store user details and to store their last active time-stamp. In my website, I only need to show active users and their names.
Say, one table named userinfo has the following columns: (id,f_name,l_name,email,mobile,verified_status). Is it a good idea to store last active time also in the same table? Or its better to make a separate table(say, user_active) to store the last activity timestamp?
The reason I am asking, If I make two tables, userinfo table will only be accessed during new signups(to INSERT new user row) and I will use user_active table (table with less columns) to UPADATE timestamp and SELECT active users frequently.
But the cost I have to pay for creating two tables is data duplication as user_active table columns will be (id, f_name, timestamp).
The answer to your question is that, to a close approximation, having more columns in a table does not really take more time than having fewer columns for accessing a single row. This may seem counter-intuitive, but you need to understand how data is stored in databases.
Rows of a table are stored on data pages. The cost of a query is highly dependent on the number of pages that need to be read and written during the course of the query. Parsing the row from the data page is usually not a significant performance issue.
Now, wider rows do have a very slight performance disadvantage, because more data would (presumably) be returned to the user. This is a very minor consideration for rows that fit on a single page.
On a more complicated query, wider rows have a larger performance disadvantage, because more data pages need to be read and written for a given number of rows. For a single row, though, one page is being read and written -- assuming you have an index to find that row (which seems very likely in this case).
As for the rest of your question. The structure of your second table is not correct. You would not (normally) include fname in two tables -- that is data redundancy and causes all sort of other problems. There is a legitimate question whether you should store a table of all activity and use that table for the display purposes, but that is not the question you are asking.
Finally, for the data volumes you are talking about, having a few extra columns would make no noticeable difference on any reasonable transaction volume. Use one table if you have one attribute per entity and no compelling reason to do otherwise.
When returning and parsing a single row, the number of columns is unlikely to make a noticeable difference. However, searching and scanning tables with smaller rows is faster than tables with larger rows.
When searching using an index, MySQL utilizes a binary search so it would require significantly larger rows (and many rows) before any speed penalty is noticeable.
Scanning is a different matter. When scanning, it's reading through all of the data for all of the rows, so there's a 1-to-1 performance penalty for larger rows. Yet, with proper indexes, you shouldn't be doing much scanning.
However, in this case, keep the date together with the user info because they'll be queried together and there's a 1-to-1 relationship, and a table with larger rows is still going to be faster than a join.
Only denormalize for optimization when performance becomes an actual problem and you can't resolve it any other way (adding an index, improving hardware, etc.).

MySQL performance: many rows and columns (MyISAM)

Since I'm still in the beginning of my site design I figured now's a good time to ask this.
I know that one of the ways to optimize MySQL queries is to split your rows into seperate tables, however, that does have a few comfort issues.
What I'm considering is this: would querying a table consisting of around 1'000'000 rows and 150 columns using excellently designed indexes and getting only the needed columns from each query result in a much higher server load than splittiing the table into multiple ones, resulting in less collumns?
Big blob tables are a anti-pattern, never use them.
Normalized tables will run much much faster than a single blob.
InnoDB is optimized for many small tables that need to be joined.
Using a normalized table will save you many headaches besides:
Your data will be smaller, so more of it fits in memory.
You only store data in one place, so it cannot end up with inconsistent data.
MySQL only allows you to use one index per select per table, multiple tables means you get to use more indexes and get more speed.
Triggers on tables execute much faster.
Normalized tables are easier to maintain.
You have less indexes per table, so inserts are faster.
Indexes are smaller (fewer rows) and narrows (less columns) and will run much faster as a result.
If the data is static, you can pack the tables for greater efficiency. Here is the page in the reference manual