Best design for a database containing blobs - mysql

I was wandering what's the best design for the database of an application in which I have to store lots of records with blobs associated (One to one) to them.
Is it better to use a separate table for blobs?
My application relies on MySQL and Hibernate.

Using a separate table would be better in the long rung, especially if you've got lots of blobs. The point is when you have them in a table together with other fields, this table will take longer to rebuild or change with all the blobs in them. This table would be much quicker when you just refer to the blob.
I googled for some support of this statement, and found this lengthy but interesting read: http://mysqldatabaseadministration.blogspot.com/2008/01/i-will-not-blob.html

Your choice should depend on the amount of data and transactions. If the amount of BLOB data are less(say no of files are less than 10000) you can follow these steps other wise it may be a bottle neck as per this article.
Is it better to use a separate table for blobs?
Did you mean one table with all the columns as BLOB type.I dont think it is a good idea.
What to do then?
BLOB is one of the many data types available in SQL. Your data base design should not be depending on datatype you use. Say you want to store User details including the image of the user. I feel there should a column in User table usrImage to be store the image with type BLOB. It does't really matter if I use BLOB or not I would continue to have a User table.
BLOB are similar to any other datatype, So attach them to where ever they fit in your DB design.

Related

How should I organize user data with several rows in MySQL?

I am currently developing a quiz-app, that keeps track of user data, such as :
the sets they've studied by the ID of the specific set ([1,2,6,12])
the friends they have by the id of the user ([1,2,3,4])
their schedule(
{"2022-07-03 00:00:00":{"551":{"type":"Flashcards","setid":"1"},"552":{"type":"Flashcards","setid":"1"},"553":{"type":"Flashcards","setid":"1"},"554":{"type":"Flashcards","setid":"1"},"555":{"type":"Flashcards","setid":"1"},"556":{"type":"Flashcards","setid":"1"},"557":{"type":"Flashcards","setid":"6"},"558":{"type":"Flashcards","setid":"6"},"559":{"type":"Flashcards","setid":"6"},"560":{"type":"Flashcards","setid":"6"}}})
every individual day they've logged in (["05/15/2022","05/17/2022","05/18/2022","05/19/2022","05/22/2022","05/23/2022","05/24/2022","05/25/2022","05/28/2022","05/29/2022","05/30/2022","05/31/2022","06/02/2022","06/05/2022","06/07/2022","06/08/2022","06/10/2022","06/11/2022","06/13/2022","06/14/2022","06/15/2022","06/17/2022","06/18/2022","06/19/2022","06/20/2022","06/22/2022","06/24/2022","06/25/2022","06/26/2022","06/28/2022","06/29/2022","06/30/2022","07/01/2022","07/02/2022"])
Note: there is quite a lot of other types of information that is stored inside the table.
All of this aforementioned information is collected in a mysql table called "users", which has rows for each user, with accompanying data (as mentioned above).
It has recently come to my attention that MySQL has data limits for the amount of data that can be represented in a given row (around 65K bytes). If I continue to represent data this way, I believe that at scale (assume a user uses the app for 5 years, imagine the amount of data inside the "every individual day they've logged in" table), I will face MySQL's data limits and it may cause problems in the future.
Here's a picture, showing how the information is represented inside of the table "users"
How would I better represent this type of table? Should I use multiple tables inside an SQL database? How should I format it? Do I not have to worry about the data limit, and should I continue saving data in this way?
Thanks.
If I understand this correctly, you are packing way too much information in each row. The structure of your data is not being represented in a way that allows MySQL to do what it is good at. You are just creating big buckets for each user and stashing them in MySQL.
To make this work better, you can either create tables to store each relationship (this is, after all, a relational database) like user_login, user_friend_requests, and so on. The direct answer to your question is that each cell in your table should be a table itself.
OR, you can embrace the blob, and use something like mongodb, which is much more suited for storing and retrieving the data in a way that fits your mindset. Since you don't do any real queries on the data, a NoSQL solution would probably fit you better.
So the "right" answer to your question is "modify your schema to store this data better, or switch your database to match the way you want to store the data."
However, having said all that, since it seems you are storing JSON in those cells, you can use the JSON data type (max size 1GB but better if you don't use so much - see https://dev.mysql.com/blog-archive/how-large-can-json-documents-be), or LONGTEXT, 4GB. (Assuming you are running in a 64-bit environment - see Maximum length for MySQL type text).
The JSON data type actually has some pretty cool features.

Data compression in RDBMS like Oracle, MySQL etc

I was reading about in-memory database which incorporates a feature like data compression. Using that, instead of storing first name, last name, father's name etc. values as it is in the column (which leads to a lot of data duplication and waste of disk storage), it creates a dictionary and attribute vector table for each column, so that only unique values are stored in dictionary and its corresponding attribute vector is stored in original table.
Clear advantage of this approach is that it a lot of space by removing overhead of data duplication.
I want to know:
Does RDBMS like Oracle, MySQL etc. implicitly follow this approach when they store the data on disk? Or when we use these RDBMS we have to implement the same if we want to take advantage of the same?
As we know there is no free lunch, so I would like to understand what are the trade-offs if developer implements above explained data compression approach? One I can think of is that in order to fetch the data from database, I will have to make a join between my dictionary table and main table. Isn't it?
Please share your thoughts and inputs.
This answer is based on my understanding of your query. It appears that you are mixing up two concepts : data normalisation and data storage optimisation.
Data Normalisation : This is a process that needs to be undertaken by the application developer. Here pieces of data that would need to be stored repeatedly are stored only once and are referenced using their identifiers which would typically be integers. This way the database consumes spaces only as much as needed to store the repeating data once. This is a common practice while storing string and variable length data into the database tables. In order to retrieve data, the application would have to perform joins between the related tables. And this process contributes directly to application performance depending on the manner in which the related tables are designed.
Data storage optimisation : This is what is handled by the RDBMS itself. This involves various steps like maintaining the B-Tree structures to hold data, compressing data before storage, managing the free space within the data files etc. Different RDBMS systems would handle them in different ways (some of them patented and proprietary while others are more general); however when we are speaking of RDBMS like Oracle and MySQL you can be assured that they would follow the best in class storage algorithms to efficiently store this data.
Hope this helps.

Having data stored across tables representing individual data types - Why is it wrong?

Say I have lots of time to waste and decide to make a database where information is not stored as entities but in separate inter-related tables representing INT,VARCHAR,DATE,TEXT, etc types.
It would be such a revolution to never have to design a database structure ever again except that the fact no-one else has done it probably indicates it's not a good idea :p
So why is this a bad design ? What principles is this going against ? What issues could it cause from a practical point of view with a relational database ?
P.S: This is for the learning exercise.
Why shouldn't you separate out the fields from your tables based on their data types? Well, there are two reasons, one philosophical, and one practical.
Philosophically, you're breaking normalization
A properly normalized database will have different tables for different THINGS, with each table having all fields necessary and unique for that specific "thing." If the only way to find the make, model, color, mileage, manufacture date, and purchase date of a given car in my CarCollectionDatabase is to join meaningless keys on three tables demarked by data-type, then my database has almost zero discoverablity and no real cohesion.
If you designed a database like that, you'd find writing queries and debugging statements would be obnoxiously tiresome. Which is kind of the reason you'd use a relational database in the first place.
(And, really, that will make writing queries WAY harder.)
Practically, databases don't work that way.
Every database engine or data-storage mechanism i've ever seen is simply not meant to be used with that level of abstraction. Whatever engine you had, I don't know how you'd get around essentially doubling your data design with fields. And with a five-fold increase in row count, you'd have a massive increase in index size, to the point that once you get a few million rows your indexes wouldn't actually help.
If you tried to design a database like that, you'd find that even if you didn't mind the headache, you'd wind up with slower performance. Instead of 1,000,000 rows with 20 fields, you'd have that one table with just as many fields, and some 5-6 extra tables with 1,000,000+ entries each. And even if you optimized that away, your indexes would be larger, and larger indexes run slower.
Of course, those two ONLY apply if you're actually talking about databases. There's no reason, for example, that an application can't serialize to a text file of some sort (JSON, XML, etc.) and never write to a database.
And just because your application needs to store SQL data doesn't mean that you need to store everything, or can't use homogenous and generic tables. An Access-like application that lets user define their own "tables" might very well keep each field on a distinct row... although in that case your database's THINGS would be those tables and their fields. (And it wouldn't run as fast as a natively written database.)

How to decide between row based and column based table structures?

I've some data set, which has hundreds of parameters (with more coming in)
If I dump them in one table, it'll probably end up having hundreds of columns (and I am not even sure how many, at this point)
I could do row based, with a bunch of meta tables, but somehow row based structure feels unintuitive
One more way would be to keep column based, but have multiple tables (split the tables logically) which seems like a good solution.
Is there any other way to do it? If yes, could you point me to some tutorial? (I am using mysql)
EDIT:
based on the answers, I should clarify one thing - updates and deletes are going to be much lesser, than inserts and selects. as it is, selects are going to be the bulk of the operations, so selects have to be fast.
I ran across several designs where a #4 was possible:
Split your columns into searchable and auxiliary
Define a table with only searchable columns, and an extra BLOB column
Put everything in one table: searchable columns go as-is, auxiliary go as a BLOB
We used this approach with BLOBs of XML data or even binary data, representing the entire serialized object. The downside is that your auxiliary columns remain non-searchable for all practical purposes. The upside is that you can add new auxiliary columns at will without changing the schema. You can also make schema changes to make previously auxiliary columns searchable with a schema change and a very simple program.
It all depends on the kind of data you need to store.
If it's not "relational" at all - for instance, a collection of web pages, documents, etc - it's usually not a good fit for a relational database.
If it's relational, but highly variable in schema - e.g. a product catalogue - you have a number of options:
single table with every possible column (your option 1)
"common" table with the attributes that each type shares, and joined tables for attributes for subtypes
table per subtype
If the data is highly variable and you don't want to make schema changes to accommodate the variations, you can use "entity-attribute-value" or EAV - though this has some significant drawbacks in the context of relational database. I think this is what you have in mind with option 2.
If the data is indeed relational, and there is at least the core of a stable model in the data, you could of course use traditional database design techniques to come up with a schema. That seems to correspond with option 3.
Does every item in the data set have all those properties? If yes, then one big table might well be fine (although scary-looking).
On the other hand, perhaps you can group the properties. The idea being that if an item has one of the properties in the group, then it has all the properties in that group. If you can create such groupings, then these could be separate tables.
So should they be separate? Yes, unless you can prove that the cost of performing joins is unacceptable. Perform all SELECTs via stored procedures and you can denormalise later on without much trouble.

Is there a better place to store large amounts of unused data than a the database?

So the application we've got calls the API's of all the major carriers (UPS, FedEx, etc) for tracking data.
We save the most recent version of the XML feed we get from them in a TEXT field in a table in our database.
We really hardly ever (read, never so far) access that data, but have it "just in case."
It adds quite a bit of additional weight to the database. Right now a 200,000 row table is coming in at around 500MB...the large majority of which is compromised of all that XML data.
So is there a more efficient way to store all that XML data? I had thought about saving them as actual text/xml files, but we update the data every couple of hours, so wasn't sure if it would make sense to do that.
Assuming it's data there's no particular reason not to keep it in your database (unless it's impeding your backups). But it would be a good idea to keep it in a separate table from the actual data that you do need to read on a regular basis — just the XML, a FK back to the original table, and possibly an autonumbered PK column.
It has been my experience that the biggest trouble with TEXT/BLOB columns that are consistently large, is that people are not careful to prevent reading them when scanning many rows. On MyISAM, this will waste your VFS cache, and on InnoDB, it will waste your InnoDB buffer pool.
A secondary problem is that as tables get bigger, they become harder to maintain.. adding a column or index can rebuild the whole table, and a 500MB table rebuilds a lot slower than a 5MB table.
I've had good success moving things like this off into offline key/value storage such as MogileFS, and/or TokyoTyrant.
If you don't need to be crazy scalable, or you must value transactional consistency over performance, then simply moving this column into another table with a 1:1 relationship with the original table will at least require a join to blow out the buffer pool, and allow you to maintain the original table w/o having to tip-toe around the 500MB gorilla.
if its really unused, try:
/dev/null
I don't know what kind of data these XML streams contain, but maybe you can parse it and store only the pertinent info in a table or set of tables that way you can eliminate some of the XML's bloat.
Learn about OLAP techniques and data warehouses. They are probably what are you looking for.
As a DATAbase is designed to store DATA this seems to be the logical place for it. A couple of suggestions:
Rather than storing it in a seperate table is to use a seperate database. If the information isn't critical
Have a look athe the compress and uncompress functions as this could reduce the size of the verbose XML.
I worked on one project where we split data between the database and file system. After this experience I vowed never again. Backups and maintenance of various production/test/dev environments turned into a nightmare.
Why not store them to text files, and them keep a simple path (or relative path) in the database?
We used to do something similar in the seismic industry where the bulk of the data were big arrays of floating point numbers. Much more efficient to store these as files on disk (or tape), and then only keep trace meta data (position/etc) in a RDBMS-like database (I at about the time they were porting to Oracle!). Even with the old system, the field data was always on disk and easily accessible - it was used more frequently than the array data (although, unlike in your case, this was most definitely essential!