How best to store web application images in the filesystem - mysql

What is the best practise for storing a large number of files that are referenced in a database in the file system?
We're currently moving from a system that stores around 14,000 files (around 6GB of images and documents) in a MySQL database. This is quickly becoming unmanageable.
We currently plan to save the files by their database primary key in the file system. I'm concerned about the possible performance issues of having that many files in the same folder. Also, these files will be inserted by several different applications on the same server.
Specifically I'd like to know:
Is this a good solution given these parameters?
Will it leave room to scale further in the future?
Are there any concerns about storage of many files in the same location?
Is there a better way to name/distribute the files?

I like to name the file as following
/* create directory */
$dir = date('Y').'/'.date('m').'/'.date('d');

Hash the contents with MD5, then add a suffix (the PK will suffice for this) to get the file's new filename. Create 16 folders corresponding to the first character of the hash. Create 16 folders under each of those for the second character. Store the image in the appropriate path based on the first 2 hex characters of the hash, then add the hash to the appropriate record in the database.

Related

MYSQL data on multiple drives

I have a MYSQL database on my SDA. It's mostly all one schema with "popular" tables in it. I want to store the less "popular" tables of the schema (which take up another 1TB or so) on my SDB partition.
What is the right way to do this? Do I need another MYSQL server running on that drive? Or can I simply set like DATA_DIRECTORY= or something? This is Ubuntu and MYSQL 5.7.38. Thank you for any help, it's much appreciated.
As of MySQL 8.0.21, the ability to specify the data directory per table has finally improved.
CREATE TABLE t1 (c1 INT PRIMARY KEY) DATA DIRECTORY = '/external/directory';
Read https://dev.mysql.com/doc/refman/8.0/en/innodb-create-table-external.html#innodb-create-table-external-data-directory for details.
In earlier versions of MySQL, you could use symbolic links. That is, the link still has to reside under the default data directory, but the link can point to a file on another physical device.
It was unreliable to use symbolic links for individual tables in this way, because OPTIMIZE TABLE or many forms of ALTER TABLE would recreate the file without the symbolic link, effectively moving it back to the primary storage device. To solve this, it was recommended to use a symbolic link for the schema subdirectory instead of individual tables.
To be honest, I've never found a case where I needed to use either of these techniques. Just keep it simple: one data directory on one filesystem, and don't put the data directory on the same device as the root filesystem. Make sure the data storage volume is large enough for all your data. Use software RAID if you need to use multiple devices to make one larger filesystem.

Storing HTML files

We have about 60 million webpages in a compressed format. We would like to de-compress and work with these files individually.
Here are my questions!
First, if I decompress them into the file system, would the FS cope with such number of files. My file system is ext4. (I've 4 different file systems so I can divide the data between them like 15 M pages for each file system)
Secondly, Would storing these files into a relational database be a better option? assuming that all the hassle of cleaning html text is done before inserting them into the database.
Thanks,
If you extract them into a single directory you may exceed the maximum allocated indices in that folder. If you extract them into multiple directories you will fair better.
60 Million is definitely a fair amount, if you plan on doing any indexing on them or searching then a database would be your best option, you can do indexing on files using something like lucene it all depends on what you want to do with the files After they ave been extracted.
I currently have a similar issue with images on a large user site, the way I got around this issue was to give each image a GUID and for each byte in the guid assign it to a different directory, then the next byte under a subdirectory (down to 8 bytes) if my fill ratio goes up I'll create more subdirectories to compensate, it also means I can spread it across different net storage boxes.

Store image files or URLs in MySQL database? Which is better? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Storing Images in DB - Yea or Nay?
Images in database vs file system
I've been developing a web application using RIA technologies (Flex + PHP + MySQL + Ajax) and now I'm in a dilemma about image files.
I use some images in my Flex app, so I think "it could be awesome if I store them into database, and then retrieve from it; consecuently, maintain process should be more easy". But, here is my dilemma:
Should I store the physical URL of my images, or is going to be better if I store directly the image?
For example, should my Cars table looks like:
ID (autonumeric) | Source (text)
or like this?
ID (autonumeric) | Image (longblob or blob)
I know that here are cool people that can answer me this question, explaining me which is better and why :)
I personally recommend to Store Images in the database. Of course it both advantages and disadvantages.
Advantages of storing BLOB data in the database:
It is easier to keep the BLOB data synchronized with the remaining items in the row.
BLOB data is backed up with the database. Having a single storage system can ease administration.
BLOB data can be accessed through XML support in MySQL, which can return a base 64–encoded representation of the data in the XML stream.
MySQL Full Text Search (FTS) operations can be performed against columns that contain fixed or variable-length character (including Unicode) data. You can also perform FTS operations against formatted text-based data contained within image fields—for example, Microsoft Word or Microsoft Excel documents.
Disadvantages of Storing BLOB Data in the Database:
Carefully consider what resources might be better stored on the file system rather than in a database. Good examples are images that are typically referenced via HTTP HREF. This is because:
Retrieving an image from a database incurs significant overhead compared to using the file system.
Disk storage on database SANs is typically more expensive than storage on disks used in Web server farms.
As a general rule you wan't to keep your databases small, so they perform better (and backup better too). So if you can store only a filesystem reference (path + filename) or URL in the DB, that would be better.
Its probably a question of personal preference.
As a general rule its better to keep the database small. However when you come to enterprise applications they regulary add the images directly to the database. If you place them on the file system the db and your file system can get out of sync.
Larger CMS will regulary place those files in the db. However be aware that this requires a larger DB sizing when everything is growing...
When you are saving the url and name only, be sure that these won't change in the future.
With files stored in the database you can implement security easier and you don't have to worry about duplicate filenames.
I used to store the path into the URL, but then adding an additional web server to the mix proved less than ideal. For one thing, you'll have to share the path to where the images are stored. We were using NFS and it became slow after a while. We tried syncing the files from one web server to another but the process became cumbersome.
Having said that, I would store them in the DB. I've since moved all my image/file storage over to MongoDB. I know this doesn't satisfy your needs but we've tried it all (even S3) and we weren't happy with the other solutions. If we had to, I would definite throw them inside MySQL.
Personally, I've always stored the URL.
There's no real reason not to store the image directly in the database, but there are benefits to not storing it in the database.
You get more flexibility when you don't store the image in the database. You can easily move it around and just update the URL in the file. So, if you wanted to move the image from your webserver to a service such as Flickr or Amazon Web Services, it would just be as easy as updating the link to the new files. That also gives you easy access to content delivery networks so that the images are delivered to end users quicker.
I'd store the url, it's less data and that means a smaller database and faster data fetching from it ;)

Storing image in database vs file system (is this a valid use case?)

I have an application where every user gets there own database and runs from the same file system folder. (the database is determined by sub domain)
Storing in the filesystem could lead to conflict. I'd imagine the images upload would be small. (I would scale them down before storing)
Is it ok in this case to store in database?
(I know this has been asked a lot)
I also want to make my application easy to install and creating a writable folder is hard for some people)
To take the contrary view from Nathanial -- I find it easier to use the data base to store opaque data like images. When you back up the data base, you automatically get a backup of the images. Also, you can retrieve, update, or delete the image along with all the other data in integrated SQL queries; keeping the files separately means writing much more complex code that has to go out to the file system to maintain data integrity every time you issue certain SQL queries. Locking can be a big problem, and transaction processing (especially rollback) even bigger.
Seems like you've already sort of talked yourself into it, but in my experience it's better to store files in a filesystem and data in a database. Use GUID's for the file names if you are worried about a conflict.
Pasting my answer from a similar post: I have implemented both solutions (file system and database-persisted images) in previous projects. In my opinion, you should store images in your database. Here's why:
File system storage is more complicated when your app servers are
clustered. You have to have shared storage. Even if your current
environment is not clustered, this makes it more difficult to scale
up when you need to
You should be using a CDN for your static
content anyways, and set your app up as the origin. This means that
your app will only be hit once for a given image, then it will be
cached on the CDN. CloudFront is dirt cheap and simple to set
up...there's no reason not to use it. Save your bandwidth for your
dynamic content.
It's much quicker (and thus cheaper) to develop
database persisted images
You get referential integrity with
database persisted images. If you're storing images on the file
system, you will inevitably have orphan files with no matching
database records, or you'll have database records with broken file
links. This WILL happen...it's just a matter of time. You'll have to
write something to clean these up.
Anyways, my two cents.

Should I use MySQL blob field type?

I am struggling to decide if I should be using the MySQL blob field type in an upcoming project I have.
My basic requirements are, there will be certain database records that can be viewed and have multiple files uploaded and "attached" to those records. Seeing said records can be limited to certain people on a case by case basis. Any type of file can be uploaded with virtually no restriction.
So looking at it one way, if I go the MySQL route, I don't have to worry about virus's creeping up or random php files getting uploaded and somehow executed. I also have a much easier path for permissioning and keeping data tied close to a record.
The other obvious route is storing the data in a specific folder structure outside of the webroot. in this case I'd have to come up with a special naming convention for folders/files to keep track of what they reference inside the database.
Is there a performance hit with using MySQL blob field type? I'm concerned about choosing a solution that will hinder future growth of the website as well as choosing a solution that wont be easy to maintain.
Is there a performance hit with using MySQL blob field type?
Not inherently, but if you have big BLOBs clogging up your tables and memory cache that will certainly result in a performance hit.
The other obvious route is storing the data in a specific folder structure outside of the webroot. in this case I'd have to come up with a special naming convention for folders/files to keep track of what they reference inside the database.
Yes, this is a common approach. You'd usually do something like have folders named after each table they're associated with, containing filenames based only on the primary key (ideally a integer; certainly never anything user-submitted).
Is this a better idea? It depends. There are deployment-simplicity advantages to having only a single data store, and not having to worry about giving the web user write access to anything. Also if there might be multiple copies of the app running (eg active-active load balancing) then you need to synchronise the storage, which is much easier with a database than it is with a filesystem.
If you do use the filesystem rather than a blob, the question is then, do you get the web server to serve it by pointing an Alias at the folder?
+ is super fast
+ caches well
- extra server config: virtual directory; needs appropriate file extension to return desired Content-Type
- extra server config: need to add Content-Disposition: attachment/X-Content-Type-Options headers to stop IE sniffing for HTML as part of anti-XSS measures
or do you serve the file manually by having a server-side script spit it out, as you would have to serving from a MySQL blob?
- is potentially slow
- needs a fair bit of manual If-Modified-Since and ETag handling to cache properly
+ can use application's own access control methods
+ easy to add correct Content-Type and Content-Disposition headers from the serving script
This is a trade-off there's not one globally-accepted answer for.
If your web server will be serving these uploaded files over the web, the performance will almost certainly be better if they are stored on the filesystem. The web server will then be able to apply HTTP caching hints such as Last-Modified and ETag which will help performance for users accessing the same file multiple times. Additionally, the web server will automatically set the correct Content-Type for the file when serving. If you store blobs in the database, you'll end up implementing the above mentioned features and more when you should be getting them for free from your web server.
Additionally, pulling large blob data out of your database may end up being a performance bottleneck on your database. Also, your database backups will probabaly be slower because they'll be backing up more data. If you're doing ad-hoc queries during development, it'll be inconvenient seeing large blobs in result sets for select statements. If you want to simply inspect an uploaded file, it will be inconvenient and roundabout to do so because it'll be awkwardly stored in a database column.
I would stick with the common practice of storing the files on the filesystem and the path to the file in the database.
In my experience storing a BLOB in MySQL is OK, as long you store only the blob in one table, while other fields are in another (joined) table. Conversely, searching in the fields of a table with a few standard fields and one blob field with 100 MB of data can slow queries dramatically.
I had to change the data layer of a mailing app for this issue where emails were stored with content in the same table as date sent, email addresses, etc. It was taking 9 secs to search 10000 emails. Now it takes what it should take ;-)
Data should be stored in one consistent place: the database.
This performance and Content-Type thing is not an issue at all, because there is nothing stopping you from caching those BLOB fields to the local web server and serving it from there as it is requested for the first time. You do not need to access that table on every page view.
This file system cache can be emptied out at any moment, which will only impact performance temporarily as it is being refilled automagically. It will also enable you to use one database and many web servers as your application grows, they will simply all have a local cache on the file system.
Many people recommend against storing file attachments (usually this applies to images) in blobs in the database. Instead they prefer to store a pathname as a string in the database, and store the file somewhere safe on the filesystem. There are some merits to this:
Database and database backups are smaller.
It's easier to edit files on the filesystem if you need to work with them ad hoc.
Filesystems are good at storing files. Databases are good at storing tuples. Let each one do what it's good at.
There are counter-arguments too, that support putting attachments in a blob:
Deleting a row in a database automatically deletes the associated attachment.
Rollback and transaction isolation work as expected when data is in a row, but not when some part of the data is on the filesystem.
Backups are simpler if all data is in the database. No need to worry about making consistent backups of data that's changing concurrently during the backup procedure.
So the best solution depends on how you're going to be using the data in your application. There's no one-size-fits-all answer.
I know you tagged your question with MySQL, but if folks reading this question use other brands of RDBMS, they might want to look into BFILE when using Oracle, or FILESTREAM when using Microsoft SQL Server 2008. These give you the ability store files outside the database but access them like they're part of a row in a database table (more or less).
Large volumes of data will eventually take their toll on performance. MS SQL 2008 has a specialized way of storing binary data in the file system:
http://msdn.microsoft.com/en-us/library/cc949109.aspx
I would employ the similar approach too for your project too.
You can create a FILES table that will keep information about files such as original names for example. To safely store files on the disk rename them using for example GUIDs. Store new file names in your FILES table and when user needs to download it you can easily locate it on disk and stream it to user.
In my opinion storing files in database is bad idea. What you can store there is id, name, type, possibly md5 hash of file, and date inserted. Files can be uploaded in to folder outside public location. Also you should be concern that it is not advised to keep more than 1000 files in one folder. So what you have to create new folder each time file id is increased by 1000.