I want Mysql to store it's data on Amazone S3, So I mounted an S3 bucket to my server and changed the path of data dir to mounted directory in my.cnf.
After doing this, I restarted the server and created the database and it caused no problem but when I try to create a table (say test), it gives me the following error.
ERROR 1033 (HY000): Incorrect information in file: './test/t.frm'
Can any one please tell me, what I am trying to oo is actually possible?
If yes, where am I going wrong?
If no, Why?
There is no viable solution for storing MySQL databases on S3. None.
There's nothing wrong with using s3fs in limited applications where it is appropriate, but it's not appropriate here.
S3 is not a filesystem. It is an object store. To modify a single byte of a multi-gigabyte "file" in S3 requires that the entire file be copied over itself.
Now... there are tools like s3nbd and s3backer that take a different approach to using S3 for storage. These use S3 to emulate a block device over which you can create a filesystem, and these would come closer than s3fs to being an appropriate bridge between what S3 is and what MySQL would need, but still this approach cannot reliably be used either, for one reason.
Consistency.
When MySQL writes data to a file, it needs absolute assurance that if it reads that same data, that it will get back what it wrote. S3 does not guarantee this.
Q: What data consistency model does Amazon S3 employ?
Amazon S3 buckets in all Regions provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES.
https://aws.amazon.com/s3/faqs/
When an object in S3 is "modified" (that's done with an overwrite PUT), there is no guarantee that a read of that file won't return a previous version for a short time after the write occurred.
In short, you are pursuing an essentially impossible objective trying to use S3 for something it isn't designed to do.
There, is, however, a built-in mechanism in MySQL that can save on storage costs: InnoDB natively supports on-the-fly table compression.
Or if you have large, read-only MyISAM tables, those can also be compressed with myisampack.
Some EC2 instances include the ephemeral disks/instance store, which are zero-cost, but volatile, hard drives that should never be used for critical data, but that might be a good option to consider if the database in question is a secondary database that can easily be rebuilt from authoritative sources in the event of data loss. They can be quite nice for "disposable" databases, like QA databases or log analytics, where the database is not the authoritative store for the data.
Actually s3 is not really a file system so it will not work as data directory in normal scenario.
May be you can use it as data directory after mounting it with data directory like /var/lib/mysql but still it will perform slow. So I don't think that it is a good idea.
S3 bucket is a storage directory where you can store your images, files, backup files etc.
If still you want to use it as data directory then you can take help from here.
http://centosfaq.org/centos/s3-as-mysql-directory/
files cannot be appended/modified in AWS S3 once created. It might be not be possible to store Mysql DB on S3.
MySQL with RocksDB engine can possibly do this:
Run MyRocks on S3FS, or
Use rockset's RocksDB-Cloud and modify MyRocks to support RocksDB-Cloud.
Both solutions might do some modification on MyRocks.
See source codes:
MyRocks
RocksDB-cloud
Related
Suppose the users of my website uploads a pdf on my website which is live on the internet. Then is there a way that those files after being uploaded gets stored in my mysql database on my system(laptop) directly.
To refine more, would it matter if one uses mySql database on his local system(localhost) or on a live website to store data? , will the database fail to store data if the website is hosted online?
If the question is not clear to anyone in any sort please mention.
Thank you.
There are a lot of nuances to your question, and I'll try to address as many of them as I can.
I would not store files directly in the database. You certainly can, but in general you're going to get better performance and other ancillary benefits from storing files as a file in the file system. Store metadata in the database, including at the very least the file name and path on disk (perhaps you want to store more, like the uploader's account information, the size, a long-form text description, and so on, but at least store the path and filename). Then, in your application, fetch the filename from the database and serve the file instead of a database BLOB. One reason is that MySQL performance can really suffer if you don't do this properly.
Let's say you decide to defy my suggestion and store the file as some BLOB data in your database, how can you replicate that to your laptop? Your laptop isn't going to be powered on and connected to the internet all the time, in any case even if you had a server at home running 24 hours a day your hosting provider still should have better uptime than your home does. What should happen to the upload if you were hosting the database on your laptop, but your laptop was off (or rebooting for system updates)? So you should host the database at the hosting provider and somehow sync it to your local machine. MySQL provides several methods of this; replication, export and import of .sql files, or exporting binary logs. These each have tradeoffs that you'll want to consider depending on your needs.
But remember how I said you can get other ancillary benefits from storing the file on the file system directly? One of those is that you can rely on file transfer techniques to get the file to your local machine. SFTP, SCP, SyncThing, WebDAV, and any other way you can imagine transferring files can be used to get the remote file to your local system. You wouldn't automatically get the database metadata, but that didn't seem like much of a requirement from your question, so you'd have easy access to the file as uploaded, as quickly as you want.
So there are plenty of ways to accomplish this, and without more details on your question it's tough to recommend a solution, but you have plenty of options available.
I have an application where customers upload files like Powerpoints and Excel spreadsheets to the application through a web UI. The files then have meta data associated with them and they are stored as BLOBs in a MySQL database. The users may download these files occasionally, but not very often. The emphasis here is on archiving. Security of data is also important.
If that is the case, what are the pros and cons of storing the files as BLOBs in MySQL as opposed to putting them on Amazon S3? I've never used S3 before but hear that it's popular for storing files.
The main advantage of relational databases (such as MySQL) is the elegance it permits you to query for data. BLOB columns, however, offer very little in terms of rich query semantics compared to other column types, so If that's your main use case, there's hardly any reason to use a relational database at all, it doesn't offer much above and beyond a regular filesystem or simple key-value datastore (such as s3).
Dollars to bytes, s3 is likely much more cost effective.
On the other hand, there are some things that a relational database can bring that would be worhtwhile. The most obvious is transactional semantics (only on the InnoDB engine, not available with MyISAM), so that you can safely know that whole groups of uploads or modifications take place consistencly. Another advantage is that you can still add metadata about your blobs (even if it's only over time, as your application improves) so you can still benefit some from the rich queries MySQL supports.
storing binary data into blob
make your database fat
size limitation (is overcome at the later version in mysql)
data portability is not there (you need a mysql api/client to access the data)
there is no true security
If you are archiving the binary data,
store into normal disk file
If security is important,
consider separate between your UI server and storage server,
but is hard to archive,
you can always consider to embed password / encryption into these binary files
security over amazon s3
http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?UsingAuthAccess.html
http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?S3_QSAuth.html
Security of data is also important.
Do note that files on S3 are not stored on encrypted disks, so you may have to encrypt client-side or on your servers before sending it up to S3.
I've been storing data in S3 for years and completely love it! What I do is upload the file to S3 (where its copied multiple times by the way) and then store a reference to the file path and name into my MySQL files table. If anything else, it takes that much load off of the MySQL DB and S3 now offers AES256 bit encryption with revolving master keys so you know its secure!
I have a web application that stores a lot of user generated files. Currently these are all stored on the server filesystem, which has several downsides for me.
When we move "folders" (as defined by our application) we also have to move the files on disk (although this is more due to strange design decisions on the part of the original developers than a requirement of storing things on the filesystem).
It's hard to write tests for file system actions; I have a mock filesystem class that logs actions like move, delete etc, without performing them, which more or less does the job, but I don't have 100% confidence in the tests.
I will be adding some other jobs which need to access the files from other service to perform additional tasks (e.g. indexing in Solr, generating thumbnails, movie format conversion), so I need to get at the files remotely. Doing this over network shares seems dodgy...
Dealing with permissions on the filesystem as sometimes given us problems in the past, although now that we've moved to a pure Linux environment this should be less of an issue.
So, my main questions are
What are the downsides of storing files as BLOBs in MySQL?
Do the same problems exist with NoSQL systems like Cassandra?
Does anyone have any other suggestions that might be appropriate, e.g. MogileFS, etc?
Not a direct answer but some pointers to very interesting and somehow similar questions (yeah, they are about blobs and images but this is IMO comparable).
What are the downsides of storing files as BLOBs in MySQL?
Storing Images in DB - Yea or Nay?
Images in database vs file system
https://stackoverflow.com/search?q=images+database+filesystem
Do the same problems exist with NoSQL systems like Cassandra?
NoSQL for filesystem storage organization and replication?
Storing images in NoSQL stores
PS: I don't want to be the killjoy but I don't think that any NoSQL solution is going to solve your problem (NoSQL is just irrelevant for most businesses).
maybe a hybrid solution.
Use a database to store metadata about each file - and use the file system to actually store the file.
any restructuring of 'folders' could be modelled in the DB and dereferenced from the actual OS location.
You can store files up to 2GB easily in Cassandra by splitting them into 1MB columns or so. This is pretty common.
You could store it as one big column too, but then you'd have to read the whole thing into memory when accessing it.
If the OS or application doesn't need access to the files, then there's no real need to store the files on the file system. If you want to backup the files at the same time you backup the database, then there's less benefit to storing them outside the database. Therefore, it might be a valid solution to store the files in the database.
An additional downside is that processing files in the db has more overhead than processing files at the file system level. However, as long as the advantages outweigh the downsides, and it seems that it might in your case, you might give it a try.
My main concern would be managing disk storage. As your database files get large, managing your entire database gets more complicated. You don't want to move out of the frying pan and into the fire.
I have a PHP based web application which is currently only using one webserver but will shortly be scaling up to another. In most regards this is pretty straightforward, but the application also stores a lot of files on the filesystem. It seems that there are many approaches to sharing the files between the two servers, from the very simple to the reasonably complex.
These are the options that I'm aware of
Simple network storage
NFS
SMB/CIFS
Clustered filesystems
Lustre
GFS/GFS2
GlusterFS
Hadoop DFS
MogileFS
What I want is for a file uploaded via one webserver be immediately available if accessed through the other. The data is extremely important and absolutely cannot be lost, so whatever is implemented needs to a) never lose data and b) have very high availability (as good as, or better, than a local filesystem).
It seems like the clustered filesystems will also provide faster data access than local storage (for large files) but that isn't of vita importance at the moment.
What would you recommend? Do you have any suggestions to add or anything specifically to look out for with the above options? Any suggestions on how to manage backup of data on the clustered filesystems?
You can look at the Mirror File System that replicate files between servers in real time.
It's very easy to install and set up. One mount command does it and you can have a HA,
Load Balancing and Backup solution in less than 10 minutes.
http://www.TwinPeakSoft.com/
Fish.Ada
It looks like the clustered filesystems are the best bet. Backup can be done as for any other filesystem, although with most of them having built in redundancy, they are already more reliable than a standard filesystem.
I am struggling to decide if I should be using the MySQL blob field type in an upcoming project I have.
My basic requirements are, there will be certain database records that can be viewed and have multiple files uploaded and "attached" to those records. Seeing said records can be limited to certain people on a case by case basis. Any type of file can be uploaded with virtually no restriction.
So looking at it one way, if I go the MySQL route, I don't have to worry about virus's creeping up or random php files getting uploaded and somehow executed. I also have a much easier path for permissioning and keeping data tied close to a record.
The other obvious route is storing the data in a specific folder structure outside of the webroot. in this case I'd have to come up with a special naming convention for folders/files to keep track of what they reference inside the database.
Is there a performance hit with using MySQL blob field type? I'm concerned about choosing a solution that will hinder future growth of the website as well as choosing a solution that wont be easy to maintain.
Is there a performance hit with using MySQL blob field type?
Not inherently, but if you have big BLOBs clogging up your tables and memory cache that will certainly result in a performance hit.
The other obvious route is storing the data in a specific folder structure outside of the webroot. in this case I'd have to come up with a special naming convention for folders/files to keep track of what they reference inside the database.
Yes, this is a common approach. You'd usually do something like have folders named after each table they're associated with, containing filenames based only on the primary key (ideally a integer; certainly never anything user-submitted).
Is this a better idea? It depends. There are deployment-simplicity advantages to having only a single data store, and not having to worry about giving the web user write access to anything. Also if there might be multiple copies of the app running (eg active-active load balancing) then you need to synchronise the storage, which is much easier with a database than it is with a filesystem.
If you do use the filesystem rather than a blob, the question is then, do you get the web server to serve it by pointing an Alias at the folder?
+ is super fast
+ caches well
- extra server config: virtual directory; needs appropriate file extension to return desired Content-Type
- extra server config: need to add Content-Disposition: attachment/X-Content-Type-Options headers to stop IE sniffing for HTML as part of anti-XSS measures
or do you serve the file manually by having a server-side script spit it out, as you would have to serving from a MySQL blob?
- is potentially slow
- needs a fair bit of manual If-Modified-Since and ETag handling to cache properly
+ can use application's own access control methods
+ easy to add correct Content-Type and Content-Disposition headers from the serving script
This is a trade-off there's not one globally-accepted answer for.
If your web server will be serving these uploaded files over the web, the performance will almost certainly be better if they are stored on the filesystem. The web server will then be able to apply HTTP caching hints such as Last-Modified and ETag which will help performance for users accessing the same file multiple times. Additionally, the web server will automatically set the correct Content-Type for the file when serving. If you store blobs in the database, you'll end up implementing the above mentioned features and more when you should be getting them for free from your web server.
Additionally, pulling large blob data out of your database may end up being a performance bottleneck on your database. Also, your database backups will probabaly be slower because they'll be backing up more data. If you're doing ad-hoc queries during development, it'll be inconvenient seeing large blobs in result sets for select statements. If you want to simply inspect an uploaded file, it will be inconvenient and roundabout to do so because it'll be awkwardly stored in a database column.
I would stick with the common practice of storing the files on the filesystem and the path to the file in the database.
In my experience storing a BLOB in MySQL is OK, as long you store only the blob in one table, while other fields are in another (joined) table. Conversely, searching in the fields of a table with a few standard fields and one blob field with 100 MB of data can slow queries dramatically.
I had to change the data layer of a mailing app for this issue where emails were stored with content in the same table as date sent, email addresses, etc. It was taking 9 secs to search 10000 emails. Now it takes what it should take ;-)
Data should be stored in one consistent place: the database.
This performance and Content-Type thing is not an issue at all, because there is nothing stopping you from caching those BLOB fields to the local web server and serving it from there as it is requested for the first time. You do not need to access that table on every page view.
This file system cache can be emptied out at any moment, which will only impact performance temporarily as it is being refilled automagically. It will also enable you to use one database and many web servers as your application grows, they will simply all have a local cache on the file system.
Many people recommend against storing file attachments (usually this applies to images) in blobs in the database. Instead they prefer to store a pathname as a string in the database, and store the file somewhere safe on the filesystem. There are some merits to this:
Database and database backups are smaller.
It's easier to edit files on the filesystem if you need to work with them ad hoc.
Filesystems are good at storing files. Databases are good at storing tuples. Let each one do what it's good at.
There are counter-arguments too, that support putting attachments in a blob:
Deleting a row in a database automatically deletes the associated attachment.
Rollback and transaction isolation work as expected when data is in a row, but not when some part of the data is on the filesystem.
Backups are simpler if all data is in the database. No need to worry about making consistent backups of data that's changing concurrently during the backup procedure.
So the best solution depends on how you're going to be using the data in your application. There's no one-size-fits-all answer.
I know you tagged your question with MySQL, but if folks reading this question use other brands of RDBMS, they might want to look into BFILE when using Oracle, or FILESTREAM when using Microsoft SQL Server 2008. These give you the ability store files outside the database but access them like they're part of a row in a database table (more or less).
Large volumes of data will eventually take their toll on performance. MS SQL 2008 has a specialized way of storing binary data in the file system:
http://msdn.microsoft.com/en-us/library/cc949109.aspx
I would employ the similar approach too for your project too.
You can create a FILES table that will keep information about files such as original names for example. To safely store files on the disk rename them using for example GUIDs. Store new file names in your FILES table and when user needs to download it you can easily locate it on disk and stream it to user.
In my opinion storing files in database is bad idea. What you can store there is id, name, type, possibly md5 hash of file, and date inserted. Files can be uploaded in to folder outside public location. Also you should be concern that it is not advised to keep more than 1000 files in one folder. So what you have to create new folder each time file id is increased by 1000.