Storing and searching files in MySQL - mysql

I am trying to store at least 500,000+ small "files" in a database (3 KB average size up to about 8~10 MB occasionally). This is to remove them from the file-system and to speed up searches/user-operations.
Meta Data (essentially filename, datetime-created, datetime-modified)
A LONGBLOB of the file-contents
Storing them in the database (MySQL) has been fine. The database stores that number of files and searching the meta data (string, datetime, datetime) is also quick with the relevant indexes.
Unfortunately but unsurprisingly any attempt to search within the LONGBLOBs is really slow. Within the LONGBLOBs here is how the data looks:
80% are "text files" (e.g. XML) and under 100 KB
15% are "text files" but are over 100 KB (up to 8~10 MB)
5% or less are binary files (which might get corrupted in a "text" container).
Would classifying this data as either text or unknown and then placing it in a separate LONGTEXT table provide performance improvements when doing operations like LIKE "%X%" (as opposed to LONGBLOB)?
Is there any other techniques I can do to improve performance when searching through BLOBs (in a very "grep" style)? The searches are typically short sequences of data held within the BLOB and there is likely few searches which get repeated (but searches are somewhat predictable, some data is more interesting than others).

Well, you better do a full-text index (which will be of a HUGE size on such amounts of data) and do a MATCH AGAINST queries in order to search efficiently. LIKE is painfully slow on huge amounts of text, this is well know and should be avoided.
http://dev.mysql.com/doc/refman/5.5/en//fulltext-search.html
You could also keep them in the FS and build yourself command line tools that you call from within your server-side language that actually do "GREP style" searching and return the list of file paths of those that match your "query", but I'm not sure if this will be efficient.

Related

cost of keys in JSON document database (mongodb, elasticsearch)

I would like if someone had any experience with speed or optimization effects on the size of JSON keys in a document store database like mongodb or elasticsearch.
So for example: I have 2 documents
doc1: { keeeeeey1: 'abc', keeeeeeey2: 'xyz')
doc2: { k1: 'abc', k2: 'xyz')
Lets say I have 10 million records, then to store data in doc1 format would mean more db file size than to store in doc2.
Other than that would are the disadvantages or negative effects in terms of speed or RAM or any other optimization?
You correctly noticed that the documents will have different size. So you will save at least 15 bytes per document (60% for similar documents) if you decide to adopt the second schema. This will end up in something like 140MB for your 10 million records. This will give you the following advantage:
HDD savings. The only problem is that looking at the prices for current HDD this is mostly useless.
RAM saving. In comparison with hard discs, this can be useful for indexing. In mongodb working set of indexes should fit in RAM to achieve a good performance. So if you will have indexes on these two fields, you will not only save 140MB of HDD space but also 140MB of potential RAM space (which is actually noticable).
I/O. A lot of bottlenecks happens due to the limitation of input/output system (the speed of reading/writing from the disk is limited). For your documents, this means that with schema 2 you can potentially read/write twice as many documents per 1 second.
network. In a lot of situations network is even way slower then IO, and if you DB server is on different machine then you application server the data has to be sent over the wire. And you will also be able to send twice as much data.
After telling about advantages, I have to tell you a disadvantage for a small keys:
readability of the database. When you do db.coll.findOne() and sees {_id: 1, t: 13423, a: 3, b:0.2} it is pretty hard to understand what is exactly stored here.
readability of the application similar with the database, but at least here you can have a solution. With a mapping logic, which transforms currentDate to c and price to p you can write a clean code and have a short schema.

Innodb + 8K length row limit

I've a Innodb table with, amongst other fields, a blob field that can contain up to ~15KB of data. Reading here and there on the web I found that my blob field can lead (when the overall fields exceed ~8000 bytes) to a split of some records into 2 parts: on one side the record itself with all fields + the leading 768 bytes of my blob, and on another side the remainder of the blob stored into multiple (1-2?) chunks stored as a linked list of memory pages.
So my question: in such cases, what is more efficient regarding the way data is cached by MySQL ? 1) Let the engine deal with the split of my data or 2) Handle the split myself in a second table, storing 1 or 2 records depending on the length of my blob data and having these records cached by MySQL ? (I plan to allocate as much memory as I can for this to occur)
Unless your system's performance is so terrible that you have to take immediate action, you are better off using the internal mechanisms for record splitting.
The people who work on MySQL and its forks (e.g. MariaDB) spend a lot of time implementing and testing optimizations. You will be much happier with simple application code; spend your development and test time on your application's distinctive logic rather than trying to work around internals issues.

To BLOB or not to BLOB

I am in the process of writing a web app backed up by a MySQL database where one of the tables has a potential for getting very large (order of gigabytes) with a significant proportion of table operations being writes. One of the table columns needs to store a string sequence that can be quite big. In my tests thus far it has reached a size of 289 bytes but to be on the safe side I want to design for a maximum size of 1 kb. Currently I am storing that column as a MySQL MediumBlob field in an InnoDB table.
At the same time I have been googling to establish the relative merits and demerits of BLOBs vs other forms of storage. There is a plethora of information out there, perhaps too much. What I have gathered is that InnoDB stores the first few bytes (789 if memory serves me right) of the BLOB in the table row itself and the rest elsewhere. I have also got the notion that if a row has more than one BLOB (which my table does not) per column then the "elsewhere" is a different location for each BLOB. That apart I have got the impression that accessing BLOB data is significantly slower than accessing row data (which sounds reasonable).
My question is just this - in light of my BLOB size and the large potential size of the table should I bother with a blob at all? Also, if I use some form of inrow storage instead will that not have an adverse effect on the maximum number of rows that the table will be able to accommodate?
MySQL is neat and lets me get away with pretty much everything in my development environment. But... that ain't the real world.
I'm sure you've already looked here but it's easy to overlook some of the details since there is a lot to keep in mind when it comes to InnoDB limitations.
The easy answer to one of your questions (maximum size of a table) is 64TBytes. Using variable size types to move that storage into a separate file would certainly change the upper limit on number of rows but 64TBytes is quite a lot of space so the ratio might be very small.
Having a column with a 1KByte string type that is stored inside the table seems like a viable solution since it's also very small compared to 64TBytes. Especially if you have very strict requirements for query speed.
Also, keep in mind that the InnoDB 64TByte limit might be pushed down by the the maximum file size for the OS you're using. You can always link several files together to get more space for your table but then it's starting to get a bit more messy.
if the BLOB data is more then 250kb it is not worth it. In your case i wouldn't bother myself whit BLOB'n. Read this

MySQL LONGTEXT pagination

I have table posts which contains LONGTEXT. My issue is that I want to retrieve parts of a specific post (basically paging)
I use the following query:
SELECT SUBSTRING(post_content,1000,1000) FROM posts WHERE id=x
This is somehow good, but the problem is the position and the length. Most of the time, the first word and the last word is not complete, which makes sense.
How can I retrieve complete words from position x for length y?
Presumably you're doing this for the purpose of saving on network traffic overhead between the MySQL server and the machine on which your application is running. As it happens, you're not saving any other sort of workload on the MySQL server. It has to fetch the LONGTEXT item from disk, then run it through SUBSTRING.
Presumably you've already decided based on solid performance analysis that you must save this network traffic. You might want to revisit this analysis now that you know it doesn't save much MySQL server workload. Your savings will be marginal, unless you have zillions of very long LONGTEXT items and lots of traffic to retrieve and display parts of them.
In other words, this is an optimization task. YAGNI? http://en.wikipedia.org/wiki/YAGNI
If you do need it you are going to have to create software to process the LONGTEXT item word by word. Your best bet is to do this in your client software. Start by retrieving the first page plus a k or two of the article. Then, parse the text looking for complete words. After you find the last complete word in the first page and its following whitespace, then that character position is the starting place for the next page.
This kind of task is a huge pain in the neck in a MySQL stored procedure. Plus, when you do it in a stored procedure you're going to use processing cycles on a shared and hard-to-scale-up resource (the MySQL server machine) rather than on a cloneable client machine.
I know I didn't give you clean code to just do what you ask. But it's not obviously a good idea to do what you're suggesting.
Edit:
An observation: A gigabyte of server RAM costs roughly USD20. A caching system like memcached does a good job of exploiting USD100 worth of memory efficiently. That's plenty for the use case you have described.
Another observation: many companies who serve large-scale documents use file systems rather than DBMSs to store them. File systems can be shared or replicated very easily among content servers, and files can be random-accessed trivially without any overhead.
It's a bit innovative to store whole books in single BLOBs or CLOBs. If you can break up the books by some kind of segment -- page? chapter? thousand-word chunk? -- and create separate data rows for each segment, your DBMS will scale up MUCH MUCH better than what you have described.
If you're going to do it anyway, here's what you do:
always retrieve 100 characters more than you need in each segment. For example, when you need characters 30000 - 35000, retrieve 30000 - 35100.
after you retrieve the segment, look for the first word break in the data (except on the very first segment) and display starting from that word.
similarly, find the very first word break in the 100 extra bytes, and display up to that word break.
So your fetched data might be 30000 - 35100 and your displayed data might be 30013 - 35048, but it would be whole words.

MySQL Blob vs. Disk for "video frames"

I have a c++ app that generates 6x relatively small image-like integer arrays per second. The data is 64x48x2-dimensional int (ie, a grid of 64x48 two-dimensional vectors, with each vector consisting of two floats). That works out to ~26kb per image. The app also generates a timestamp and some features describing the data. I want to store the timestamp and the features in a MySQL db column, per frame. I also need to store the original array as binary data, either in a file on disc or as a blob field in the database. Assume that the app will be running more or less nonstop, and that I'll come up with a way to archive data older than a certain age, so that storage does not become a problem.
What are the tradeoffs here for blobs, files-on-disc, or other methods I may not even be thinking of? I don't need to query against the binary data, but I need to query against the other metadata/features in the table (I'll definitely have an index built against timestamp), and retrieve the binary data. Does the equation change if I store multiple frames in a single file on disk, vs. one frame per file?
Yes, I've read MySQL Binary Storage using BLOB VS OS File System: large files, large quantities, large problems and To Do or Not to Do: Store Images in a Database, but I think my question differs because in this case there are going to be millions of identically-dimensioned binary files. I'm not sure how the performance hit to maintaining that many small files in a filesystem compares to storing that many files in db blob columns. Any perspective would be appreciated.
At a certain point, querying for many blobs becomes unbearably slow. I suspect that even if your identically dimensioned binary files this will be the case. Moreover you will still need some code to access and process the blobs. And this doesn't take advantage of file caching that might speed up image queries straight from the file system.
But! The link you provided did not mention object based databases, which can store the data you described in a way that you can access it extremely quickly, and possibly return it in native format. For a discussion see the link or just search google, there are many discussions:
Storing images in NoSQL stores
I would also look into HBase.
I figured since you were not sure about what to use in the first place(and there were no answers), an alternative solution might be appropriate.