MongoDB high storageSize - mysql

I've just switched from mysql to work with mongoDB and it's pretty awesome but I'm struggling with the db datasize..
I have about 700 documents per day, and each has about 900 comments embeddeds inside.
The average object size is about 53k (this is only with a couple of hours), so with easy maths it should be 53*700 = 37MB. But the total size is about 250MB (storageSize) (only 2h!)
So, I'll create more than 1GB of data every day, in mysql was about 100mb/day (even less).
is this normal? How can I deal with it? Thanks!

The reason why you are seeing this is because of fragmentation of record objects.
Each document within MongoDB is held within an internal record object, think of it as a C++ struct which represents a document.
Record objects are single contiguous pieces of hard disk space, so as to limit the number of hard disk look-ups and make them sequential. This hard disk look-up has a nasty down side though, if are constantly growing your documents then they must constantly be moved to larger and larger record objects, sending the old record objects to the $freelists (an internal list of free spaces) to be used by another object of that size that comes in.
This creates fragmentation, I believe this is what you are seeing with your own data.
One way to solve this normally is to use powerof2sizes ( http://docs.mongodb.org/manual/reference/command/collMod/ ), unfortunately due to how your document increases I do not think this will work.
Another way to solve this would be to manually set the padding so that the document always fits and never moves however you cannot yet: https://jira.mongodb.org/browse/SERVER-1810
The best way, currently, to solve this problem is to change your schema to factor out the comments into their own collection.
This does mean two queries but they should be two indexed super fast queries, maybe a couple of microseconds slower than loading that document from disk.

incase of planning to change schema, visit http://docs.mongodb.org/ecosystem/use-cases/pre-aggregated-reports to avoid data grow and fragmentation issue.

One thing I haven't seen in any of the current answers is document padding on initial insert. You can avoid the data growth (to some extent) by "padding" the documents with some extra space at the beginning to accommodate the comments that will be added in the future.
http://docs.mongodb.org/manual/faq/developers/#faq-developers-manual-padding
Using the data you already have on hand about your average document size, add a little bit to that and on your initial insert, include that padding. It should improve your update performance as well as avoid the swiss cheese effect the commenters above are talking about.
For reference, this is why you are seeing so much extra space:
http://docs.mongodb.org/manual/core/record-padding/

Related

Data pipeline proposal

Our product has been growing steadily over the last few years and we are now on a turning point as far as data size for some of our tables is, where we expect that the growth of said tables will probably double or triple in the next few months, and even more so in the next few years. We are talking in the range of 1.4M now, so over 3M by the end of the summer and (since we expect growth to be exponential) we assume around 10M at the end of the year. (M being million, not mega/1000).
The table we are talking about is sort of a logging table. The application receives data files (csv/xls) on a daily basis and the data is transfered into said table. Then it is used in the application for a specific amount of time - a couple of weeks/months - after which it becomes rather redundant. That is: if all goes well. If there is some problem down the road, the data in the rows can be useful to inspect for problem solving.
What we would like to do is periodically clean up the table, removing any number of rows based on certain requirements, but instead of actually deleting the rows move them 'somewhere else'.
We currently use MySQL as a database and the 'somewhere else' could be the same, but can be anything. For other projects we have a Master/Slave setup where the whole database is involved, but that's not what we want or need here. It's just some tables where the Master table would need to become shorter and the Slave only bigger, not a one-on-one sync.
The main requirement for the secondary store would be that the data should be easy to inspect/query when need to, either by SQL or another DSL, or just visual tooling. So we are not interested in backing up the data to one or more CSV files or another plain text format, since that is not as easy to inspect. The logs will then be somewhere on S3 so we would need to download it, grep/sed/awk on it... We'd much rather have something database like that we can consult.
I hope the problem is clear?
For the record: while the solution can be anything we prefer to have the simplest solution possible. It's not that we don't want Apache Kafka (example), but then we'd have to learn it, install it, maintain it. Every new piece of technology adds onto our stack, the lighter it remains the more we like it ;).
Thanks!
PS: we are not just being lazy here, we have done some research but we just thought it'd be a good idea to get some more insight in the problem.

The effect of the fields length on the querying time

I have a mysql database in which I keep information of item and also I keep description.
The thing is that the description column can hold up to 150 chars which I think is long and I wondered if it slows the querying time. Also I wanted to know if its recommended to shorten the size of the int I mean if I have a price which is normally not that big should I limit the column to small/medium int?
The columns are something like this:
id name category publisher mail price description
Thanks in advance.
Store your character data as varchar() and not as char() and read up on the MySQL documentation on these data types (here). This only stores the characters actually in the description, plus a few more bytes of overhead.
As for whether or not the longer fields imply worse-performing queries. That is a complicated subject. Obviously, at the extreme, having the maximum size records is going to slow things down versus a 10-byte record. The reason has to do with I/O performance. MySQL reads in pages and a page can contain one or more records. The records on the page are then processed.
The more records that fit on the page, the fewer the I/Os.
But then it gets more complicated, depending on the hardware and the storage engine. Disks, nowadays, do read-aheads as do operating systems. So, the next read of a page (if pages are not fragmented and are adjacent to each other) may be much faster than the read of the initial page. In fact, you might have the next page in memory before processing on the first page has completed. At that point, it doesn't really matter how many records are on each page.
And, 200 bytes for a record is not very big. You should worry first about getting your application working and second about getting it to meet performance goals. Along the way, make reasonable choices, such as using varchar() instead of char() and appropriately sized numerics (you might consider fixed point numeric types rather than float for monetary values).
It is only you that considers 150 long - the database most likely does not, as they're designed to handle much more at once. Do not consider sacrificing your data for "performance". If the nature of your application requires you to store up to 150 characters of text at once, don't be afraid to do so, but do look up optimization tips.
Using proper data types, though, can help you save space. For instance, if you have a field which is meant to store values 0 to 20, there's no need for an INT field type. A TINYINT will do.
The documentation lists the data types and provides information on how much space they use and how they're managed.

To BLOB or not to BLOB

I am in the process of writing a web app backed up by a MySQL database where one of the tables has a potential for getting very large (order of gigabytes) with a significant proportion of table operations being writes. One of the table columns needs to store a string sequence that can be quite big. In my tests thus far it has reached a size of 289 bytes but to be on the safe side I want to design for a maximum size of 1 kb. Currently I am storing that column as a MySQL MediumBlob field in an InnoDB table.
At the same time I have been googling to establish the relative merits and demerits of BLOBs vs other forms of storage. There is a plethora of information out there, perhaps too much. What I have gathered is that InnoDB stores the first few bytes (789 if memory serves me right) of the BLOB in the table row itself and the rest elsewhere. I have also got the notion that if a row has more than one BLOB (which my table does not) per column then the "elsewhere" is a different location for each BLOB. That apart I have got the impression that accessing BLOB data is significantly slower than accessing row data (which sounds reasonable).
My question is just this - in light of my BLOB size and the large potential size of the table should I bother with a blob at all? Also, if I use some form of inrow storage instead will that not have an adverse effect on the maximum number of rows that the table will be able to accommodate?
MySQL is neat and lets me get away with pretty much everything in my development environment. But... that ain't the real world.
I'm sure you've already looked here but it's easy to overlook some of the details since there is a lot to keep in mind when it comes to InnoDB limitations.
The easy answer to one of your questions (maximum size of a table) is 64TBytes. Using variable size types to move that storage into a separate file would certainly change the upper limit on number of rows but 64TBytes is quite a lot of space so the ratio might be very small.
Having a column with a 1KByte string type that is stored inside the table seems like a viable solution since it's also very small compared to 64TBytes. Especially if you have very strict requirements for query speed.
Also, keep in mind that the InnoDB 64TByte limit might be pushed down by the the maximum file size for the OS you're using. You can always link several files together to get more space for your table but then it's starting to get a bit more messy.
if the BLOB data is more then 250kb it is not worth it. In your case i wouldn't bother myself whit BLOB'n. Read this

How important is it to select the smallest possible data type when designing a database?

How much of a difference does using tinyint or smallint (when applicable) instead of just int do? Or restricting a char field to the minimum characters needed?
Do these choices affect performance or just allocated space?
On an Indexed field with a significantly large table the size of your field can make a large affect on performance. On a nonindexed field its not nearly as important bit it still has to write the extra data.
That said, the downtime of a resize of a large table can be several minutes or several hours even, so don't make them smaller than you'd imagine ever needing.
Yes it affects performance too.
If the indexes are larger, it takes longer to read them from disk, and less can be cached in memory.
I've frequently seen these three schema design defects causing problems:
A varchar(n) field was created with n only big enough for the sample of data that the designer had pulled in, not the global population: fine in unit tests, silent truncations in the real world.
A varchar(n) used where the data is fixed size. This masks data bugs.
A char(n) used for variable length data. This provides performance improvements (by enabling the data to sit in-line in the row on disc, but all the client code (and various stored procs/views etc) need to cope with whitespace padding issues (and often they don't). Whitespace padding can be difficult to track down, because spaces don't show up too well, and various libraries/sql clients suppress them.
I've never seen a well intentioned (i.e. not just using varchar(255) for all cols) but conservative selection of the wrong data size cause a significant performance problems. By significant, I mean factor of 10. I regularly see algorithmic design flaws (missing indexes, sending too much data over the wire etc.) causing much bigger performance hits.
Both, in some cases. But imo, it's more of a question of design than performance and storage considerations. The reason you don't make everything varchar(...) is because that doesn't accurately reflect what sort of data should be stored there, and it reduces your data's integrity and type-safety.

CPU bound applications vs. IO bound

For 'number-crunching' style applications that use alot of data (reads: "hundreds of MB, but not into GB" ie, it will fit nicely into memory beside the OS), does it make sense to read all your data into memory first before starting processing to avoid potentially making your program IO bound while reading large related datasets, instead loading them from RAM?
Does this answer change between using different data backings? ie, would the answer be the same irrespective of if you were using XML files, flat files, a full DBMS, etc?
Your program is as fast as whatever its bottleneck is. It makes sense to do things like storing your data in memory if that improves the overall performance. There is no hard and fast rule that says it will improve performance however. When you fix one bottleneck, something new becomes the bottleneck. So resolving one issue may get a 1% increase in performance or 1000% depending on what the next bottleneck is. The thing you're improving may still be the bottleneck.
I think about these things as generally fitting into one of three levels:
Eager. When you need something from disk or from a network or the result of a calculation you go and get or do it. This is the simplest to program, the easiest to test and debug but the worst for performance. This is fine so long as this aspect isn't the bottleneck;
Lazy. Once you've done a particular read or calculation don't do it again for some period of time that may be anything from a few milliseconds to forever. This can add a lot of complexity to your program but if the read or calculation is expensive, can reap enormous benefits; and
Over-eager. This is much like a combination of the previous two. Results are cached but instead of doing the read or calculation or requested there is a certain amount of preemptive activity to anticipate what you might want. Like if you read 10K from a file, there is a reasonably high likelihood that you might later want the next 10K block. Rather than delay execution you get it just in case it's requested.
The lesson to take from this is the (somewhat over-used and often mis-quoted) quote from Donald Knuth that "premature optimization is the root of all evil." Eager and over-eager solutions add a huge amount of complexity so there is no point doing them for something that won't yield a useful benefit.
Programmers often make the mistake of creating some highly (alleged) optimized version of something before determining if they need to and whether or not it will be useful.
My own take on this is: don't solve a problem until you have a problem.
I would guess that choosing the right data storage method will have more effect than whether you read from disk all at once or as needed.
Most database tables have regular offsets for fields in each row. For example, a customer record may be 50 bytes long and have a pants_size column start at the 12th byte. Selecting all pants sizes is as easy as getting values at offsets 12, 62, 112, 162, ad nauseum.
XML, however, is a lousy format for fast data access. You'll need to slog through a bunch of variable-length tags and attributes in order to get your data, and you won't be able to jump instantly from one record to the next. Unless you parse the file into a data structure like the one mentioned above. In which case you'd have something very much like an RDMS, so there you go.