Is column-based storage on disk still faster even when reading all rows? - csv

So I am deciding whether to save some information in a row-based format - CSV in this case - vs a column-based format - Parquet in this case. From what i've seen online, CSV is row-based, meaning each row is stored in full sequentially on disk. Parquet is column based, meaning all values for a single column are grouped together sequentially on disk.
From what I've seen online, Parquet is far more performant both in reads as well as space taken on disk. A column-based format makes sense when I want to filter my data by the values of a particular group of columns, but what if I want to read all rows in the DB?
If I have 2GB of data in total, and I want to read all rows, I'll have to read through all the data anyway. The only advantage I see of parquet in this case, is that because values for a single column are grouped together, you would be able to use some compression like Snappy/Gzip/etc to reduce the data size, and thus lower storage costs + speed up reads from disk. But you also have to factor in the decompression time here. So is Parquet faster even when reading all rows?

Related

Database or storage method for json data

I'm working to store json data. The data will be:
For each user, let's estimate 5000+
Large, up to a few megabytes
Updated frequently, up to 1000 times a day
There are 2 special cases here:
A. The data will usually be mostly the same. 98% of the time changes to the json data will only be a 1-2% different from the existing data.
B. The json schema is not set. Occasionally keys will change names, may have different data, or will cease to exist. This accounts for some of the "large" changes not included in (A).
My main concern is point (3) above. I have knowledge in mysql. Is there any storage medium that will allow me to do partial updates to the large, possibly not-set, dataset? This will allow the user to send smaller amounts of data, and hopefully make the db updates more efficient.

Innodb + 8K length row limit

I've a Innodb table with, amongst other fields, a blob field that can contain up to ~15KB of data. Reading here and there on the web I found that my blob field can lead (when the overall fields exceed ~8000 bytes) to a split of some records into 2 parts: on one side the record itself with all fields + the leading 768 bytes of my blob, and on another side the remainder of the blob stored into multiple (1-2?) chunks stored as a linked list of memory pages.
So my question: in such cases, what is more efficient regarding the way data is cached by MySQL ? 1) Let the engine deal with the split of my data or 2) Handle the split myself in a second table, storing 1 or 2 records depending on the length of my blob data and having these records cached by MySQL ? (I plan to allocate as much memory as I can for this to occur)
Unless your system's performance is so terrible that you have to take immediate action, you are better off using the internal mechanisms for record splitting.
The people who work on MySQL and its forks (e.g. MariaDB) spend a lot of time implementing and testing optimizations. You will be much happier with simple application code; spend your development and test time on your application's distinctive logic rather than trying to work around internals issues.

Storing and searching files in MySQL

I am trying to store at least 500,000+ small "files" in a database (3 KB average size up to about 8~10 MB occasionally). This is to remove them from the file-system and to speed up searches/user-operations.
Meta Data (essentially filename, datetime-created, datetime-modified)
A LONGBLOB of the file-contents
Storing them in the database (MySQL) has been fine. The database stores that number of files and searching the meta data (string, datetime, datetime) is also quick with the relevant indexes.
Unfortunately but unsurprisingly any attempt to search within the LONGBLOBs is really slow. Within the LONGBLOBs here is how the data looks:
80% are "text files" (e.g. XML) and under 100 KB
15% are "text files" but are over 100 KB (up to 8~10 MB)
5% or less are binary files (which might get corrupted in a "text" container).
Would classifying this data as either text or unknown and then placing it in a separate LONGTEXT table provide performance improvements when doing operations like LIKE "%X%" (as opposed to LONGBLOB)?
Is there any other techniques I can do to improve performance when searching through BLOBs (in a very "grep" style)? The searches are typically short sequences of data held within the BLOB and there is likely few searches which get repeated (but searches are somewhat predictable, some data is more interesting than others).
Well, you better do a full-text index (which will be of a HUGE size on such amounts of data) and do a MATCH AGAINST queries in order to search efficiently. LIKE is painfully slow on huge amounts of text, this is well know and should be avoided.
http://dev.mysql.com/doc/refman/5.5/en//fulltext-search.html
You could also keep them in the FS and build yourself command line tools that you call from within your server-side language that actually do "GREP style" searching and return the list of file paths of those that match your "query", but I'm not sure if this will be efficient.

Efficient and scalable storage for JSON data with NoSQL databases

We are working on a project which should collect journal and audit data and store it in a datastore for archive purposes and some views. We are not quite sure which datastore would work for us.
we need to store small JSON documents, about 150 bytes, e.g. "audit:{timestamp: '86346512',host':'foo',username:'bar',task:'foo',result:0}" or "journal:{timestamp:'86346512',host':'foo',terminalid:1,type='bar',rc=0}"
we are expecting about one million entries per day, about 150 MB data
data will be stored and read but never modified
data should stored in an efficient way, e.g. binary format used by Apache Avro
after a retention time data may be deleted
custom queries, such as 'get audit for user and time period' or 'get journal for terminalid and time period'
replicated data base for failsafe
scalable
Currently we are evaluating NoSQL databases like Hadoop/Hbase, CouchDB, MongoDB and Cassandra. Are these databases the right datastore for us? Which of them would fit best?
Are there better options?
One million inserts / day is about 10 inserts / second. Most databases can deal with this, and its well below the max insertion rate we get from Cassandra on reasonable hardware (50k inserts / sec)
Your requirement "after a retention time data may be deleted" fits Cassandra's column TTLs nicely - when you insert data you can specify how long to keep it for, then background merge processes will drop that data when it reaches that timeout.
"data should stored in an efficient way, e.g. binary format used by Apache Avro" - Cassandra (like many other NOSQL stores) treats values as opaque byte sequences, so you can encode you values how ever you like. You could also consider decomposing the value into a series of columns, which would allow you to do more complicated queries.
custom queries, such as 'get audit for user and time period' - in Cassandra, you would model this by having the row key to be the user id and the column key being the time of the event (most likely a timeuuid). You would then use a get_slice call (or even better CQL) to satisfy this query
or 'get journal for terminalid and time period' - as above, have the row key be terminalid and column key be timestamp. One thing to note is that in Cassandra (like many join-less stores), it is typical to insert the data more than once (in different arrangements) to optimise for different queries.
Cassandra has a very sophisticate replication model, where you can specify different consistency levels per operation. Cassandra is also very scalable system with no single point of failure or bottleneck. This is really the main difference between Cassandra and things like MongoDB or HBase (not that I want to start a flame!)
Having said all of this, your requirements could easily be satisfied by a more traditional database and simple master-slave replication, nothing here is too onerous
Avro supports schema evolution and is a good fit for this kind of problem.
If your system does not require low latency data loads, consider receiving the data to files in a reliable file system rather than loading directly into a live database system. Keeping a reliable file system (such as HDFS) running is simpler and less likely to have outages than a live database system. Also, separating the responsibilities ensures that your query traffic won't ever impact the data collection system.
If you will only have a handful of queries to run, you could leave the files in their native format and write custom map reduces to generate the reports you need. If you want a higher level interface, consider running Hive over the native data files. Hive will let you run arbitrary friendly SQL-like queries over your raw data files. Or, since you only have 150MB/day, you could just batch load it into MySQL readonly compressed tables.
If for some reason you need the complexity of an interactive system, HBase or Cassandra or might be good fits, but beware that you'll spend a significant amount of time playing "DBA", and 150MB/day is so little data that you probably don't need the complexity.
We're using Hadoop/HBase, and I've looked at Cassandra, and they generally use the row key as the means to retrieve data the fastest, although of course (in HBase at least) you can still have it apply filters on the column data, or do it client side. For example, in HBase, you can say "give me all rows starting from key1 up to, but not including, key2".
So if you design your keys properly, you could get everything for 1 user, or 1 host, or 1 user on 1 host, or things like that. But, it takes a properly designed key. If most of your queries need to be run with a timestamp, you could include that as part of the key, for example.
How often do you need to query the data/write the data? If you expect to run your reports and it's fine if it takes 10, 15, or more minutes (potentially), but you do a lot of small writes, then HBase w/Hadoop doing MapReduce (or using Hive or Pig as higher level query languages) would work very well.
If your JSON data has variable fields, then a schema-less model like Cassandra could suit your needs very well. I'd expand the data into columns rather then storing it in binary format, that will make it easier to query. With the given rate of data, it would take you 20 years to fill a 1 TB disk, so I wouldn't worry about compression.
For the example you gave, you could create two column families, Audit and Journal. The row keys would be TimeUUIDs (i.e. timestamp + MAC address to turn them into unique keys). Then the audit row you gave would have four columns, host:'foo', username:'bar', task:'foo', and result:0. Other rows could have different columns.
A range scan over the row keys would allow you to query efficiently over time periods (assuming you use ByteOrderedPartitioner). You could then use secondary indexes to query on users and terminals.

MySQL Blob vs. Disk for "video frames"

I have a c++ app that generates 6x relatively small image-like integer arrays per second. The data is 64x48x2-dimensional int (ie, a grid of 64x48 two-dimensional vectors, with each vector consisting of two floats). That works out to ~26kb per image. The app also generates a timestamp and some features describing the data. I want to store the timestamp and the features in a MySQL db column, per frame. I also need to store the original array as binary data, either in a file on disc or as a blob field in the database. Assume that the app will be running more or less nonstop, and that I'll come up with a way to archive data older than a certain age, so that storage does not become a problem.
What are the tradeoffs here for blobs, files-on-disc, or other methods I may not even be thinking of? I don't need to query against the binary data, but I need to query against the other metadata/features in the table (I'll definitely have an index built against timestamp), and retrieve the binary data. Does the equation change if I store multiple frames in a single file on disk, vs. one frame per file?
Yes, I've read MySQL Binary Storage using BLOB VS OS File System: large files, large quantities, large problems and To Do or Not to Do: Store Images in a Database, but I think my question differs because in this case there are going to be millions of identically-dimensioned binary files. I'm not sure how the performance hit to maintaining that many small files in a filesystem compares to storing that many files in db blob columns. Any perspective would be appreciated.
At a certain point, querying for many blobs becomes unbearably slow. I suspect that even if your identically dimensioned binary files this will be the case. Moreover you will still need some code to access and process the blobs. And this doesn't take advantage of file caching that might speed up image queries straight from the file system.
But! The link you provided did not mention object based databases, which can store the data you described in a way that you can access it extremely quickly, and possibly return it in native format. For a discussion see the link or just search google, there are many discussions:
Storing images in NoSQL stores
I would also look into HBase.
I figured since you were not sure about what to use in the first place(and there were no answers), an alternative solution might be appropriate.