Postgresql JSONB is coming. What to use now? Hstore? JSON? EAV? - json

After going through the relational DB/NoSQL research debate, I've come to the conclusion that I will be moving forward with PG as my data store. A big part of that decision was the announcement of JSONB coming to 9.4. My question is what should I do now, building an application from the ground up knowing that I want to migrate to (I mean use right now!) jsonb? The DaaS options for me are going to be running 9.3 for a while.
From what I can tell, and correct me if I'm wrong, hstore would run quite a bit faster since I'll be doing a lot of queries of many keys in the hstore column and if I were to use plain json I wouldn't be able to take advantage of indexing/GIN etc. However I could take advantage of nesting with json, but running any queries would be very slow and users would be frustrated.
So, do I build my app around the current version of hstore or json data type, "good ol" EAV or something else? Should I structure my DB and app code a certain way? Any advice would be greatly appreciated. I'm sure others may face the same question as we await the next official release of PostgreSQL.
A few extra details on the app I want to build:
-Very relational (with one exception below)
-Strong social network aspect (groups, friends, likes, timeline etc)
-Based around a single object with variable user assigned attributes, maybe 10 or 1000+ (this is where the schema-less design need comes into play)
Thanks in advance for any input!

It depends. If you expect to have a lot of users, a very high transaction volume, or an insane number of attribute fetches per query, I would say use HSTORE. If, however, you app will start small and grow over time, or have relatively few transactions that fetch attributes, or just fetch a few per query, then use JSON. Even in the latter case, if you're not fetching many attributes but checking one or two keys often in the WHERE clause of your queries, you can create a functional index to speed things up:
CREATE INDEX idx_foo_somekey ON foo((bar ->> 'somekey'));
Now, when you have WHERE bar ->> somekey, it should use the index.
And of course, it will be easier to use nested data and to upgrade to jsonb when it becomes available to you.
So I would lean toward JSON unless you know for sure you're going kick your server's ass with heavy use of key fetches before you have a chance to upgrade to 9.4. But to be sure of that, I would say, do some benchmarking with anticipated query volumes now and see what works best for you.

You probably don't give quite enough to give a very detailed answer, but I will say this... If your data is "very relational" then I believe your best course is to build it with a good relational design. If it's just one field with "variable assigned attributes", then that sounds like a good use for an hstore. Which is pretty tried and true at this point. I've been doing some reading on 9.4 and jsonb sounds cool, but, that won't be out for a while. I suspect that a good schema design in 9.3 + a very targeted use of hstore will probably yield a good combination of performance and flexibility.

Related

Most suitable data store for billions of indexes

So we're looking to store two kinds of indexes.
First kind will be in the order of billions, each with between 1 and 1000 values, each value being one or two 64 bit integers.
Second kind will be in the order of millions, each with about 200 values, each value between 1KB and 1MB in size.
And our usage pattern will be something like this:
Both kinds of index will have values added to the top up to thousands of times per second.
Indexes will be infrequently read, but when they are read it'll be the entirety of the index that is read
Indexes should be pruned, either on writing values to the index or in some kind of batch type job
Now we've considered quite a few databases, our favourites at the moment are Cassandra and PostreSQL. However, our application is in Erlang, which has no production-ready bindings for Cassandra. And a major requirement is that it can't require too much manpower to maintain. I get the feeling that Cassandra's going to throw up unexpected scaling issues, whereas PostgreSQL's just going to be a pain to shard, but at least for us it's a know quantity. We're already familiar with PostgreSQL, but not hugely well acquainted with Cassandra.
So. Any suggestions or recommendations as to which data store would be most appropriate to our use case? I'm open to any and all suggestions!
Thanks,
-Alec
You haven't given enough information to support much of an answer re: your index design. However, Cassandra scales up quite easily by growing the cluster.
You might want to read this article: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
A more significant issue for Cassandra is whether it supports the kind of queries you need - scalability won't be the problem. From the numbers you give, it sounds like we are talking about terabytes or tens of terabytes, which is very safe territory for Cassandra.
Billions is not a big number by todays standards, why not writing a benchmark instead of guesswork? That will give you a better decision tool and it's really easy to do. Just install your target OS, and each database engine, then run querys with let's say Perl (because i like it)
It won't take you more than one day to do all this, i've done something like this before.
A nice way to benchmark is writing a script that randomly , or with something like a gauss bell curve, executes querys, "simulating" real usage. Then plot the data or do it like a boss and just read the logs.

Performance of MySql Xml functions?

I am pretty excited about the new Mysql XMl Functions.
Now I can finally embed something like "object oriented" documents in my oldschool relational database.
For an example use-case consider a user who sings up at your website using facebook connect.
You can fetch an object for the user using the graph api, and get nice information. This information however can vary vastly. Some fields may or may not be set, some may be added over time and so on.
Well if you are just intersted in very special fields (for example friends relations, gender, movies...), you can project them into your relational database scheme.
However using the XMl functions you could store the whole object inside a field and then your different models can access the data using the ExtractValue function. You can store everything right away without needing to worry what you will need later.
But what will the performance be?
For example I have a table with 50 000 entries which represent useres.
I have an enum field that states "male", "female" (or various other genders to be politically correct).
The performance of for example fetching all males will be very fast.
But what about something like WHERE ExtractValue(userdata, '/gender/') = 'male' ?
How will the performance vary if the object gets bigger?
Can I maby somehow put an Index on specified xpath selections?
How do field types work together with this functions/performance. Varchar/blob?
Do I need fulltext indexes?
To sum up my question:
Mysql XML functins look great. And I am sure they are really great if you just want to store structured data that you fetch and analyze further in your application.
But how will they stand battle in procedures where there are internal scans/sorting/comparision/calculations performed on them?
Can Mysql replace document oriented databases like CouchDB/Sesame?
What are the gains and trade offs of XML functions?
How and why are they better/worse than a dynamic application that stores various data as attributes?
For example a key/value table with an xpath as key and the value as value connected to the document entity.
Anyone made any other experiences with it or has noticed something mentionable?
I tend to make comments similar to Pekka's, but I think the reason we cannot laugh this off is your statement "This information however can vary vastly." That means it is not realistic to plan to parse it all and project it into the database.
I cannot answer all of your questions, but I can answer some of them.
Most notably I cannot tell you about performance on MySQL. I have seen it in SQL Server, tested it, and found that SQL Server performs in memory XML extractions very slowly, to me it seemed as if it were reading from disk, but that is a bit of an exaggeration. Others may dispute this, but that is what I found.
"Can Mysql replace document oriented databases like CouchDB/Sesame?" This question is a bit over-broad but in your case using MySQL lets you keep ACID compliance for these XML chunks, assuming you are using InnoDB, which cannot be said automatically for some of those document oriented databases.
"How and why are they better/worse than a dynamic application that stores various data as attributes?" I think this is really a matter of style. You are given XML chunks that are (presumably) documented and MySQL can navigate them. If you just keep them as-such you save a step. What would be gained by converting them to something else?
The MySQL docs suggest that the XML file will go into a clob field. Performance may suffer on larger docs. Perhaps then you will identify sub-documents that you want to regularly break out and put into a child table.
Along these same lines, if there are particular sub-docs you know you will want to know about, you can make a child table, "HasDocs", do a little pre-processing, and populate it with names of sub-docs with their counts. This would make for faster statistical analysis and also make it faster to find docs that have certain sub-docs.
Wish I could say more, hope this helps.

Storing JSON in an msSQL database?

I'm developing a form generator, and wondering if it would be bad mojo to store JSON in an SQL database?
I want to keep my database & tables simple, so I was going to have
`pKey, formTitle, formJSON`
on a table, and then store
{["firstName":{"required":"true","type":"text"},"lastName":{"required":"true","type":"text"}}
in formJSON.
Any input is appreciated.
I use JSON extensively in my CMS (which hosts about 110 sites) and I find the speed of access data to be very fast. I was surprised that there wasn't more speed degradation. Every object in the CMS (Page, Layout, List, Topic, etc) has an NVARCHAR(MAX) column called JSONConfiguration. My ORM tool knows to look for that column and reconstitute it as an object if needed. Or, depending on the situation, I will just pass it to the client for jQuery or Ext JS to process.
As for readability / maintainability of my code, you might say it's improved because I now have classes that represent a lot of the JSON objects stored in the DB.
I used JSON.net for all serialization / deserialization. https://www.newtonsoft.com/json
I also use a single query to return meta-JSON with the actual data. As in the case of Ext JS, I have queries that return both the structure of the Ext JS object as well as the data the object will need. This cuts out one post back / SQL round trip.
I was also surprised at how fast the code was to parse a list of JSON objects and map them into a DataTable object that I then handed to a GridView.
The only downside I've seen to using JSON is indexing. If you have a property of the JSON you need to search, then you have to store it as a separate column.
There are JSON DB's out there that might server your needs better: CouchDB, MongoDB, and Cassandra.
A brilliant way to make an object database from sql server. I do this for all config objects and everything else that doesn't need any specific querying. extending your object - easy, just create a new property in your class and init with default value. Don't need a property any more? Just delete it in the class. Easy roll out, easy upgrade. Not suitable for all objects, but if you extract any prop you need to index on - keep using it. Very modern way of using sql server.
It will be slower than having the form defined in code, but one extra query shouldn't cause you much harm. (Just don't let 1 extra query become 10 extra queries!)
Edit: If you are selecting the row by formTitle instead of pKey (I would, because then your code will be more readable), put an index on formTitle
We have used a modified version of XML for exactly the purpose you decribe for seven or eight years and it works great. Our customers' form needs are so diverse that we could never keep up with a table/column approach. We are too far down the XML road to change very easily but I think JSON would work as well and maybe evan better.
Reporting is no problem with a couple of good parsing functions and I would defy anyone to find a significant difference in performance between our reporting/analytics and a table/column solution to this need.
I wouldn't recommend it.
If you ever want to do any reporting or query based on these values in the future it's going to make your life a lot harder than having a few extra tables/columns.
Why are you avoiding making new tables? I say if your application requires them go ahead and add them in... Also if someone has to go through your code/db later it's probably going to be harder for them to figure out what you had going on (depending on what kind of documentation you have).
You should be able to use SisoDb for this. http://sisodb.com
I think it not an optimal idea to store object data in a string in SQL. You have to do transformation outside of SQL in order to parse it. That presents a performance issue and you lose the leverage of using SQL native data parsing capability. A better way would be to store JSON as an XML datatype in SQL. This way, you kill two birds with one stone: You don't have to create shit load of tables and still get all the native querying benefits of SQL.
XML in SQL Server 2005? Better than JSON in Varchar?

Is HBase meaningful if it's not running in a distributed environment?

I'm building an index of data, which will entail storing lots of triplets in the form (document, term, weight). I will be storing up to a few million such rows. Currently I'm doing this in MySQL as a simple table. I'm storing the document and term identifiers as string values than foreign keys to other tables. I'm re-writing the software and looking for better ways of storing the data.
Looking at the way HBase works, this seems to fit the schema rather well. Instead of storing lots of triplets, I could map document to {term => weight}.
I'm doing this on a single node, so I don't care about distributed nodes etc. Should I just stick with MySQL because it works, or would it be wise to try HBase? I see that Lucene uses it for full-text indexing (which is analogous to what I'm doing). My question is really how would a single HBase node compare with a single MySQL node? I'm coming from Scala, so might a direct Java API have an edge over JDBC and MySQL parsing etc each query?
My primary concern is insertion speed, as that has been the bottleneck previously. After processing, I will probably end up putting the data back into MySQL for live-querying because I need to do some calculations which are better done within MySQL.
I will try prototyping both, but I'm sure the community can give me some valuable insight into this.
Use the right tool for the job.
There are a lot of anti-RDBMSs or BASE systems (Basically Available, Soft State, Eventually consistent), as opposed to ACID (Atomicity, Consistency, Isolation, Durability) to choose from here and here.
I've used traditional RDBMSs and though you can store CLOBs/BLOBs, they do
not have built-in indexes customized specifically for searching these objects.
You want to do most of the work (calculating the weighted frequency for
each tuple found) when inserting a document.
You might also want to do some work scoring the usefulness of
each (documentId,searchWord) pair after each search.
That way you can give better and better searches each time.
You also want to store a score or weight for each search and weighted
scores for similarity to other searches.
It's likely that some searches are more common than others and that
the users are not phrasing their search query correctly though they mean
to do a common search.
Inserting a document should also cause some change to the search weight
indexes.
The more I think about it, the more complex the solution becomes.
You have to start with a good design first. The more factors your
design anticipates, the better the outcome.
MapReduce seems like a great way of generating the tuples. If you can get a scala job into a jar file (not sure since I've not used scala before and am a jvm n00b), it'd be a simply matter to send it along and write a bit of a wrapper to run it on the map reduce cluster.
As for storing the tuples after you're done, you also might want to consider a document based database like mongodb if you're just storing tuples.
In general, it sounds like you're doing something more statistical with the texts... Have you considered simply using lucene or solr to do what you're doing instead of writing your own?

How fast is MySQL compared to a C/C++ program running in the server?

Ok, I have a need to perform some intensive text manipulation operations.
Like concatenating huge (say 100 pages of standard text), and searching in them etc. so I am wondering if MySQL would give me a better performance for these specific operations, compared to a C program doing the same thing?
Thanks.
Any database is always slower than a flat-file program outside the database.
A database server has overheads that a program reading and writing simple files doesn't have.
In general the database will be slower. But much depends on the type of processing you want to do, the time you can devote for coding and the coding skills. If the database provides out-of-the-box the tools and functionality you need, then why don't give it a try, which should take much less time than coding own tool. If the performance turns out to be an issue then write your own solution.
But I think that MySQL will not provide the text manipulation operations you want. In Oracle world one has Text Mining and Oracle Text.
There are several good responses that I voted up, but here are more considerations from my opinion:
No matter what path you take: indexing the text is critical for speed. There's no way around it. The only choice is how complex you need to make your index for space constraints as well as search query features. For example, a simple b-tree structure is fast and easy to implement but will use more disk space than a trie structure.
Unless you really understand all the issues, or want to do this as a learning exercise, you are going to be much better off using an application that has had years of performance tuning.
That can mean a relational databases like MySQL even though full-text is a kludge in databases designed for tables of rows and columns. For MySQL use the MyISAM engine to do the indexing and add a full text index on a "blob" column. (Afaik, the InnoDB engine still doesn't handle full text indexing, so you need to use MyISAM). For Postgresql you can use tsearch.
For a bit more difficulty of implementation though you'll see the best performance integrating indexing apps like Xapian, Hyper Estraier or (maybe) Lucene into your C program.
Besides better performance, these apps will also give you important features that MySQL full-text searching is missing, such as word stemming, phrase searching, etc., in other words real full-text query parsers that aren't limited to an SQL mindset.
Relational Databases are normally not good for handling large text data. The performance-wise strength of realtional DBs is the indexation and autmatically generated query plan. Freeform text does not work well in with this model.
If you are talking about storing plain text in one db field and trying to manipulate with data, then C/C++ sould be faster solution. In simple way, MySQL should be a lot bigger C programm than yours, so it must be slower in simple tasks like string manipulation :-)
Of course you must use correct algorithm to reach good result. There is useful e-book about string search algorithms with examples included: http://www-igm.univ-mlv.fr/~lecroq/string/index.html
P.S. Benchmark and give us report :-)
Thanks for all the answers.
I kind of thought that a DB would involve some overhead as well. But what I was thinking is that since my application required that the text be stored somewhere in the first place already, then the entire process of extracting the text from DB, passing it to the C program, and writing back the result into the DB would overall be less efficient than processing it within the DB??
If you're literally talking about concatenating strings and doing a regexp match, it sounds like something that's worth doing in C/C++ (or Java or C# or whatever your favorite fast high-level language is).
Databases are going to give you other features like persistence, transactions, complicated queries, etc.
With MySQL you can take advantage of full-text indices, which will be hundreds times faster, then directly searching through the text.
MySQL is fairly efficient. You need to consider whether writing your own C program would mean more or less records need to be accessed to get the final result, and whether more or less data needs to be transferred over the network to get the final result.
If either solution will result in the same number of records being accessed, and the same amount transferred over the network, then there probably won't be a big difference either way. If performance is critical then try both and benchmark them (if you don't have time to benchmark both then you probably want to go for whichever is easiest to implemnent anyway).
MySQL is written in C, so it is not correct to compare it to a C program. It's itself a C program