Serializing persistent/functional data structures - language-agnostic

Persistent data structures depend on the sharing of structure for efficiency. For an example, see here.
How can I preserve the structure sharing when I serialize the data structures and write them to a file or database? If I just naively traverse the datastructures, I'll store the correct values, but I'll lose the structure sharing. I'd like to be able to save data-structures with shared components to a file, restore them, and still have most of the structure shared in the restored data.

You want some form of hash-consing. This problem has been well studied. Andrew Kennedy's paper on pickler combinators explains in detail how to serialize and unserialize while preserving sharing.

There are two obvious methods I can think of, and they're related.
Don't serialize the structures, serialize the nodes. So, you'd store a serialized record for each of the nodes in the example tree you gave, and you'd convert all node references to a database key name for the node. This gives you sharing automatically, but has the cost of having to do multiple lookups chasing the references in order to load a structure.
Color your nodes by ownership, like in your example. Have a concept of which structure a given node 'belongs' to and only serialize the nodes in a structure that belong to that structure. Links to nodes in other structures get replaced by a reference to that structure and the node in question. This allows you to load an entire structure at once, but can cause you to have to load ALL related structures if they are highly interlinked.
Choosing between these options depends on what you are trying to optimize for, and what sorts of linkage you expect to see in practice.

Related

Storing a JSON file in Redis and retrieving it

I am storing the info contained in a JSON file in Redis. I am doing it with the nodejs redis driver. Do you think that I am losing something if I am employing a hashtable for storing the info?
The info is simply a large array (several thousands) of elements (several fields within every element, no more than 50 fields sometimes) in the data and a small bunch of properties in the meta.
I understand that you're storing those JSON strings as follows:
hset some-key some-sub-key <the json>
Actually there's another valid approach which involves using the global key space directly:
set some-key:sub-key <the json>
If you're just storing those JSON strings I would say that creating global space keys is the simplest and most effective approach in your case.
What do you mean by losing something? Storing values(JSON) and retrieving them in Redis could be really fast. Plus Redis comes with some very handy APIs like TTL, FLUSHALL etc...
Personally, I'm using Redis for my Profile page. I store my image uploads in Redis and never had an issue.
My profile page: http://fanjin.computer
Github repo: https://github.com/bfwg/relay-gallery
Although this question has been answered, for future reference some might be asking the same question but looking for a different answer (like me).
If that's the case I would suggest looking in to RedisJSON for creating a JSON type in redis.

Accessing objects from json files on disk

I have ~500 json files on my disk that represents hotels all over the world, each around 30 mbs, all objects have the same structure.
At certain points in my spring server I require to get the information of a single hotel, let's say via code (which is inside the json object).
The data is read only, but I might get updates from the hotels providers at certain times, like extra json files or delta changes.
Now I don't want to migrate my json files to a relational database that's for sure, so I've been investigating in the best solution to achieve what I want.
I tried Apache Drill because querying straight from json files made me think less headaches of dealing with the data, I did a directory query using Drill, something like:
SELECT * FROM dfs.'C:\hotels\' WHERE code='1b3474';
but this obviously does not seem to be the most efficient way for me as it takes around 10 seconds to fetch a single hotel.
At the moment I'm trying out Couch DB, but I'm still learning it. Should I migrate all the hotels to a single document (makes a bit of sense to me)? Or should I consider each hotel a document?
I'm just looking for pointers on what is a good solution to achieve what I want, so here to take your opinion.
The main issue here is that json files do not have indexes associated with them, and Drill does not create indexes for them. So whenever you do a query like SELECT * FROM dfs.'C:\hotels\' WHERE code='1b3474'; Drill has no choice but to read each json file and parse and process all the data in each file. The more files and data you have, the longer this query will take. If you need to do lookups like this often, I would suggest not using Drill for this use case. Some alternatives are:
A relational database where you have an index built for the code column.
A key value store where code is the key.

Risks in using Neo4j as a stand-alone database

I have seen quite a few products with graph-related data models built on both Neo4j and a relational or document database. The other db is generally used to store the metadata of each node.
I am considering building a product relying entirely on Neo4j, storing all my objects' metadata as node properties. Is there any caveat in doing so?
Entirely depends of how much metadata you want to store. 10 primitive / short String properties per node is absolutely fine. 1000 large JSON documents per node... not so much. It isn't a document store.
What sort of numbers are we talking about? I would suggest you generate a random graph with similar number of properties and similar values that you wish to have in your product. See how it performs.
Otherwise no caveats I would say. Oh, don't refer to internal Neo4j node IDs anywhere; unlike in a relational database, these get re-used.

which database suits my application mysql or mongodb ? using Node.js , Backbone , Now.js

I want to make an application like docs.google.com (without its api,completely on my own server) using
frontend : backbone
backend : node
What database would u think is better ? mysql or mongodb ? Should support good scalability .
I am familiar with mysql with php and i will be happy if the answer is mysql.
But many tutorials i saw, they used mongodb, why did they use mongodb without mysql ?
What should i use ?
Can anyone give me link for some sample application(with source) build using backbone , Node , mysql (or mongo) . or atleast app. with Node and mysql
Thanks
With MongoDB, you can just store JSON objects and retrieve them fully-formed, so you don't really need an ORM layer and you spend less CPU time translating your data back-and-forth. The developers behind MongoDB have also made horizontally scaling the database a higher priority and let you run arbitrary Javascript code to pre-process data on the DB side (allowing map-reduce style filtering of data).
But you lose some for these gains: You can't join records. Actually, the JSON structure you store could only be done via joins in SQL, but in MongoDB you only have that one structure to your data, while in SQL you can query differently and get your data represented in alternate ways much easier, so if you need to do a lot of analytics on your database, MongoDB will make that harder.
The query language in MongoDB is "rougher", in my opinion, than SQL's, partly because it's less familiar, and partly because the querying features "feel" haphazardly put together, partially to make it valid JSON, and partially because there are literally a couple of ways of doing the same thing, and some are older ways that aren't as useful or regularly-formatted as the others. And there's the added complexity of the array and sub-object types over SQL's simple row-based design, so the syntax has to be able to handle querying for arrays that contain some of the values you defined, contain all of the values you defined, contain only the values you defined, and contain none of the values you defined. The same distinctions apply to object keys and their values, and this makes the query syntax harder to grasp. (And while I can see the need for edge-cases, the $where query parameter, which takes a javascript function that is run on every record of the data and returns a boolean, is a Siren song because you can easily define what objects you want to return or not, but it has to run on every record in the database, no indexes can be used.)
So, it depends on what you want to do, but since you say it's for a Google Docs clone, you probably don't care about any representation but the document representation, itself, and you're probably only going to query based on document ID, document name, or the owner's ID/name, nothing too complex in the querying.
Then, I'd say being able to take the JSON representation of the document your user is editing, and just throw it into the database and have it automatically index these important fields, is worth the price of learning a new database.
I was also struggling with this choice looking at the hype created by using MongoDB for tasks it was not built for. So my 2 cents are:
Storing and retrieving hierarchical objects, that your documents probably are, is easier in MongoDB, as David says. It becomes more complicated if you want to store documents that are bigger than 16Mb though - MongoDB's answer is GridFS.
Organising documents in folders, groups, keeping track of which user owns which documents and who he/she provided access to them is definitely easier with MySQL - you have the advantage of powerful SQL queries with joins etc., built in EXPLAIN optimization, triggers, functions, stored procedures, etc. MongoDB is nowhere near.
So what prevents you from using both MySQL to organize the documents and MongoDB to store one collection of documents identified by id (or several collections - one for each document type)? It seems to me the best choice and using two databases in one application is not a problem, really.
MySQL will store users, groups, folders, permissions - whatever you fancy - and for each document it will store a reference to the collection and the document id (MongoDB has a special format for it - DBRefs). MongoDB will store documents themselves in collections, if they are all less than 16MB, or the previews and metadata of documents in collections and the whole documents in GridFS.
David provided a good answer. A few things to add to it.
MongoDB's flexible nature permits for easy agile / iterative development.
MongoDB like node.js is asyncronous in nature and works very well within asyncronous environments.
Mongoose is a good ODM (object document mapper) that makes working with MongoDB with Node.js feel very natural. Unlike ORMs this is a very thin layer.
For Google Doc like functionality, the flexibility & very rich data structure provided by MongoDB feels like a much better fit.
You can find some good example posts by searching for mongoose, node and MongoDB.
Here's one that also uses backbone.js and looks good http://mattkopala.com/blog/2012/02/12/getting-started-with-nodejs/

What are some cons of storing html in a database for use?

Altough its very easy to do a search about the topic, it's not as easy to come to a conclusion. What are some cons of storing html in a database for use?
HTML is static, and querying the data from a database uses database resources; database resources are typically among the more restricted on moderate to heavy use systems, therefore it makes sense to not store HTML in the database, but to place it on the filesystem, where it can be retrieved without using critical resources.
In the broadest sense, HTML is a document markup language and serves to structure data into a document. The database on the other hand should contain raw data organized along its logical relations. Documents use formatting and may present data redundantly, but the true, underlying data is always fixed. Thus you should store the most immediate, raw form of data that you possibly can, and retrieve it in meaningful ways using both the query language itself to create suitable views for your purposes, and other, output-specific data processing to generate documents.
Of course you may like to cache the result of an output formatting operation, and you may choose to store the cache in a database, too. That's fine of course. But concerning the raw payload data, I would always go for the above.
That depends on the use of the HTML in the database. If it's data that you only ever access as a blob (meaning you never/rarely query the contents of the HTML), then I think it can be a good idea in some cases. Then the question is essentially the same as "Should I store files in xyz format in my database?" And the answer to questions like that depends on several things:
How large are the files? Would storing them on the filesystem, with just their filename/path in the DB be more efficient?
Do you need to replicate the data to other servers? If so, then storing raw files in the DB may be easier than on the FS, if you already have DB-sync infrastructure in place.
What are your query uses like? Are they friendlier to a DB or a file system storage?
Now, if you're talking about storing HTML data that you frequently have to query, that changes the game entirely.
Any database normalization nazi would tell you never to do it. But there might be cases when it's useful. For instance, if you're using some sort of full-text searching engine, you may want that in a database--or in whatever form the full-text search engine uses.