How do you organize buckets in Riak? - namespaces

Since Riak uses buckets as a way of separating keys, is it possible to have buckets within buckets? If not how would one go about organizing a Riak setup with many buckets for several apps.
The basic problem is how one would go about representing "databases" and "tables" within Riak. Since a bucket translates to a table, what translates to a database?
Namespaces in programming languages usually have hierarchies. It makes sense for Riak buckets to also allow hierarchies, since buckets are essentially namespaces.

You need to think about Riak as about very big key -> value "table" where buckets are only prefixes for keys. Now when you know that you can do anything with buckets as long as they are still binary objects.
You can create linear "tables":
<<"table1">>
<<"table2">>
Or you can create hierarchies:
<<"db1.table1">>
<<"db1.table2">>
<<"db2.table1">>
<<"db2.table2">>
Or you even can use tuples as buckets:
1> term_to_binary({"db1", "table1"}).
<<131,104,2,107,0,3,100,98,49,107,0,6,116,97,98,108,101,49>>
2> term_to_binary({"db1", "table2"}).
<<131,104,2,107,0,3,100,98,49,107,0,6,116,97,98,108,101,50>>
3> term_to_binary({"db2", "table1"}).
<<131,104,2,107,0,3,100,98,50,107,0,6,116,97,98,108,101,49>>
4> term_to_binary({"db2", "table2"}).
<<131,104,2,107,0,3,100,98,50,107,0,6,116,97,98,108,101,50>>

Related

Why are hash table based data structures not the default when implementing adjacency lists?

I looked at some existing implementations of adjacency lists online, and most if not all of them have been implemented using dynamic arrays. But wouldn't hashtable based data structures be more suitable? (set and map)
There are very limited scenarios where we would access graph nodes by index. Even if that's the case, if some indices are missing from the graph, there will be wasted space. And if the nodes are not inserted in order, lookups are O(n).
However, if we use a hashtable based data structure, lookups will be O(1) whether the nodes are indexed or otherwise.
So why are maps and sets not the default data structures used when implementing adjacency lists?
The choice of the right container is not quite easy.
I will consider some of the most common:
a list (elements which contain a reference to the next and/or previous)
an array (with consecutive storage)
an associated array
a hash table.
Each of them has advantages and disadvantages.
Concerning a list, insertions and removals can be very fast (worst case O(1) if the insertion point / removal element is known) but a look-up has worst case time complexity of O(N).
The look-up in an array has a complexity of O(1) in worst case if the index is known (but insertion and removal can be slow if the order must be kept).
A hash table has a look-up of O(1) in best case but the worst case might be O(N) (even if it's unlikely to happen often if the hash table isn't completely bad implemented).
An associated array has a time complexity of O(lg N) in worst case.
So the choice always depends on the expected use cases to find the best compromise where the advantages pay off most while the disadvantages doesn't hurt too much.
For the management of node and edge lists in graphs, OP made the observation that arrays seem to be very common.
I recently had a look into the Boost Graph Library (for curiosity and inspiration). Concerning the data structures, it is mentioned:
The adjacency_list class is the general purpose “swiss army knife” of graph classes. It is highly parameterized so that it can be optimized for different situations: the graph is directed or undirected, allow or disallow parallel edges, efficient access to just the out-edges or also to the in-edges, fast vertex insertion and removal at the cost of extra space overhead, etc.
For the configuration (according to a specific use case), there is spent an extra page BGL – adjacency_list.
However, the defaults for vertex (node) list and edge list are in fact vectors (aka. dynamic arrays). Assuming that the average use case is an non-mutable graph (loaded once and never modified) which is explored by algorithms to answer certain user questions, the worst case of O(1) for look-up in arrays is hard to beat and will very probably pay off.
To organize this, the nodes and edges have to be enumerated. If the input data doesn't provide this, it's easy to add this as a kind of internal ID to the in-memory representation of the graph.
In this case, "public" node references have to be mapped into the internal IDs, and answers have to be mapped back. For the mapping of the public node references, the most appropriate container should be used. This might be in fact an associated array or hash table.
Considering that a request like e.g. find the shortest route from A to B has to map A and B once to the corresponding internal IDs but may need many look-up of nodes and edges to compute the answer, the choice of the array for storage of nodes and edges makes very sense.
There are very limited scenarios where we would access graph nodes by index.
This is true, and exactly what you should be thinking about: you want a data structure which can efficiently do whatever operations you actually want to use it for. So the question is, what operations do you want to be efficient?
Suppose you are implementing some kind of standard algorithm which uses an adjacency list, e.g. Dijkstra's algorithm, A* search, depth-first search, breadth-first search, topological sorting, or so on. For almost every algorithm like this, you will find that the only operation you need to use the adjacency list for is: for a given node, iterate over its neighbours.
That operation is more efficient for a dynamic array than for a hashtable, because a hashtable has to be sufficiently sparse to prevent too many collisions. Besides that, dynamic arrays will use less memory than hashtables, for the same reason; and the dynamic arrays are more efficient to build in the first place, because you don't have to compute any hashes.
Now, if you have a different algorithm where you need to be able to test for the existence of an edge in O(1) time, then an adjacency list implemented using hashtables may be a good choice; but you should also consider whether an adjacency matrix is more suitable.

Azure DocumentDB Data Modeling, Performance & Price

I'm fairly new to NoSQL type databases, including Azure's DocumentDB. I've read through the documentation and understand the basics.
The documentation left me with some questions about data modeling, particularly in how it relates to pricing.
Microsoft charges fees on a "per collection" basis, with a collection being a list of JSON objects with no particular schema, if I understand it correctly.
Now, since there is no requirement for a uniform schema, is the expectation that your "collection" is analogous to a "database" in that the collection itself might contain different types of objects? Or is the expectation that each "collection" is analogous to a "table" in that it contains only objects of similar type (allowing for variance in the object properties, perhaps).
Does query performance dictate one way or another here?
Thanks for any insight!
The normal pattern under DocumentDB is to store lots of different types of objects in the same "collection". You distinguish them by either have a field type = "MyType" or with isMyType = true. The latter allows for subclassing and mixin behavior.
As for performance, DocumentDB gives you guaranteed 10ms read/15ms write latency for your chosen throughput. For your production system, put everything in one big "partitioned collection" and slide the size and throughput levers over time as your space needs and load demands. You'll get essentially infinite scalability and DocumentDB will take care of allocating (and deallocating) resources (secondaries, partitions, etc.) as you increase (or decrease) your throughput and size levers.
A collection is analogous to a database, more so than a relational table. Normally, you would store a type property within documents to distinguish between types, and add the AND type='MyType' filter to each of your queries if restricting to a certain type.
Query performance will not be significantly different if you store different types of documents within the same collection vs. different collections because you're just adding another filter against an indexed property (type). You might however benefit from pooling throughput into a single collection, vs. spreading small amounts of throughput for each type/collection.

Database optimized for searching in large number of objects with different attributes

Im am currently searching for an alternative to our aging MySQL database using an EAV approach. Current projects seem to have outgrown traditional table oriented database structures and especially searches in such database.
I head and researched about various NoSQL database systems but I can't find anything that seems to be what Im looking for. Maybe you can help.
I'll show you a generalized example on what kind of data I have and what operations I want to execute on them:
I have an object that has a small number of META attributes. Attributes that are common to all instanced of my objects. For example these
DataObject Common (META) Attributes
Unique ID (Some kind of string containing a unique identifier)
Created Date (A date time showing creation time of the object)
Type (Some kind of type identifier, maybe something like "Article", "News", "Image" or "Video"
... I think you get the Idea
Then each of my Objects has a variable number of other attributes. Most probably, many Objects will share a number of these attributes, but there is no rule. For my sample, we say each Object instance has between 5 to 20 such attributes. Here are some samples
Data Object variable Attributes
Color (Some CSS like color string)
Name (A string)
Category (The category or Tag of this item) (Maybe we also have more than one of these?)
URL (a url containing some website)
Cost (a number with decimals
... And a whole lot of other stuff mostly being of the usual column types
References to other data is an idea, but not a MUST at the moment. I could provide those within my application logic if needed.
A small sample:
Image
Unique ID = "0s987tncsgdfb64s5dxnt"
Created Date = "2013-11-21 12:23:11"
Type = "Image"
Title = "A cute cat"
Category = "Animal"
Size = "10234"
Mime = "image/jpeg"
Filename = "cat_123.jpg"
Copyright = "None"
Typical Operations
An average storage would probably have around 1-5 million such objects, each with 5-20 attributes.
Apart from the usual stuff like writing one object to database or readin it by it's uid, the most problematic operations are these:
Search by several attributes - Select every DataObject that has Type "News" the Titel contains "blue" and the Created Date is after 2012.
Paged bulk read - Get a large number of objects from a search (see above) starting at element 100 and ending at 250
Get many objects with all of their attributes - When reading larger numbers of objects, I need to get every object with all of it's attributes in one call.
Storage Requirements
Persistance - The storage needs to be persistance and not in memory only. If the server reboots, the data has to be at the same point in time as when it shut down before. No memory only systems.
Integrity - All data is important, nothing can be ignored. So every single write action has to be securely stored. Systems (Redis?) that tend to loose something now and then arent usable. Systems with huge asynchronity are also problematic. If data changes, every responsible node should see that.
Complexity - The system should be fairly easy to setup and maintain. So, systems that force the admin to take many week long courses in it's use arent really a solution here. Same goes for huge data warehouses with loads of nodes. Clustering is nice, but it should also be possible to get a cheap system with one node.
tl;dr
Need super fast database system with object oriented data and fast searched even with hundreds of thousands of items.
A reason as to why I am searching for a better alternative to mysql can be found here: Need MySQL optimization for complex search on EAV structured data
Update
Key-Value stores like Redis weren't an option as we need to do some heavy searching insode our data. Somethng which isnt possible in a typical Key-Value store.
In the end, we are using MongoDB with a slightly optimized scheme to make best use of MongoDBs use of indizes.
Some small drawback still remain but are acceptable at the moment:
- MongoDBs aggregate function can not wotk with very large result sets. We have to use find (and refine our data structure to make that one sufficient)
- You can not sort large datasets on specific values as it would take up to much memory. You also cant create indizes on those values as they are schema free.
I don't know if you wan't a more sophisticated answer than mine. But maybe i can inspire you a little.
MySql are scaleable and can be used for exactly your course. I think it's more of an optimization and server problem if you database i slow. Many system with massive amount of data i using MySql and works perfectly, Though NoSql (Not-Only SQL) is built for large amount of data with different attributes.
There's many diffrent NoSql providers and they have different ways of handling you data.
Think about that before you choose a NoSql platform.
The possibilities are
Key–value Stores - ex. Redis, Voldemort, Oracle BDB
Column Store - ex. Cassandra, HBase
Document Store - ex. CouchDB, MongoDb
Graph Database - ex. Neo4J, InfoGrid, Infinite Graph
Most website uses document based storing, but ex. facebook are using the column based, because of the many dynamic atrribute.
You can try the Document based NoSql at http://try.mongodb.org/
In the end, it really depends on how you build and optimize you database, and not from which technology you choose, though chossing the right technology can save a bunch of time.
The system we have developed are using a a combination of MySql and NoSql depending on what data we are working with. MySql for the system itself and NoSql for all the data we import via API's.
Hope this inspires a little and feel free to ask any westions

External data (keys) mapping within own database

I am currently working on a project of a new system.
This system will be using several different web-services to produce composite data.
Some data is compound and I use relational SQL tables (server, in particular, is MySQL) to compose data for further usage.
My problem is that I have to implement some data mapping.
Take countries for example.
Within our system countries are keyed (primary key based of CHAR2 ascii column) on ISO 3166-1 alpha-2.
One web-service provides data in the very same format. While several other have their own, unique, integer type identifiers.
As I am about to implement in-code mapping, I would like to have a possibility to dynamically update mapping tables, without making changes to the code.
Thus I am thinking about mappings table.
I may produce a table service_mappings, that would contain arbitrary length columns such as service_id (my own identifier for particular service), ref_id (datum provided by web-service), model (what data I am mapping this to in my system), key (what key this [service_id, ref_id] correspond to in my model).
On the other hand, I may choose something like a mapping table for each separate model, that would contain less keys (take model from previous table, as it would be defined by table name). This could be more feasible to use with ORMs of some kind.
So, my question is as following: what is the correct approach, what is the most efficient, and maybe there is some completely different technique?
Cache hint
In response to recent answer by Alexey.
We are likely to use some caching technique (such as memcache), although for primary data source we would like to rely on MySQL as we have methods in-place for creating and restoring back-ups, and we would have to think how to implement them.
Also, MySQL seems to inhibit some methods for faster access, and according to research by DeNA it may, actually, be faster than noSQL alternatives on primary-key/unique-key look-ups.
In your case, i'm would leave model in mapping table instead create separate tables, because it's will be much easer to find proper mapping. If you want more efficient, you may use some nosql storage for this mapping (such as Redis or memcachedb) , which is often much faster and reliable.

Is HBase meaningful if it's not running in a distributed environment?

I'm building an index of data, which will entail storing lots of triplets in the form (document, term, weight). I will be storing up to a few million such rows. Currently I'm doing this in MySQL as a simple table. I'm storing the document and term identifiers as string values than foreign keys to other tables. I'm re-writing the software and looking for better ways of storing the data.
Looking at the way HBase works, this seems to fit the schema rather well. Instead of storing lots of triplets, I could map document to {term => weight}.
I'm doing this on a single node, so I don't care about distributed nodes etc. Should I just stick with MySQL because it works, or would it be wise to try HBase? I see that Lucene uses it for full-text indexing (which is analogous to what I'm doing). My question is really how would a single HBase node compare with a single MySQL node? I'm coming from Scala, so might a direct Java API have an edge over JDBC and MySQL parsing etc each query?
My primary concern is insertion speed, as that has been the bottleneck previously. After processing, I will probably end up putting the data back into MySQL for live-querying because I need to do some calculations which are better done within MySQL.
I will try prototyping both, but I'm sure the community can give me some valuable insight into this.
Use the right tool for the job.
There are a lot of anti-RDBMSs or BASE systems (Basically Available, Soft State, Eventually consistent), as opposed to ACID (Atomicity, Consistency, Isolation, Durability) to choose from here and here.
I've used traditional RDBMSs and though you can store CLOBs/BLOBs, they do
not have built-in indexes customized specifically for searching these objects.
You want to do most of the work (calculating the weighted frequency for
each tuple found) when inserting a document.
You might also want to do some work scoring the usefulness of
each (documentId,searchWord) pair after each search.
That way you can give better and better searches each time.
You also want to store a score or weight for each search and weighted
scores for similarity to other searches.
It's likely that some searches are more common than others and that
the users are not phrasing their search query correctly though they mean
to do a common search.
Inserting a document should also cause some change to the search weight
indexes.
The more I think about it, the more complex the solution becomes.
You have to start with a good design first. The more factors your
design anticipates, the better the outcome.
MapReduce seems like a great way of generating the tuples. If you can get a scala job into a jar file (not sure since I've not used scala before and am a jvm n00b), it'd be a simply matter to send it along and write a bit of a wrapper to run it on the map reduce cluster.
As for storing the tuples after you're done, you also might want to consider a document based database like mongodb if you're just storing tuples.
In general, it sounds like you're doing something more statistical with the texts... Have you considered simply using lucene or solr to do what you're doing instead of writing your own?