which one is better between GUNDB and filecoin in terms of props. and cons. ? which will be better for storing mutable data - ipfs

what is difference between GUNDB and filecoin in terms of props. and cons. ? which will be better for storing mutable data.

Related

Why are hash table based data structures not the default when implementing adjacency lists?

I looked at some existing implementations of adjacency lists online, and most if not all of them have been implemented using dynamic arrays. But wouldn't hashtable based data structures be more suitable? (set and map)
There are very limited scenarios where we would access graph nodes by index. Even if that's the case, if some indices are missing from the graph, there will be wasted space. And if the nodes are not inserted in order, lookups are O(n).
However, if we use a hashtable based data structure, lookups will be O(1) whether the nodes are indexed or otherwise.
So why are maps and sets not the default data structures used when implementing adjacency lists?
The choice of the right container is not quite easy.
I will consider some of the most common:
a list (elements which contain a reference to the next and/or previous)
an array (with consecutive storage)
an associated array
a hash table.
Each of them has advantages and disadvantages.
Concerning a list, insertions and removals can be very fast (worst case O(1) if the insertion point / removal element is known) but a look-up has worst case time complexity of O(N).
The look-up in an array has a complexity of O(1) in worst case if the index is known (but insertion and removal can be slow if the order must be kept).
A hash table has a look-up of O(1) in best case but the worst case might be O(N) (even if it's unlikely to happen often if the hash table isn't completely bad implemented).
An associated array has a time complexity of O(lg N) in worst case.
So the choice always depends on the expected use cases to find the best compromise where the advantages pay off most while the disadvantages doesn't hurt too much.
For the management of node and edge lists in graphs, OP made the observation that arrays seem to be very common.
I recently had a look into the Boost Graph Library (for curiosity and inspiration). Concerning the data structures, it is mentioned:
The adjacency_list class is the general purpose “swiss army knife” of graph classes. It is highly parameterized so that it can be optimized for different situations: the graph is directed or undirected, allow or disallow parallel edges, efficient access to just the out-edges or also to the in-edges, fast vertex insertion and removal at the cost of extra space overhead, etc.
For the configuration (according to a specific use case), there is spent an extra page BGL – adjacency_list.
However, the defaults for vertex (node) list and edge list are in fact vectors (aka. dynamic arrays). Assuming that the average use case is an non-mutable graph (loaded once and never modified) which is explored by algorithms to answer certain user questions, the worst case of O(1) for look-up in arrays is hard to beat and will very probably pay off.
To organize this, the nodes and edges have to be enumerated. If the input data doesn't provide this, it's easy to add this as a kind of internal ID to the in-memory representation of the graph.
In this case, "public" node references have to be mapped into the internal IDs, and answers have to be mapped back. For the mapping of the public node references, the most appropriate container should be used. This might be in fact an associated array or hash table.
Considering that a request like e.g. find the shortest route from A to B has to map A and B once to the corresponding internal IDs but may need many look-up of nodes and edges to compute the answer, the choice of the array for storage of nodes and edges makes very sense.
There are very limited scenarios where we would access graph nodes by index.
This is true, and exactly what you should be thinking about: you want a data structure which can efficiently do whatever operations you actually want to use it for. So the question is, what operations do you want to be efficient?
Suppose you are implementing some kind of standard algorithm which uses an adjacency list, e.g. Dijkstra's algorithm, A* search, depth-first search, breadth-first search, topological sorting, or so on. For almost every algorithm like this, you will find that the only operation you need to use the adjacency list for is: for a given node, iterate over its neighbours.
That operation is more efficient for a dynamic array than for a hashtable, because a hashtable has to be sufficiently sparse to prevent too many collisions. Besides that, dynamic arrays will use less memory than hashtables, for the same reason; and the dynamic arrays are more efficient to build in the first place, because you don't have to compute any hashes.
Now, if you have a different algorithm where you need to be able to test for the existence of an edge in O(1) time, then an adjacency list implemented using hashtables may be a good choice; but you should also consider whether an adjacency matrix is more suitable.

Standardize/Normalize data (binary+numeric) before autoencoder, ward hierarchy clustering, etc.?

I have a dataset with both binary data (0,1) and numeric data with different units. If I want to apply some machine learning techniques to classify my data (potentially autoencoder or hierarchy clustering), should I standardize or normalize the data?
Thank you!
It depends.
For neural networks you may want to standardize continuous variables for numerical reasons. But it depends on your platform. Consider Googles TPUs: they work with 1 byte precision, so you want the relevant input domain to use this limited range optimally.
For distance based methods like clustering, preprocessing the data is crucial, but difficult. It is false that standardizing is always the right thing to do. But it is fairly common to apply some normalization. But you need a domain expert to find the best normalization.

Serialized Json vs UDT implications in data and schema migrations in cassandra

Coming up short with a real answer to what the different implications of storing the serialized json of a type vs using a UDT in Cassandra are. I'm now reaching out hoping for someone with experience to elaborate.
In terms of performance, data and schema changes (add, alter, remove columns) how do they differ?
What are some pro's and cons of each approach?
In what other noteworthy way do they differ?
There is a big difference and I'll try to explain it.
UDTs are awesome if you wan't "strongly typed" fields in CQL schema. You can use UDT as a part of your primary key (clustering column) as well as adding and renaming fields. Downside is that when doing selects you are always selecting the whole UDT and you cannot remove a field. Don't go too crazy with usage because they are a hell to maintain especially if same ones are used across multiple tables.
Using a serialized JSON string is good for some cases. I even heard people save compressed data into fields (protobuff) to solve their problems (I think that someone from Soundcloud was talking about this). The problem with JSON is that they are not typed and that you need additional logic on the application to handle the serialization and changes to the data. This also means that you can have variable structure and insert only the fields that you need.
At the end its about your preference as long as you understand pros and cons of both approaches.

Efficient implementation of multiple return values?

Is it possible to implement efficiently (with little to no runtime overhead) functions that return multiple vales / a tuple type?
In a C-like language something like this:
int, float f(int a) {
return a*2 , a / 2;
}
Is there a reason why very few statically compiled languages do this?
Yes, it can be efficient. You may need to spill registers, but it is possible.
GHC for example, implements the "constructed product return" optimization, that:
determines when a function can profitably return multiple results in registers. The analysis is based only on a function's definition, and not on its uses (so separate compilation is easily supported) and the results of the analysis can be expressed by a transformation of the function definition alone.
CPR is a huge win for returning small structures (i.e. tuples, tagged unions).
More information:
CPR analysis in GHC.
More on demand analysis.
The best paper I have read exactly about this topic is An Efficient Implementation of Multiple Return Values in Scheme (PDF). Although it is about Scheme programming language, they explain the matter in terms of low machine level stack/registry implementation.
This article actually made me think that many high-level features normally considered inefficient are a solved problem regarding the efficient implementation and just the inertia of the popular languages is in the way.
If your tuple doesn't fit into a single register (32- or 64-bit, depending on your architecture, most likely), then there's going to be actual allocation (most likely on the heap) involved in implementing this.
That said, the reasons why very few languages permit this style is unlikely to be related to performance as much as it is likely related to stylistic concerns in the language (i.e., there are other idiomatic ways to achieve the same thing, such as returning a struct). Introducing new primitives to the language can be clumsy and introduce inconsistencies. For example, if tuples become first-class values, can I use them anywhere? How do I access them? Do we enforce immutability? How do I allocate or deallocate them?
Languages with more expressive type systems tend to make it easier to add these kinds of language features in a principled manner, which is why you'll find tuples (and all sorts of other exotic creatures) as first-class values in languages derived from the ML family (amongst others).
In C, just return a struct holding the values. Sure, the result won't fit in a register, unlike an int, so it will not be as efficient as an int return, but it will still be efficient, if the struct is a local variable and thus allocated on the stack rather than the heap.

Is there a difference between 'data structure' and 'data type'?

Two of the questions that often come up in the Uni exam, where I study, are:
Define data types. Classify and explain datatypes
Define data structures. Classify and explain data structures
Somehow, aren't they the same thing ?
Consider that you are making a Tree<E> in Java. You'd declare your class for Tree<E>, add methods to it and somewhere you would do Tree<String> myTree = new Tree<>(); to make a tree object.
Your data 'structure' is now a data 'type'.
Say if you were asked a question: Of what type is the variable myTree? The answer would be, Tree<E>. Your data 'structure' is now a data 'type'.
Now since they are the same, they will be classified in the same way depending on what basis you want to classify them on. Primitive or non primitive. Homogeneous or heterogeneous. Linear or hierarchical.
That is my understanding. Is the understanding wrong ?
I would like to correct the following to start - you created a class called "Tree" and an object called "myTree", not a variable called "myTree" of datatype "Tree". These are different things.
The following is the definition of a data type:
A data type or simply type is a classification identifying one of various types of data,
such as real-valued, integer or Boolean, that determines the possible values for that type;
the operations that can be done on values of that type; the meaning of the data;
and the way values of that type can be stored.
Now, as per Wikipedia, there are various definitions for "type" in data type.
The question you have asked is a good one. There are data types in today's modern languages, that are referred to as Abstract Data Types or ADT in short. The definition of an ADT is:
An abstract data type (ADT) is a mathematical model for a certain class of data structures that have similar behavior; or for certain data types of one or more programming languages that have similar semantics. An abstract data type is defined indirectly, only by the operations that may be performed on it and by mathematical constraints on the effects (and possibly cost) of those operations.
It is also written that:
Abstract data types are purely theoretical entities, used (among other things) to simplify the description of abstract algorithms, to classify and evaluate data structures, and to formally describe the type systems of programming languages. However, an ADT may be implemented by specific data types or data structures, in many ways and in many programming languages; or described in a formal specification language.
Meaning that ADT's can be implemented using either data types or data structures.
As for data structures:
A data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently.
Many textbooks, use these words interchangeably. With more complex types, that can lead to confusion.
Take a small example: it's something of a standard to use b-tree for implementation of databases. Meaning, we know that this type of ADT is well suited for such type of a problem and can handle it more effectively. But in order to inject this effectiveness in an ADT you will need to create a data structure that will give you the desired output.
Another example: there are so many trees like b-tree, binary-search tree, AA tree, etc. All these are essentially of the type of a tree, but each one is in it's own a data structure.
Refer: List of data structures for a huge list of available structures.
The distinction is between abstract and concrete data structures. Some CS textbooks refer to abstract data structures as "data types", which is confusing because not all data types are data structures. They use "data structure" to specifically mean a concrete data structure.
An abstract data structure, also called an abstract data type, is the interface of the data structure. Java often represents them using interfaces; examples are List, Queue, Map, Deque, Set. (But there are others not represented in Java, such as bags/multisets, multimaps, graphs, stacks, and priority queues.) They are distinguished by their behavior and how you use the data structure. For instance, a set is characterized by forbidding duplicates and not recording order, whereas a list allows duplicates and remembers the order. A queue has a restricted interface that only lets you add to one end and remove from the other.
A concrete data structure is an implementation of an abstract data structure. Examples are ArrayList and LinkedList. These are both implementations of lists; while their list interface is the same, the programmer might still care about their different performance characteristics. Note that LinkedList also implements Queue.
Also, there exist data structures in programming languages that have no type system. For example, you can modell a Map in LISP, or have a dictionary in Python. It would be misleading to speak of a type here, as type does IMHO only make sense with respect to some type system, or as an abstract concept like "the set of all values that inhabit t".
So, it seems that data structure has a connotation of an concrete implementation of some abstract type. If we speak of some object in a programming language with type system, OTOH, we would probably say that "it has type XY".