Novel or lesser known data structures for network (graph) data? - language-agnostic

What are some more interesting graph data structures for working with networks? I am interested in structures which may offer some particular advantage in terms of traversing the network, finding random nodes, size in memory or for insertion/deletion/temporary hiding of nodes for example.
Note: I'm not so much interested in database like designs for addressing external memory problems.

One of my personal favorites is the link/cut tree, a data structure for partitioning a graph into a family of directed trees. This lets you solve network flow problems asymptotically faster than more traditional methods and can be used as a more powerful generalization of the union/find structure you may have heard of before.

I've heard of Skip Graphs ( http://www.google.com/search?ie=UTF-8&oe=UTF-8&sourceid=navclient&gfns=1&q=skip+graphs ), a probabilistic graph structure that is - as far as I know - already in use in some peer-to-peer applications.
These graphs are kind of self-organizing and their goal is to achieve a good connectivity and a small diameter. There is a distributed algorithm that tries to achieve such graphs: http://www14.informatik.tu-muenchen.de/personen/jacob/Publications/podc09.pdf

Related

NIfTi vs DICOM for 3D volumetric data

Are there major benefits of selecting NIfTi over DICOM (or viz.) as the choice of data format? I am working on 3D Volumetric semantic segmentation. I will have to convert either format to numpy array or tensor before feeding to the network, but curious on the performance benefits of selection.
(This question risks being opinion-based, so trying to stick to facts.)
DICOM is a very powerful, flexible but complex format, and its strength is to provide interoperability between different hardware and software. However, DICOM is not particularly efficient for image processing and analysis. One potential drawback of DICOM is that a single volume is stored as a sequence of 2D slices, which can be cumbersome to deal with.
NIfTi is an improved version of the Analyze file format, which was designed to be simpler than DICOM, while still retaining all the essential metadata. And it has the added benefit of being able to store a volume in a single file, with a simple header followed by raw data. This makes it fast to load and process.
There are several other medical file formats suitable for this task. You may also wish to consider NRRD which has many features in common with NIfTi. Simple format, fast to parse and load, flexible storage encoding for 2,3,4D data. Many tools and libraries can process NRRD files too.
So given your primary need is for efficient storage and analysis, NIfTi or NRRD would be a better choice.

Convolutional filter design in neural networks by data clustering

My understanding is that filters in convolutional neural networks are going to extract features in raw data (or previous layers), so designing them by supervised learning through backpropagation makes complete sense. But I have seen some papers in which the filters are found by unsupervised clustering of input data samples. That looks strange to me how cluster centers can be regarded as good filters for feature extraction. Does anybody have a good explanation for that?
Certain popular clustering algorithms such as k-means are vector quantization methods.
They try to find a good least-squares quantization of the data, such that every data point can be represented by a similar vector with least-squares difference.
So from a least-squares approximation point of view, the cluster centers are good approximations (we can't afford to find the optimal centers, but we have a good chance at finding reasonably good centers). Whether or not least squares is appropriate depends a lot on the data, for example all attributes should be of the same kind. For a typical image processing task, where each pixel is represented the same way, this will be a good starting point for later supervised optimization. But I believe soft factorizations will usually be better that do not assume every patch is of exactly one kind.

How to store a large directed unweighted graph with billions of nodes and vertices

The graph size is in the billions of nodes, and tens of billions of vertices.
It will store webpages urls, and links between webpages and it will be used for testing ranking algorithms.
Any language is fine but java is prefered.
Solutions i found so far:
neo4j
storing in sorted flat files
Yes, i have already read Best Way to Store/Access a Directed Graph.
Update
The data can be distributed on multiple computers and does not need to be fully in-memory.
Depending on your implementation, another solution could be Terracotta. I think supports object graphs of this magnitude using a distributed virtual heap.
http://www.terracotta.org/web/display/docs/Concept+and+Architecture+Guide#ConceptandArchitectureGuide-VirtualHeap

Performances evaluation with Message Passing

I have to build a distributed application, using MPI.
One of the decision that I have to take is how to map instances of classes into process (and then into machines), in order to take maximum advantages from a distributed environment.
My question is: there is a model that let me choose the better mapping? I mean, some arrangements are surely wrong (for ex., putting in two different machines two objects that should process together a fairly large amount of data, in a sequential manner, without a stream of tokens to process), but there's a systematically way to determine such wrong arrangements, determined by flow of execution, message complexity, time taken by the computation done by the algorithmic components?
Well, there are data flow diagrams. Those can help identify parallelism's opportunities and pitfalls. The references on the wikipedia page might give you some more theoretical grounding.
When I worked at Lockheed Martin, I was exposed to CSIM, a tool they developed for modeling algorithm mapping to processing blocks.
Another thing you might try is the Join Calculus. I've found examples of programming with it to be surprisingly intuitive, and I think it's well grounded in theory. I'm not sure why it hasn't caught on more.
The other approach is the Pi Calculus, and I think that might be more popular, though it seems harder to understand.
A practical solution to this would be using a different model of distributed-memory parallel programming, that directly addresses your concerns. I work on the Charm++ programming system, whose model is that of individual objects sending messages from one to another. The runtime system facilitates automatic mapping of these objects to available processors, to account for issues of load balance and communication locality.

How well do common programming tasks translate to GPUs?

I have recently begun working on a project to establish how best to leverage the processing power available in modern graphics cards for general programming. It seems that the field general purpose GPU programming (GPGPU) has a large bias towards scientific applications with a lot of heavy math as this fits well with the GPU computational model. This is all good and well, but most people don't spend all their time running simulation software and the like so we figured it might be possible to create a common foundation for easily building GPU-enabled software for the masses.
This leads to the question I would like to pose; What are the most common types of work performed by programs? It is not a requirement that the work translates extremely well to GPU programming as we are willing to accept modest performance improvements (Better little than nothing, right?).
There are a couple of subjects we have in mind already:
Data management - Manipulation of large amounts of data from databases
and otherwise.
Spreadsheet type programs (Is somewhat related to the above).
GUI programming (Though it might be impossible to get access to the
relevant code).
Common algorithms like sorting and searching.
Common collections (And integrating them with data manipulation
algorithms)
Which other coding tasks are very common? I suspect a lot of the code being written is of the category of inventory management and otherwise tracking of real 'objects'.
As I have no industry experience I figured there might be a number of basic types of code which is done more often than I realize but which just doesn't materialize as external products.
Both high level programming tasks as well as specific low level operations will be appreciated.
General programming translates terribly to GPUs. GPUs are dedicated to performing fairly simple tasks on streams of data at a massive rate, with massive parallelism. They do not deal well with the rich data and control structures of general programming, and there's no point trying to shoehorn that into them.
General programming translates terribly to GPUs. GPUs are dedicated to performing fairly simple tasks on streams of data at a massive rate, with massive parallelism. They do not deal well with the rich data and control structures of general programming, and there's no point trying to shoehorn that into them.
This isn't too far away from my impression of the situation but at this point we are not concerning ourselves too much with that. We are starting out by getting a broad picture of which options we have to focus on. After that is done we will analyse them a bit deeper and find out which, if any, are plausible options. If we end up determining that it is impossible to do anything within the field, and we are only increasing everybody's electricity bill then that is a valid result as well.
Things that modern computers do a lot of, where a little benefit could go a long way? Let's see...
Data management: relational database management could benefit from faster relational joins (especially joins involving a large number of relations). Involves massive homogeneous data sets.
Tokenising, lexing, parsing text.
Compilation, code generation.
Optimisation (of queries, graphs, etc).
Encryption, decryption, key generation.
Page layout, typesetting.
Full text indexing.
Garbage collection.
I do a lot of simplifying of configuration. That is I wrap the generation/management of configuration values inside a UI. The primary benefit is I can control work flow and presentation to make it simpler for non-techie users to configure apps/sites/services.
The other thing to consider when using a GPU is the bus speed, Most Graphics cards are designed to have a higher bandwidth when transferring data from the CPU out to the GPU as that's what they do most of the time. The bandwidth from the GPU back up to the CPU, which is needed to return results etc, isn't as fast. So they work best in a pipelined mode.
You might want to take a look at the March/April issue of ACM's Queue magazine, which has several articles on GPUs and how best to use them (besides doing graphics, of course).