Efficient set intersection - decide whether the intersection is larger than k - language-agnostic

I am faced with a problem where I have to calculate intersections between all pairs in a collection of sets. None of the sets are smaller than a small constant k, and I'm only interested in whether two sets have an intersection larger than k-1 elements or not. I do not need the actual intersections nor the exact size, only whether it's larger than k-1 or not. Is there some clever pre-processing trick or a neat set intersection algorithm that I could use to speed things up?
More info that can be useful to answer the question:
The sets represent maximal cliques in a large, undirected, sparse graph. The number of sets can be in the order of tens of thousands or more, but most of the sets are likely to be small.
The sets are already sorted members of each set are in increasing order. Effectively they are sorted lists - I receive them this way from an underlying library for maximal clique search.
Nothing is known about the distribution of elements in the sets (i.e. whether they are in tight clumps or not).
Most of the set intersections are likely to be empty, so the ideal solution would be a clever data structure that helps me cut down the number of set intersections I have to make.

Consider a mapping with all sets of size k as the keys and corresponding values of lists of all sets from your collection that contain the key as a subset. Given this mapping, you don't need to perform any intersection tests: for each key, all pairs of sets from the list will have an intersection of size at least k. This approach can produce the same pair of sets more than once, so that will need to be checked.
The mapping is easy enough to calculate. For each set in the collection, calculate all the size-k subsets and append the original set to the list for that key set. But is this actually faster? In general, no. The performance of this approach will depend on the distribution of the sizes of the sets in the collection and the value of k. With d distinct elements in the sets, you could have as many as d choose k keys, which can be very large.
However, the basic idea is usable to reduce the number of intersections. Instead of using sets of size k, use smaller ones of fixed size q as the keys. The values are again lists of all sets that have the key as a subset. Now, test each pair of sets from the list for intersection. Thus, with q=1 you only test those pairs of sets that have at least one element in common, with q=2 you only test those pairs of sets that have at least two elements in common, and so on. The optimal value for q will depend on the distribution of sizes of the sets, I think.
For the sets in question, a good choice might be q=2. The keys are then just the edges of the graph, giving a predictable size to the mapping. Since most sets are expected to be disjoint, q=2 should eliminate a lot of comparisons without much additional overhead.

One possible optimization, which is more effective the smaller the range of values contained in each set:
Create a list of all the sets, sorted by their kth-greatest element (this is easy to find, since you already have each set with its elements in order). Call this list L.
For any two sets A and B, their intersection cannot have as many as k elements in it if the kth-greatest element in A is less than the least element in B.
So, for each set in turn, calculate its intersection only with the sets in the relevant part of L.
You can use the same fact to exit early from computing the intersection of any two sets - if there are only n-1 elements left to compare in one of the sets, and the intersection so far contains at most k-n elements, then stop. The above procedure is simply this rule applied to all the sets in L at once, with n=k, at the point where we're looking at the least element of set B and the kth-greatest element of A.

The following strategy should be quite efficient. I've used variations of this for intersecting ascending sequences on a number of occasions.
First I assume that you have some sort of priority queue available (if not, rolling your own heap is pretty easy). And a fast key/value lookup (btree, hash, whatever).
With that said, here is pseudocode for an algorithm that should do what you want quite efficiently.
# Initial setup
sets = array of all sets
intersection_count = key/value lookup with keys = (set_pos, set_pos) and values are counts.
p_queue = priority queue whose elements are (set[0], 0, set_pos), organized by set[0]
# helper function
def process_intersections(current_sets):
for all pairs of current_sets:
if pair in intersection_count:
intersection_count[pair] += 1
else:
intersection_count[pair] = 1
# Find all intersections
current_sets = []
last_element = first element of first thing in p_queue
while p_queue is not empty:
(element, ind, set_pos) = get top element from p_queue
if element != last_element:
process_intersections(current_sets)
last_element = element
current_sets = []
current_sets.append(set_pos)
ind += 1
if ind < len(sets[set_pos]):
add (sets[set_pos][ind], ind, set_pos) to p_queue
# Don't forget the last one!
process_intersections(current_sets)
final answer = []
for (pair, count) in intersection_count.iteritems():
if k-1 < count:
final_answer.append(pair)
The running time will be O(sum(sizes of sets) * log(number of sets) + count(times a point is in a pair of sets). In particular note that if two sets have no intersection, you never try to intersect them.

What if you used a predictive subset as a prequalifier. Pre-sort, but use a subset intersection as a threshold condition. If subset intersection > n% then complete the intersection, otherwise abandon. n then becomes the inverse of your comfort level with the prospect of a false positive.
You could also sort by the subset intersections(m) calculated earlier and begin running the full intersection ordered by m descending. So presumably the majority of your highest m intersections would likely cross your k threshold on the full subset and the probably of hitting your k threshold would continually decrease.
This really starts to treat the problem as NP-Complete.

Related

Would this be a valid Implementation of an ordinal CrossEntropy?

Would this be a valid implementation of a cross entropy loss that takes the ordinal structure of the GT y into consideration? y_hat is the prediction from a neural network.
ce_loss = F.cross_entropy(y_hat, y, reduction="none")
distance_weight = torch.abs(y_hat.argmax(1) - y) + 1
ordinal_ce_loss = torch.mean(distance_weight * ce_loss)
I'll attempt to answer this question by first fully defining the task, since the question is a bit sparse on details.
I have a set of ordinal classes (e.g. first, second, third, fourth,
etc.) and I would like to predict the class of each data example from
among this set. I would like to define an entropy-based loss-function
for this problem. I would like this loss function to weight the loss
between a predicted class torch.argmax(y_hat) and the true class y
according to the ordinal distance between the two classes. Does the
given loss expression accomplish this?
Short answer: sure, it is "valid". You've roughly implemented L1-norm ordinal class weighting. I'd question whether this is truly the correct weighting strategy for this problem.
For instance, consider that for a true label n, the bin n response is weighted by 1, but the bin n+1 and n-1 responses are weighted by 2. This means that a lot more emphasis will be placed on NOT predicting false positives than on correctly predicting true positives, which may imbue your model with some strange bias.
It also means that examples on the edge will result in a larger total sum of weights, meaning that you'll be weighting examples where the true label is say "first" or "last" more highly than the intermediate classes. (Say you have 5 classes: 1,2,3,4,5. A true label of 1 will require distance_weight of [1,2,3,4,5], the sum of which is 15. A true label of 3 will require distance_weight of [3,2,1,2,3], the sum of which is 11.
In general, classification problems and entropy-based losses are underpinned by the assumption that no set of classes or categories is any more or less related than any other set of classes. In essence, the input data is embedded into an orthogonal feature space where each class represents one vector in the basis. This is quite plainly a bad assumption in your case, meaning that this embedding space is probably not particularly elegant: thus, you have to correct for it with sort of a hack-y weight fix. And in general, this assumption of class non-correlation is probably not true in a great many classification problems (consider e.g. the classic ImageNet classification problem, wherein the class pairs [bus,car], and [bus,zebra] are treated as equally dissimilar. But this is probably a digression into the inherent lack of usefulness of strict ontological structuring of information which is outside the scope of this answer...)
Long Answer: I'd highly suggest moving into a space where the ordinal value you care about is instead expressed in a continuous space. (In the first, second, third example, you might for instance output a continuous value over the range [1,max_place]. This allows you to benefit from loss functions that already capture well the notion that predictions closer in an ordered space are better than predictions farther away in an ordered space (e.g. MSE, Smooth-L1, etc.)
Let's consider one more time the case of the [first,second,third,etc.] ordinal class example, and say that we are trying to predict the places of a set of runners in a race. Consider two races, one in which the first place runner wins by 30% relative to the second place runner, and the second in which the first place runner wins by only 1%. This nuance is entirely discarded by the ordinal discrete classification. In essence, the selection of an ordinal set of classes truncates the amount of information conveyed in the prediction, which means not only that the final prediction is less useful, but also that the loss function encodes this strange truncation and binarization, which is then reflected (perhaps harmfully) in the learned model. This problem could likely be much more elegantly solved by regressing the finishing position, or perhaps instead by regressing the finishing time, of each athlete, and then performing the final ordinal classification into places OUTSIDE of the network training.
In conclusion, you might expect a well-trained ordinal classifier to produce essentially a normal distribution of responses across the class bins, with the distribution peak on the true value: a binned discretization of a space that almost certainly could, and likely should, be treated as a continuous space.

get random index numbers from a matrix, fortran 90

I am looking for a function or a way to get the index numbers of a 2D matrix:
my example is, I have A(Ly,Lx) where Ly = 100 and Lx = 100
I want to get a random index number of the matrix, such as : Random_node(A) = (random y, random x)
Then I want to do this repeatedly having the constraint that I don't want my random points to be repeated or even not to be close one to each other following a threshold of (let's say) 10 nodes of radius. The matrix is an eulerian 2D matrix (y,x).
Is at least the first question straightforward?
Thank you all!
Albert P
Here's one way of getting a random set of locations in your 100x100 matrix. First, declare a 100x100 matrix of reals:
real, dimension(100,100) :: randarray
then, put a random number into each element of that array
call random_number(randarray)
Now, an expression such as
randarray > 0.9
returns a logical array containing, approximately, 10% true values and 90% false. By tracking down the locations of the true values you have the random x-es and y-es that you seek. Indeed you may not need to find those locations at all, you can simply use the expression in masked assignments and similar operations, for example
where(randarray>0.9) a = func()
as long, of course, as func returns a scalar or a 100x100 array.
This approach guarantees that each location is different from all the others.
It does not however, address your constraint that the 'random' locations should not be too close to each other. That constraint, of course, is a little inconsistent with randomness.
You could, I suppose, break your 100x100 array into 10x10 blocks and choose, randomly, one element in each block. Would that be a good compromise between your constraints ?

numerical issues causing the difference in outputs of two programs?

I have two codes that theoretically should return the exact same output. However, this does not happen. The issue is that the two codes handle very small numbers (doubles) to the order of 1e-100 or so. I suspect that there could be some numerical issues which are related to that, and lead to the two outputs being different even though they should be theoretically the same.
Does it indeed make sense that handling numbers on the order of 1e-100 cause such problems? I don't mind the difference in output, if I could safely assume that the source is numerical issues. Does anyone have a good source/reference that talks about issues that come up with stability of algorithms when they handle numbers in such order?
Thanks.
Does anyone have a good source/reference that talks about issues that come up with stability of algorithms when they handle numbers in such order?
The first reference that comes to mind is What Every Computer Scientist Should Know About Floating-Point Arithmetic. It covers floating-point maths in general.
As far as numerical stability is concerned, the best references probably depend on the numerical algorithm in question. Two wide-ranging works that come to mind are:
Numerical Recipes by Press et al;
Matrix Computations by Golub and Van Loan.
It is not necessarily the small numbers that are causing the problem.
How do you check whether the outputs are the "exact same"?
I would check equality with tolerance. You may consider the floating point numbers x and y equal if either fabs(x-y) < 1.0e-6 or fabs(x-y) < fabs(x)*1.0e-6 holds.
Usually, there is a HUGE difference between the two algorithms if there are numerical issues. Often, a small change in the input may result in extreme changes in the output, if the algorithm suffers from numerical issues.
What makes you think that there are "numerical issues"?
If possible, change your algorithm to use Kahan Summation (aka compensated summation). From Wikipedia:
function KahanSum(input)
var sum = 0.0
var c = 0.0 //A running compensation for lost low-order bits.
for i = 1 to input.length do
y = input[i] - c //So far, so good: c is zero.
t = sum + y //Alas, sum is big, y small, so low-order digits of y are lost.
c = (t - sum) - y //(t - sum) recovers the high-order part of y; subtracting y recovers -(low part of y)
sum = t //Algebraically, c should always be zero. Beware eagerly optimising compilers!
//Next time around, the lost low part will be added to y in a fresh attempt.
return sum
This works by keeping a second running total of the cumulative error, similar to the Bresenham line drawing algorithm. The end result is that you get precision that is nearly double the data type's advertised precision.
Another technique I use is to sort my numbers from small to large (by manitude, ignoring sign) and add or subtract the small numbers first, then the larger ones. This has the virtue that if you add and subtract the same value multiple times, such numbers may cancel exactly and can be removed from the list.

Determining edge weights given a list of walks in a graph

These questions regard a set of data with lists of tasks performed in succession and the total time required to complete them. I've been wondering whether it would be possible to determine useful things about the tasks' lengths, either as they are or with some initial guesstimation based on appropriate domain knowledge. I've come to think graph theory would be the way to approach this problem in the abstract, and have a decent basic grasp of the stuff, but I'm unable to know for certain whether I'm on the right track. Furthermore, I think it's a pretty interesting question to crack. So here we go:
Is it possible to determine the weights of edges in a directed weighted graph, given a list of walks in that graph with the lengths (summed weights) of said walks? I recognize the amount and quality of permutations on the routes taken by the walks will dictate the quality of any possible answer, but let's assume all possible walks and their lengths are given. If a definite answer isn't possible, what kind of things can be concluded about the graph? How would you arrive at those conclusions?
What if there were several similar walks with possibly differing lengths given? Can you calculate a decent average (or other illustrative measure) for each edge, given enough permutations on different routes to take? How will discounting some permutations from the available data set affect the calculation's accuracy?
Finally, what if you had a set of initial guesses as to the weights and had to refine those using the walks given? Would that improve upon your guesstimation ability, and how could you apply the extra information?
EDIT: Clarification on the difficulties of a plain linear algebraic approach. Consider the following set of walks:
a = 5
b = 4
b + c = 5
a + b + c = 8
A matrix equation with these values is unsolvable, but we'd still like to estimate the terms. There might be some helpful initial data available, such as in scenario 3, and in any case we can apply knowledge of the real world - such as that the length of a task can't be negative. I'd like to know if you have ideas on how to ensure we get reasonable estimations and that we also know what we don't know - eg. when there's not enough data to tell a from b.
Seems like an application of linear algebra.
You have a set of linear equations which you need to solve. The variables being the lengths of the tasks (or edge weights).
For instance if the tasks lengths were t1, t2, t3 for 3 tasks.
And you are given
t1 + t2 = 2 (task 1 and 2 take 2 hours)
t1 + t2 + t3 = 7 (all 3 tasks take 7 hours)
t2 + t3 = 6 (tasks 2 and 3 take 6 hours)
Solving gives t1 = 1, t2 = 1, t3 = 5.
You can use any linear algebra techniques (for eg: http://en.wikipedia.org/wiki/Gaussian_elimination) to solve these, which will tell you if there is a unique solution, no solution or an infinite number of solutions (no other possibilities are possible).
If you find that the linear equations do not have a solution, you can try adding a very small random number to some of the task weights/coefficients of the matrix and try solving it again. (I believe falls under Perturbation Theory). Matrices are notorious for radically changing behavior with small changes in the values, so this will likely give you an approximate answer reasonably quickly.
Or maybe you can try introducing some 'slack' task in each walk (i.e add more variables) and try to pick the solution to the new equations where the slack tasks satisfy some linear constraints (like 0 < s_i < 0.0001 and minimize sum of s_i), using Linear Programming Techniques.
Assume you have an unlimited number of arbitrary characters to represent each edge. (a,b,c,d etc)
w is a list of all the walks, in the form of 0,a,b,c,d,e etc. (the 0 will be explained later.)
i = 1
if #w[i] ~= 1 then
replace w[2] with the LENGTH of w[i], minus all other values in w.
repeat forever.
Example:
0,a,b,c,d,e 50
0,a,c,b,e 20
0,c,e 10
So:
a is the first. Replace all instances of "a" with 50, -b,-c,-d,-e.
New data:
50, 50
50,-b,-d, 20
0,c,e 10
And, repeat until one value is left, and you finish! Alternatively, the first number can simply be subtracted from the length of each walk.
I'd forget about graphs and treat lists of tasks as vectors - every task represented as a component with value equal to it's cost (time to complete in this case.
In tasks are in different orderes initially, that's where to use domain knowledge to bring them to a cannonical form and assign multipliers if domain knowledge tells you that the ratio of costs will be synstantially influenced by ordering / timing. Timing is implicit initial ordering but you may have to make a function of time just for adjustment factors (say drivingat lunch time vs driving at midnight). Function might be tabular/discrete. In general it's always much easier to evaluate ratios and relative biases (hardnes of doing something). You may need a functional language to do repeated rewrites of your vectors till there's nothing more that romain knowledge and rules can change.
With cannonical vectors consider just presence and absence of task (just 0|1 for this iteratioon) and look for minimal diffs - single task diffs first - that will provide estimates which small number of variables. Keep doing this recursively, be ready to back track and have a heuristing rule for goodness or quality of estimates so far. Keep track of good "rounds" that you backtraced from.
When you reach minimal irreducible state - dan't many any more diffs - all vectors have the same remaining tasks then you can do some basic statistics like variance, mean, median and look for big outliers and ways to improve initial domain knowledge based estimates that lead to cannonical form. If you finsd a lot of them and can infer new rules, take them in and start the whole process from start.
Yes, this can cost a lot :-)

Detecting repetition with infinite input

What is the most optimal way to find repetition in a infinite sequence of integers?
i.e. if in the infinite sequence the number '5' appears twice then we will return 'false' the first time and 'true' the second time.
In the end what we need is a function that returns 'true' if the integer appeared before and 'false' if the function received the integer the first time.
If there are two solutions, one is space-wise and the second is time-wise, then mention both.
I will write my solution in the answers, but I don't think it is the optimal one.
edit: Please don't assume the trivial cases (i.e. no repetitions, a constantly rising sequence). What interests me is how to reduce the space complexity of the non-trivial case (random numbers with repetitions).
I'd use the following approach:
Use a hash table as your datastructure. For every number read, store it in your datastructure. If it's already stored before you found a repetition.
If n is the number of elements in the sequence from start to the repetition, then this only requires O(n) time and space. Time complexity is optimal, as you need to at least read the input sequence's elements up to the repetition point.
How long of a sequence are we talking (before the repetition occurs)? Is a repetition even guaranteed at all? For extreme cases the space complexity might become problematic. But to improve it you will probably need to know more structural information on your sequence.
Update: If the sequence is as you say very long with seldom repetitions and you have to cut down on the space requirement, then you might (given sufficient structural information on the sequence) be able to cut down the space cost.
As an example: let's say you know that your infinite sequence has a general tendency to return numbers that fit within the current range of witnessed min-max numbers. Then you will eventually have whole intervals that have already been contained in the sequence. In that case you can save space by storing such intervals instead of all the elements contained within it.
A BitSet for int values (2^32 numbers) would consume 512Mb. This may be ok if the BitSets are allocated not to often, fast enough and the mem is available.
An alternative are compressed BitSets that work best for sparse BitSets.
Actually, if the max number of values is infinite, you can use any lossless compression algorithm for a monochrome bitmap. IF you imagine a square with at least as many pixels as the number of possible values, you can map each value to a pixel (with a few to spare). Then you can represent white as the pixels that appeared and black for the others and use any compression algorithm if space is at a premium (that is certainly a problem that has been studied)
You can also store blocks. The worst case is the same in space O(n) but for that worst case you need that the number appeared have exactly 1 in between them. Once more numbers appear, then the storage will decrease:
I will write pseudocode and I will use a List, but you can always use a different structure
List changes // global
boolean addNumber(int number):
boolean appeared = false
it = changes.begin()
while it.hasNext():
if it.get() < number:
appeared != appeared
it = it.next()
else if it.get() == number:
if !appeared: return true
if it.next().get() == number + 1
it.next().remove() // Join 2 blocks
else
it.insertAfter(number + 1) // Insert split and create 2 blocks
it.remove()
return false
else: // it.get() > number
if appeared: return true
it.insertBefore(number)
if it.get() == number + 1:
it.remove() // Extend next block
else:
it.insertBefore(number + 1)
}
return false
}
What this code is the following: it stores a list of blocks. For each number that you add, it iterates over the list storing blocks of numbers that appeared and numbers that didn't. Let me illustrate with an example; I will add [) to illustrate which numbers in the block, the first number is included, the last is not.In the pseudocode it is replaced by the boolean appeared. For instance, if you get the 5, 9, 6, 8, 7 (in this order) you will have the following sequences after each function:
[5,6)
[5,6),[9,10)
[5,7),[9,10)
[5,7),[8,10)
[5,10)
In the last value you keep a block of 5 numbers with only 2.
Return TRUE
If the sequence is infinite then there will be repetition of every conceivable pattern.
If what you want to know is the first place in the sequence when there is a repeated digit that's another matter, but there's some difference between your question and your example.
Well, it seems obvious that in any solution we'll need to save the numbers that already appeared, so space wise we will always have a worst-case of O(N) where N<=possible numbers with the word size of our number type (i.e. 2^32 for C# int) - this is problematic over a long time if the sequence is really infinite/rarely repeats itself.
For saving the numbers that already appeared I would use an hash table and then check it each time I receive a new number.