Forming Disjoint-Sets of nodes of MST - minimum-spanning-tree

I have been N nodes and N-1 edges,basically undirected acyclic MST.
I want to basically count number of disjoint set of nodes.
Two nodes belong to set if they have same parent and and degree of nodes are same. For eg. In the given tree . {3,4} belongs to same set because parent[3]=1 and parent[4]=1 and degree[3]=degree[4]=1 ; {5} and {2} also forms disjoint set ,because parent[2]=parent[5]=1 but degree[5]=2 and degree[2]=3 . I have to count total number of disjoint sets . I have been trying doing with structure something like :
`define MAX 100000
int set;
struct parent
{
int z[MAX-1];
}
p_data[MAX];
`
But as it is showing p_data array very big . So please help .What kind of of data structure or logic I should follow to count total number of disjoint sets defined above.
I am unable to post image here: So please refer to this image : http://postimage.org/image/sw2nrciib/
PLEASE SOMEONE TELL NUMBER OF DISJOINTS SET THAT CAN BE PERFORMED

Related

MySQL: Can we create an array (associative or not) as a MySQL server script variable? [duplicate]

This question already has answers here:
How to store arrays in MySQL?
(7 answers)
Closed 10 months ago.
I've seen a lot of SQL server script variables of this kind: #variable. But can we, in fact, store an array (associative or not) behind the #variable?
UPDATE
This question turns out to be a duplicate of this one, which suggests to consider possible using of:
SET type and JSON type, which seem to be only column types but not #variable types.
A TEMPORARY TABLE, which seem to be stored in HDD (right?).
Functions working with JSON strings (e.g., JSON_VALUE and JSON_LENGTH), which are usable entirely within MySQL server script. Although, these functions do not help to derive an array and store it in a #variable and are merely JSON walkarounds. I would accept this variant but it seems like #json_string is parsed each time we call JSON_VALUE(#json_string).
So, till now it seems that there IS an opportunity to CREATE an array (associative or not!) but there IS NO an opportunity to surely KEEP the array for its further processing!
Regarding the question mentioned in the beginning of this one. Right now I've only reached 5th and 6th answers, which are related to JSON strings. They are interesting! Be sure to check them out if you're interested in the subject!
Thanks to everyone for your time!
UPDATE
As #Panagiotis Kanavos has mentioned, fetching data by value is slower in case of arrays.
But what if:
We indeed want to simply iterate over M input arrays simultaneously and produce N output arrays? (Maybe, we are simply interested in collation of parameters along a timeline and keep the results.) Arrays are perfectly suitable for this task. But of course, in this case we can still use tables. The question is what will be faster? If our iterative process involves many requests to arrays' elements (can we rely on the server caching the M input arrays and that they'll always be at hand?) and creation of multiple result arrays (how long will it take in case of tables and how do we know that tables are created in RAM for fast access?)?
We want to create an array manually along the course of a server script and are going to only use it in C-like style (aren't going to fetch its data by value) and after the script execution there'll be no need in the array? So, this will be a classic C-like script-only array. To me, in this case putting the array directly into the RAM is what we need and will be more effective than table creation (which'll probably go to HDD), won't it?
And so, the 2nd (and more general) question arises: How far can we rely on the server's optimizations?
I understand that a huge work's been put in optimization in many ways. But has somebody met a situation when a server didn't optimize in the best way? When a programmer had to explicitly rearrange the code in order to manually bring it to the optimal state?
MySQL will implement a data type, ARRAY, to store
variable-sized arrays, in compliance with Standard
SQL (SQL:2003) array functionality.
Syntax
Add a new column data type:
ARRAY [[ may be any data type supported (except
for ARRAY itself, and REF, which MySQL does not support).
It defines the type of data that the array will contain.
-- The [] must be an unsigned integer
greater than zero. It defines the maximum cardinality of
the array, rather than its exact size. Note that the inner
set of brackets is mandatory when defining an array with
a specific size, e.g. INT ARRAY is correct for defining an
array with the default size, INT ARRAY[5] is the correct syntax
for defining an array that will contain 5 elements.
-- As shown in the syntax diagram, []
is optional. If omitted, the maximum cardinality of the
array defaults to an implementation-defined default value.
Oracle's VARRAY size is limited to the maximum number of
columns allowed in a table, so I suggest we make our default
match that maximum. Thus, if [] is omitted,
the maximum cardinality of the array defaults to 1000, which
should also be the absolute maximum cardinality. Thus:
-- [] defaults to 1000.
-- [] may range from 1 to 1000.
Function
An array is an ordered collection of elements, possibly
containing data values. The ARRAY data type will be used
to store data arrays in database tables.
Rules
-- An array is an ordered set of elements.
-- The maximum number of elements in the array is known as
the array's maximum cardinality. The maximum cardinality is
defined at the time the array is defined, as is the element
data type.
-- The actual number of elements that contain data values is
known as the array's cardinality. The cardinality of an array
may vary and is not defined at the time the array is defined.
That is, an instance of an array may always contain fewer
elements than the maximum cardinality allows.
-- Each element may contain a data value that corresponds
to the array's defined data type.
-- Each element has three states: blank (no value assigned
to the element), NULL (NULL assigned to the element), and
containing the valid value (data value assigned to the element).
-- Each element is associated with exactly one ordinal position
in the array. The first array element is found at position 1 (one),
the next at position 2 (two), and so on. Thus, assuming n is the
cardinality of an array, the ordinal position of an array element
is an integer in the range 1 (one) <= element <= n.
-- An array has a maximum cardinality and an actual cardinality.
-- It is an error if one attempts to assign a value to an array
an element whose position is greater than the maximum cardinality of
the array.
-- It is not an error if one attempts to assign values to only
some of an array's elements.
-- Privileges:
-- No special privileges are required to create a table
with the ARRAY data type, or to utilize ARRAY data.
-- Comparison:
-- See WL#2084 Add the ability to compare ARRAY data.
-- Assignment:
-- See WL#2082 Add ARRAY element reference function and
LW #2083 Add ARRAY value constructor function.
Other statements
-- Two other syntax elements must be implemented for
the ARRAY data type to be useful. See WL#2082 for
Array Element Reference syntax and WL#2083 for Array Constructor
syntax.
-- Also related:
-- CARDINALITY(). See WL#2085.
-- Array concatenation. See WL#.
An example
Create a table with the new data type:
CREATE TABLE ArrayTable (array_column INT ARRAY[3]);
Insert data:
INSERT INTO ArrayTable (array_column)
VALUES (ARRAY[10,20,30]);
Retrieve data:
SELECT array_column from ArrayTable
WHERE array_column <> ARRAY[];
-- Returns all cases where array_column is not an empty array
Reserved words
ARRAY, eventually CARDINALITY

CUDA: Scatter communication pattern

I am learning CUDA from the Udacity's course on parallel programming. In a quiz, they have a given a problem of sorting a pre-ranked variable(player's height). Since, it is a one-one correspondence between input and output array, should it not be a Map communication pattern instead of a Scatter?
CUDA makes no canonical definition of these terms, that I know of. Therefore my answer is merely a suggestion of how it might be or have been interpreted.
"Since, it is a one-one correspondence between input and output array"
This statement doesn't appear to be supported by the diagram, which shows gaps in the output array, which have no corresponding input point associated with them.
If a smaller set of values are distributed into a larger array (with resultant gaps in the output array, therefore, in which no input value corresponds to the gap location(s)), then a scatter might be used to describe that operation. Both scatters and maps have maps which describe where the input values go, but it might be that the instructor has defined scatter and map in such a way as to differentiate between these two cases, such as the following plausible definitions:
Scatter: one-to-one relationship from input to output (ie. unidirectional relationship). Every input location has a corresponding output location, but not every output location has a corresponding input location.
Map: one-to-one relationship between input and output (ie. bidirectional relationship). Every input location has a corresponding output location, and every output location has a corresponding input location.
Gather: one-to-one relationship from output to input (ie. unidirection relationship). Every output location has a corresponding input location, but not every input location has a corresponding output location.
The definition of each communication pattern (map, scatter, gather, etc.) varies slightly from one language/environment/context to another, but since I have followed that same Udacity course I'll try to explain that term as I understand it in the context of the course:
The Map operation calculates each output element as a function of its corresponding input element, i.e.:
output[tid] = foo(input[tid]);
The Gather pattern calculates each output element as a function of one or more (usually more) input elements, not necessarily the corresponding one (typically these are elements from a neighborhood). For example:
output[tid] = (input[tid-1] + input[tid+1]) / 2;
Lastly, the Scatter operation has each input element contribute to one or more (again, usually more) output elements. For instance,
atomicAdd( &(output[tid-1]), input[tid]);
atomicAdd( &(output[tid]), input[tid]);
atomicAdd( &(output[tid+1]), input[tid]);
The example given in the question is clearly not a Map, because each output is calculated from an input at a different location.
Also, it is hard to see how the same example can be a scatter, because each input element only causes one write to the output, but it is indeed a scatter because each input causes a write to an output whose location is determined by the input.
In other words, each CUDA thread processes an input element at the location associated with its tid(thread ID number), and calculates where to write the result. More usually a scatter would write on several places instead of only one, so this is a particular case that might as well be named differently.
Each player has 3 properties (name, height, rank).
So I think scatter is correct, because we should consider these three things to make output.
If player has only one property like rank,
then Map is correct I think.
reference: Parallel Communication Patterns Recap in this lecture
reference: map/reduce/gather/scatter with image

Stream filter in cuda

I have an array of values and a linked list of indexes. Now, i only want to keep those values from the array that correspond to the indexes in the LL. is there a standard algorithm to do this. Please give example if possible
So, suppose i have an array 1,2,5,6,7,9
and i have a linked list 2->3
So, i want to keep the values at the index 2 and 3. That is keep 5 and 6.
Thus my function should return 5 and 6
In general, linked list is inherently serial. Having a parallel machine will not speed up the traversal of your list, hence the number of steps of your problem cannot go below O(n), where n is the size of the list.
However, if you have some additional way to access the list you can do something with it.
For example, all elements of the list could be stored in a fixed-size array (although, not necesairly in a consecutive way). List member could be represented in an array using the following struct.
struct ListNode {
bool isValid;
T data;
int next;
}
The value isValid sets if given cell in an array is occupied by a valid list member, or it is just an empty cell.
Now, a parallel algorithm would read all cells at once, check if it represents a valid data, and if so, do something with it.
Second part: Each thread, having a valid index idx of your input array A would have to mark A[idx] not to be deleted. Once we know which elements of A should be removed and which not - a parallel compaction algorithm can be applied.

Efficient set intersection - decide whether the intersection is larger than k

I am faced with a problem where I have to calculate intersections between all pairs in a collection of sets. None of the sets are smaller than a small constant k, and I'm only interested in whether two sets have an intersection larger than k-1 elements or not. I do not need the actual intersections nor the exact size, only whether it's larger than k-1 or not. Is there some clever pre-processing trick or a neat set intersection algorithm that I could use to speed things up?
More info that can be useful to answer the question:
The sets represent maximal cliques in a large, undirected, sparse graph. The number of sets can be in the order of tens of thousands or more, but most of the sets are likely to be small.
The sets are already sorted members of each set are in increasing order. Effectively they are sorted lists - I receive them this way from an underlying library for maximal clique search.
Nothing is known about the distribution of elements in the sets (i.e. whether they are in tight clumps or not).
Most of the set intersections are likely to be empty, so the ideal solution would be a clever data structure that helps me cut down the number of set intersections I have to make.
Consider a mapping with all sets of size k as the keys and corresponding values of lists of all sets from your collection that contain the key as a subset. Given this mapping, you don't need to perform any intersection tests: for each key, all pairs of sets from the list will have an intersection of size at least k. This approach can produce the same pair of sets more than once, so that will need to be checked.
The mapping is easy enough to calculate. For each set in the collection, calculate all the size-k subsets and append the original set to the list for that key set. But is this actually faster? In general, no. The performance of this approach will depend on the distribution of the sizes of the sets in the collection and the value of k. With d distinct elements in the sets, you could have as many as d choose k keys, which can be very large.
However, the basic idea is usable to reduce the number of intersections. Instead of using sets of size k, use smaller ones of fixed size q as the keys. The values are again lists of all sets that have the key as a subset. Now, test each pair of sets from the list for intersection. Thus, with q=1 you only test those pairs of sets that have at least one element in common, with q=2 you only test those pairs of sets that have at least two elements in common, and so on. The optimal value for q will depend on the distribution of sizes of the sets, I think.
For the sets in question, a good choice might be q=2. The keys are then just the edges of the graph, giving a predictable size to the mapping. Since most sets are expected to be disjoint, q=2 should eliminate a lot of comparisons without much additional overhead.
One possible optimization, which is more effective the smaller the range of values contained in each set:
Create a list of all the sets, sorted by their kth-greatest element (this is easy to find, since you already have each set with its elements in order). Call this list L.
For any two sets A and B, their intersection cannot have as many as k elements in it if the kth-greatest element in A is less than the least element in B.
So, for each set in turn, calculate its intersection only with the sets in the relevant part of L.
You can use the same fact to exit early from computing the intersection of any two sets - if there are only n-1 elements left to compare in one of the sets, and the intersection so far contains at most k-n elements, then stop. The above procedure is simply this rule applied to all the sets in L at once, with n=k, at the point where we're looking at the least element of set B and the kth-greatest element of A.
The following strategy should be quite efficient. I've used variations of this for intersecting ascending sequences on a number of occasions.
First I assume that you have some sort of priority queue available (if not, rolling your own heap is pretty easy). And a fast key/value lookup (btree, hash, whatever).
With that said, here is pseudocode for an algorithm that should do what you want quite efficiently.
# Initial setup
sets = array of all sets
intersection_count = key/value lookup with keys = (set_pos, set_pos) and values are counts.
p_queue = priority queue whose elements are (set[0], 0, set_pos), organized by set[0]
# helper function
def process_intersections(current_sets):
for all pairs of current_sets:
if pair in intersection_count:
intersection_count[pair] += 1
else:
intersection_count[pair] = 1
# Find all intersections
current_sets = []
last_element = first element of first thing in p_queue
while p_queue is not empty:
(element, ind, set_pos) = get top element from p_queue
if element != last_element:
process_intersections(current_sets)
last_element = element
current_sets = []
current_sets.append(set_pos)
ind += 1
if ind < len(sets[set_pos]):
add (sets[set_pos][ind], ind, set_pos) to p_queue
# Don't forget the last one!
process_intersections(current_sets)
final answer = []
for (pair, count) in intersection_count.iteritems():
if k-1 < count:
final_answer.append(pair)
The running time will be O(sum(sizes of sets) * log(number of sets) + count(times a point is in a pair of sets). In particular note that if two sets have no intersection, you never try to intersect them.
What if you used a predictive subset as a prequalifier. Pre-sort, but use a subset intersection as a threshold condition. If subset intersection > n% then complete the intersection, otherwise abandon. n then becomes the inverse of your comfort level with the prospect of a false positive.
You could also sort by the subset intersections(m) calculated earlier and begin running the full intersection ordered by m descending. So presumably the majority of your highest m intersections would likely cross your k threshold on the full subset and the probably of hitting your k threshold would continually decrease.
This really starts to treat the problem as NP-Complete.

Generating unique codes that are different in two digits

I want to generate unique code numbers (composed of 7 digits exactly). The code number is generated randomly and saved in MySQL table.
I have another requirement. All generated codes should differ in at least two digits. This is useful to prevent errors while typing the user code. Hopefully, it will prevent referring to another user code while doing some operations as it is more unlikely to miss two digits and match another existing user code.
The generate algorithm works simply like:
Retrieve all previous codes if any from MySQL table.
Generate one code at a time.
Subtract the generated code with all previous codes.
Check the number of non-zero digits in the subtraction result.
If it is > 1, accept the generated code and add it to previous codes.
Otherwise, jump to 2.
Repeat steps from 2 to 6 for the number of requested codes.
Save the generated codes in the DB table.
The algorithm works fine, but the problem is related to performance. It takes a very long to finish generating the codes when requesting to generate a large number of codes like: 10,000.
The question: Is there any way to improve the performance of this algorithm?
I am using perl + MySQL on Ubuntu server if that matters.
Have you considered a variant of the Luhn algorithm? Luhn is used to generate a check digit for strings of numbers in lots of applications, including credit card account numbers. It's part of the ISO-7812-1 standard for generating identifiers. It will catch any number that is entered with one incorrect digit, which implies any two valid numbers differ in a least two digits.
Check out Algorithm::LUHN in CPAN for a perl implementation.
Don't retrieve the existing codes, just generate a potential new code and see if there are any conflicting ones in the database:
SELECT code FROM table WHERE abs(code-?) regexp '^[1-9]?0*$';
(where the placeholder is the newly generated code).
Ah, I missed the generating lots of codes at once part. Do it like this (completely untested):
my #codes = existing_codes();
my $frontwards_index = {};
my $backwards_index = {};
for my $code (#codes) {
index_code($code, $frontwards_index);
index_code(reverse($code), $backwards_index);
}
my #new_codes = map generate_code($frontwards_index, $backwards_index), 1..10000;
sub index_code {
my ($code, $index) = #_;
push #{ $index{ substr($code, 0, length($code)/2) } }, $code;
return;
}
sub check_index {
my ($code, $index) = #_;
my $found = grep { ($_ ^ $code) =~ y/\0//c <= 1 } #{ $index{ substr($code, 0, length($code)/2 } };
return $found;
}
sub generate_code {
my ($frontwards_index, $backwards_index) = #_;
my $new_code;
do {
$new_code = sprintf("%07d", rand(10000000));
} while check_index($new_code, $frontwards_index)
|| check_index(reverse($new_code), $backwards_index);
index_code($new_code, $frontwards_index);
index_code(reverse($new_code), $backwards_index);
return $new_code;
}
Put the numbers 0 through 9,999,999 in an augmented binary search tree. The augmentation is to keep track of the number of sub-nodes to the left and to the right. So for example when your algorithm begins, the top node should have value 5,000,000, and it should know that it has 5,000,000 nodes to the left, and 4,999,999 nodes to the right. Now create a hashtable. For each value you've used already, remove its node from the augmented binary search tree and add the value to the hashtable. Make sure to maintain the augmentation.
To get a single value, follow these steps.
Use the top node to determine how many nodes are left in the tree. Let's say you have n nodes left. Pick a random number between 0 and n. Using the augmentation, you can find the nth node in your tree in log(n) time.
Once you've found that node, compute all the values that would make the value at that node invalid. Let's say your node has value 1,111,111. If you already have 2,111,111 or 3,111,111 or... then you can't use 1,111,111. Since there are 8 other options per digit and 7 digits, you only need to check 56 possible values. Check to see if any of those values are in your hashtable. If you haven't used any of those values yet, you can use your random node. If you have used any of them, then you can't.
Remove your node from the augmented tree. Make sure that you maintain the augmented information.
If you can't use that value, return to step 1.
If you can use that value, you have a new random code. Add it to the hashtable.
Now, checking to see if a value is available takes O(1) time instead of O(n) time. Also, finding another available random value to check takes O(log n) time instead of... ah... I'm not sure how to analyze your algorithm.
Long story short, if you start from scratch and use this algorithm, you will generate a complete list of valid codes in O(n log n). Since n is 10,000,000, it will take a few seconds or something.
Did I do the math right there everybody? Let me know if that doesn't check out or if I need to clarify anything.
Use a hash.
After generating a successful code (not conflicting with any existing code), but that code in the hash table, and also put the 63 other codes that differ by exactly one digit into the hash.
To see if a randomly generated code will conflict with an existing code, just check if that code exists in the hash.
Howabout:
Generate a 6 digit code by autoincrementing the previous one.
Generate a 1 digit code by incrementing the previous one mod 10.
Concatenate the two.
Presto, guaranteed to differ in two digits. :D
(Yes, being slightly facetious. I'm assuming that 'random' or at least quasi-random is necessary. In which case, generate a 6 digit random key, repeat until its not a duplicate (i.e. make the column unique, repeat until the insert doesn't fail the constraint), then generate a check digit, as someone already said.)