Prim's and Kruskal's algorithm - minimum-spanning-tree

Prim's and Kruskal's algorithm both produce the minimum spanning tree. According to the cut property, the total cost of the tree will be the same for these algorithms, but is it possible that these two algorithms give different MST with the same total cost, given that we choose it in alphabetic order when faced with multiple choices. for example, we compare max(source,dest), for edges A->B and B->C, we compare A from A->B and B from B->C.
Thank you

Assuming that your comparator handles the case where both edges are equal in cost and have the same max(source, dest) character, it will never declare any two edges equal. For there to be the possibility of multiple MSTs, at least two edges in the graph must be equal. Therefore, the MST is unique, and both Prim's and Kruskal's algorithm will return the same result.
On the other hand, if your comparator declares the edges A->B (cost 1) and A->C (cost 1) equal, than there is a possibility that the algorithms will generate different MSTs, depending on which edge they consider first (A->B or A->C).

It is definitely possible for one graph to have multiple MSTs, so long as those different representations of the MSTs have the same total weight. Otherwise, the one with the lower total weight would be the true MST and the other would no longer be an MST.
Because Prim's and Kruskal's algorithms have different steps, it is possible that they would choose different edges of the same weight during the actual traversal, yet still end with the same total weight.
However, if you add the restriction that you stated in your question (choosing the node that comes first in the alphabet) the MST for Prim's and Kruskal's should be the same tree, for each of the decisions, even if they are the same weight, would prefer the same edge for both Kruskal's and Prim's.

Related

Can I find price floors and ceilings with cuda

Background
I'm trying to convert an algorithm from sequential to parallel, but I am stuck.
Point and Figure Charts
I am creating point and figure charts.
Decreasing
While the stock is going down, add an O every time it breaks through the floor.
Increasing
While the stock is going up, add an X every time it breaks through the ceiling.
Reversal
If the stock reverses direction, but the change is less than a reversal threshold (3 units) do nothing. If the change is greater than the reversal threshold, start a new column (X or O)
Sequential vs Parallel
Sequentially, this is pretty straight forward. I keep a variable for the floor and ceiling. If the current price breaks through the floor or ceiling, or changes more than the reversal threshold, I can take the appropriate action.
My question is, is there a way to find these reversal point in parallel? I'm fairly new to thinking in parallel, so I'm sorry if this is trivial. I am trying to do this in CUDA, but I have been stuck for weeks. I have tried using the finite difference algorithms from NVidia. These produce local max / min but not the reversal points. Small fluctuations produce numerous relative max / min, but most of them are trivial because the change is not greater than the reversal size.
My question is, is there a way to find these reversal point in parallel?
one possible approach:
use thrust::unique to remove periods where the price is numerically constant
use thrust::adjacent_difference to produce 1st difference data
use thrust::adjacent_difference on 1st difference data to get the 2nd difference data, i.e the points where there is a change in the sign of the slope.
use these points of change in sign of slope to identify separate regions of data - build a key vector from these (e.g. with a prefix sum). This key vector segments the price data into "runs" where the price change is in a particular direction.
use thrust::exclusive_scan_by_key on the 1st difference data, to produce the net change of the run
Wherever the net change of the run exceeds a threshold, flag as a "reversal"
Your description of what constitutes a reversal may also be slightly unclear. The above method would not flag a reversal on certain data patterns that you might classify as a reversal. I suspect you are looking beyond a single run as I have defined it here. If that is the case, there may be a method to address that as well - with more steps.

Dividing a vector of points into two spaces

I have a memory-mapped file of many millions of 3D points as a STL vector using CGAL. Given an arbitrary plane that divides the dataset into approximately equal parts, I would like to sort the dataset such that all inside points are contiguous in the vector, and likewise the outside points. This process then needs to be repeated to an arbitrary depth, creating a non axis-aligned BSP tree.
Due to the size of the dataset I would like to do this in place if possible. I have a predicate functor that I use to create a filtered_iterator, but of course that doesn't sort the points, just skips non-matching ones. So I could create a second vector and copy the sorted points into that, and the re-use the original vector round-robin style, but I would like to avoid that if possible, if only to keep the iterators that mark the start and end of each space.
Of course, by invoking the question gods, I received direct communication from them almost as soon as I posted!
I had simply been blind to the STL algorithm partition which does exactly what I need.

Alternatives to the Big-O notation?

Good afternoon all,
We say that a hashtable has O(1) lookup (provided that we have the key), whereas a linked list has O(1) lookup for the next node (provided that we have a reference to the current node).
However, due to how the Big-O notation works, it is not very useful in expressing (or differentiating) the cost of an algorithm x, vs the cost of an algorithm x + m.
For example, even though we label both the hashtable's lookup and the linked list's lookup as O(1), these two O(1)s boil down to a very different number of steps indeed,
The linked list's lookup is fixed at x number of steps. However, the hashtable's lookup is variable. The cost of the hashtable's lookup depends on the cost of the hashing function, so the number of steps required for the hashtable's lookup is: x + m,
where x is a fixed number
and m is an unknown variable value
In other words, even though we call both operations O(1), the cost of the hashtable's lookup is a magnitude higher than the cost of the linked list's lookup.
The Big-O notation is specifically about the size of the input data collection. This does have its advantages, but it has its disadvantages as well, as can be seen when we collapse and normalize all non-n variables into 1. We cannot see the m variable (the hashing function) inside it anymore.
Besides the Big-O notation, Is there another (established) notation we can use for expressing the fixed-cost O(1) which means x operations and the variable-cost O(1) which means x + m (m, the hashing function) number of operations?
literal O(1) which means exactly 1 operation
Except it doesn't. The big O-Notation concerns relative comparision of complexity in relation to an input. If the algorithm does take a constant amount of steps, completely independent of the size of your input, than the exact amount of steps doesn't matter.
Take a look at the (informal) definition of O(n):
It means: There is a certain k so that for each n the function f is smaller than the function g.
In the case above, the hashtable lookup and linked list lookup would be f, and g would be g(n) = 1. For each case, you are able to find a k that f(n) <= g(n) * k.
Now, this k doesn't need to fixed, it can vary depending on platform, implementation, specific hardware. The only interesting point is that it exists. That's why both hashtable lookup and linked list node lookup are O(1): Both have a constant complexity, regardless of input. And when evaluating algorithms, that's what interesting, not the physical steps.
Specifically concerning the Hashtable lookup
Yes, the hash function does take a variable amount of operations (depending on implementation). However, it doesn't take a variable amount of operation depending on the size of the input. Big O-Nation is specifically about the size of the input data collection. A hash function takes a single element. For the evaluation of an algorithm it doesn't matter wether a certain function takes 10, 20, 50 or 100 operations, if the number of operations doesn't increase with the input size, it is O(1). There is no way to distinguish this in big O-Notation, as this isn't what big O-Notation is about.
"~" includes the constant factor - see the family of bachmann functions
The issue is that the "number of operations" is highly context dependent. In fact, that's why big-O notation was invented -- it seems to work rather well in modelling a broad number of computers.
Besides, what a programmer things the number of "ops" is doesn't mean how much time it actually does take (e.g. is it already in cache?) or how many steps hardware actually takes (what does your processor do -exactly-? Does it have micro-ops?) or even how many operations are dictated to the processor (what is your compiler doing for you?). And those are all concerns, even when you try to define a precise concept that's abstract enough to be useful.
So. For now, it's Big-O vs. "operations" -- whatever "operations" means to you and your colleagues at the time.

What is the time complexity of lookups in directed acyclic word graphs?

A directed acyclic word graph is a great data structure for certain tasks. I can't find any information on the time complexity of performing a lookup though.
I would guess it depends linearly on the average word length, and logarithmically on the number of words in the graph.
So is it O(L * log W), where W is the number of words and L is the average word length?
I think that complexity is just O(L). Number of operations is proportional to length of word and it does not matter how many entries structure have. (there might be differences based on implementation of node searching but that is in worst case and worst implementation just constant whit upper limit equal to size of alphabet)
I’d say it’s just O(L). For each lookup of a word of n characters, you always follow at most n edges, irrespective of how many other edges there are.
(That’s assuming a standard DAWG in which each node has outgoing edges for every letter of the alphabet, i.e. 26 for English. Even if you have fewer outgoing edges per node and therefore more levels, the number of edges to follow is still at most a constant multiple of n, so we still get O(L).)
How many words you already have in your structure seems to be irrelevant.
Even if, at each step, you perform a linear search for the correct edge to follow from the current node, this is still constant-time because the alphabet is bounded, and therefore so is the number of outgoing edges from each node.

Full-text search relevance is measured in?

I am making a quiz system, and when quizmakers insert questions into the Question Bank, I am to check the DB for duplicate / very highly similar questions.
Testing MySQL's MATCH() ... AGAINST(), the highest relevance I get is 30+, when I test against a 100% similar string.
So what exactly is the relevance? To quote the manual:
Relevance values are non-negative floating-point numbers. Zero relevance means no similarity. Relevance is computed based on the number of words in the row, the number of unique words in that row, the total number of words in the collection, and the number of documents (rows) that contain a particular word.
My problem is how to test the relevance value if a string is a duplicate. If it's 100% duplicate, prevent it from being inserter into Question Bank. But if it is only so similar, prompt the quizmaker to verify, insert or not. So how do I do that? 30+ for 100% identical string is not percentage, so I'm stump.
Thanks in advance.
The basic data structure for a text retrieval system is an Inverted Index. This is essentially a list of words found in the document collection with a list of the documents they occur in. It can also have metadata about the occurrence for each document, such as the number of times the word appears.
Documents containing the words can be queried by matching on the search terms. To determine relevance, a heuristic known as a Cosine Ranking is calculated on the hits. This works by constructing n-dimensional vector with one component for each of the n search terms. You can also weight the search terms if desired. This vector gives a point in n-dimensional space that corresponds to your search terms.
A similar vector based on the weighted occurrences in each document can be constructed from the inverted index with each axis in the vector corresponding with the axis for each search term. If you calculate a dot product of these vectors you get the cosine of the angle between them. 1.0 is equivalent to cos (0), which would assume the vectors occupy a common line from the origin. The closer the vectors together, the smaller the angle and the closer the cosine is to 1.0.
If you sort the search results by the cosine (or bung them into a priority queue as mg does) you get the most relevant. Cleverer relevance algorithms tend to fiddle with the weights of the search terms, skewing the dot product in favour of terms with high relevance.
If you want to dig a little, Managing Gigabytes by Bell and Moffet discusses the internal architecture of text retrieval systems.
andygeers is on the right track: Those numbers have no empirical meaning other than their relations to each other and cannot be used on their own to determine what is or is not an "exact match". You need to determine that yourself. Even aside from the limitations of fulltext search ranking, there's also the open question of just what you consider to consitiute an "exact match". (Actual text only or do soundex matches count? Do synonyms (e.g., "couch" vs. "sofa") count as matching or as distinct? Should an attempt be made to compensate for misspellings? Etc.)
If I had the need to perform such a check, I would grab only the highest-ranked entry returned by the fulltext search, remove any designated stopwords, normalize whitespace, convert to lowercase, do the comparison, and leave it at that until I encountered a case that called for it to be refined further. It's not really all that much extra work - if you specify the language you're using for your application, you could probably find someone around here who could write the normalization function within a dozen or so lines of code.
I don't know the specifics of the MySQL function you're using, but I imagine it could be that there is no absolute meaning for those numbers - they're just designed to be compared with other values produced by the same function. To check for an absolute match you could select out the text itself and compare manually.