Why do we only put duplicate nodes of a binary search tree in either the right or left subtree, but not both? - duplicates

A test question I had on an exam asked why do we only put duplicate nodes of the root of a binary search tree in either the right or left subtree, but not both, so why is just putting duplicates of the root anywhere bad, what is the good thing that happens from selecting either right or left?
The question didn't give elaboration or a diagram, this was a final exam and there must've been just an implicit assumption that this was a common concept or practice. I still got a B on this exam but I missed this one and still don't understand why you can't just put any value anywhere regardless of duplicates.

why we only put duplicate nodes of the root of a binary search tree in either the right or left subtree?
I suppose you mean duplicate values, which get stored in separate nodes.
The algorithm for storing a new value in a binary search tree needs to decide whether to go left or right. The choice is unique when the value to store is different from the root's value. When it's less, it should be stored in the left subtree, when it's greater, it should be stored in the right subtree.
However, when it is equal, it becomes a matter of choice: both options are fine. So in practice an implementation will in that case always choose left, or will always choose right -- that's just the simplest way to implement it.
why is just putting duplicates of the root anywhere bad
No-one says that is bad, but consider the following:
if insertion has to be done sometimes in the left subtree, sometimes in the right one, then you need to introduce some additional algorithm to drive that choice, e.g. a random number generation, which is just going to slow down the process.
if ever you need to count the number of duplicates of a given value in the tree, then it will be easier when you know they are all on one side of the root. Otherwise you need to traverse in two directions to find them all. That is also possible, but leads to be bit more code.
Still, if the binary search tree has additional mechanics to keep it well balanced, like for example AVL or red-black trees have, then duplicate values can end up at either side of the root through rotation. For instance, if your binary search tree is an AVL tree, and the insertion algorithm is such that equal values are always inserted in the right subtree, then see what inserting the value 6 in the following tree does:
5 5 5
\ insertion \ rotation / \
5 =======> 5 =======> 5 6
\
6 (inserted)
... and so the duplicate value is suddenly at the left side.

Related

MySQL writing efficient procedures for geometry queries

This is a question regarding efficiency because what I want to write is likely to break my machine.
Brief description. I have two sets of data,
Set1 contains ~2500 entries, each entry has a polygon associated.
Set2 contains ~4000 entries, each entry has a point associated.
I want to find out which polygons from set1 enclose which points from set2. All points and polygons are unique and do not overlap.
I was about to embark on writing a procedure using a nested cursor that will look at a point in set2 scroll through all of set1 and find a polygon that encloses the point.
Then I realized how much data I have, that I will want to run it more than once, and this may take quite a while. Is there a better way?
MySQL for Geometry - Fine.
Does not seem to be a lengthy process.
You may or may not have nested cursor.
You can pretty well create a linked view having each point from set2 along with all poygons enclosing the point.
And that view can be called from your code for selection list, etc (replacing your expected cursor). Once the view is defined, MySQL will be almost ready with the output always.

Adding an order column to database

I have a table containing articles.
By default, the articles are sorted based on their date added (desc.) so newest articles appear first.
However, I would like to give the editor the ability to change the order of the articles so they can be displayed in the order he likes. So I am thinking of adding an integer "order" column.
I am in a dilemma of how to handle this as when an article's order is edited, I don't want to have to change al the others.
What is the best practice for this problem? and how other CMS like Wordpress handle this?
Updating the records between the moved record's original position and it's new position might be simplest and most reliable solution, and can be accomplished in two queries assuming you don't have a unique key on the ordering column.
The idea suggested by Bill's comment sounds like a good alternative, but with enough moves in the same region (about 32 for float, and 64 for double) you could still end up running into precision issues that will need checked for and handled.
Edit: Ok, I was curious and ran a test; it looks like you can half a float column 149 times between 0 and 1 (only taking 0.5, .25, .125, etc... not counting .75 and the like); so it may not be a huge worry.
Edit2: Of course, all that means is that a malicious user can cause a problem by simply moving the third item between the first and second items 150 times (i.e "swapping" the 2rd and 3rd by moving the new third.
More challenging is the UI to facilitate the migration of items.
First, determine what the main goal(s) are. Interview the Editors, then "read between the lines" -- they won't really tell you what they want.
If the only goal is to once move an item to the top of the list, then you could simply have a flag saying which item needs to come first. (Beware: Once the Editors have this feature, they will ask for more!)
Move an item to the 'top' of the list, but newer items will be inserted above it.
Move an item to the 'top' of the list, but newer items will be inserted below it
Swap pairs of adjecent items. (This is often seen in UIs with only a small number of items; not viable for thousands -- unless the rearrangement is just localized.
Major scrambling.
Meanwhile, the UI needs to show enough info to be clear what the items are, yet compact enough to fit on a single screen. (This may be an unsolvable problem.)
Once you have decided on a UI, the internals in the database are not a big deal. INT vs FLOAT -- either would work.
INT -- easy for swapping adjacent pairs; messier for moving a item to the top of the list.
FLOAT -- runs out of steam after about 20 rearrangements (in the worst case). DOUBLE would last longer; BIGINT could simulate such -- by starting with large gaps between items' numbers.
Back to your question -- I doubt if there is a "standard" way to solve the problem. Think of it as a "simple" problem that can be dealt with.

Lock free doubly linked skip list

There exists tons of research on lock-free doubly linked list. Likewise, there is tons of reserach on lock-free skip lists. As best I can tell, however, nobody has managed a lock free doubly linked skip list. Does anybody know of any research to the contrary, or a reason why this is the case?
Edit:
The specific scenario is for building a fast quantile (50%, 75%, etc) accumulator. Samples are inserted into the skip list in O(log n) time. By maintaining an iterator to the current quantile, we can compare the inserted value to the current quantile in O(1) time, and can easily determine whether the inserted value is to the left or right of the quantile, and by how much the quantile needs to move as a result. It's the left move that requires a previous pointer.
As I understand it, any difficulty will come from keeping the previous pointers consistent in the face of multiple threads inserting and removing at once. I imagine the solution will almost certainly involve a clever use of pointer marking.
But why would you do such a thing? I've not actually sat down and worked out exacty how skip lists work, but from my vague understanding, you'd never use the previous pointers. So why have the overhead of maintaining them?
But if you wanted to, I don't see why you cannot. Just replace the singly linked list with a doubly linked list. The doubly linked list is logically coherent, so it's all the same.
I have an idea for you. We use a "cursor" to find the item in a skiplist. The cursor also maintains the trail that was taken to get to the item. We use this trail for delete and insert - it avoids a second search to perform those operations, and it embeds the version # of the list that was seen when the traversal was made. I am wondering if you could use the cursor to more quickly find the previous item.
You would have to go up a level on the cursor and then search for the item that is just barely less than your item. Alternatively, if the search made it to the lowest level of the linked list, just save the prev ptr as you traverse. The lowest level is probably used 50% of the time to find your item, so performance would be decent.
Hmm... thinking about it now, it seems that the cursor would 50% of the time have the prev ptr, 25% of the time need to search again from 1 level up, 12.% 2 levels up, etc. So in infrequent cases, you have to almost do the search entirely again.
I think the advantage to this would be that you don't have to figure out how to "lock free" maintain a double linked skip list, and for the majority of cases you dramatically decrease the cost of locating the previous item.
As an alternative to maintaining backlinks, when a quantile needs to be updated, you could do another search to find the node whose key is less than the current one. As I also just mentioned in a comment to johnnycrash, it's possible to build an array of the rightmost node found at each level -- and from that it would be possible to accelerate the second search. (Fomitchev's thesis mentions this as a possible optimization.)

Searching in graphs trees with Depth/Breadth first/A* algorithms

I have a couple of questions about searching in graphs/trees:
Let's assume I have an empty chess board and I want to move a pawn around from point A to B.
A. When using depth first search or breadth first search must we use open and closed lists ? This is, a list that has all the elements to check, and other with all other elements that were already checked? Is it even possible to do it without having those lists? What about A*, does it need it?
B. When using lists, after having found a solution, how can you get the sequence of states from A to B? I assume when you have items in the open and closed list, instead of just having the (x, y) states, you have an "extended state" formed with (x, y, parent_of_this_node) ?
C. State A has 4 possible moves (right, left, up, down). If I do as first move left, should I let it in the next state come back to the original state? This, is, do the "right" move? If not, must I transverse the search tree every time to check which states I've been to?
D. When I see a state in the tree where I've already been, should I just ignore it, as I know it's a dead end? I guess to do this I'd have to always keep the list of visited states, right?
E. Is there any difference between search trees and graphs? Are they just different ways to look at the same thing?
A. When using depth first search or
breadth first search must we use open
and closed lists ?
With DFS you definitely need to store at least the current path. Otherwise you would not be able to backtrack. If you decide upon maintaining a list of all visited (closed) nodes, you are able to detect and avoid cycles (expanding the same node more than once). On the other side you don't have the space efficiency of DFS anymore. DFS without closed list only needs space proportional to the depth of the search space.
With BFS you need to maintain an open list (sometimes called fringe). Otherwise the algorithm simply can't work. When you additionally maintain a closed list, you will (again) be able to detect/avoid cycles. With BFS the additional space for the closed list might be not that bad, since you have to maintain the fringe anyway. The relation between fringe size and closed list size strongly depends upon the structure of the search space, so this has to be considered here. E.g. for a branching factor of 2, both lists are equal in size and the impact of having the closed list doesn't seem very bad compared to its benefits.
What about A*, does it need it?
A*, as it can be seen as some special (informed) type of BFS, needs the open list. Omitting the closed list is more delicate than with BFS; also deciding upon updating costs inside the closed list. Depending upon those decisions, the algorithm can stop being optimal and/or complete depending on the type of heuristic used, etc. I won't go into details here.
B.
Yup, the closed list should form some kind of inverse tree (pointers going towards the root node), so you can extract the solution path. You usually need the closed list for doing this. For DFS, your current stack is exactly the solution path (no need for closed list here). Also note that sometimes you are not interested in the path but only in the solution or the existence of it.
C.
Read previous answers and look for the parts which talk about the detection of cycles.
D.
To avoid cycles with a closed list: don't expand nodes that are already inside the closed list. Note: with path-costs coming into play (remember A*), things might get more tricky.
E. Is there any difference between
search trees and graphs?
You could consider searches that maintain a closed list to avoid cycles as graph-searches and those without one tree-searches.
A) It's possible to avoid the open/closed lists - you could try all possible paths, but that would take a VERY long time.
B) Once you've reached the goal, you use the parent_of_this_node information to "walk backwards" from the goal. Start with the goal, get its parent, get the parent's parent, etc. until you reach the start.
C) I think it doesn't matter - there's no way that the step you describe would result in a shorter path (unless your steps have negative weight, in which case you can't use Dijkstra/A*). In my A* code, I check for this case and ignore it, but do whatever is easiest to code up.
D) It depends - I believe Dijkstra can never reopen the same node (can someone correct me on that?). A* definitely can revisit a node - if you find a shorter path to the same node, you keep that path, otherwise you ignore it.
E) Not sure, I've never done anything specifically for trees myself.
There's a good introduction to A* here:
http://theory.stanford.edu/~amitp/GameProgramming/
that covers a lot of details about how to implement the open set, pick a heuristic, etc.
A. Open and Closed lists are common implementation details, not part of the algorithm as such. It's common to do a depth-first tree search without either of these for example, the canonical way being a recursive traversal of the tree.
B. It is typical to ensure that nodes refer back to previous nodes allowing for a plan to be reconstructed by following the back-links. Alternatively you just store the entire solution so far in each candidate, though it would then be misleading to call it a node really.
C. I'm assuming that moving left and then moving right bring you to an equivalent state - in this case, you would have already explored the original state, it would be on the closed list, and therefore should not have been put back onto the open list. You don't traverse the search tree each time because you keep a closed list - often implemented as an O(1) structure - for precisely this purpose of knowing which states have already been fully examined. Note that you cannot always assume that being in the same position is the same as being in the same state - for most game path-finding purposes, it is, but for general purpose search, it is not.
D. Yes, the list of visited states is what you're calling the closed list. You also want to check the open list to ensure you're not planning to examine a given state twice. You don't need to search any tree as such, since you typically store these things in linear structures. The algorithm as a whole is searching a tree (or a graph), and it generates a tree (of nodes representing the state space) but you don't explicitly search through a tree structure at any point within the algorithm.
E. A tree is a type of graph with no cycles/loops in it. Therefore you use the same graph search procedure to search either. It's common to generate a tree structure that represents your search through the graph, which is represented implicitly by the backwards links from each node to the node that preceded/generated it in the search. (Although if you go down the route of holding the entire plan in each state, there will be no tree, just a list of partial solutions.)

Pathing in a non-geographic environment

For a school project, I need to create a way to create personnalized queries based on end-user choices.
Since the user can choose basically any fields from any combination of tables, I need to find a way to map the tables in order to make a join and not have extraneous data (This may lead to incoherent reports, but we're willing to live with that).
For up to two tables, I already managed to design an algorithm that works fine. However, when I add another table, I can't find a way to path through my database. All tables available for the personnalized reports can be linked together so it really all falls down to finding which path to use.
You might be able to try some form of an A* algorithm. Basically this looks at each of the possible next options to choose and applies a heuristic to it, a function that determines roughly how far it is between this node and your goal. It then chooses the one that is closer and repeats. The hardest part of implementing A* is designing a good heuristic.
Without more information on how the tables fit together, or what you mean by a 'path' through the tables, it's hard to recommend something though.
Looks like it didn't like my link, probably the * in it, try:
http://en.wikipedia.org/wiki/A*_search_algorithm
Edit:
If that is the whole database, I'd go with a depth-first exhaustive search.
I thought about using A* or a similar algorithm, but as you said, the hardest part is about designing the heuristic.
My tables are centered around somewhat of a backbone with quite a few branches each leading to at most a single leaf node. Here is the actual map (table names removed because I'm paranoid). Assuming I want to view data from tha A, B and C tables, I need an algorithm to find the blue path.