Related
I am following a MOOC, but I don't understand the correct answer nor the other answers.
The MOOC closed and I cannot ask any questions on the forum.
This is the question:
Considering the following relation R:
A B C D
1 0 2 2
4 1 2 2
6 0 6 3
7 1 2 3
1 0 6 1
1 1 2 1
Between all these requests, which one return the same relation R?
ΠA,B,C,D(R⋈δA→D,D→F(R))
R⋈δA→D,D→A(R)
R⋈δB→C,C→B(R)
ΠA,B,C,D(R⋈δB→G,C→F(R)) (note: this is the correct answer)
The only given explanation is :
The first 3 answers loose the tuple(4,1,2,2). In the last joint, no tuple is lost.
Could you details please whats does the answers do?
Thank you very much for your attention!
This is a question about the Relational Algebra's Natural Join, and attribute naming. I presume the squiggly thing in your formulas is for Rename, usually denoted by Greek letter rho ρ (see the wikipedia link).
For Natural Join see the wikipedia example and note
The result of the natural join is the set of all combinations of tuples in R and S that are equal on their common attribute names.
Because of the renaming in the four formulas, in general, the result from renamed R will not have the same attribute names as the original R, or will not be equal on the values in the resulting same-named attributes.
I suggest you go through each four of the renamings, and work out what is the 'heading' of each result -- that is, what are the resulting attribute names.
You'll find in requests 1., 2., 3. there's at least one resulting attribute same-named as the original R but the values for that attribute are not the same.
In request 4., although attributes B, C are renamed, their new names do not clash with any existing attribute in R. So the Natural Join to original R will use attributes A, D. This'll produce an interesting intermediate result: consider the tuples <1, 0, 6, 1>, <1, 1, 2, 1> which each contain equal values in their A attribute and their D attribute.
But then in request 4., the projection will throw away the newly-named attributes G, F and collapse back to the original A, B, C, D. So in general, request 4. always returns exactly the original R.
Requests 1., 2., 3. might sometimes return the original R, depending on the content of R. But with the content you show, there are clashes of newly-same-named attributes with non-equal values, so they do 'lose' tuples.
BTW, although tuple <4, 1, 2, 2> does indeed get 'lost' in those three requests, it's not the only tuple that gets 'lost'. In particular in request 3., note that for the sample data, there are no values in common between B, C, so swapping them round in the rename has the effect of returning an empty result from the Join.
I have four channels in my application: A, B, C, D. Some application users are only interested in documents contained in both channels A and B only. Also can be expressed as: A ∩ B. Others may be interested in a different combination like: A ∩ B ∩ D.
UPDATE
I don't think the following will work anyway
What has been suggested so far is that I can create a new channel (like A_B and A_B_D) for each combination and then tag the documents that meet the intersection criteria accordingly. But you can see how this could easily get out of hand since with just 4 channels, you end up with 15 combinations (11 extra channels).
Is there a way to do this with channels or perhaps some other feature I have missed in Couchbase?
The assignment of channels to a document is done via the sync function. So a document is not "contained" in a channel, but it may have attributes from which the channels to which it is routed can be derived. Only in the simplest default case, the document's channel attribute will route it to the channel having that value of that attribute.
So what you intend can be achieved by putting statements like
if (doc.areas.includes("A") && doc.areas.includes("B") {
channel("AB");
}
into the sync function. (I renamed the channels attribute to areas to make clear to the reader of the program that these are not the actual channels, but that channels are only derived from combinations of them.)
Here is a function I would like to write but am unable to do so. Even if you
don't / can't give a solution I would be grateful for tips. For example,
I know that there is a correlation between the ordered represantions of the
sum of an integer and ordered set partitions but that alone does not help me in
finding the solution. So here is the description of the function I need:
The Task
Create an efficient* function
List<int[]> createOrderedPartitions(int n_1, int n_2,..., int n_k)
that returns a list of arrays of all set partions of the set
{0,...,n_1+n_2+...+n_k-1} in number of arguments blocks of size (in this
order) n_1,n_2,...,n_k (e.g. n_1=2, n_2=1, n_3=1 -> ({0,1},{3},{2}),...).
Here is a usage example:
int[] partition = createOrderedPartitions(2,1,1).get(0);
partition[0]; // -> 0
partition[1]; // -> 1
partition[2]; // -> 3
partition[3]; // -> 2
Note that the number of elements in the list is
(n_1+n_2+...+n_n choose n_1) * (n_2+n_3+...+n_n choose n_2) * ... *
(n_k choose n_k). Also, createOrderedPartitions(1,1,1) would create the
permutations of {0,1,2} and thus there would be 3! = 6 elements in the
list.
* by efficient I mean that you should not initially create a bigger list
like all partitions and then filter out results. You should do it directly.
Extra Requirements
If an argument is 0 treat it as if it was not there, e.g.
createOrderedPartitions(2,0,1,1) should yield the same result as
createOrderedPartitions(2,1,1). But at least one argument must not be 0.
Of course all arguments must be >= 0.
Remarks
The provided pseudo code is quasi Java but the language of the solution
doesn't matter. In fact, as long as the solution is fairly general and can
be reproduced in other languages it is ideal.
Actually, even better would be a return type of List<Tuple<Set>> (e.g. when
creating such a function in Python). However, then the arguments wich have
a value of 0 must not be ignored. createOrderedPartitions(2,0,2) would then
create
[({0,1},{},{2,3}),({0,2},{},{1,3}),({0,3},{},{1,2}),({1,2},{},{0,3}),...]
Background
I need this function to make my mastermind-variation bot more efficient and
most of all the code more "beautiful". Take a look at the filterCandidates
function in my source code. There are unnecessary
/ duplicate queries because I'm simply using permutations instead of
specifically ordered partitions. Also, I'm just interested in how to write
this function.
My ideas for (ugly) "solutions"
Create the powerset of {0,...,n_1+...+n_k}, filter out the subsets of size
n_1, n_2 etc. and create the cartesian product of the n subsets. However
this won't actually work because there would be duplicates, e.g.
({1,2},{1})...
First choose n_1 of x = {0,...,n_1+n_2+...+n_n-1} and put them in the
first set. Then choose n_2 of x without the n_1 chosen elements
beforehand and so on. You then get for example ({0,2},{},{1,3},{4}). Of
course, every possible combination must be created so ({0,4},{},{1,3},{2}),
too, and so on. Seems rather hard to implement but might be possible.
Research
I guess this
goes in the direction I want however I don't see how I can utilize it for my
specific scenario.
http://rosettacode.org/wiki/Combinations
You know, it often helps to phrase your thoughts in order to come up with a solution. It seems that then the subconscious just starts working on the task and notifies you when it found the solution. So here is the solution to my problem in Python:
from itertools import combinations
def partitions(*args):
def helper(s, *args):
if not args: return [[]]
res = []
for c in combinations(s, args[0]):
s0 = [x for x in s if x not in c]
for r in helper(s0, *args[1:]):
res.append([c] + r)
return res
s = range(sum(args))
return helper(s, *args)
print partitions(2, 0, 2)
The output is:
[[(0, 1), (), (2, 3)], [(0, 2), (), (1, 3)], [(0, 3), (), (1, 2)], [(1, 2), (), (0, 3)], [(1, 3), (), (0, 2)], [(2, 3), (), (0, 1)]]
It is adequate for translating the algorithm to Lua/Java. It is basically the second idea I had.
The Algorithm
As I already mentionend in the question the basic idea is as follows:
First choose n_1 elements of the set s := {0,...,n_1+n_2+...+n_n-1} and put them in the
first set of the first tuple in the resulting list (e.g. [({0,1,2},... if the chosen elements are 0,1,2). Then choose n_2 elements of the set s_0 := s without the n_1 chosen elements beforehand and so on. One such a tuple might be ({0,2},{},{1,3},{4}). Of
course, every possible combination is created so ({0,4},{},{1,3},{2}) is another such tuple and so on.
The Realization
At first the set to work with is created (s = range(sum(args))). Then this set and the arguments are passed to the recursive helper function helper.
helper does one of the following things: If all the arguments are processed return "some kind of empty value" to stop the recursion. Otherwise iterate through all the combinations of the passed set s of the length args[0] (the first argument after s in helper). In each iteration create the set s0 := s without the elements in c (the elements in c are the chosen elements from s), which is then used for the recursive call of helper.
So what happens with the arguments in helper is that they are processed one by one. helper may first start with helper([0,1,2,3], 2, 1, 1) and in the next invocation it is for example helper([2,3], 1, 1) and then helper([3], 1) and lastly helper([]). Of course another "tree-path" would be helper([0,1,2,3], 2, 1, 1), helper([1,2], 1, 1), helper([2], 1), helper([]). All these "tree-paths" are created and thus the required solution is generated.
Given a set** S containing duplicate elements, how can one determine the total number all the possible subsets of S, where each subset is unique.
For example, say S = {A, B, B} and let K be the set of all subsets, then K = {{}, {A}, {B}, {A, B}, {B, B}, {A, B, B}} and therefore |K| = 6.
Another example would be if S = {A, A, B, B}, then K = {{}, {A}, {B}, {A, B}, {A, A}, {B, B}, {A, B, B}, {A, A, B}, {A, A, B, B}} and therefor |K| = 9
It is easy to see that if S is a real set, having only unique elements, then |K| = 2^|S|.
What is a formula to calculate this value |K| given a "set" S (with duplicates), without generating all the subsets?
** Not technically a set.
Take the product of all the (frequencies + 1).
For example, in {A,B,B}, the answer is (1+1) [the number of As] * (2+1) [the number of Bs] = 6.
In the second example, count(A) = 2 and count(B) = 2. Thus the answer is (2+1) * (2+1) = 9.
The reason this works is that you can define any subset as a vector of counts - for {A,B,B}, the subsets can be described as {A=0,B=0}, {A=0,B=1}, {0,2}, {1,0}, {1,1}, {1,2}.
For each number in counts[] there are (frequencies of that object + 1) possible values. (0..frequencies)
Therefore, the total number of possiblities is the product of all (frequencies+1).
The "all unique" case can also be explained this way - there is one occurence of each object, so the answer is (1+1)^|S| = 2^|S|.
I'll argue that this problem is simple to solve, when viewed in the proper way. You don't care about order of the elements, only whether they appear in a subset of not.
Count the number of times each element appears in the set. For the one element set {A}, how many subsets are there? Clearly there are only two sets. Now suppose we added another element, B, that is distinct from A, to form the set {A,B}. We can form the list of all sets very easily. Take all the sets that we formed using only A, and add in zero or one copy of B. In effect, we double the number of sets. Clearly we can use induction to show that for N distinct elements, the total number of sets is just 2^N.
Suppose that some elements appear multiple times? Consider the set with three copies of A. Thus {A,A,A}. How many subsets can you form? Again, this is simple. We can have 0, 1, 2, or 3 copies of A, so the total number of subsets is 4 since order does not matter.
In general, for N copies of the element A, we will end up with N+1 possible subsets. Now, expand this by adding in some number, M, of copies of B. So we have N copies of A and M copies of B. How many total subsets are there? Yes, this seems clear too. To every possible subset with only A in it (there were N+1 of them) we can add between 0 and M copies of B.
So the total number of subsets when we have N copies of A and M copies of B is simple. It must be (N+1)*(M+1). Again, we can use an inductive argument to show that the total number of subsets is the product of such terms. Merely count up the total number of replicates for each distinct element, add 1, and take the product.
See what happens with the set {A,B,B}. We get 2*3 = 6.
For the set {A,A,B,B}, we get 3*3 = 9.
For a particular project, we acquire data for a number of events and collect variables about those events at the same time. After the data has been collected, we perform a user-customizable analysis on said data to determine whatever it is that the user is interested in.
The data is collected in a form similar to this:
Timestamp Event
0 x = 0
0 y = 1
3 Event A occurred
3 x = 1
4 Event A occurred
4 x = 2
9 Event B occurred
9 y = 2
9 x = 0
To understand the entire state at any time, the most straightforward approach is to walk over the entire set of data. For example, if I start at time 0, and "analyze" until timestamp 5, I know that at that point x = 2, y = 1, and Event A has occurred twice. That's a really simple example. The user might be (and often is) interested in the time between events, say from A to B, and they might specify the first occurrence of A, then B, or the last occurrence of A, then B (respectively, 9-3 = 6 or 9-4 = 5). Like I said, this is easy to analyze when you're walking over the entire set.
Now, we need to adapt the model to analyze an arbitrary window of time. If we look at 0-N, that's the easy case. But if I look at 1-5, for instance, I have no notion of y unless I begin at 0 and know that y was initially 1 and did not change in the window 1-5.
Our approach is to essentially create a dictionary of variables, and run callbacks on events. If one analysis was "What is x when Event A occurs and time is > 3" then we would run that callback on the first Event A, and it would immediately return because time is not greater than 3. It would run again at 4, and and it would report that x was 1 at t=4.
To adapt to the "time-windowing", I think I am going to (in the background) tack on additional conditions to the analysis. If their analysis is just "What is x when Event A occurs", and the current window is 1-5, then I will change it to "What is x when Event A occurs and time >= 1 and time <= 5". Then if the next window is 6-10, I can readjust the condition as necessary.
My main question is: what pattern does this fit? We are obviously not the first people to approach a problem like this, but I have not been able to find how others have approached it. I probably just don't know what exactly to search on Google. Is there any other approach besides keeping a dictionary of the entire global state for looking up a single state at a given time? Note also that the data could have several, maybe tens of thousands of records, so the fewer iterations over the data set, the better.
I think your best approach here would be to take periodic "snapshots" of the full state data, say every 1000 samples (for example), along with recording the deltas. When you're storing your data as offsets from some original value (aka deltas), you don't have any choice but to reconstruct the full data starting with the original values. Storing periodic snapshots will lessen the amount of reconstruction you have to do - the design tradeoff is between low storage requirements but long reconstruction time on the one hand, and higher storage requirements but shorter reconstruction time on the other.
MPEGs, for example, store each frame as the differences between the current frame and the previous frame. Ordinarily, this would force an MPEG to be viewed from the beginning, but the format also periodically stores full frames so that the decoder doesn't have to backtrack all the way to the beginning of the file.
You can search by time in Log(N), and you can have a feeling about how many updates ares acceptable... hence here's my solution:
Pick a number, N, of updates that are acceptable in order to return a result. 256 might be good, given the scales you've mentioned so far.
Every N records, commit an entry of all state to a dictionary, with a timestamp.
Now, you have a tradeoff, dictionary size against speed. N->\infty is regular searching. N<-1 is your current solution, N anywhere else will require less memory, but be slower.
Your implementation is now (for time X):
Log(n) search of subsampled global dictionary to timestamp before X, (timestamped as Y).
Log(n) search of eventlist to timestamp Y, and perform less than N updates.
Picking N as a power of two will even allow you to do some nice shift tricks to do a rounded-down integer divide nice and fast.