Splitting Data into a test set and training set - rapidminer

Which operator in Rapidminer can I use to make an out of bag sample as my training set, and use the remaining data as my test set?

The Split Data operator is one option. This makes 2 or more example sets split up the way you want and you can do what you want with these. An alternative that incorporates the training and test aspects is Split-Validation.

Use the X-validation operator.
Attach your data set to the X-validation operator, then attach the output of the operator to the output node.
After this, go into the x-validation operator by double clicking it or clicking the small double blue window at its bottom right corner.
Once inside the operator, attach whatever model you wish to create (for this instance I used a decision tree model) on the training side of the data then on the testing side, attach the apply model operator to the performance operator. Finally attach the performance operator to the output.
Then press play. It should work.

Related

Questions about the Boundary Value Check

I'm doing my JUnit homework and need some explanations here.
Here's the quotation from my homework description:
One of the issues with boundary conditions is that the system needs to behave well even if the boundary is approached multiple times. This should be obvious, but it doesn't always happen in practice.
Remember that we can characterize an object as state and behavior. Typically, the state is not directly accessible, but instead, is accessed indirectly by means of the behavior. That is, the behavior reflects the state of the object.
Now, if we think about boundaries in math, it might not be too surprising to imagine the the value at some boundary will be different if we approach that boundary in different ways. So, if the value can be likened to the state, the state at the boundary may vary depending on how we got there. This would mean that the behavior could be different.
To make objects that behave consistently, we would have to insure that the internal state at those boundaries is consistent. So, test cases should check this assumption. To receive challenge points for this homework assignment enhance your test cases so that potential problems around the boundaries may be discovered.
Clearly mark the Challenge test cases with the string "### challenge ###" in the comments. Include in those comments what boundary is being tested, and how you're guessing that the state of the object may be different depending on how the boundary is being approached.
I don't understand this especially the highlighted part. What does he mean by "object behave consistently" and the "potential probelms"?
Also, how is this different than general boundary check that will just throws the exception and i expected in the JUnit?
Thank you!
Without knowing the details of the homework, an answer could only be somewhat generic, but I'll try.
Boundary checking is not just exception checking, its about seeing which paths in your code are execution on what condition. If you have control statements, loops, if-else, switch, etc you have to verify, on what conditions (of your internal state) those statements are processed in what way.
To me, boundary testing is that you change certain values of an instance field in a way that would cause the behavior to run through different branches of your code.
for example, you have this behavior:
if(someInstanceValue > 5) {
return "great";
} else {
return "poor";
}
Now you could test with data for someInstanceValue that define the boundary
4 : "poor"
5 : "great"
If you have multiple fields in your class, all of them define the state but only some of them may affect a certain path in your code. As the test is a specification of your class under test, written in code, you should specify which fields are relevant to a function, and which are not (by leaving them out).
So you should set up your instance-under-test accordingly (calling all setters) or if you require more complex objects, you could use frameworks like Mockito to specify the state (in a when().thenReturn() syntax).
If you want to verify if you covered all your boundaries, you could run a mutation test against your suite using a mutation testing tool like PIT. It will flip the switches in your code (i.e. replacing a < with a >=) to check whether your test will fail. Often, it's a good source of inspiration for improving the way you test.
Neverthelss, some parts of the homework assignment sound a bit confusing to me. You may approach a boundary from two sides, ok, but there is no such thing as a state that represents THE boundary, you're either on one or the other side of the boundary. If the way, how you approached one side of a boundary matters, and the object behaves differently depending on that "history" of how you reached that state, the history becomes part of the state. In other words: different history = different state.
Keep in mind: every instance field is part of the state. Every possible combination of values of your instance fields defines a single state. Every transition from one combination to another is a state transition triggered by calling a behavior. No think of your test describing this statemachine, be listing the triple of {currentState,input} -> nextState (with input being method invocation). Wich is basically the Given-When-Then structure good tests should have.

Formating Output in Information Design Tool

When working with IDT 4.1 and when making a query in business layer, is it possible to format the yielded numbers? What I mean is - the output looks something like "1.9982121921**E7**" (please noctice E7 part). I would like BO to display the whole number without any suffixes.
Additionally, it would be even better to add a delimiter after thousands, millions,...
Is 1.9982121921**E7** the value that is returned from the database? Thus not a number but a string (alphanumeric)? In that case, you'll have to change the select statement and use a database function to trim the non-numeric characters off (e.g. SUBSTR, MID, LEFT, …).
Once you have numeric data, you can use the Display Format function to change the layout of your object.
If you're not happy with the predefined formats to choose from, you can always define a custom format. The formatting options are described in the Information Design Tool User Guide (links to the documentation for IDT in BI 4.1 SP3), section 12.10.23 Creating and editing display formats for business layer objects.
Right-click and object and select Create Display Format… from the context menu.
Or click the Create Display Format… button in the object's properties (located in the Advanced tab).
Set the type to numeric and enable the high precision check mark.

Complex Gremlin queries to output nodes/edges

I am trying to implement a query and graph visualisation framework that allows a user to enter a Gremlin query, returning a D3 graph of results. The D3 graph is built using a JSON - this is created using separate vertices and edges outputs from the Gremlin query. For simple queries such as:
g.V.filter{it.attr_a == "foo"}
this works fine. However, when I try to perform a more complicated query such as the following:
g.E.filter{it.attr_a == 'foo'}.groupBy{it.attr_b}{it.outV.value}.cap.next().findAll{k,e->e.size()<=3}
- Find all instances of *value*
- Grouped by unique *attr_b*
- Where *attr_a* = foo
- And *attr_b* is paired with no more than 2 other instances of *value*
Instead, the output is of the following form:
attr_b1: {value1, value2, value3}
attr_b2: {value4}
attr_b3: {value6, value7}
I would like to know if there is a way for Gremlin to output the results as a list of nodes and edges so I can display the results as a graph. I am aware that I could edit my D3 code to take in this new output but there are currently no restrictions to the type/complexity of the query, so the key/value pairs will no necessarily be the same every time.
Thanks.
You've hit what I consider one of the key problems with visualizing Gremlin results. They can be anything. Gremlin results might not just be a list of vertices and edges. There is no way to really control this that I can think of. At the end of the day, you can really only visualize results that match a pattern that D3 expects. I'd start by trying to detect that pattern and visualize only in those cases (simply display non-recognized patterns as JSON perhaps).
Thinking of your specific example that results like this:
attr_b1: {value1, value2, value3}
attr_b2: {value4}
attr_b3: {value6, value7}
What would you want D3 to visualize there? The vertices/edges that were traversed over to get that result? If so, you might be stuck. Gremlin doesn't give you a way to introspect the pipeline to see what's passing through it. In other words, unless the user explicitly gathers vertices and edges within the pipeline that were touched you won't have access to them. It would be nice to be able to "spy" on a pipeline in that way, but at the moment it doesn't do that. There's been internal discussion within TinkerPop to create a new kind of pipeline implementation that would help with that, but at the moment, it doesn't exist.
So, without the "spying" capability, I think your only workarounds would be to:
detect vertex/edge list on your client side and only render those with d3. this would force users to always write gremlin that returned data in such a format, if they wanted visualization. put it in the users hands.
perhaps supply server-side bindings for a list of vertices/edges that a user could explicitly side-effect their vertices/edges into if their results did not conform to those expected by your visualization engine. again, this would force users to write their gremlin appropriately for your needs if they want visualization.

What does Backpatching mean?

What does backpatching mean ? Please illustrate with a simple example.
Back patching usually refers to the process of resolving forward branches that have been planted in the code, e.g. at 'if' statements, when the value of the target becomes known, e.g. when the closing brace or matching 'else' is encountered.
In intermediate code generation stage of a compiler we often need to execute "jump" instructions to places in the code that don't exist yet. To deal with this type of cases a target label is inserted for that instruction.
A marker nonterminal in the production rule causes the semantic action to pick up.
Some statements like conditional statements, while, etc. will be represented as a bunch of "if" and "goto" syntax while generating the intermediate code.
The problem is that, These "goto" instructions, do not have a valid reference at the beginning(when the compiler starts reading the source code line by line - A.K.A 1st pass). But, after reading the whole source code for the first time, the labels and references these "goto"s are pointing to, are determined.
The problem is that can we make the compiler able to fill the X in the "goto X" statements in one single pass or not?
The answer is yes.
If we don't use backpatching, this can be achieved by a 2 pass analysis on the source code. But, backpatching lets us to create and hold a separate list which is exclusively designed for "goto" statements. Since it is done in only one pass, the first pass will not fill the X in the "goto X" statements because the comipler doesn't know where the X is at first glance. But, it does stores the X in that exclusive list and after going through the whole code and finding that X, the X is replaced by that address or reference.
Backpaching is the process of leaving blank entries for the goto instruction where the target address is unkonown in the forward transfer in the first pass and filling these unknown in the second pass.
Backpatching:
The syntax directed definition can be implemented in two or more passes (we have both synthesized attributes and inherited attributes).
Build the tree first.
Walk the tree in the depth-first order.
The main difficulty with code generation in one pass is that we may not know the target of a branch when we generate code for flow of control statements
Backpatching is the technique to get around this problem.
Generate branch instructions with empty targets
When the target is known, fill in the label of the branch instructions (backpatching).
backpatching is a process in which the operand field of an instruction containing a forward reference is left blank initially. the address of the forward reference symbol is put into this field when its definition is encountered in the program.
Back patching is the activity of filling up the unspecified information of labels
by using the appropriate semantic expression in during the code generation process.
It is done by:
boolean expression.
flow of control statement.

Manually evaluating math formula from a string

I know there are solutions to evaluate math formulas with AS3: Using MathParser, or port it to JavaScript and many others.
But what if I want to manually evaluate a math formula, without using any built libraries?
It should be able to solve this: 1+(2*5+(9-6)/(5-2))+(6/2)*5
This is how I intend to go about it:
Seperate the string literal with left and right brackets.
Sort the result with its priority (by scanning the string from left to right, every
time it sees a left bracket then priority increases. Every time it sees a right bracket
then decreases).
Calculate the results from the one with the highest priority.
However, I have not been able to successfully implement it yet.
Transform your expression into Reverse Polish Notation using the Shunting-yard algorithm and evaluate it. There's a good example in the second article.