Impute missing value error in rapidminer - rapidminer

I have a sample data of some students in with subjects, grades, semester column values. In this data set I have some missing values in semester column, I need to impute those missing values by learning the existing values in semester column. For this I am using Rapid Miner, see the below image in which I am using 2 processes of retrieve data and impute that data,
when I try to execute the flow it shows me below error:
I tried to change the data type of semester column from numerical to real also but no success, there is no such kind of solution on the web too, can anyone have suggestions?
UPDATED
Below is the XML:
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.1.001" expanded="true" height="68" name="Retrieve" width="90" x="179" y="85">
<parameter key="repository_entry" value="//Local Repository/testing data 2"/>
</operator>
<operator activated="true" class="impute_missing_values" compatibility="8.1.001" expanded="true" height="68" name="Impute Missing Values" width="90" x="380" y="85">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Semester"/>
<parameter key="attributes" value="Subjects|Semester|Grades|GPA|Course Code|Batch"/>
<process expanded="true">
<connect from_port="example set source" to_port="model sink"/>
<portSpacing port="source_example set source" spacing="0"/>
<portSpacing port="sink_model sink" spacing="0"/>
</process>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Impute Missing Values" to_port="example set in"/>
<connect from_op="Impute Missing Values" from_port="example set out" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="158" y="110">Type your comment</description>
</process>
</operator>
</process>
One more thing is when I run the process along with the above error I can see the resulted output by clicking the result output icon as below:
I have total of 54 records in my file out of which 7 are missing values in semester column, and resulted output shows 47 records, it removes those missing values records, should these missing values be replaced with some other value??why it is removing those records??

The problem seems that you try to connect a data set (called ExampleSet in Rapidminer) to another Operator that requires a model as input.
When you are unsure about the input and output of an Operator you can hover over the ports or press F1 (Show Operator Info in the right-click context menu) and you'll see more information.
In general it is also always very helpful to attach the process xml to a question, so others can directly copy and inspect your process (sans the data of course). The xml view can be found under View -> Show Panel -> XML in the menu bar.
Best

The Impute Missing Values operator requires another operator inside it that takes an example set and produces a model. The idea is that all columns with missing values are iterated over as labels and the model predicts what the missing value would be. It basically assumes that the missing values are test data and the non missing is training data. There is an example process available - if you go to the help for the Impute operator and scroll to the end, you will find a process that loads some data with missing values and imputes these.

Related

Problem with exercise with Xquery on Basex

I really need your help with a query in BaseX.
The problem is that I really do not understand the logic behind this language which is Xquery.
So I have this first exercise and it is asking me:
"Find the first symptom(s) appearing after June 5, 2012. Report the result in a document having root SYMSAFTER, containing elements SYM."
The database is like that
<?xml version="1.0"?>
<PATIENT_SYMS>
<PATIENT>
<NAME>Bob</NAME>
<SYMOCC>
<SYM>
<INT>high</INT>
<DESC> edema </DESC>
</SYM>
</SYMOCC>
</PATIENT>
<PATIENT>
<NAME>Ann</NAME>
<SYMOCC>
<DATE>2015-08-03</DATE>
<SYM>
<INT>low</INT>
<DESC> asthma </DESC>
</SYM>
</SYMOCC>
<SYMOCC>
<DATE>2017-05-03</DATE>
<SYM>
<INT> high </INT>
<DESC> nausea </DESC>
</SYM>
</SYMOCC>
</PATIENT>
<PATIENT>
<NAME> Tom </NAME>
<SYMOCC>
<DATE>2011-01-01</DATE>
<SYM>
<INT>high</INT>
<DESC> headache </DESC>
</SYM>
<SYM>
<INT> low </INT>
<DESC> nausea </DESC>
</SYM>
</SYMOCC>
</PATIENT>
<PATIENT>
<NAME>Sue</NAME>
</PATIENT>
</PATIENT_SYMS>
The answer to the question is the following:
<SYMSAFTER> {
for $s in doc('Ps.xml')//SYMOCC
where $s/DATE > '2012-06-05' and (every $s1 in doc('Ps.xml')//SYMOCC satisfies not($s1/DATE > '2012-06-05') or $s1/DATE >= $s/DATE)
return $s
}
</SYMSAFTER>
The output will be:
<SYMSAFTER>
<SYMOCC>
<DATE>2015-08-03</DATE>
<SYM>
<INT>low</INT>
<DESC>asthma</DESC>
</SYM>
</SYMOCC>
</SYMSAFTER>
I honestly don't understand the logic behind that.
How instructions are executed in this language? Is it comparing every single date in $s with any other date in s1? Is there any order it follows?
How does satisfies/satisfies-not work? Because in this case to understand what is going on I thought: "well, if
satisfies not($s1/DATE > 2012-06-05)
why this one down below it is actually not working?
satisfies ($s1/DATE < 2012-06-05)
isn't it the exact same thing?
Why is the last part "OR" and not "AND". I got we're checking if the first date is actually the first by checking if there isn't another date before that date but shouldn't it be "AND"?
Why in this line
$s1/DATE >= $s/DATE
we put greater equal (and not just greater)? isn't it obvious that it is going to find the same date equal to the one on $s?
As you can imagine I'm a bit confuse about this, but online informations are really poor and I had no idea on what I need to do.
Thank you!
Learning any language from online resources alone can be very tough. There's so much information, but it is typically of very mixed quality, and most of it's written in an hour or two with very little design or review. Get yourself a good old-fashioned book, like Priscilla Walmsley's - you know that's written by an expert, who has spent months thinking carefully about how to present information in a logical sequence, and it will have been carefully reviewed by others.
Now let's look at this example query.
for $s in doc('Ps.xml')//SYMOCC
where $s/DATE > '2012-06-05'
and (every $s1 in doc('Ps.xml')//SYMOCC
satisfies not($s1/DATE > '2012-06-05')
or $s1/DATE >= $s/DATE)
return $s
I actually think this is a very poor answer to the question, but let's analyse what it means.
Firstly, you have to know the language pretty well to know the precedence of the operators, specifically, whether the "or xxxx" clause is part of the "satisfies" condition or not. In fact it is, as I have tried to show in my indentation - but it would be better to use parentheses to make it clear.
The query is looking for dates in doc('Ps.xml')//SYMOCC that satisfy two conditions: (a) the date D must be after 2012-06-05, and (b) every date in the document must either be before 2012-06-05, or >= D. Those two conditions correspond to the conditions in the requirement that (a) the date must be after 2012-06-05, and (b) it must be earlier than any other date.
Let's try and answer your questions:
How instructions are executed in this language? Is it comparing every single date in $s with any other date in s1? Is there any order it follows?
It's not an imperative, procedural language, it's a declarative language. It doesn't have instructions, and they aren't executed. It's a logic-based declarative language where you say what conditions the answer must satisfy, and the system works out how to get that answer. Different implementations will do it quite differently depending on their optimization strategy.
The difference between DATE < XXX and not(DATE >= XXX) arises when there is no DATE (some of the SYMOCC elements do not have a DATE child). If there is no DATE, then DATE < XXX and DATE >= XXX are both false.
Why is it OR rather than AND? Well, I think the way the query is expressed is a little perverse, but given the approach taken, it's correct. The date D we're looking for is the first one after 2012-06-05 if every other date is either (a) earlier than 2012-06-05, or (b) later than D.
Why is the final condition >= rather than >? Because there can be multiple symptoms appearing on the same date. If you wrote >, then you'd get no results in the event of duplicates.
Most of your questions seem to be less a problem with XQuery notation, and more a lack of understanding of how predicate logic works. But having said that, I would have produced a different solution to this problem. I would start by sorting all the events by date, then removing those before 2012-06-05, then removing those after the first date in the sequence. That would be something like
let $selected :=
for $s in doc('Ps.xml')//SYMOCC[DATE]
where $s/DATE > '2012-06-05'
order by $s/DATE
return $s
return $selected[DATE = $selected[1]/DATE]

How to finalize log file just after time is over when using logback SizeAndTimeBasedFNATP?

My object is size(10M) and time(daily) based compressed(zip) archiving, so i write the config like this:
<appender name="Behavior" class="ch.qos.logback.core.rolling.RollingFileAppender">
<Encoding>UTF-8</Encoding>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<FileNamePattern>${LOG_HOME}/%d{yyyy-MM-dd}.%i.log.zip</FileNamePattern>
<timeBasedFileNamingAndTriggeringPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedFNATP">
<maxFileSize>10MB</maxFileSize>
</timeBasedFileNamingAndTriggeringPolicy>
</rollingPolicy>
<layout class="ch.qos.logback.classic.PatternLayout">
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} %logger{50} - %msg%n
</pattern>
</layout>
</appender>
But i meet a problem. For example, today is Aug 10th, so logback is writing the log file "2013-08-10.0.log".
But the log file won't be finalized(it means be closed and compressed to "2013-08-10.0.log.zip") at Aug 11st 0:00:00. In fact, it won't be finalized until the first record after Aug 10th is written.
This means, after Aug 10th, if the first record is written at Aug 11st 16:00:00, I can't get "2013-08-10.0.log.zip" when i scan the directory between Aug 11st 0:00:00 and 16:00:00. I can only get "2013-08-10.0.log" and i can't make sure if it is finished.
How can i do to finalize the log file as soon as time is over?
According to the logback-manual, the rollover is triggered on the first log-event AFTER the rollover time, not the time itself:
"For various technical reasons, rollovers are not clock-driven but depend on the arrival of logging events. For example, on 8th of March 2002, assuming the fileNamePattern is set to yyyy-MM-dd (daily rollover), the arrival of the first event after midnight will trigger a rollover. If there are no logging events during, say 23 minutes and 47 seconds after midnight, then rollover will actually occur at 00:23'47 AM on March 9th and not at 0:00 AM. Thus, depending on the arrival rate of events, rollovers might be triggered with some latency." (http://logback.qos.ch/manual/appenders.html)
So there is no configuration-only way to achieve what you intend. If it's this much an issue, you could perhaps try to implement code in your application that sends a logging-event right after midnight to make sure the rollover is triggered in a timely fashion. If you have no access to the main application's code, you could implement a simple little application that just clocks and only sends that one logging-event after midnight every day and uses the same appender.

What password length should I use for a Jasypt-hashed password?

I'm using Java 6, Jasypt, and BouncyCastle to generate hashed passwords. I'm storing them in a MySQL 5.5 db with a default UTF-8 character encoding. I'm trying to figure out how long my VARCHAR password field should be given that I'm using a SHA-256 hashing algorithm and a RandomSaltGenerator of 20 bytes. Here's my declaration in my Spring application context:
<beans:bean id="bcProvider" class="org.bouncycastle.jce.provider.BouncyCastleProvider" />
<beans:bean id="jasyptStringDigester" class="org.jasypt.digest.StandardStringDigester">
<beans:property name="algorithm">
<beans:value>SHA-256</beans:value>
</beans:property>
<beans:property name="provider">
<beans:ref bean="bcProvider" />
</beans:property>
<beans:property name="saltGenerator">
<beans:ref bean="saltGenerator" />
</beans:property>
<beans:property name="saltSizeBytes" value="20" />
</beans:bean>
Thanks for any guidance, - Dave
The output of the SHA-256 hash function is, perhaps unsurprisingly, 256 bits long. The number of characters than makes depends on how you encode it.
A straight binary encoding, 8 bits per byte, would give you 32 bytes, but if you're storing the hash output in a text field, you're probably encoding it using e.g. Base64 (6 bits per char, padded to a multiple of 4 chars, for a total of 44 chars) or possibly hexadecimal (4 bits per char, 64 chars total).
In addition to the hash, it's common for a password field to contain the salt needed to reconstruct it as well. This will add some additional number of characters, which will depend on the exact password hashing scheme, output encoding and parameters (such as your saltSizeBytes) chosen.
Anyway, if the output length of your password hashing method isn't explicitly documented, the easiest way to find it might be to just feed it some test passwords and see what it returns. (Typically, the output should be of constant length.) Then, if you want, add some margin in the database definition just to be sure.
Oh, and as Waleed Khan notes, you really should be using a password hashing method that implements key stretching, such as PBKDF2.

Method/Function arguments: Value then index or vice versa

Which would you do:
setter(index, value)
or
setter(value, index)
I would say the first. I normally put the high level argument first, i.e. the index decides where to place the value in.
From a computer's perspective: first you need the location of where to store, when that is found, the value can be set.
It is similar as setting a property of e.g. an element of a car:
SetWheelDiameter(CarModel model, Part.Wheels, Wheel.Diameter, 19.0)
Parameters are from high level to low level.
Googling:
"T value int index"
suggests that it is a lot more common than:
"int index T value"

What is the difference between FUNCALL and #'function-name in common lisp?

I am reading through a book for homework, and I understand that using #' is treating the variable as a function instead of a variable. But I am a little hazy on FUNCALL. I understand that lisp makes object out of variables, so is the function name just a 'pointer' (may be a bad word, but hopefully you get what I mean), in which case you use #' to invoke it, or is funcall the only way to invoke them? ex.
(defun plot (fn min max step)
(loop for i from min to max by step do
(loop repeat (funcall fn i) do (format t "*"))
(format t "~%")))
couldn't I just do:
(defun plot (fn min max step)
(loop for i from min to max by step do
(loop repeat #'(fn i) do (format t "*"))
(format t "~%")))
I guess my confusion lies in what exactly is in the function names. When I read the book, it said that the variable's value is what will be the function object.
#'function-name is (function function-name). Nothing is called, evaluating either results in the function associated with function-name (the object representing the function). funcall is used to call functions.
See funcall and function in the HyperSpec.
Sample session using both:
CL-USER> (defun square (x) (* x x))
SQUARE
CL-USER> #'square
#<FUNCTION SQUARE>
CL-USER> (function square)
#<FUNCTION SQUARE>
CL-USER> (funcall #'square 3)
9
CL-USER> (funcall 'square 3)
9
The second invocation of funcall works because it also accepts a symbol as function designator (see the link for funcall above for details).
The #' and funcall notations are needed in Common Lisp because this language is a so-called "Lisp-2" where a given symbol can have two separate and unrelated main "meanings" normally listed as
When used as first element of a form it means a function
When used in any other place it means a variable
These are approximate explanations, as you will see in the following example that "first element of a form" and "any other place" are not correct definitions.
Consider for example:
the above code prints 144... it may seem surprising at first but the reason is that the same name square is used with two different meanings: the function that given an argument returns the result of multiplying the argument by itself and the local variable square with value 12.
The first and third uses of the name square the meaning is the function named square and I've painted the name with red color. The second and fourth uses instead are about a variable named square and are painted in blue instead.
How can Common Lisp decide which is which? the point is the position... after defun it's clearly in this case a function name, like it's a function name in the first part of (square square). Likewise as first element of a list inside a let form it's clearly a variable name and it's also a variable name in the second part of (square square).
This looks pretty psychotic... doesn't it? Well there is indeed some splitting in the Lisp community about if this dual meaning is going to make things simpler or more complex and it's one of the main differences between Common Lisp and Scheme.
Without getting into the details I'll just say that this apparently crazy choice has been made to make Lisp macros more useful, providing enough hygiene to make them working nicely without the added complexity and the removed expressiveness power of full hygienic macros. For sure it's a complication that makes it harder to explain the language to whoever is learning it (and that's why Scheme is considered a better (simpler) language for teaching) but many expert lispers think that it's a good choice that makes the Lisp language a better tool for real problems.
Also in human languages the context plays an important role anyway and there it's not a serious problem for humans that sometimes the very same word can be used with different meanings (e.g. as a noun or as a verb like "California is the state I live in" or "State your opinion").
Even in a Lisp-2 you need however to use functions as values, for example passing them as parameters or storing them into a data structure, or you need to use values as functions, for example calling a function that has been received as parameter (your plot case) or that has been stored somewhere. This is where #' and funcall come into play...
#'foo is indeed just a shortcut for (function foo), exactly like 'x is a shortcut for (quote x). This "function" thing is a special form that given a name (foo in this case) returns the associated function as a value, that you can store in variables or pass around:
(defvar *fn* #'square)
in the above code for example the variable *fn* is going to receive the function defined before. A function value can be manipulated as any other value like a string or a number.
funcall is the opposite, allowing to call a function not using its name but by using a value...
(print (funcall *fn* 12))
the above code will display 144... because the function that was stored in the variable *fn* now is being called passing 12 as argument.
If you know the "C" programming language an analogy is considering (let ((p #'square))...) like taking the address of the function square (as with { int (*p)(int) = &square; ...}) and instead (funcall p 12) is like calling a function using a pointer (as with (*p)(12) that "C" allows to be abbreviated to p(12)).
The admittely confusing part in Common Lisp is that you can have both a function named square and a variable named square in the same scope and the variable will not hide the function. funcall and function are two tools you can use when you need to use the value of a variable as a function or when you want a function as a value, respectively.