How to format date / time attributes in rapidminer - rapidminer

I have a .xls file that has 2 columns. One is called "msgdate" which contains values like "20160314" (yyyyMMdd) and another column called "msgtime" which contains values like "111215" (HHmmss). I would like to concatenate these 2 columns as date_time for the data type so I can plot the values. I have tried a few things but get an unparsable date error. Things I've tried:
Import file selecting msgdate as date datatype with a format yyyyMMdd, which works but I can't set any time format during the import without ruining the date format.
Import file selecting msgdate as date datatype with a format yyyyMMdd and setting msgtime as integer then using the Numerical to Date operator however the value msgtime value generated is not correct, results are -> Wed Dec 31 18:01:51 CST 1969). Grateful for any knowledge provided and thank you for taking the time to read this.

The easiest way is to concatenate the two fields into one and then parse out into a single new datetime attribute. I'm assuming the two input fields are nominals already in the following which shows this working.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.0.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data_user_specification" compatibility="7.0.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="136">
<list key="attribute_values">
<parameter key="msgdate" value=""20160314""/>
<parameter key="msgtime" value=""111215""/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.0.000" expanded="true" height="82" name="Generate Attributes" width="90" x="246" y="136">
<list key="function_descriptions">
<parameter key="datetime" value="msgdate+msgtime"/>
</list>
</operator>
<operator activated="true" class="nominal_to_date" compatibility="7.0.000" expanded="true" height="82" name="Nominal to Date (3)" width="90" x="380" y="136">
<parameter key="attribute_name" value="datetime"/>
<parameter key="date_type" value="date_time"/>
<parameter key="date_format" value="yyyyMMddHHmmss"/>
<parameter key="keep_old_attribute" value="true"/>
</operator>
<connect from_op="Generate Data by User Specification" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Nominal to Date (3)" to_port="example set input"/>
<connect from_op="Nominal to Date (3)" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Hope this helps to get you along the road.

Related

Rapidminer count total occurrence and sort by date

I have rapidminer example set like this,
ID Issue Exp
100 9/8/2020 11/8/2020
100 8/5/2019 9/5/2019
101 6/3/2020 10/1/2020
102 8/15/2020 12/12/2020
I want to add a new column which will count the occurrence of the ID by adding the numbers and sort by the earliest date so we know at what date how many count I had.
Output like this,
ID Issue Exp Count
100 8/5/2019 9/5/2019 1
100 9/8/2020 11/8/2020 2
101 6/3/2020 10/1/2020 1
102 8/15/2020 12/12/2020 1
But when I aggregate by ID and do a count it will just count the total instead and show them for the same ID. So, for ID 100 it shows me 2 both the times because it is just adding the numbers both the times.
For example, for ID 100 in 2019 we had only 1 issue date hence count is 1, when we find ID 100 again at 2020 the count will be 2. So, the sort by date is also important because it will help us find the ID occurrence in correct order.
Any help is appreciated.
Thanks.
One approach is to use the Loop Values operator to loop through all the possible values of the ID operator, use this value to filter the example set (which has already been sorted), generate a new incrementing id from this filtered set and finally append all the filtered examples back together.
Here's the process and corresponding XML to do this.
<?xml version="1.0" encoding="UTF-8"?><process version="9.9.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.9.000" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.9.000" expanded="true" height="68" name="Retrieve occById" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Local Repository/data/occById"/>
</operator>
<operator activated="true" class="blending:sort" compatibility="9.9.000" expanded="true" height="82" name="Sort" width="90" x="179" y="34">
<list key="sort_by">
<parameter key="ID" value="ascending"/>
<parameter key="Issue" value="ascending"/>
</list>
</operator>
<operator activated="true" class="concurrency:loop_values" compatibility="9.9.000" expanded="true" height="82" name="Loop Values" width="90" x="313" y="34">
<parameter key="attribute" value="ID"/>
<parameter key="iteration_macro" value="loop_value"/>
<parameter key="reuse_results" value="false"/>
<parameter key="enable_parallel_execution" value="true"/>
<process expanded="true">
<operator activated="true" class="filter_examples" compatibility="9.9.000" expanded="true" height="103" name="Filter Examples" width="90" x="112" y="34">
<parameter key="parameter_string" value="ID=%{loop_value}"/>
<parameter key="parameter_expression" value=""/>
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="invert_filter" value="false"/>
<list key="filters_list">
<parameter key="filters_entry_key" value="ID.eq.%{loop_value}"/>
</list>
<parameter key="filters_logic_and" value="true"/>
<parameter key="filters_check_metadata" value="true"/>
</operator>
<operator activated="true" class="generate_id" compatibility="9.9.000" expanded="true" height="82" name="Generate ID" width="90" x="313" y="34">
<parameter key="create_nominal_ids" value="false"/>
<parameter key="offset" value="0"/>
</operator>
<connect from_port="input 1" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="append" compatibility="9.9.000" expanded="true" height="82" name="Append" width="90" x="447" y="34">
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
<parameter key="merge_type" value="all"/>
</operator>
<connect from_op="Retrieve occById" from_port="output" to_op="Sort" to_port="example set input"/>
<connect from_op="Sort" from_port="example set output" to_op="Loop Values" to_port="input 1"/>
<connect from_op="Loop Values" from_port="output 1" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
The input data was handcrafted and stored with name occById in the local repository - it looks like this.
The result is below.

Find a word in excel file in Rapidminer

I have Process that read a text file and have a operator Process Document from Data Operator that have Tokenize operator.
It work normally but when I change the source of Process Document from Data to Read Excel, the output is empty. I think that I have mistake and the Read Excel operator can not connect to Process Document from Data directly and must read every column of Excel file and then connect to Process Document from Data.
Anybody can help me how I connect Excel file from Process Document from Data?
PS: My goal is read excel file and show the word that repeat in column of excel file more that 3 times.
Sample file is:
Since you don't include your process or input data, may I simply suggest an alternative without Documents at all?
If your goal is to find entries in a specific column of an Excel file, you can do this in three operators: Read Excel, Aggregate and Filter Examples:
Use Read Excel to extract the column as an example set with a single attribute (e.g. words), Aggregate the words attribute with the count function and also group by words (this gives you your desired count per word) and finally use Filter Examples to only keep words with a count of 3 or more.
Example process (re-run the import configuration wizard for your specific setup):
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="9.0.003" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
<parameter key="excel_file" value="D:\words.xlsx"/>
<parameter key="imported_cell_range" value="A1:A100"/>
<list key="annotations"/>
<parameter key="date_format" value="MMM d, yyyy h:mm:ss a z"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="words.true.polynominal.attribute"/>
</list>
<parameter key="read_not_matching_values_as_missings" value="false"/>
</operator>
<operator activated="true" class="aggregate" compatibility="9.0.003" expanded="true" height="82" name="Aggregate" width="90" x="179" y="34">
<list key="aggregation_attributes">
<parameter key="words" value="count"/>
</list>
<parameter key="group_by_attributes" value="words"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="9.0.003" expanded="true" height="103" name="Filter Examples" width="90" x="313" y="34">
<list key="filters_list">
<parameter key="filters_entry_key" value="count(words).ge.3"/>
</list>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

Calculate Percentage Values

My Rapidminer process results are published as follows
Row No. Count
1 9.0
2 11.0
3 32.0
If I want to calculate:
(9/32)*100 and
(11/32)*100
from this result set, how would I do it?
the solution is not quite straight forward, as RapidMiner normally treats Examples (rows) as independent of each other.
What you can do is to extract the value needed as a macro and use it in the Generate Attributes Operator.
See the attached sample process for a solution to your particular problem. Just copy and paste the XML below to your process window in RapidMiner.
Also feel free to ask further, or re-post, questions in the RapidMiner community forum.
<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data_user_specification" compatibility="7.6.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="85">
<list key="attribute_values">
<parameter key="Count" value="9"/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="generate_data_user_specification" compatibility="7.6.001" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="112" y="187">
<list key="attribute_values">
<parameter key="Count" value="11"/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="generate_data_user_specification" compatibility="7.6.001" expanded="true" height="68" name="Generate Data by User Specification (3)" width="90" x="112" y="340">
<list key="attribute_values">
<parameter key="Count" value="32"/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="append" compatibility="7.6.001" expanded="true" height="124" name="Append" width="90" x="380" y="187"/>
<operator activated="true" class="extract_macro" compatibility="7.6.001" expanded="true" height="68" name="Extract Macro" width="90" x="581" y="187">
<parameter key="macro" value="divisor"/>
<parameter key="macro_type" value="data_value"/>
<parameter key="attribute_name" value="Count"/>
<parameter key="example_index" value="3"/>
<list key="additional_macros"/>
<description align="center" color="green" colored="true" width="126">Extracting the third value as a macro. It can be the called using the %{macro_name} syntax</description>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.6.001" expanded="true" height="82" name="Generate Attributes" width="90" x="782" y="187">
<list key="function_descriptions">
<parameter key="Percentage" value="5"/>
</list>
<description align="center" color="green" colored="true" width="126">Creating a new Attribute (column) with the desired calculation<br><br>Check the final paragraph of the help text for the &quot;Generate Attribute&quot; Operator for a description of how to work with macros</description>
</operator>
<connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
<connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
<connect from_op="Generate Data by User Specification (3)" from_port="output" to_op="Append" to_port="example set 3"/>
<connect from_op="Append" from_port="merged set" to_op="Extract Macro" to_port="example set"/>
<connect from_op="Extract Macro" from_port="example set" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<description align="center" color="yellow" colored="false" height="581" resized="true" width="444" x="56" y="18">Generating sample data to fit the original problem</description>
</process>
</operator>
</process>

Filtering out 'most recent' example record in Rapidminer

I'm trying to filter an example set of commercial properties in rapidminer. Many of the properties are duplicated because the property transaction history is included in the data table, and many of the properties been sold more than once over the period of the data table. What I want to do is to filter out all but the most recent transaction for each property.
I can't figure out how to filter all but the record with the most recent transaction date. Any help would be appreciated.
You should post a standalone reproducible example that includes data to show what you have tried so far.
Without this, the general advice might be along these lines. Use the Aggregate operator to find the maximum date for a given property then use the Join operator to inner join the original example set with the example set containing maxima.
Here's a toy example using the Iris data set that might be applicable in your case.
<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.4.000" expanded="true" height="68" name="Retrieve Iris" width="90" x="112" y="187">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="aggregate" compatibility="7.4.000" expanded="true" height="82" name="Aggregate" width="90" x="313" y="187">
<list key="aggregation_attributes">
<parameter key="a1" value="maximum"/>
</list>
<parameter key="group_by_attributes" value="label"/>
</operator>
<operator activated="true" class="join" compatibility="7.4.000" expanded="true" height="82" name="Join" width="90" x="514" y="187">
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="label" value="label"/>
<parameter key="a1" value="maximum(a1)"/>
</list>
</operator>
<connect from_op="Retrieve Iris" from_port="output" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Join" to_port="right"/>
<connect from_op="Aggregate" from_port="original" to_op="Join" to_port="left"/>
<connect from_op="Join" from_port="join" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

How to load transaction (basket) data in RapidMiner for association rule?

I have comma separated transaction (basket) data in itemsets format
citrus fruit,semi-finished,bread,margarine
tropical fruit,yogurt,coffee,milk
yogurt,cream,cheese,meat spreads
etc
where each row indicates the items purchased in a single transaction.
By using Read.CSV operator i loaded this file in RapidMiner. I could not find any operator to transform this data for FP-growth and association rule mining.
Is there any way to read such type of file in RapidMiner for association rule mining?
I finally understood what you meant - sorry I was being slow. This can be done using operators from the Text Processing Extension. You have to install this from the RapidMiner repository. Once you have you can try this process.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.0.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="7.0.000" expanded="true" height="68" name="Read CSV" width="90" x="246" y="85">
<parameter key="csv_file" value="C:\Temp\is.txt"/>
<parameter key="column_separators" value="\r\n"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations"/>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="att1.true.polynominal.attribute"/>
</list>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.0.000" expanded="true" height="82" name="Nominal to Text" width="90" x="380" y="85"/>
<operator activated="true" class="text:process_document_from_data" compatibility="7.0.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="514" y="85">
<parameter key="vector_creation" value="Term Occurrences"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.0.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34">
<parameter key="mode" value="specify characters"/>
<parameter key="characters" value=","/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
The trick is to use Read CSV to read the original file in but use end of line as the delimiter. This reads the entire line in as a polynominal attribute. From there, you have to convert this to text so that the text processing operators can do their work. The Process Documents from Data operator is then used to make the final example set. The important point is to use the Tokenize operator to split the lines into words separated by commas.