I have Process that read a text file and have a operator Process Document from Data Operator that have Tokenize operator.
It work normally but when I change the source of Process Document from Data to Read Excel, the output is empty. I think that I have mistake and the Read Excel operator can not connect to Process Document from Data directly and must read every column of Excel file and then connect to Process Document from Data.
Anybody can help me how I connect Excel file from Process Document from Data?
PS: My goal is read excel file and show the word that repeat in column of excel file more that 3 times.
Sample file is:
Since you don't include your process or input data, may I simply suggest an alternative without Documents at all?
If your goal is to find entries in a specific column of an Excel file, you can do this in three operators: Read Excel, Aggregate and Filter Examples:
Use Read Excel to extract the column as an example set with a single attribute (e.g. words), Aggregate the words attribute with the count function and also group by words (this gives you your desired count per word) and finally use Filter Examples to only keep words with a count of 3 or more.
Example process (re-run the import configuration wizard for your specific setup):
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="9.0.003" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
<parameter key="excel_file" value="D:\words.xlsx"/>
<parameter key="imported_cell_range" value="A1:A100"/>
<list key="annotations"/>
<parameter key="date_format" value="MMM d, yyyy h:mm:ss a z"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="words.true.polynominal.attribute"/>
</list>
<parameter key="read_not_matching_values_as_missings" value="false"/>
</operator>
<operator activated="true" class="aggregate" compatibility="9.0.003" expanded="true" height="82" name="Aggregate" width="90" x="179" y="34">
<list key="aggregation_attributes">
<parameter key="words" value="count"/>
</list>
<parameter key="group_by_attributes" value="words"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="9.0.003" expanded="true" height="103" name="Filter Examples" width="90" x="313" y="34">
<list key="filters_list">
<parameter key="filters_entry_key" value="count(words).ge.3"/>
</list>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
Related
I'm trying to filter an example set of commercial properties in rapidminer. Many of the properties are duplicated because the property transaction history is included in the data table, and many of the properties been sold more than once over the period of the data table. What I want to do is to filter out all but the most recent transaction for each property.
I can't figure out how to filter all but the record with the most recent transaction date. Any help would be appreciated.
You should post a standalone reproducible example that includes data to show what you have tried so far.
Without this, the general advice might be along these lines. Use the Aggregate operator to find the maximum date for a given property then use the Join operator to inner join the original example set with the example set containing maxima.
Here's a toy example using the Iris data set that might be applicable in your case.
<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.4.000" expanded="true" height="68" name="Retrieve Iris" width="90" x="112" y="187">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="aggregate" compatibility="7.4.000" expanded="true" height="82" name="Aggregate" width="90" x="313" y="187">
<list key="aggregation_attributes">
<parameter key="a1" value="maximum"/>
</list>
<parameter key="group_by_attributes" value="label"/>
</operator>
<operator activated="true" class="join" compatibility="7.4.000" expanded="true" height="82" name="Join" width="90" x="514" y="187">
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="label" value="label"/>
<parameter key="a1" value="maximum(a1)"/>
</list>
</operator>
<connect from_op="Retrieve Iris" from_port="output" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Join" to_port="right"/>
<connect from_op="Aggregate" from_port="original" to_op="Join" to_port="left"/>
<connect from_op="Join" from_port="join" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
I have a .xls file that has 2 columns. One is called "msgdate" which contains values like "20160314" (yyyyMMdd) and another column called "msgtime" which contains values like "111215" (HHmmss). I would like to concatenate these 2 columns as date_time for the data type so I can plot the values. I have tried a few things but get an unparsable date error. Things I've tried:
Import file selecting msgdate as date datatype with a format yyyyMMdd, which works but I can't set any time format during the import without ruining the date format.
Import file selecting msgdate as date datatype with a format yyyyMMdd and setting msgtime as integer then using the Numerical to Date operator however the value msgtime value generated is not correct, results are -> Wed Dec 31 18:01:51 CST 1969). Grateful for any knowledge provided and thank you for taking the time to read this.
The easiest way is to concatenate the two fields into one and then parse out into a single new datetime attribute. I'm assuming the two input fields are nominals already in the following which shows this working.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.0.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data_user_specification" compatibility="7.0.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="136">
<list key="attribute_values">
<parameter key="msgdate" value=""20160314""/>
<parameter key="msgtime" value=""111215""/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.0.000" expanded="true" height="82" name="Generate Attributes" width="90" x="246" y="136">
<list key="function_descriptions">
<parameter key="datetime" value="msgdate+msgtime"/>
</list>
</operator>
<operator activated="true" class="nominal_to_date" compatibility="7.0.000" expanded="true" height="82" name="Nominal to Date (3)" width="90" x="380" y="136">
<parameter key="attribute_name" value="datetime"/>
<parameter key="date_type" value="date_time"/>
<parameter key="date_format" value="yyyyMMddHHmmss"/>
<parameter key="keep_old_attribute" value="true"/>
</operator>
<connect from_op="Generate Data by User Specification" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Nominal to Date (3)" to_port="example set input"/>
<connect from_op="Nominal to Date (3)" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Hope this helps to get you along the road.
I have comma separated transaction (basket) data in itemsets format
citrus fruit,semi-finished,bread,margarine
tropical fruit,yogurt,coffee,milk
yogurt,cream,cheese,meat spreads
etc
where each row indicates the items purchased in a single transaction.
By using Read.CSV operator i loaded this file in RapidMiner. I could not find any operator to transform this data for FP-growth and association rule mining.
Is there any way to read such type of file in RapidMiner for association rule mining?
I finally understood what you meant - sorry I was being slow. This can be done using operators from the Text Processing Extension. You have to install this from the RapidMiner repository. Once you have you can try this process.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.0.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="7.0.000" expanded="true" height="68" name="Read CSV" width="90" x="246" y="85">
<parameter key="csv_file" value="C:\Temp\is.txt"/>
<parameter key="column_separators" value="\r\n"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations"/>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="att1.true.polynominal.attribute"/>
</list>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.0.000" expanded="true" height="82" name="Nominal to Text" width="90" x="380" y="85"/>
<operator activated="true" class="text:process_document_from_data" compatibility="7.0.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="514" y="85">
<parameter key="vector_creation" value="Term Occurrences"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.0.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34">
<parameter key="mode" value="specify characters"/>
<parameter key="characters" value=","/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
The trick is to use Read CSV to read the original file in but use end of line as the delimiter. This reads the entire line in as a polynominal attribute. From there, you have to convert this to text so that the text processing operators can do their work. The Process Documents from Data operator is then used to make the final example set. The important point is to use the Tokenize operator to split the lines into words separated by commas.
Sorry if this is a very novice question to ask But I have recently started exploring Rapidminer. I have used it to cluster my sample data [using K-means clustering]. My query is if I use a excel raw data file to cluster, how will I get my excel data back [output data] split into K clusters in excel file. I know how to create cluster and switch between the Design and Results screens.
Thanks in advance.
Hi and welcome to stackoverflow and RapidMiner.
If I understand your question correctly, you read your data from excel, make a clustering and then want to write the single clusters back to excel.
If you want to do it manually you can use the "Filter Examples" Operator and filter for the specific cluster.
You can also do it automatically with the "Loop Values" Operator, where you set the loop attribute to cluster and use the iteration macro inside the loop to filter your data. Then you could store your data and use the iteration macro also for the file name.
See the sample process below (you can copy it and paste it in the XML panel directly in RapidMiner):
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.1.000-SNAPSHOT" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data" compatibility="7.1.000-SNAPSHOT" expanded="true" height="68" name="Generate Data" width="90" x="112" y="34"/>
<operator activated="true" class="generate_id" compatibility="7.1.000-SNAPSHOT" expanded="true" height="82" name="Generate ID" width="90" x="246" y="34"/>
<operator activated="true" class="k_means" compatibility="7.1.000-SNAPSHOT" expanded="true" height="82" name="Clustering" width="90" x="447" y="34">
<parameter key="k" value="5"/>
</operator>
<operator activated="true" class="loop_values" compatibility="7.1.000-SNAPSHOT" expanded="true" height="82" name="Loop Values" width="90" x="715" y="34">
<parameter key="attribute" value="cluster"/>
<process expanded="true">
<operator activated="true" breakpoints="after" class="filter_examples" compatibility="7.1.000-SNAPSHOT" expanded="true" height="103" name="Filter Examples" width="90" x="179" y="34">
<list key="filters_list">
<parameter key="filters_entry_key" value="cluster.equals.%{loop_value}"/>
</list>
</operator>
<connect from_port="example set" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_port="out 1"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<connect from_op="Generate Data" from_port="output" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Loop Values" to_port="example set"/>
<connect from_op="Loop Values" from_port="out 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
I'm using RapidMiner Studio 5.3 with 'Read CSV' operator with 'first row as names' parameter checked. After that I can't use 'Rename' or 'Set Role' operators because "attribute name is undefined". It's like It reads fine but It doesn't send the attribute names forward.
Here's the Meta Data View of what happens with a breakpoint after the 'Read CSV' operator, where you can see that he recognizes attributes names.
Now the 'Set Role' operator can't find attribute names.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="5.3.015" expanded="true" height="60" name="Read CSV" width="90" x="179" y="75">
<parameter key="csv_file" value="C:\Users\lffreitas\Documents\tae.csv"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="set_role" compatibility="5.3.015" expanded="true" height="76" name="Set Role" width="90" x="380" y="75">
<list key="set_additional_roles"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Any clue of what can be happening here?
Do the following:
In the process pane, look at the paper clip icon on top right (number 5) and click on it...That will do the trick. Cheers; Alex
The Set Role operator doesn't have an attribute selected so it fails. Fix this by choosing the attribute name from the drop down within the parameters for this operator.
In the process pane, look at the paper clip icon on top right (number 5) and click on it...That will do the trick. Cheers; Alex
This answer help me solve the problem. But there is still a problem,which is the "read URL" operator can't pass the attribute name to the next operator "rename operator". So there are some messages in the log. The picture shows what happens
.