How to import multiple excel files to Rapidminer [closed] - rapidminer

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I am trying to upload a folder with three excel files in it to Rapidminer at once.
What operator do I need to use to do so (without selecting each of them and use read excel operator)?

There is an operator Loop files, that you can use to loop through a directory of files. Inside the subprocess of this operator use the Read Excel operator. The result of this is a collection of ExampleSets. There are multiple ways to deal with a collection of ExampleSets. For concatenation (to produce single ExampleSet), use the Append operator.
Here is a sample process xml:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.007">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.007" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="loop_files" compatibility="5.3.007" expanded="true" height="76" name="Loop Files" width="90" x="782" y="30">
<parameter key="directory" value="D:\xls"/>
<parameter key="filter" value="^.*\.xlsx?$"/>
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="5.3.007" expanded="true" height="60" name="Read Excel" width="90" x="782" y="30">
<parameter key="excel_file" value="%{file_path}"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<connect from_op="Read Excel" from_port="output" to_port="out 1"/>
<portSpacing port="source_file object" spacing="0"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="append" compatibility="5.3.007" expanded="true" height="76" name="Append" width="90" x="916" y="30"/>
<connect from_op="Loop Files" from_port="out 1" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Related

Find a word in excel file in Rapidminer

I have Process that read a text file and have a operator Process Document from Data Operator that have Tokenize operator.
It work normally but when I change the source of Process Document from Data to Read Excel, the output is empty. I think that I have mistake and the Read Excel operator can not connect to Process Document from Data directly and must read every column of Excel file and then connect to Process Document from Data.
Anybody can help me how I connect Excel file from Process Document from Data?
PS: My goal is read excel file and show the word that repeat in column of excel file more that 3 times.
Sample file is:
Since you don't include your process or input data, may I simply suggest an alternative without Documents at all?
If your goal is to find entries in a specific column of an Excel file, you can do this in three operators: Read Excel, Aggregate and Filter Examples:
Use Read Excel to extract the column as an example set with a single attribute (e.g. words), Aggregate the words attribute with the count function and also group by words (this gives you your desired count per word) and finally use Filter Examples to only keep words with a count of 3 or more.
Example process (re-run the import configuration wizard for your specific setup):
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="9.0.003" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
<parameter key="excel_file" value="D:\words.xlsx"/>
<parameter key="imported_cell_range" value="A1:A100"/>
<list key="annotations"/>
<parameter key="date_format" value="MMM d, yyyy h:mm:ss a z"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="words.true.polynominal.attribute"/>
</list>
<parameter key="read_not_matching_values_as_missings" value="false"/>
</operator>
<operator activated="true" class="aggregate" compatibility="9.0.003" expanded="true" height="82" name="Aggregate" width="90" x="179" y="34">
<list key="aggregation_attributes">
<parameter key="words" value="count"/>
</list>
<parameter key="group_by_attributes" value="words"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="9.0.003" expanded="true" height="103" name="Filter Examples" width="90" x="313" y="34">
<list key="filters_list">
<parameter key="filters_entry_key" value="count(words).ge.3"/>
</list>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

Generate ExampleSet with zeros in RapidMiner

What is the easiest/correct way to generate an ExampleSet in RapidMiner that looks like this:
The way I am using now:
Select Attributes was necessary because ‘Generate Data’ gave me a ‘label’ attribute which I don’t want
Three operators seems to be the minimum. You could use Generate Data by User Specification combined with Loop and Append. Here's an example...
<?xml version="1.0" encoding="UTF-8"?><process version="7.5.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="concurrency:loop" compatibility="7.5.000" expanded="true" height="82" name="Loop" width="90" x="246" y="136">
<process expanded="true">
<operator activated="true" class="generate_data_user_specification" compatibility="7.5.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="179" y="238">
<list key="attribute_values">
<parameter key="attribute1" value="0"/>
<parameter key="anotherattribute" value="0"/>
</list>
<list key="set_additional_roles"/>
</operator>
<connect from_op="Generate Data by User Specification" from_port="output" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="append" compatibility="7.5.000" expanded="true" height="82" name="Append" width="90" x="447" y="136"/>
<connect from_op="Loop" from_port="output 1" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Andrew

Filtering out 'most recent' example record in Rapidminer

I'm trying to filter an example set of commercial properties in rapidminer. Many of the properties are duplicated because the property transaction history is included in the data table, and many of the properties been sold more than once over the period of the data table. What I want to do is to filter out all but the most recent transaction for each property.
I can't figure out how to filter all but the record with the most recent transaction date. Any help would be appreciated.
You should post a standalone reproducible example that includes data to show what you have tried so far.
Without this, the general advice might be along these lines. Use the Aggregate operator to find the maximum date for a given property then use the Join operator to inner join the original example set with the example set containing maxima.
Here's a toy example using the Iris data set that might be applicable in your case.
<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.4.000" expanded="true" height="68" name="Retrieve Iris" width="90" x="112" y="187">
<parameter key="repository_entry" value="//Samples/data/Iris"/>
</operator>
<operator activated="true" class="aggregate" compatibility="7.4.000" expanded="true" height="82" name="Aggregate" width="90" x="313" y="187">
<list key="aggregation_attributes">
<parameter key="a1" value="maximum"/>
</list>
<parameter key="group_by_attributes" value="label"/>
</operator>
<operator activated="true" class="join" compatibility="7.4.000" expanded="true" height="82" name="Join" width="90" x="514" y="187">
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="label" value="label"/>
<parameter key="a1" value="maximum(a1)"/>
</list>
</operator>
<connect from_op="Retrieve Iris" from_port="output" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Join" to_port="right"/>
<connect from_op="Aggregate" from_port="original" to_op="Join" to_port="left"/>
<connect from_op="Join" from_port="join" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

How to load transaction (basket) data in RapidMiner for association rule?

I have comma separated transaction (basket) data in itemsets format
citrus fruit,semi-finished,bread,margarine
tropical fruit,yogurt,coffee,milk
yogurt,cream,cheese,meat spreads
etc
where each row indicates the items purchased in a single transaction.
By using Read.CSV operator i loaded this file in RapidMiner. I could not find any operator to transform this data for FP-growth and association rule mining.
Is there any way to read such type of file in RapidMiner for association rule mining?
I finally understood what you meant - sorry I was being slow. This can be done using operators from the Text Processing Extension. You have to install this from the RapidMiner repository. Once you have you can try this process.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.0.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="7.0.000" expanded="true" height="68" name="Read CSV" width="90" x="246" y="85">
<parameter key="csv_file" value="C:\Temp\is.txt"/>
<parameter key="column_separators" value="\r\n"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations"/>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="att1.true.polynominal.attribute"/>
</list>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.0.000" expanded="true" height="82" name="Nominal to Text" width="90" x="380" y="85"/>
<operator activated="true" class="text:process_document_from_data" compatibility="7.0.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="514" y="85">
<parameter key="vector_creation" value="Term Occurrences"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.0.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34">
<parameter key="mode" value="specify characters"/>
<parameter key="characters" value=","/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
The trick is to use Read CSV to read the original file in but use end of line as the delimiter. This reads the entire line in as a polynominal attribute. From there, you have to convert this to text so that the text processing operators can do their work. The Process Documents from Data operator is then used to make the final example set. The important point is to use the Tokenize operator to split the lines into words separated by commas.

Rapidminer - k-means query

Sorry if this is a very novice question to ask But I have recently started exploring Rapidminer. I have used it to cluster my sample data [using K-means clustering]. My query is if I use a excel raw data file to cluster, how will I get my excel data back [output data] split into K clusters in excel file. I know how to create cluster and switch between the Design and Results screens.
Thanks in advance.
Hi and welcome to stackoverflow and RapidMiner.
If I understand your question correctly, you read your data from excel, make a clustering and then want to write the single clusters back to excel.
If you want to do it manually you can use the "Filter Examples" Operator and filter for the specific cluster.
You can also do it automatically with the "Loop Values" Operator, where you set the loop attribute to cluster and use the iteration macro inside the loop to filter your data. Then you could store your data and use the iteration macro also for the file name.
See the sample process below (you can copy it and paste it in the XML panel directly in RapidMiner):
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.0.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.1.000-SNAPSHOT" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data" compatibility="7.1.000-SNAPSHOT" expanded="true" height="68" name="Generate Data" width="90" x="112" y="34"/>
<operator activated="true" class="generate_id" compatibility="7.1.000-SNAPSHOT" expanded="true" height="82" name="Generate ID" width="90" x="246" y="34"/>
<operator activated="true" class="k_means" compatibility="7.1.000-SNAPSHOT" expanded="true" height="82" name="Clustering" width="90" x="447" y="34">
<parameter key="k" value="5"/>
</operator>
<operator activated="true" class="loop_values" compatibility="7.1.000-SNAPSHOT" expanded="true" height="82" name="Loop Values" width="90" x="715" y="34">
<parameter key="attribute" value="cluster"/>
<process expanded="true">
<operator activated="true" breakpoints="after" class="filter_examples" compatibility="7.1.000-SNAPSHOT" expanded="true" height="103" name="Filter Examples" width="90" x="179" y="34">
<list key="filters_list">
<parameter key="filters_entry_key" value="cluster.equals.%{loop_value}"/>
</list>
</operator>
<connect from_port="example set" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_port="out 1"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<connect from_op="Generate Data" from_port="output" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Loop Values" to_port="example set"/>
<connect from_op="Loop Values" from_port="out 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>