RapidMiner Text Processing: How To write ngrams to file - rapidminer

How can I write ngrams extracted from Text to a new XLS or CSV file?
The process I created is shown below. I would like to know how to connect the Write Document utility and at which level. In the Main Process or in the Vector Creation? Which pipe goes where?
Screenshot Main Process:
Screenshot Vector Creation process:
Screenshot ngrams produced:
Screenshot Write Document operator:
I am using RapidMiner Studio 6.0.003 Community Edition
EDIT Solution:

There are two outputs from the Process Documents from Files operator. The top one is an example set and will correspond to the document vector generated by the operator. The bottom one is a word list that contains all the different words, including n-grams, that form the attributes within the document vector.
To write the word list to a file, you have to convert it to an example set using the WordList to Data operator. The example set that is produced can then be written to CSV or XLSX using the Write CSV or Write Excel operators.

Related

xlsread in octave return zero values

I am trying to read a csv file in octave. The file contains a table with both numeric and text data. It also contains information of date and hour. In addition, the first line is in a different format then the rest of the lines since it contains titles.
The csvread can only read numeric data (according to Octave help), so I tried using xlsread as follows:
[NUMARR, TXTARR, RAWARR, LIMITS] = xlsread ('Line.csv')
I get only a matrix of NUMARR with numeric values. However, all other returned variables are empty- their dimension is 0x0.
How do I get all the text and all other information?
TX!
To solve this issue, open your CSV file in Windows notepad and save it as ANSI format instead of UNICODE.

skip error rows from flat file in ssis

I am trying to load data from a flat file. The file is around 2.5 GB in size and row count is close to billion. I am using a flat file source inside DFT. Few rows inside the file does not follow the column pattern, for example there is a extra delimiter or say text qualifier as value of one column. I want to skip those rows and load rest of the rows which has correct format. I am using SSIS 2014. Flat file source inside DFT is failing. I have set alwaysCheckforrowdelimiter property to false but still does not work. Since the file is too huge manually opening and changing is not possible. Kindly help.
I have the same idea as Nick.McDermaid but I can maybe help you a bit more.
You can clean your file with a regular expression. (In a script)
You just need to define a regex to match lines with the number of delimiter you want. Other lines should be deleted.
Here is a visual example executed in Notepad++
Notepad++ Example screenshot
Here is the pattern used for my example:
^[A-Z]*;[A-Z]*;[A-Z]*;[A-Z]*$
And the data sample:
AA;BB;CC;DD
AA;BB;CC;DD
AA;BB;CC;DD;EE
AA;BB;CC;DD
AA;BB;CC
AA;BB;CC;DD
AA;BB;CC;DD
You can try it on line: https://regex101.com/r/PIYIcY/1
Regards,
Arnaud

Taking count in Rapidminer

How to take a row count of a list which is in word document?? If the same list is in excel I am able to take the count using aggregate operator but in word document it is not happening.
I recommend the answer from #awchisholm as it's the easiest solution. However, if you have several word documents this might become impractical.
In this case you can use the operator Loop Zip files to unzip the word document and look inside the for the file /word/document.xml and using RapidMiner's text functions (or Read XML) look for each instance of <w:p ...>...</w:p>, this represents a new line so you can count them from there.
There is also an xml doc in the unzipped directory called /docProps/app.xml you can read this in to find some meta information about the document such as number of words, characters & pages. Unfortunately I've found that unreliable for number of lines which is why I recommend using the <w:p> tag to search.
RapidMiner cannot easily read Word documents. You have to save the document as a text file and use the Read CSV operator to read the file.

How to read a file where one column data is present in other column using Talend Data Integration

I get data from a CSV format daily.
Example data looks like:
Emp_ID emp_leave_id EMP_LEAVE_reason Emp_LEAVE_Status Emp_lev_apprv_cnt
E121 E121- 21 Head ache, fever, stomach-ache Approved 16
E139 E139_ 5 Attending a marraige of my cousin Approved 03
Here you can see that emp_leave_id and EMP_LEAVE_reason column data is shifted/scattered into the next columns.
So the problem by using tFileInputDelimited and various reading patterns I couldn't load data correctly into my target database. Mainly I'm not able to read the data correctly with that component in Talend.
Is there a way that I can properly parse this CSV to get my data in the format that I want?
This is probably a TSV file. Not sure about Talend, but uniVocity can parse these files for you:
TsvDataStoreConfiguration tsv = new TsvDataStoreConfiguration("my_TSV_datastore");
tsv.setLimitOfRowsLoadedInMemory(10000);
tsv.addEntities("/some/dir/with/your_files", "ISO-8859-1"); //all files in the given directory path will accessible entities.
JdbcDataStoreConfiguration database = new JdbcDataStoreConfiguration("my_Database", myDataSource);
database.setLimitOfRowsLoadedInMemory(10000);
Univocity.registerEngine(new EngineConfiguration("My_ETL_Engine", tsv, database));
DataIntegrationEngine engine = Univocity.getEngine("My_ETL_Engine");
DataStoreMapping dataStoreMapping = engine.map("my_TSV_datastore", "my_Database");
EntityMapping entityMapping = dataStoreMapping.map("your_TSV_filename", "some_database_table");
entityMapping.identity().associate("Emp_ID", "emp_leave_id").toGeneratedId("pk_leave"); //assumes your database does not keep the original ids.
entityMapping.value().copy("EMP_LEAVE_reason", "Emp_LEAVE_Status").to("reason", "status"); //just copies whatever you need
engine.executeCycle(); //executes the mapping.
Do not use a CSV parser to parse TSV inputs. It won't handle escape sequences properly (such as \t inside the value, you will get the escape sequence instead of a tab character), and will surely break if your value has a quote in it (a CSV parser will try to find the closing quote character and will keep reading chars until it finds another quote)
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

Natural ordering files in directory into a cell array using Octave

I have files being generated by another program/user that have names such as "jh-1.txt, jh-2.txt, ..., jh-100.txt, ..., jh-1024.txt". I'm extracting a column from these files, manipulating the data, and outputting to a new matrix. The only problem is that Octave is using ASCII ordering and not natural ordering when reading in the files. Thus, the output matrix is not ordered in a natural way. My question is, can Octave sort file names in a natural order? I'm getting file names in the standard method:
fileDirectory = '/path/to/directory';
filePattern = fullfile(fileDirectory, '*.txt'); % Selects only the txt files.
dataFiles = dir(filePattern); % Gets the info from the txt files in the directory.
baseFileName = {dataFiles.name}'; % Gets all the txt file names.
I can't rename the files because this is a script for another user. They are on a Windows machine and already have Octave installed with Cygwin and I don't want to make them use the command line more than they have to because they are unfamiliar with it. Alternatively, it would be nice to have the output with the file names in a column but, I haven't figured that one out either (bit of a noob with Octave myself). That way the user could use Excel (which they are familiar with) to sort the columns.
I don't think there's a built in natural sort in Octave. However, there is a natural sort submission on Mathwork's File Exchange. I've not used it, but the comments imply it works in Octave too.