Weka is meant to make it very easy to build classifiers. There are many different kinds, and here I want to use a scheme called “J48” that produces decision trees.
Weka can read Comma Separated Values (.csv) format files by selecting the appropriate File Format in the Open file dialog.
I've created a small spreadsheet file (see the next image), saved it .csv format, and loaded it into Weka.
The first row of the .csv file have the attribute names, separated by commas, which for this case is classe real and resultado modelo.
I've got the dataset opened in the Explorer.
If I go to the Classify panel, choose a classifier, open trees and click J48, i should just run it (I have the dataset, the classifier). (see the next image)
Well, it doesn't allow to press start. (see the next image)
What do I need to do to fix this?
If you look back at the Preprocess, you will see that resultado modelo is probably being treated as a numeric attribute. J48 only works with nominal class attributes. (Predictor attributes can be numeric, as a commenter #nekomatic noted.)
You can change this by using a filter in the Preprocess tab. Choose the unsupervised attribute filter NumericToNominal and this will convert all your variables (or a subset of them) from numeric to nominal. Then you should be able to run J48 just fine.
Related
I've got a 3rd party file coming in - utf-8 encoded, 56 columns, csv export from MySql. My intent is to load it into a Sql Server 2019 instance - a table layout I do not have control over.
Sql Server Import Wizard will automatically do the code page conversions to latin 1 (and a couple string-to-int conversions) but it will not handle the MySql "\N" for null conventions, so I thought I'd try my hand at SSIS to see if I could get the data cleaned up on ingestion.
I got a number of components set up to do various filtering and transforming (like the "\N" stuff) and that was all working fine. Then I tried to save the data using an OLE DB destination, and the wheels kinda fall off the cart.
SSIS appears to drop all of the automatic conversions Import Wizard would do and force you to make the conversions explicit.
I add a Data Transformation Component into the flow and edit all 56 columns to be explicit about the various conversions - only it lets me edit the "Copy of" output column code pages it will not save them. Either in the Editor or the Advanced Editor.
I saw another article here saying "Use the Derived Column Transformation" but that seems to be on a column-by-column basis (so I'd have to add 56 of them).
It seems kinda crazy that SSIS is such a major step backwards in this regard from Import Wizard, bcp, or BULK INSERT.
Is there a way to get it to work through the code page switch in SSIS using SSIS components? All the components I've seen recommended don't seem to be working and all of the other articles say "make another table using different code pages or NVARCHAR and then copy one table to the other" which kinda defeats the purpose.
It took synthesizing a number of different posts on tangentially related issues, but I think I've finally gotten SSIS to do a lot of what Import Wizard and BULK INSERT gave for free.
It seems that to read a utf-8 csv file in with SSIS and to process it all the way through to a table that's in 1252 and not using NVARCHAR involves the following:
Create a Flat File Source component and set the incoming encoding to 65001 (utf-8). In the Advanced editor, convert all string columns from DT_STR/65001 to DT_WSTR (essentially NVARCHAR). It's easier to work with those outputs the rest of the way through your workflow, and (most importantly) a Data Conversion transform component won't let you convert from 65001 to any other code page. But it will let you convert from DT_WSTR to DT_STR in a different code page.
1a) SSIS is pretty annoying about putting a default 50 length on everything by default. And not carrying through any lengths as defaults from one component/transform to the next. So you have to go through and set the appropriate lengths on all the "Column 0" input columns from the Flat File Source and all the WSTR transforms you create in that component.
1b) If your input file contains, as mine apparently does, invalid utf-8 encoding now and then, choose "RD_RedirectRow" as the Truncation error handling for every column. Then add a Flat File Destination to your workflow, and attach the red line coming out of your Flat File Source to it. That's if you want to see what row was bad. You can just choose "RD_IgnoreError" if you don't care about bad input. But leaving the default means your whole package will blow up if it hits any bad data
Create a Script transform component, and in that script you can check each column for the MySql "\N" and change it to null.
Create a Data Conversion transformation component and add it to your workflow. Because of the DT_WSTR in step 1, you can now change that output back to a DT_STR in a different code page here. If you don't change to DT_WSTR from the get-go, the Data Conversion component will not work changing the code page at this step. 99% of the data I'm getting in just has latinate characters, utf-8 encoded (the accents). There are a smattering of kanji characters in a small subset of data, so to reproduce what Import Wizard does for you, you must change the Truncation error handling on every column here that might be impacted to RD_IgnoreError. Unlike some documentation I read, RD_IgnoreError does not put null in the column; it puts the text with the non-mapping characters replaced with "?" like we're all used to.
Add your OLE DB destination component and map all of the output columns from step 3 to the columns of your database.
So, a lot of work to get back to Import Wizard started and to start getting the extra things SSIS can do for you. And SSIS can be kind of annoying about snapping column widths back to the default 50 when you change something. If you've got a lot of columns this can get pretty tedious.
I have this solution that helps me creating a Wizard to fill some data and turn into JSON, the problem now is that I have to receive a xlsx and turn specific data from it into JSON, not all the data but only the ones I want which are documented in the last link.
In this link: https://stackblitz.com/edit/xlsx-to-json I can access the excel data and turn into object (when I print document.getElementById('output').innerHTML = JSON.parse(dataString); it shows [object Object])
I want to implement this solution and automatically get the specified fields in the config.ts but can't get to work. For now, I have these in my HTML and app-component.ts
https://stackblitz.com/edit/angular-xbsxd9 (It's probably not compiling but it's to show the code only)
It wasn't quite clear what you were asking, but based on the assumption that what you are trying to do is:
Given the data in the spreadsheet that is uploaded
Use a config that holds the list of column names you want returned in the JSON when the user clicks to download
based on this, I've created a fork of your sample here -> Forked Stackbliz
what I've done is:
use the map operator on the array returned from the sheet_to_json method
Within the map, the process is looping through each key of the record (each key being a column in this case).
If a column in the row is defined in the propertymap file (config), then return it.
This approach strips out all columns you don't care about up front. so that by the time the user clicks to download the file, only the columns you want are returned. If you need to maintain the original columns, then you can move this logic somewhere more convenient for you.
I also augmented the property map a little to give you more granular control over how to format the data in the returned JSON. i.e. don't treat numbers as strings in the final output. you can use this as a template if it suites your needs for any additional formatting.
hope it helps.
I've got parent & child data which I am trying to convert into a flat file using Dell Boomi. The flat file structure is column-based and needs a structure where the lines data is on the same line of the file as the header data.
For instance, a header record which has 4 line items needs to generate a file with a structure of:
[header][line][line][line][line]
Currently what I have been able to generate is either
[header][line]
[header][line]
[header][line]
[header][line]
or
[header]
[line]
[line]
[line]
[line]
I think using the results of the second profile and then using a data processing shape to strip [\r][\n] might be my best option but wanted to check before implementing it.
I created a user define function for each field of data. For this example, lets just speak for "FirstName."
I used a branch immediately-Branch 1 Split the documents, flow controlled them one at a time, and entered the map, then stopped. Branch two contained a message where I built the new file. I typed a static value with the header name, then used a dynamic process property as the parameter next to the header.
User defined function for "FirstName" accepts the first name as input, appends the dynamic process property (need to define a dynamic property for each field), prepends your delimiter of choice, then sets the same dynamic process property.
This was all done with NO scripting at all. I hope this helps. I can provide screenshots if you need more clarification.
I have three files: Conf.txt, Temp1.txt and Temp2.txt. I have done regex to fetch some values from config.txt file. I want to place the values (Which are of same name in Temp1.txt and Temp2.txt) and create another two file say Temp1_new.txt and Temp2_new.txt.
For example: In config.txt I have a value say IP1 and the same name appears in Temp1.txt and Temp2.txt. I want to create files Temp1_new.txt and Temp2_new.txt replacing IP1 to say 192.X.X.X in Temp1.txt and Temp2.txt.
I appreciate if someone can help me with tcl code to do same.
Judging from the information provided, there basically are two ways to do what you want:
File-semantics-aware;
Brute-force.
The first way is to read the source file, parse it to produce certain structured in-memory representation of its content, then serialize this content to the new file after replacing the relevant value(s) in the produced representation.
Brute-force method means treating the contents of the source file as plain text (or a series of text strings) and running something like regsub or string replace on this text to produce the new text which you then save to the new file.
The first way should generally be favoured, especially for complex cases as it removes any chance of replacing irrelevant bits of text. The brute-force way me be simpler to code (if there's no handy library to do this, see below) and is therefore good for throw-away scripts.
Note that for certain file formats there are ready-made libraries which can be used to automate what you need. For instance, XSLT facilities of the tdom package can be used to to manipulate XML files, INI-style file can be modified using the appropriate library and so on.
Machine learning algorithms in OpenCV appear to use data read in CSV format. See for example this cpp file. The data is read into an OpenCV machine learning class CvMLData using the following code:
CvMLData data;
data.read_csv( filename )
However, there does not appear to be any readily available documentation on the required format for the csv file. Does anyone know how the csv file should be arranged?
Other (non-Opencv) programs tend to have a line per training example, and begin with an integer or string indicating the class label.
If I read the source for that class, particularly the str_to_flt_elem function, and the class documentation I conclude that valid formats for individual items in the file are:
Anything that can be parsed to a double by strod
A question mark (?) or the empty string to represent missing values
Any string that doesn't parse to a double.
Items 1 and 2 are only valid for features. anything matched by item 3 is assumed to be a class label, and as far as I can deduce the order of the items doesn't matter. The read_csv function automatically assigns each column in the csv file the correct type, and (if you want) you can override the labels with set_response_index. Delimiter wise you can use the default (,) or set it to whatever you like before calling read_csv with set_delimiter (as long as you don't use the decimal point).
So this should work for example, for 6 datapoints in 3 classes with 3 features per point:
A,1.2,3.2e-2,+4.1
A,3.2,?,3.1
B,4.2,,+0.2
B,4.3,2.0e3,.1
C,2.3,-2.1e+3,-.1
C,9.3,-9e2,10.4
You can move your text label to any column you want, or even have multiple text labels.