converting csv to arff - csv

I am working on a school project for data mining, where we were given CSV data from kaggle (this is how the data looks (2 lines out of 6970)):
4,1970,Female,150,DomesticPartnersKids,Bachelor's Degree,Democrat,,Yes,No,No,No,Yes,Public,No,Yes,No,Yes,No,No,Yes,Science,Study first,Yes,Yes,No,No,Receiving,No,No,Pragmatist,No,No,Cool headed,Standard hours,No,Happy,Yes,Yes,Yes,No,A.M.,No,End,Yes,No,Me,Yes,Yes,No,Yes,No,Mysterious,No,No,,,,,,,,,,Mac,Yes,Cautious,No,Umm...,No,Space,Yes,In-person,No,Yes,Yes,No,Yay people!,Yes,Yes,Yes,Yes,Yes,No,Yes,,,,,,,,,,,,,,,,,No,No,No,Only-child,Yes,No,No
5,1997,Male,75,Single,High School Diploma,Republican,,Yes,Yes,No,,Yes,Private,No,No,No,Yes,No,No,Yes,Science,Study first,,Yes,No,Yes,Receiving,No,Yes,Pragmatist,No,Yes,Cool headed,Odd hours,No,Right,Yes,No,No,Yes,A.M.,Yes,Start,Yes,Yes,Circumstances,No,Yes,No,Yes,Yes,Mysterious,No,No,Tunes,Technology,Yes,Yes,Yes,Yes,No,Supportive,No,PC,No,Cautious,No,Umm...,No,Space,No,In-person,No,No,Yes,Yes,Grrr people,Yes,No,No,No,No,No,No,Yes,No,No,Yes,No,Own,Pessimist,Mom,No,No,No,No,Nope,Yes,No,No,No,Yes,No,Yes,No,Yes,No
and we have to get this to an .arff format for use in weka. I manualy typed the header(107 attributes)
#ATTRIBUTE user_id NUMERIC
#ATTRIBUTE yob NUMERIC
#ATTRIBUTE gender {Male,Female}
#ATTRIBUTE income {150,100,75,50,25,10}
#ATTRIBUTE householdstatus {MarriedKids,Married,DomesticPartnersKids,DomesticPartners,Single,SingleKids}
#ATTRIBUTE educationlevel {Bachelor's Degree,High School Diploma,Current K-12,Current Undergraduate,Master's Degree,Associate's Degree,Doctoral Degree}
#ATTRIBUTE party {Democrat,Republican}
#ATTRIBUTE Q124742 {Yes,No}
#ATTRIBUTE Q124122 {Yes,No}
and I get this error :
} expected at end of enumeration read token eol
Then I tried to use the weka converter but it gave me an error
Wrong number of values.Read 2,expected 1,read Token[EOL],line 4 Problem encountered at line:3

Here's what I did:
From Kaggle, I downloaded train.csv (5568 instances, highest ID numbeer 6960).
I didn't use the converter -- just loaded it into the Weka Explorer as a CSV file. Some problems and their solution:
Line 3: First instance of "Bachelor's Degree". It did NOT like that single quote ("line 3, read 7, expected 108"). Got rid of all single quotes (using a global replace in a text editor). Then I tried to load it into Weka again.
The file doesn't have a CR (the Enter key on the keyboard) at the end of the last line, which caused an error ("null on line 5569"). I added one, again in a text editor. Then I loaded it into Weka, and took a look at the variables.
YOB (Year of Birth) is missing for about 300 instances, with "NA" filled in. So, it didn't evaluate as either string or numeric. Edited these to be empty cells instead. Then I loaded it into Weka.
And, of course, moved Party to be the class variable (at the end). I did this in Weka.
Saved this as train.arff
Loaded it back in, and it seems to work OK. I generated 51% accuracy with a OneR classifier, but you wouldn't expect a OneR classifier to work well here. I'm sure you can do better.
Note I didn't do any manual typing of headers. That must have taken a while!
Good luck!

Related

MSAccess - CSV TransferText Import Spec, non-ascii delimiter?

I receive a CSV file from a 3rd party I need to IMPORT into Access. They claim they are unable to add any sort of Text Qualifier; all my common delimiter options (comma, tabs, pipe, $, ~, ^, etc.) all seem to appear in the data, so not reliable to use in an Import Spec. I cannot edit the data, but we can adjust the delimiter. Record counts are in 500K range x 50 columns (250MB).
I tried a non-ascii char as a delimiter (i.e., ÿ), I can add to an Import Spec, the sample data appears to delimit OK, but get a error (Subscript out of Range) when attempting the actual Import. Also tried a multi-character delimiter, but no-go.
Any suggestions to permit me to receive these csv tables? Daily task, many low-skilled users, remote locations, import function behind a button.
Sample Raw Data, truncated for width (June7, not sure if this helps the discussion)
9798ÿ9798ÿ451219417ÿ9033504ÿ9033504ÿPUNCH BIOPSY 4MM UNI-PUNCH SS SEAMLS RAZOR SHARP BLADE...
9798ÿ9798ÿ451219418ÿ1673BXÿ1673BXÿCLEANER INST 1GL KLENZYME LATEXÿSTERIS PLCÿ1673BXÿ1673BX...
9798ÿ9798ÿ451219419ÿA4823PRÿA4823PRÿBAG BIOHAZ THK1.3 MIL 24X23IN RED LDPE PRINT INF WASTE...
9798ÿ9798ÿ451219420ÿCUR9225ÿCUR9225ÿGLOVE EXAM CURAD MEDIUM LATEX FREEÿMEDLINE INDUSTRIES,...
9798ÿ9798ÿ451219421ÿCUR9226ÿCUR9226ÿGLOVE EXAM CURAD LARGE LATEX FREEÿMEDLINE INDUSTRIES, ...
9798ÿ9798ÿ451219422ÿ90176101ÿ90176101ÿDRAPE CONSUMABLE PK EQUIP OEC UROVIEW 2800 STERILE L...
Try another extended-ASCII character (128 - 254). The chosen delimiter ÿ (255) apparently doesn't work, but it's already a suspicious character since it has all bits set and sometimes has special meaning for that reason.
It's also good to consider the code page. If you're in the US using standard English version of Windows, its likely that Access is using the default "Western European (Windows)" (Windows-1252) code page. But if you're outside the US or have other languages installed, it could be that the particular default code page will treat certain characters differently. For reference, I'm using Access 2013 on Windows 10. In the Access text import wizard, clicking on the [Advanced...] button shows more options, including the selection of the import code page. Since you're having problems with the import, it is worth inspecting that settings.
For the record, I had similar results as you and others using the sample data and delimiter ÿ (255).
Next I tried À (192) which is a standard letter character in various code pages, so it should likely work even if the default were not Windows-1252. Indeed, it worked on my system and resulted in no errors.
To get the import working without errors at first, I would choose all Short Text and Long Text fields before specifying integer, date or other non-text types. If all text columns work, then try specific fields types. In this way, you can at least differentiate between delimiter errors and other data errors.
This isn't to discourage other options like fixed-width text, especially since in that case you won't have to worry about the delimiter at all.

How can I read in a TXT file in Access that is over 255 char/line and contains control char?

I am running Access 2010. I need to read in a TXT file into a string. Each line can be anywhere from 40 to 320 char long, ending in a CR. The biggest problem is the TXT file of various lines contains comma's (,) and quotations (") as part of the data.
Is there a trick to doing this? Even if it is getting each char, and testing to see if it is a CR....
To accomplish this task, you will need to write your own import code that will read directly from the file. The Microsoft Access import features will not handle a file like this very well, and since you want to analyze each line in code, it is better to handle reading it yourself.
There are many approaches you can take, and all will involve File handles and Opening the file. But, the best approach is to use a class that does all of the dirty work for you.
One such class is the LargeTextFile class that can be found in any of the Microsoft Access Developer's Handbooks (Volume 1) for Access 97, 2000, 2002 or 2003, written by Getz, Litwin, and Gilbert (Sybex), if you have access to one of them.
Another option would be the clsReadTextFile class, available for free on the Access MVP Site (The Access Web) site:
http://www.theaccessweb.com/downloads/clsReadTextFile.txt
Using clsReadTextFile you can process your file, line by line using code similar to this:
Dim file As New clsReadTextFile
Dim line As String
file.FileName = "C:\MyFile.txt"
file.cfOpenFile
Do While Not file.EndOfFile
file.csGetALine
line = file.Text
If InStr(line, "MySearchText") Then
'Do something
End If
Loop
file.cfCloseFile
The line string variable will contain the text of the line just read, and you can write code to parse it how you need and process it appropriately. Then the loop will go on to read the next line. This will allow you to process each line of the file manually in your code.
It is not clear from your post as to whether or not you can - or have tried - to use the tools available in the product for this task. Access 2010 offers linking to a .txt file as well as appending a .txt file to a table. These are standard features in the External tab of the ribbon.
The Large Text (formerly Memo) field type allows ~4K characters. Not sure if you wish to attempt to bring in all the txt data into a single field - if so then this limit is important.
If the CRs of the text document imply a new record/row of data - rather than a continuous string for the entire document - - AND - - - if there is any consistent structure within all rows of data - then the import wizard can use either character count or symbols (i.e. comma if they exist) - as the means to separate/segregate each individual row of data into separate fields in a single row of a table.

Badly Formed hexadecimal uuid string error in Django fixture; json uuid conversion fails issue

File "/home/malikarumi/Projects/cannon/local/lib/python2.7/site-packages/django/db/models/fields/__init__.py", line 2390, in get_db_prep_value
value = uuid.UUID(value)
File "/usr/lib/python2.7/uuid.py", line 134, in __init__
raise ValueError('badly formed hexadecimal UUID string')
ValueError: Problem installing fixture '/home/malikarumi/Projects/cannon/jamf/essell/fixtures/test22byhand.json': badly formed hexadecimal UUID string
I've found the following links so far:
https://github.com/dcramer/django-uuidfield/issues/40
https://github.com/dcramer/django-uuidfield/commit/caae1bc4e45445a06dd11bb22da6a9f07395f78a
Django UUIDField modelfield causes error in Django admin: badly formed hexadecimal UUID string
Django Primary Key: badly formed hexadecimal UUID string
I counted my uuidfield value. It is len=36, because it has dashes in it. At least the string representation I can see is that way. So I replaced it with the same alphanumeric without dashes, as suggested as a test by the bugfix, but I still got the same result.
I checked the model, but there is no max length on any uuid field, nor on the fk link back to the uuid. There's nothing on the fk to suggest it is, or should be limited to, chars, ints, uuids, etc.
Then I found this: http://arthurpemberton.com/2015/04/fixing-uuid-is-not-json-serializable which I hacked into /python2.7/site-packages/django/core/serializers/python.py. The blogger had put it into models.py. But I got the same error, before realizing it was NOT coming from serializers/python.py, as it was yesterday, but from /usr/lib/python2.7/uuid.py, line 134, in init. the relevant portions of that code are:
if hex is not None:
hex = hex.replace('urn:', '').replace('uuid:', '')
hex = hex.strip('{}').replace('-', '')
if len(hex) != 32:
raise ValueError('badly formed hexadecimal UUID string')
int = long(hex, 16)
Rather than try to hack more core code, given that the indication is the problem is json, not Python, I left this alone for now.
Finally, I looked at this:
https://code.djangoproject.com/ticket/24012
It is stated a couple of times here that Django's "UUIDField generates UUIDs in Python". Now here is some history. I created one row, a single instance of Model A into Django with a fixture that had no uuid and no datefield and had no issues. (The uuidfield is on an abstract model, so it is created when the object is created). I did that because I needed the uuid of that Model A instance for a fk field in Model B, which is the one I am struggling with now. I did that by copy pasting the Model A uuid into the fk field on Model B in a csv file which I then converted to json in order to use it as a fixture.
Is it possible that the uuid ran into problems in this copy paste maneuver, before the conversion to json?
If not, that means even though it was an acceptable Python object when it was created, going thru the json conversion messed it up, correct?
If that's the case, what is a workaround?
Can the Arthur Pemberton code be made to work somewhere else in this process?
If I leave the uuid off, I can probably make this work, but then I have to go back and put the all the fk uuid's in manually. Is there a better solution? Maybe a bulk insert of that field alone?
This may be a recurring issue for me, because I am also using Scrapy, which supports but does not require json. None of my scraped items will come with uuid, but how do I automate adding their fk's into my process in order to get them into Django?
Or is all of this a good reason to forget uuids altogether?
Thanks.
EDIT/UPDATE per #rolf:
Since I just discovered that the django shell differs more than I realized (the shell can find settings, the regular interpreter can't) I decided to run this once in each one, but the results were the same.
(cannon)malikarumi#Tetuoan2:~/Projects/cannon/jamf$ python manage.py shell
Python 2.7.10 (default, Oct 14 2015, 16:09:02)
IPython 4.0.3 -- An enhanced Interactive Python.
In [1]: uuid.UUID(a82857b6-e336-4c6c-8499-47601770b39d)
File "<ipython-input-1-e282858da374>", line 1
uuid.UUID(a82857b6-e336-4c6c-8499-47601770b39d)
^
SyntaxError: invalid syntax
In [2]: uuid.UUID(a0a69415-6627-43db-8c7a-b57d0c4cefe2)
File "<ipython-input-2-befebf1573ba>", line 1
uuid.UUID(a0a69415-6627-43db-8c7a-b57d0c4cefe2)
^
SyntaxError: invalid syntax
In [3]: uuid.UUID(e6e11b06-ea3b-4e98-a31f-9a83447ad884)
File "<ipython-input-3-a59ea095e61a>", line 1
uuid.UUID(e6e11b06-ea3b-4e98-a31f-9a83447ad884)
^
SyntaxError: invalid syntax
In [4]: uuid.UUID(bd116432-65d7-4612-abfe-9a99dcaf5cad)
File "<ipython-input-4-c4a04434aa3c>", line 1
uuid.UUID(bd116432-65d7-4612-abfe-9a99dcaf5cad)
^
SyntaxError: invalid syntax
Now that I have posted this, I notice that even Stack Overflow treats these uuid differently, i.e., the way they are colored, if that's relevant and meaningful here.
But now that we know this, what do we do with / about it?
2nd Update
This morning I thought, what about a uuid that had never been anywhere but in Django? So here's what I did:
In [5]: e.uuid
Out[5]: UUID('61877565-5fe5-4175-9f2b-d24704df0b74')
In [6]: uuid.UUID(61877565-5fe5-4175-9f2b-d24704df0b74)
File "<ipython-input-6-56137f5f4eb6>", line 1
uuid.UUID(61877565-5fe5-4175-9f2b-d24704df0b74)
^
SyntaxError: invalid syntax
In [7]: uuid.UUID('61877565-5fe5-4175-9f2b-d24704df0b74')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-3b4d3e5bd156> in <module>()
----> 1 uuid.UUID('61877565-5fe5-4175-9f2b-d24704df0b74')
NameError: name 'uuid' is not defined
This is apparently because I left the quote around the alphanumeric, but why that would generate a uuid not defined error, instead of 'string type' or some such error is beyond me.
In [8]: uuid.UUID(61877565-5fe5-4175-9f2b-d24704df0b74)
File "<ipython-input-8-56137f5f4eb6>", line 1
uuid.UUID(61877565-5fe5-4175-9f2b-d24704df0b74)
^
SyntaxError: invalid syntax
The first time I keyed in the characters by hand. I decided to repeat the test by copying and pasting, but as you can see, it made no difference. If there was something weird about the way only the 5 that the caret is pointing to was generated, we might be on to something, but if so, why do I get the same error in the same place when I typed it in by hand myself?
This no longer seems like a json issue to me, since – as far as I know – json has never touched this uuid, unless it did somehow in the internal workings of Django.
Instead, there is either
1. something wrong with the way uuid.UUID generates uuids, or
2. the way it generates them on my system, (Ubuntu 15.10, Django 1.9.1, Python 2.7.10) or
3. the way it reads and evaluates them when they come back, like in uuid.UUID() or being input outside the internal, automatic uuid generation process.
But that also means people using uuid.UUID() to generate uuids will never know there is an issue unless they do what I did, which is try to bring them in from outside. I remember reading somewhere that all uuids are supposed to be compatible. So, unless someone here has a better insight, I think we might be up for a bug report. But is it a Python bug, a Django bug, or both?
Your syntax is wrong:
uuid.UUID('61877565-5fe5-4175-9f2b-d24704df0b74') # note the quotes

Test and Training Set are Not Compatible

I have seen various articles about the same issue, Tried a lot of solutions and nothing is working. Kindly advice.
I am getting an error in WEKA:
"Problem Evaluating Classifier: Test and Training Set are Not
Compatible".
I am using
J48 as my algorithm
This is my Test set:
Trainset:
https://www.dropbox.com/s/fm0n1vkwc4yj8yn/train.csv
Evalset:
https://www.dropbox.com/s/2j9jgxnoxr8xjdx/Eval.csv
(I am unable to copy and paste due to long code)
I have tried "Batch Filtering" in WEKA (for Traningset) but it still does not work.
EDIT: I have even converted my .csv to .arff but still the same
issue.
EDIT2: I have made sure the headers in both CSV's match. Even
then same issue. Please help!
Please advice.
A common error in converting ".csv" files to ".arff" with Weka is when values for nominal attributes appear in a different order or not at all from dataset to dataset.
Your evaluation ".arff" file probably looks like this (skipping irrelevant data):
#relation Eval
#attribute a321 {TRUE}
Your train ".arff" file probably looks like this (skipping irrelevant data):
#relation train
#attribute a321 {FALSE}
However, both should contain all possible values for that attribute, and in the same order:
#attribute a321 {TRUE, FALSE}
You can remedy this by post-processing your ".arff" files in a text editor and changing the header so that your nominal values appear in the same order (and quantity) from file to file.
How do I divide a dataset into training and test set?
You can use the RemovePercentage filter (package weka.filters.unsupervised.instance).
In the Explorer just do the following:
training set:
Load the full dataset
select the RemovePercentage filter in the preprocess panel
set the correct percentage for the split
apply the filter
save the generated data as a new file
test set:
Load the full dataset (or just use undo to revert the changes to the dataset)
select the RemovePercentage filter if not yet selected
set the invertSelection property to true
apply the filter
save the generated data as new file

Why is SSIS complaining that "There is a partial row at the end of the file"?

I'm importing a flat file into a database using a Data Flow Task in SSIS. The file is very simple: it contains three comma-separated values per row. Whenever I run this task, however, I receive a warning from the Flat File component:
Warning: 0x8020200F: There is a partial row at the end of the file.
This warning seems to happen regardless of the size of the file: even with only a handful of rows in the file, visually validated (with extended characters and whatnot visible) I still receive it. Moreover, it doesn't seem to matter whether I have a blank row at the end of the file or I just end it without a trailing CR+LF.
How can I get rid of this warning so I can run my package with WarnAsError enabled?
(BTW, it seems someone else may have had a similar problem in There is a partial row at the end of the file, though it wasn't much of a question.)
I have found three things to try if you encounter this problem. In at least two out of the three cases, SSIS was ignoring rows of my input file with only the above warning to show for it. Because of that, I do not recommend ignoring this warning!
Step 1: verify that your flat file is valid
This error will appear when you have an invalid input file. This can be especially hard to detect if your input file has millions of lines, as mine do, but it's vital that you discover file format violations because SSIS will happily give you this warning and continue on its way without importing the offending lines or, in some cases, the lines after the offending lines. The easiest way I found to discover a problem with the source file is to check the number of rows that are being imported successfully. If it's vastly different than the number you expect in your flat file, something may have gone wrong in the middle somewhere.
Step 2: try a dummy line at the end (fixed-width only)
If you are using a fixed-width format input file, Microsoft may have a helpful KB article for you. Basically, they suggest that you add a dummy line at the end of the file.
I am not using fixed-width files, so I can't say how useful this technique is.
Step 3: turn off text qualification for non-text
This is the tricky one, because I believe the TextQualified property is True by default. If your input file uses non-text fields (integers, etc.), then you must tell SSIS that it should not expect those columns to be qualified as text. Essentially, your input file will be invalid in spite of looking perfectly valid.
TextQualified is a property of the columns in your Flat File Connection Manager.
To change it, open up your connection manager, click "Advanced", and then click on a non-text column. Make sure the TextQualified property is set to False. You will need to do this for all of your non-text columns.
If the byte width of a line in the file is known, you can always double check that the total byte size of the file can be divided by the expected line size to give you a nice round line count number (as opposed to a decimal).
It helps also to know from your source just how many records are expected, but if you don't have this you can at least double check the resultant loaded tables record count against the calculation of line count while loading the file.
I've seen this error often when a source flat text file is missing it's last \r\n at the end of the file.
Running on Windows 64 bit is perfect. It led to no missing row, but I lost the last row when running on Windows 2008.
My workaround is
1. open the ssis in BIDs on the Windows 2008.
2. open the file connection manager make sure Text Qualifier set to
3. rebuild it
All work fine in both Windows 7 and Windows 2008.