Weka and CSV files - csv

I'm currently trying to import some data into weka. Currently the data is in a CSV file, and consists of a numerical ID and then some string data(Tweets). I'm getting an error where it is reading "Wrong number of values, Read 1, expected 2 Token[EOL], line 17". I'm using quotes as my enclosure characters for the String data. I understand that something(presumably an EOL character?) is causing weka to incorrectly separate some of the String data into multiple entries on the same line, but I'm not sure how to fix the EOL token problem.
My data set can be viewed here. The current data set is on Sheet 2:
https://docs.google.com/spreadsheets/d/1Yclu0t4ITFWn6itYBsVtkGalmP9BPaWFFP6U6jAeLMU/edit?usp=sharing
The text file itself may be found here:
https://drive.google.com/file/d/0B433FqC3TscQQkRxZklQclA3Z3M/view?usp=sharing
Current error is now on the 3rd line, with the same error. The only newline character there is the one at the end of the line denoting a new entry, so I'm not sure why its having issues.

In its datasets, Weka considers a newline character as an indication of the end of instance. Your line 17 is actually a multi-line tweet which confuses Weka. You can use either
a RegEx to get rid of the newline characters in every single tweet or
during downloading the tweets, clean the tweets to get rid of any newline character in them.
Unfortunately, Weka does not have a mechanism to get rid of this problem by itself (as far as I know).
EDIT
Okay, here are some other things that need to be fixed (according to your EDITS in the question):
Replace ' with \'
Replace grave accent with \grave accent
Many tweets contain quotes inside quotes. The inside double quotes (") should be replaced by \"
If you put your tweets inside double quotes, then your header should be id, "text"
Some tweets contain two consecutive double quotes, get rid of them or replace them with \".
I cannot say exactly where, because I lost trace, but I think still some tweets contain new lines in them (or at least one tweet has it still)
These are just a few things that I noticed. There might be more. Time will tell.

Related

How do I convince Splunk that a backslash inside a CSV field is not an escape character?

I have the following row in a CSV file that I am ingesting into a Splunk index:
"field1","field2","field3\","field4"
Excel and the default Python CSV reader both correctly parse that as 4 separate fields. Splunk does not. It seems to be treating the backslash as an escape character and interpreting field3","field4 as a single mangled field. It is my understanding that the standard escape character for double quotes inside a quoted CSV field is another double quote, according to RFC-4180:
"If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."
Why is Splunk treating the backslash as an escape character, and is there any way to change that configuration via props.conf or any other way? I have set:
INDEXED_EXTRACTIONS = csv
KV_MODE = none
for this sourcetype in props.conf, and it is working fine for rows without backslashes in them.
UPDATE: Yeah so Splunk's CSV parsing is indeed not RFC-4180 compliant, and there's not really any workaround that I could find. In the end I changed the upstream data pipeline to output JSON instead of CSVs for ingestion by Splunk. Now it works fine. Let this be a cautionary tale if anyone stumbles across this question while trying to parse CSVs in Splunk!

concern while importing/linking csv to access database

I have a csv file with delimiter as , (comma) and few of the data column of same file has comma in it .
Hence while linking / importing the file, data is getting jumbled in next column.
I have tried all possible means like skip column etc , but not getting any fruitful results.
Please let me know if this can be handled through VBA function in ms-access.
If the CSV file contains text fields that contain commas and are not surrounded by a text qualifier (usually ") then the file is malformed and cannot be parsed in a bulletproof way. That is,
1,Hello world!,1.414
2,"Goodbye, cruel world!",3.142
can be reliably parsed, but
1,Hello world!,1.414
2,Goodbye, cruel world!,3.142
cannot. However, if you have additional information about the file, e.g., that it should contain three columns
a Long Integer column,
a Short Text column, and
a Double column
then your VBA code could read the file line-by-line and split the string on commas into an array. The first array element would be the Long Integer, the last array element would be the Double value, and the remaining "columns" in between could be concatenated together to reconstruct the string.
As you can imagine, that approach could easily be confounded (e.g., if there was more than one text field that might contain commas). Therefore it is not particularly appealing.
(Also worth noting is that the CSV parser in Access has never been able to properly handle text fields that contain line breaks, but at least we can import those CSV files into Excel and then import into Access from the Excel file.)
TL;DR - If the CSV file contains unqualified text containing commas then the system that produced it is broken and should be fixed.

H2 DB CSVREAD does not translate line breaks

I would like to use H2 database embedded in my application to create an in-memory database, populated by a CSV file. So I have used the CSVREAD function.
All is working well except one pesky problem, which is that it doesn't seem to recognize line breaks. It translates \n literally as the two characters \ and n.
The docs say that the default escape character is quotation mark but if I try to use the quotation mark to escape anything else besides another quotation mark, it simply ends the record there.
Is it possible to put text with line breaks into a CSV file and have H2's CSVREAD interpret it correctly somehow? Thanks!
After much experimentation, I found that literally putting returns into the middle of the text is interpreted correctly.
e.g. if you want a record with multiple lines, try this in your csv:
"column 1", "column 2", "column 3 multi-line
still part of column 3
yet still pat of column 3"

Remove Speech Marks from file where Text Qualifier is "

Introduction
We have a pretty standard way of importing .txt and .csv into our data warehouse using SSIS.
Our txt/csvs are produced with speech marks as text qualifiers. So a typical file may look like the below:
"0001","025",1,"01/01/19","28/12/18",4,"ST","SMITH,JOHN","15/01/19"
"0002","807",1,"01/01/19","29/12/18",3,"ST","JONES,JOY","06/02/19"
"0003","160",1,"01/01/19","29/12/18",3,"ST","LEWIS,HANNAH","18/01/19"
We have set all our SSIS packages to strip out the speech marks by setting Text Qualifier = "
Problem
However, as some of our data entry is done manually, speech marks are sometimes used - particularly in free text fields such as NAME where people have nicknames/alias. This causes errors in our SSIS loading.
An example of a problematic row would be:
"0004","645",1,"01/01/19","29/12/18",3,"ST","MOORE,STANLEY "STAN"","12/04/19"
My question
Is there a way to somehow strip out these problematic speech marks? i.e. the speech marks surrounding "STAN", so that column would be treated as MOORE, STANLEY STAN.
If there was a way within SSIS to do this, great. If not, we are open to other ideas outside of SSIS.
Solution needs to be scalable as we have hundreds of SSIS packages where this problem can occur.
I have a few suggestions:
I know Excel has a setting that says something like "Treat Consecutive Delimiters as one."
Change your delimiter to something else, like a pipe (the thing above the backslash, not sure what it is called elsewhere, looks like a vertical line). You can distinguish delimiters from quote marks that are meant to be included in the resulting value because any string delimiter either immediately precedes or immediately follows a comma. A quote character anywhere else is not a delimiter.
If you do not need to pass the data through any T-SQL you might want to replace non-delimiter quotes with single quotes or, depending on the final output, maybe the html entity (") instead.
Hope this helps,
Joey

csv parsing, quotes as characters

I have a csv file that contains fields such as this:
""#33CCFF"
I would imagine that should be the text value:
"#33CCFF
But both excel and open office calc will display:
#33CCFF"
What rule of csv am I missing?
When Excel parses the value, it does not first remove the outer quotes, to then proceed reading what's in between them. Even if it would, what would it do with the remaining " in front of #? It can not show this as you expect "#33CCFF, because for it to appear like that, the double quote should have been escaped by duplicating it. (That might be the 'csv' rule you are missing.)
Excel reads the value from left to right, interpreting "" as a single ", it then reads on, and finds an unexpected double quote at the end, 'panics' and simply displays it.
The person/application creating the csv file made the mistake of adding encapsulation without escaping the encapsulation character within the values. Now your data is malformed. Since when using encapsulation, the value should be """#33CCFF", and when not using encapsulation, it should be "#33CCFF.
This might clarify some things