concern while importing/linking csv to access database - csv

I have a csv file with delimiter as , (comma) and few of the data column of same file has comma in it .
Hence while linking / importing the file, data is getting jumbled in next column.
I have tried all possible means like skip column etc , but not getting any fruitful results.
Please let me know if this can be handled through VBA function in ms-access.

If the CSV file contains text fields that contain commas and are not surrounded by a text qualifier (usually ") then the file is malformed and cannot be parsed in a bulletproof way. That is,
1,Hello world!,1.414
2,"Goodbye, cruel world!",3.142
can be reliably parsed, but
1,Hello world!,1.414
2,Goodbye, cruel world!,3.142
cannot. However, if you have additional information about the file, e.g., that it should contain three columns
a Long Integer column,
a Short Text column, and
a Double column
then your VBA code could read the file line-by-line and split the string on commas into an array. The first array element would be the Long Integer, the last array element would be the Double value, and the remaining "columns" in between could be concatenated together to reconstruct the string.
As you can imagine, that approach could easily be confounded (e.g., if there was more than one text field that might contain commas). Therefore it is not particularly appealing.
(Also worth noting is that the CSV parser in Access has never been able to properly handle text fields that contain line breaks, but at least we can import those CSV files into Excel and then import into Access from the Excel file.)
TL;DR - If the CSV file contains unqualified text containing commas then the system that produced it is broken and should be fixed.

Related

Load a CSV onto Apache Beam where there is a comma in some of the fields

I am loading a CSV into Apache Beam, but the CSV I am loading has commas in the fields. It looks like this:
ID, Name
1, Barack Obama
2, Barry, Bonds
How can I go about fixing this issue?
This is not specific to Beam, but a general problem with CSV. It's unclear if the second line should have ID="2, Barry" Name="Bonds" or the other way around.
If you can use some context (e.g. ID is always an integer, only one field could possibly contain commas) you could solve this by reading it as a text file line by line and parsing it into separate fields with a custom DoFn (assuming rows also contain newlines).
Generally, non-separating commas should be inside quotes in well-formed CSV, which makes this much more tractable (e.g. it would just work with the Beam Dataframes read_csv.)

Reading a .dat file in Julia, issues with variable delimeter spacing

I am having issues reading a .dat file into a dataframe. I think the issue is with the delimiter. I have included a screen shot of what the data in the file looks like below. My best guess is that it is tab delimited between columns and then new-line delimited between rows. I have tried reading in the data with the following commands:
df = CSV.File("FORCECHAIN00046.dat"; header=false) |> DataFrame!
df = CSV.File("FORCECHAIN00046.dat"; header=false, delim = ' ') |> DataFrame!
My result either way is just a DataFrame with only one column including all the data frome each column concatenated into one string. I tried to even specify the types with the following code:
df = CSV.File("FORCECHAIN00046.dat"; types=[Float64,Float64,Float64,Float64,
Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64]) |> DataFrame!
And I received an the following error:
┌ Warning: 2; something went wrong trying to determine row positions for multithreading; it'd be very helpful if you could open an issue at https://github.com/JuliaData/CS
V.jl/issues so package authors can investigate
I can work around this by uploading it into google sheets and then downloading a csv, but I would like to find a way to make the original .dat file work.
Part of the issue here is that .dat is not a proper file format—it's just something that seems to be written out in a somewhat human-readable format with columns of numbers separated by variable numbers of spaces so that the numbers line up when you look at them in an editor. Google Sheets has a lot of clever tricks built in to "do what you want" for all kinds of ill-defined data files, so I'm not too surprised that it manages to parse this. The CSV package on the other hand supports using a single character as a delimiter or even a multi-character string, but not a variable number of spaces like this.
Possible solutions:
if the files aren't too big, you could easily roll your own parser that splits each line and then builds a matrix
you can also pre-process the file turning multiple spaces into single spaces
That's probably the easiest way to do this and here's some Julia code (untested since you didn't provide test data) that will open your file and convert it to a more reasonable format:
function dat2csv(dat_path::AbstractString, csv_path::AbstractString)
open(csv_path, write=true) do io
for line in eachline(dat_path)
join(io, split(line), ',')
println(io)
end
end
return csv_path
end
function dat2csv(dat_path::AbstractString)
base, ext = splitext(dat_path)
ext == ".dat" ||
throw(ArgumentError("file name doesn't end with `.dat`"))
return dat2csv(dat_path, "$base.csv")
end
You would call this function as dat2csv("FORCECHAIN00046.dat") and it would create the file FORCECHAIN00046.csv, which would be a proper CSV file using commas as delimiters. That won't work well if the files contain any values with commas in them, but it looks like they are just numbers, in which case it should be fine. So you can use this function to convert the files to proper CSV and then load that file with the CSV package.
A little explanation of the code:
the two-argument dat2csv method opens csv_path for writing and then calls eachline on dat_path to read one line form it at a time
eachline strips any trailing newline from each line, so each line will be bunch of numbers separated by whitespace with some leading and/or trailing whitespace
split(line) does the default splitting of line which splits it on whitespace, dropping any empty values—this leaves just the non-whitespace entries as strings in an array
join(io, split(line), ',') joins the strings in the array together, separated by the , character and writes that to the io write handle for csv_path
println(io) writes a newline after that—otherwise everything would just end up on a single very long line
the one-argument dat2csv method calls splitext to split the file name into a base name and an extension, checking that the extension is the expected .dat and calling the two-argument version with the .dat replaced by .csv
Try using the readdlm function in DelimitedFiles library, and convert to DataFrame afterwards:
using DelimitedFiles, DataFrames
df = DataFrame(readdlm("FORCECHAIN00046.dat"), :auto)

Is there any technical difference between CSV, a TSV or a TXT file?

I use these files constantly in my application, but aren't CSV, TSV or TXT files all flat files?
The content is:
"sample","sample"
They are all text files, following the same "guidelines". The difference between the files are - as long as the creator followed some "rules", that:
A csv file will have comma separated values and a tsv file will have tab seperated values.
For .txt files, there is no formatting specified.
.csv stands for comma separated values, .tsv stands for tab separated values.
As the names suggest, different elements in the file are separated by ',' and '\t' respectively.
The type is chosen depending on the data. If we have say numbers larger than 3 digits, we might need commas as part of the content ans it would be better to use a csv in that case.
Both are types of text files and are increasingly used for classification and data mining purposes.
They do not have any other technical distinguishing factor.
A text file (which might have a txt file extension) will have lines separated by a platform specific line separator (CRLF on Windows, LF on Linux, and so on), and it will tend to contain characters human readable as text in some encoding. Apart from that human readability expectation this allows pretty much any file content on some platforms, so this is more of a content classification than a specific file format.
The other two formats are usually considered special cases of a text file intended to allow easy automated processing; tsv, a "tab separated values" file is simpler than csv, a "comma separated values" file.
csv will have commas as field separators, and it may use quoting and escaping especially to handle commas and quotes occurring in those fields. It may also include a header line as the first line in the file. The last line in the file may or may not end with a line separator.
(Details.)
tsv simply disallows tabs in the values, the header line is mandatory, the final line separator is mandatory.
(Details.)
A "flat file", in connection with databases, is a text file as opposed to a machine optimized storage method (such as a fixed size record file or a compressed backup file or a file using more elaborate markup language supporting data validation); a flat file tends to be csv or tsv or similar.
This answer benefited from a comment by Alex Shpilkin.

Weka and CSV files

I'm currently trying to import some data into weka. Currently the data is in a CSV file, and consists of a numerical ID and then some string data(Tweets). I'm getting an error where it is reading "Wrong number of values, Read 1, expected 2 Token[EOL], line 17". I'm using quotes as my enclosure characters for the String data. I understand that something(presumably an EOL character?) is causing weka to incorrectly separate some of the String data into multiple entries on the same line, but I'm not sure how to fix the EOL token problem.
My data set can be viewed here. The current data set is on Sheet 2:
https://docs.google.com/spreadsheets/d/1Yclu0t4ITFWn6itYBsVtkGalmP9BPaWFFP6U6jAeLMU/edit?usp=sharing
The text file itself may be found here:
https://drive.google.com/file/d/0B433FqC3TscQQkRxZklQclA3Z3M/view?usp=sharing
Current error is now on the 3rd line, with the same error. The only newline character there is the one at the end of the line denoting a new entry, so I'm not sure why its having issues.
In its datasets, Weka considers a newline character as an indication of the end of instance. Your line 17 is actually a multi-line tweet which confuses Weka. You can use either
a RegEx to get rid of the newline characters in every single tweet or
during downloading the tweets, clean the tweets to get rid of any newline character in them.
Unfortunately, Weka does not have a mechanism to get rid of this problem by itself (as far as I know).
EDIT
Okay, here are some other things that need to be fixed (according to your EDITS in the question):
Replace ' with \'
Replace grave accent with \grave accent
Many tweets contain quotes inside quotes. The inside double quotes (") should be replaced by \"
If you put your tweets inside double quotes, then your header should be id, "text"
Some tweets contain two consecutive double quotes, get rid of them or replace them with \".
I cannot say exactly where, because I lost trace, but I think still some tweets contain new lines in them (or at least one tweet has it still)
These are just a few things that I noticed. There might be more. Time will tell.

csv parsing, quotes as characters

I have a csv file that contains fields such as this:
""#33CCFF"
I would imagine that should be the text value:
"#33CCFF
But both excel and open office calc will display:
#33CCFF"
What rule of csv am I missing?
When Excel parses the value, it does not first remove the outer quotes, to then proceed reading what's in between them. Even if it would, what would it do with the remaining " in front of #? It can not show this as you expect "#33CCFF, because for it to appear like that, the double quote should have been escaped by duplicating it. (That might be the 'csv' rule you are missing.)
Excel reads the value from left to right, interpreting "" as a single ", it then reads on, and finds an unexpected double quote at the end, 'panics' and simply displays it.
The person/application creating the csv file made the mistake of adding encapsulation without escaping the encapsulation character within the values. Now your data is malformed. Since when using encapsulation, the value should be """#33CCFF", and when not using encapsulation, it should be "#33CCFF.
This might clarify some things