line feed within a column in csv - csv

I have a csv like below. some of columns have line break like column B below. when I doing wc -l file.csv unix is returning 4 but actually these are 3 records. I don't want to replace line break with space, I am going to load data in database using sql loader and want to load data as it is. what should I do so that unix consider line break as one record?
A,B,C,D
1,"hello
world",sds,sds
2,sdsd,sdds,sdds

Unless you're dealing with trivial cases (No quoted fields, no embedded commas, no embedded newlines, etc.), CSV data is best processed with tools that understand the format. Languages like perl and python have CSV parsing libraries available, there are packages like csvkit that provide useful utilities, and more.
Using csvstat from csvkit on your example:
$ csvstat -H --count foo.csv
Row count: 3

Related

What does 'multiline strings are different' meant by from RIDE (Robot Framework) output?

i am trying to compare two csv file data and followed below process in RIDE -
${csvA} = Get File ${filePathA}
${csvB} = Get File ${filePathB}
Should Be Equal As Strings ${csvA} ${csvB}
Here are my two csv contents -
csvA data
Harshil,45,8.03,DMJ
Divy,55,8,VVN
Parth,1,9,vvn
kjhjmb,44,0.5,bugg
csvB data
Harshil,45,8.03,DMJ
Divy,55,78,VVN
Parth,1,9,vvnbcb
acc,5,6,afafa
As few of the data is not in match, when i Run the code in RIDE, the result is FAIL. But in the log below data is shown -
**
Multiline strings are different:
--- first
+++ second
## -1,4 +1,4 ##
Harshil,45,8.03,DMJ
-Divy,55,8,VVN
-Parth,1,9,vvn
-kjhjmb,44,0.5,bugg
+Divy,55,78,VVN
+Parth,1,9,vvnbcb
+acc,5,6,afafa**
I would like to know the meaning of ---first +++second ##-1,4+1,4## content.
Thanks in advance!
When robot compares multiline strings (data that has newlines in it), it uses the standard unix tool diff to show the differences. Those characters are all part of what's called a unified diff. Even though you pass in raw data, it's treating the data as two files and showing the differences between the two in a format familiar to most programmers.
Here are two references to read more about the format:
What does "## -1 +1 ##" mean in Git's diff output?. (stackoverflow)
the diff man page (gnu.org)
In short, the ## gives you a reference for which line numbers are different, and the + and - show you which lines are different.
In your specific example it's telling you that three lines were different between the two strings: the line beginning with Divy, the line beginning with Parth, and the line beginning with acc. Since the line beginning with Harshil does not show a + or -, that means it was identical between the two strings.

Unix diff with custom line separator

Looking to compare two CSV files. Suppose the field separator is $, each record has two fields, and the file can be formatted something like:
a$simple line$
b$run-on-
line$
c$simple line$
Is there some switch or variety of Unix diff command that will let me run the comparison where the record separator (line separator) is the $ sign immediately followed by a new line?
Ideally I want to be guaranteed that diff outputs the entire record when any change is detected.
With the default behavior, I could potentially get a partial record as diff output (in scenarios where the record runs over several lines).
Is there some smarter way to do this that I'm not considering?
--
Edited to add: sample of expected output
If I compared the CSV file above with:
a$simple line$
b$run-on-changed-
line$
c$simple line$
... I would want to see the entire record b reported as a difference. Something like
2c2
< b$run-on-\nline$
---
> b$run-on-changed-\nline$
Peter, there is no direct support of custom line separator in gnu diff: http://man7.org/linux/man-pages/man1/diff.1.html (gnu diffutils)
You may try to use sed twice: sed to convert your format to one-record-per-line for diffing; diff converted files; sed back to multiline record format.
First sed will convert all $\n to real \n; and \n without $ before it to some unique special sequence, like #%#$%#$%#$#.
Then do diff.
Second sed will convert #%#$%#$%#$# back to \n (or to \\n to easier viewing of diff output)
There are diff variants which support working with csv. Some of them may handle csv with line breaks inside fields:
https://pypi.python.org/pypi/csvdiff (python)
csvdiff allows you to compare the semantic contents of two CSV files, ignoring things like row and column ordering in order to get to what’s actually changed. This is useful if you’re comparing the output of an automatic system from one day to the next, so that you can look at just what’s changed.
https://github.com/agardiner/csv-diff (ruby)
Unlike a standard diff that compares line by line, and is sensitive to the ordering of records, CSV-Diff identifies common lines by key field(s), and then compares the contents of the fields in each line.
http://csvdiff.sourceforge.net/ (perl)
csvdiff is a perl script to compare/diff two (comma) seperated files with each other. The part that is different to standard diff is, that you'll get the number of the record where the difference occours and the field/column which is different. The separator can be set to the value you want it to, not just comma. Also you can to provide a third file which contains the columnnames in one(!) line separated by your separator.

Load csv file with integers in Octave 3.2.4 under Windows

I am trying to import in Octave a file (i.e. data.txt) containing 2 columns of integers, such as:
101448,1077
96906,924
105704,1017
I use the following command:
data = load('data.txt')
However, the "data" matrix that results has a 1 x 1 dimension, with all the content of the data.txt file saved in just one cell. If I adjust the numbers to look like floats:
101448.0,1077.0
96906.0,924.0
105704.0,1017.0
the loading works as expected, and I obtain a matrix with 3 rows and 2 columns.
I looked at the various options that can be set for the load command but none of them seem to help. The data file has no headers, just plain integers, comma separated.
Any suggestions on how to load this type of data? How can I force Octave to cast the data as numeric?
The load function is not to read csv files. It is meant to load files saved from Octave itself which define variables.
To read a csv file use csvread ("data.txt"). Also, 3.2.4 is a very old version no longer supported, you should upgrade.

How to read a file where one column data is present in other column using Talend Data Integration

I get data from a CSV format daily.
Example data looks like:
Emp_ID emp_leave_id EMP_LEAVE_reason Emp_LEAVE_Status Emp_lev_apprv_cnt
E121 E121- 21 Head ache, fever, stomach-ache Approved 16
E139 E139_ 5 Attending a marraige of my cousin Approved 03
Here you can see that emp_leave_id and EMP_LEAVE_reason column data is shifted/scattered into the next columns.
So the problem by using tFileInputDelimited and various reading patterns I couldn't load data correctly into my target database. Mainly I'm not able to read the data correctly with that component in Talend.
Is there a way that I can properly parse this CSV to get my data in the format that I want?
This is probably a TSV file. Not sure about Talend, but uniVocity can parse these files for you:
TsvDataStoreConfiguration tsv = new TsvDataStoreConfiguration("my_TSV_datastore");
tsv.setLimitOfRowsLoadedInMemory(10000);
tsv.addEntities("/some/dir/with/your_files", "ISO-8859-1"); //all files in the given directory path will accessible entities.
JdbcDataStoreConfiguration database = new JdbcDataStoreConfiguration("my_Database", myDataSource);
database.setLimitOfRowsLoadedInMemory(10000);
Univocity.registerEngine(new EngineConfiguration("My_ETL_Engine", tsv, database));
DataIntegrationEngine engine = Univocity.getEngine("My_ETL_Engine");
DataStoreMapping dataStoreMapping = engine.map("my_TSV_datastore", "my_Database");
EntityMapping entityMapping = dataStoreMapping.map("your_TSV_filename", "some_database_table");
entityMapping.identity().associate("Emp_ID", "emp_leave_id").toGeneratedId("pk_leave"); //assumes your database does not keep the original ids.
entityMapping.value().copy("EMP_LEAVE_reason", "Emp_LEAVE_Status").to("reason", "status"); //just copies whatever you need
engine.executeCycle(); //executes the mapping.
Do not use a CSV parser to parse TSV inputs. It won't handle escape sequences properly (such as \t inside the value, you will get the escape sequence instead of a tab character), and will surely break if your value has a quote in it (a CSV parser will try to find the closing quote character and will keep reading chars until it finds another quote)
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

Compare CSV files

I am currently using a windows utility called TableTexCompare
This tool can take 2 CSV files and compare them. The nice thing about it is that it can make the comparison even if the records of the 2 files are not sorted in the same order or the fields are not positioned in the same order.
As such, the following 2 files would result in a successful comparison
(File1.csv)
FirstName,LastName,Age
Mona,Sax,30
Max,Payne,43
Jack,Lupino,50
(File2.csv)
FirstName,Age,LastName
Max,43,Payne
Jack,50,Lupino
Mona,30,Sax
What I am looking for is to do the same thing from the command-line with just 1 difference:
I would like the comparison to be performed in one direction only, i.e. if File2.csv is as follows (a subset of File1.csv), the comparison should pass
(File2.csv)
FirstName,Age,LastName
Jack,50,Lupino
I do not particularly care if it's going to be in some programming language, a dedicated cli tool or a shell script (e.g. using awk). I have some experience with Java and Groovy but would like to be pointed to some initial direction.
I can offer a Python solution:
import csv
with open("file1.csv") as f1, open("file2.csv") as f2:
r1 = list(csv.DictReader(f1))
r2 = csv.DictReader(f2)
for item in r2:
if not item in r1:
print("r2 is not a subset of r1!")
break
This is actually a bit more verbose than necessary in Python (but easier to understand); I personally would have used a generator expression:
import csv
with open("file1.csv") as f1, open("file2.csv") as f2:
r1 = list(csv.DictReader(f1))
r2 = csv.DictReader(f2)
if all(item in r1 for item in r2):
print("r2 is a subset of r1")
If you can afford to do a case insensitive comparison, and if there are no duplicates within File2.csv that must be matched within File1.csv, and if File1.csv does not contain \\ or \", then all you need is a simple FINDSTR command.
The following will list lines in File2.csv that do not appear in File1.csv:
findstr /vxig:"File1.csv" "File2.csv"
If all you want is an indication whether File1.csv is a superset of File2.csv, then
findstr /vxig:"File1.csv" "File2.csv" >nul && (echo File1 is NOT a superset of File2) || (echo File1 IS a superset of File2)
The search should not have to be case insensitive, except there is a nasty FINDSTR bug: it may fail to find matches when there are multiple case sensitive literal search strings of varying size. The case insensitive option avoids the bug. See Why doesn't this FINDSTR example with multiple literal search strings find a match? for more info.
The search will not work properly if File2.csv contains \\ or \" because FINDSTR will treat them as \ and " respectively. See What are the undocumented features and limitations of the Windows FINDSTR command? for more info. The accepted answer has sections describing FINDSTR escape sequences about half way down.
You can take a look at q - Text as a Database , which allows executing SQL directly on csv files, including joins. This will allow doing a compare easily, and much more, such as matching specific columns for equality, and getting specific columns from rows that don't match etc.
Full disclosure - It's my own open source tool.
Harel Ben-Attia