Unix diff with custom line separator - csv

Looking to compare two CSV files. Suppose the field separator is $, each record has two fields, and the file can be formatted something like:
a$simple line$
b$run-on-
line$
c$simple line$
Is there some switch or variety of Unix diff command that will let me run the comparison where the record separator (line separator) is the $ sign immediately followed by a new line?
Ideally I want to be guaranteed that diff outputs the entire record when any change is detected.
With the default behavior, I could potentially get a partial record as diff output (in scenarios where the record runs over several lines).
Is there some smarter way to do this that I'm not considering?
--
Edited to add: sample of expected output
If I compared the CSV file above with:
a$simple line$
b$run-on-changed-
line$
c$simple line$
... I would want to see the entire record b reported as a difference. Something like
2c2
< b$run-on-\nline$
---
> b$run-on-changed-\nline$

Peter, there is no direct support of custom line separator in gnu diff: http://man7.org/linux/man-pages/man1/diff.1.html (gnu diffutils)
You may try to use sed twice: sed to convert your format to one-record-per-line for diffing; diff converted files; sed back to multiline record format.
First sed will convert all $\n to real \n; and \n without $ before it to some unique special sequence, like #%#$%#$%#$#.
Then do diff.
Second sed will convert #%#$%#$%#$# back to \n (or to \\n to easier viewing of diff output)

There are diff variants which support working with csv. Some of them may handle csv with line breaks inside fields:
https://pypi.python.org/pypi/csvdiff (python)
csvdiff allows you to compare the semantic contents of two CSV files, ignoring things like row and column ordering in order to get to what’s actually changed. This is useful if you’re comparing the output of an automatic system from one day to the next, so that you can look at just what’s changed.
https://github.com/agardiner/csv-diff (ruby)
Unlike a standard diff that compares line by line, and is sensitive to the ordering of records, CSV-Diff identifies common lines by key field(s), and then compares the contents of the fields in each line.
http://csvdiff.sourceforge.net/ (perl)
csvdiff is a perl script to compare/diff two (comma) seperated files with each other. The part that is different to standard diff is, that you'll get the number of the record where the difference occours and the field/column which is different. The separator can be set to the value you want it to, not just comma. Also you can to provide a third file which contains the columnnames in one(!) line separated by your separator.

Related

How can I change specific recurring text on a very large HTML file?

I have a very big HTML file (talking about 20MB) and I need to remove from the file a large amount of nodes of the form:
<tr><td>SPECIFIC-STRING</td><td>RANDOM-STRING</td><td>RANDOM-STRING</td></tr><tr><td style="padding-top:0" colspan="3">RANDOM-STRING</td></tr>
The file I need to work on is basically made of thousands of these strings, and I only need to remove those that have a specific first string, for instance, all those with the first string being "banana":
<tr><td>banana</td><td>RANDOM-STRING</td><td>RANDOM-STRING</td></tr><tr><td style="padding-top:0" colspan="3">RANDOM-STRING</td></tr>
I tried achieving this opening the file in Geany and using the replace feature with this regex:
<tr><td>banana<\/td><td>(.*)<\/td><td>(.*)<\/td><\/tr><tr><td(.*)<\/td><\/tr>
but the console output was that it removed X amount of occurrences, when I know there are way more occurrences than that in the file.
Firefox, Chrome and Brackets fail even to view the html code of the file due to it's size. I can't think of another way to do this due to my large unexperience with HTML.
You could be using a stream editor which as the name suggest streams the file content, thus never loads the whole file into the main memory.
A popular editor is sed. It does support RegEx.
Your command would have the following structure.
sed -i -E 's/SEARCH_REGEX/REPLACEMENT/g' INPUTFILE
-E for support of extended RegEx
-i for in-place editing mode
s denotes that you want to replace values
g is for global. By default sed would only replace the first occurrence so to replace all occrrences you must provide g
SEARCH_REGEX is the RegEx you need to find the substrings you want to replace
REPLACEMENT is the value you want to replace all matches with
INPUTFILE is the file sed is gonna read line-by line and do the replacement for you.
While regex may not be the best tool to do this kinda job, try this adjustment to your pattern:
<tr><td>banana<\/td><td>(.*?)<\/td><td>(.*?)<\/td><\/tr><tr><td(.*?)<\/td><\/tr>
That's making your .* matches lazy. I am wondering if those patterns are consuming too much.

CSV - Deleting lines containing numbers with wrong structure

Good afternoon all,
I was saving data from oscilloscope to USB stick as point delimited coma separated files and apparently there was some problem with transfer resulting in appearance of lines that do not match "usual" numerical format. It is hard to explain, easier to show:
1.788400e-04,0.008,0.006,0.008
1.788600e-04,-0.008,0.002,0.02
1.788800e-04,0.016,0.002,0
1.789200e-04,0,0.002.673200e-04,0.008,0.012,0.12
1.673400e-04,0,-0.002,0.008
1.673600e-04,0,0.01,0.012
1.673800e-04,0.008,0.002,0.008
What I mean is the 0.002.673200e-04 on 4th row. Luckily it is not too frequent and lines such as this can be deleted. It is however hard to find as the files are around million lines. First I thought it would be easy to do by locating the .002. and deleting it using:
grep -v ".002." testfile.csv > testfile-fixed.csv
This indeed worked, however the number between the dots changes. So far I managed to find .000. and .002. and it may not be limited to those two.
The other thing that changes is the number of columns.
Is there some easy way to get rid of these lines?
thank you
If it is OK to delete all the lines containing a number with two dots, I suggest you to use sed instead of grep.
sed '/\.[0-9]*\./d' testfile.csv > testfile-fixed.csv
This command deletes the line matching the regex \.[0-9]*\., which matches all the lines containing a dot followed by 0 or more digits and followed by a dot.
You can even do the change inside the file itself, but if you make a mistake, you can destroy your file, so make first a backup. Use the flag -i with sed:
sed -i '/\.[0-9]*\./d' testfile.csv

line feed within a column in csv

I have a csv like below. some of columns have line break like column B below. when I doing wc -l file.csv unix is returning 4 but actually these are 3 records. I don't want to replace line break with space, I am going to load data in database using sql loader and want to load data as it is. what should I do so that unix consider line break as one record?
A,B,C,D
1,"hello
world",sds,sds
2,sdsd,sdds,sdds
Unless you're dealing with trivial cases (No quoted fields, no embedded commas, no embedded newlines, etc.), CSV data is best processed with tools that understand the format. Languages like perl and python have CSV parsing libraries available, there are packages like csvkit that provide useful utilities, and more.
Using csvstat from csvkit on your example:
$ csvstat -H --count foo.csv
Row count: 3

How to sort this CSV file by date with the Unix sort command?

I have never used UNIX before, and am using this because I could not find a solution on Windows to sort this list by date for such a large file.
I am trying to sort a CSV file with 14 million entries (the file is 2gigs). The file is all of the taxi transactions that happened in 2013 during the month of January. I wanted to sort the list by date so that I could only select data from the first week.
I found the https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html and I have been trying to write a command that will do what I want. What I have tried so far is
sort -t, -k 6n 8-trip_data_1.csv
that didn't work.
I think I want to tell it to sort by the 6th column (pickup date time) and then the 9,10 indexes of that column because that is all that will be changing in the data column across the file. I put some of the table below.
medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
A6699B6310BFDF8D1EE42C12622D94FA,66C6E65E8D6476B8DDA075A01D63E78A,VTS,1,,2013-01-16 19:21:00,2013-01-16 19:35:00,2,840,1.71,-73.986603,40.739986,-73.99221,40.719715
B45D26A20BE724B0F752461C624233CB,B240D08915F9F593F219D9109127FF1A,VTS,1,,2013-01-16 19:26:00,2013-01-16 19:32:00,3,360,.67,-73.982338,40.768349,-73.981285,40.774017
You don't need the n — indeed, it is counterproductive. The dates are in ISO 8601 format, and they sort in time order when sorted alphanumerically. Numeric sorting only pays attention to the 2013 part of the field; the rest isn't part of a single number. You also don't need to worry about subsetting the time information — the fact that only some parts change won't matter.
You've given a very minimal data set with the pickup-time information already in sorted order, so we have to get a little inventive. The heading information won't sort numerically; you can remove it, or let it float around. To show that the sorting works when the data is sorted, I specify r (reverse order). This puts the heading data at the top and reverses the two lines of actual data.
$ sort -t, -k6r data.file
medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
B45D26A20BE724B0F752461C624233CB,B240D08915F9F593F219D9109127FF1A,VTS,1,,2013-01-16 19:26:00,2013-01-16 19:32:00,3,360,.67,-73.982338,40.768349,-73.981285,40.774017
A6699B6310BFDF8D1EE42C12622D94FA,66C6E65E8D6476B8DDA075A01D63E78A,VTS,1,,2013-01-16 19:21:00,2013-01-16 19:35:00,2,840,1.71,-73.986603,40.739986,-73.99221,40.719715
$
Or, in ascending order (the heading goes at the end):
$ sort -t, -k6 data.file
A6699B6310BFDF8D1EE42C12622D94FA,66C6E65E8D6476B8DDA075A01D63E78A,VTS,1,,2013-01-16 19:21:00,2013-01-16 19:35:00,2,840,1.71,-73.986603,40.739986,-73.99221,40.719715
B45D26A20BE724B0F752461C624233CB,B240D08915F9F593F219D9109127FF1A,VTS,1,,2013-01-16 19:26:00,2013-01-16 19:32:00,3,360,.67,-73.982338,40.768349,-73.981285,40.774017
medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
$
Also, you can decide which dates are relevant and modify this grep command to select the correct dates for the first week — which reduces the data size to about one quarter of its original size.
grep ',2013-01-0[1-7] [0-2][0-9]:[0-5][0-9]:[0-5][0-9],' data.file
That looks for dates in the range 2013-01-01 through 2013-01-07 (allowing any time for each day). You could omit the regex after the blank if you prefer; if the data is valid, it won't make any difference, but the regex avoids selecting some invalid data. Obviously, you can change the dates if you want the first week to run, for example, from the first Sunday through the first Saturday (Sunday 6th to Saturday 12th 2013):
grep -E ',2013-01-(0[6-9]|1[012]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9],' data.file
You could then run this reduced data set through the sort process.
In future, please give 5 lines or so for sample data — it's easier to demonstrate what's working and what's not.
I am sure you do not want to remove the header nor want it to "float", so create executable file sort_csv:
#!/usr/bin/perl
use strict;
sub my_cmp($$)
{
my $a = shift;
my $b = shift;
return substr($a, 81, 8) cmp substr($b, 81, 8); # assuming seconds are always zero
}
print scalar (<>);
print sort my_cmp <>;
And then:
# Make it executable
chmod +x sort_csv
sort_csv <input.csv >sorted.csv

What does 'multiline strings are different' meant by from RIDE (Robot Framework) output?

i am trying to compare two csv file data and followed below process in RIDE -
${csvA} = Get File ${filePathA}
${csvB} = Get File ${filePathB}
Should Be Equal As Strings ${csvA} ${csvB}
Here are my two csv contents -
csvA data
Harshil,45,8.03,DMJ
Divy,55,8,VVN
Parth,1,9,vvn
kjhjmb,44,0.5,bugg
csvB data
Harshil,45,8.03,DMJ
Divy,55,78,VVN
Parth,1,9,vvnbcb
acc,5,6,afafa
As few of the data is not in match, when i Run the code in RIDE, the result is FAIL. But in the log below data is shown -
**
Multiline strings are different:
--- first
+++ second
## -1,4 +1,4 ##
Harshil,45,8.03,DMJ
-Divy,55,8,VVN
-Parth,1,9,vvn
-kjhjmb,44,0.5,bugg
+Divy,55,78,VVN
+Parth,1,9,vvnbcb
+acc,5,6,afafa**
I would like to know the meaning of ---first +++second ##-1,4+1,4## content.
Thanks in advance!
When robot compares multiline strings (data that has newlines in it), it uses the standard unix tool diff to show the differences. Those characters are all part of what's called a unified diff. Even though you pass in raw data, it's treating the data as two files and showing the differences between the two in a format familiar to most programmers.
Here are two references to read more about the format:
What does "## -1 +1 ##" mean in Git's diff output?. (stackoverflow)
the diff man page (gnu.org)
In short, the ## gives you a reference for which line numbers are different, and the + and - show you which lines are different.
In your specific example it's telling you that three lines were different between the two strings: the line beginning with Divy, the line beginning with Parth, and the line beginning with acc. Since the line beginning with Harshil does not show a + or -, that means it was identical between the two strings.