Sorting and separating data in UNIX shell - csv

I am relatively new to using the unix shell, and i am having trouble with a .csv file. My aim is to create a new file that has all of the same data but sorted. I have achieved this to an extent where i use the command
sort -t, datafile.csv>newdatafile.csv
However i seem to lose some lines. The original file has 271116 lines and the new sorted file has 33889, why have some lines been thrown away?
I would also like to know how I can take the first 100 lines of a csv file and create a new file with just those 100 lines.
Thanks

To print the only the first 100 lines of a file you should use the command head:
head -n 100 datafile.csv > newdatafile.csv
By default head prints the first 10 lines.
Use -n xxx to print more or less lines.

awk at the rescue:
Assuming it's a .csv file, then:
awk -F"," 'NR == 1,NR == 100 {print $0 > "newdatafile.csv"}' datafile.csv
It'll save the first 100 lines of the file into a new file named :newdatafile.csv
Hope this helps.

You are very close w.r.t the sorting requirement of your question:
use the below command, for example, to sort based on column 3 in your csv:
sort -t, -k3 datafile.csv>newdatafile.csv
This will sort the file based on column 3 alphabetically.
In case of numerical columns, use the below command to sort ascending:
sort -t, -nk3 datafile.csv>newdatafile.csv
To sort descending numerically:
sort -t, -nrk3 datafile.csv>newdatafile.csv
Also to get the first 100 rows from the sorted file use :
sort -t, -k3 datafile.csv | head -100 >newdatafile.csv
This will sort the datafile.csv alphabetically on column 3 and then select first 100 rows and write into newdatafile.csv

Related

Splitting a CSV File by value of specific column

I have multiple CSV files that I need to split into 67 separate files each. Each sheet has over a million rows and dozens of columns. One of the columns is called "Code" and it ranges from 1 to 67 which is what I have to base the split on. I have been doing this split manually by selecting all of the rows within each value (1, 2, 3, etc) and pasting them into their own CSV file and saving them, but this is taking way too long. I usually use ArcGIS to create some kind of batch file split, but I am not having much luck in doing so this go around. Any tips or tricks would be greatly appreciated!
If you have access to awk there's a good way to do this.
Assuming your file looks like this:
Code,a,b,c
1,x,x,x
2,x,x,x
3,x,x,x
You want a command like this:
awk -F, 'NR > 1 {print $0 >> "code" $1 ".csv"}' data.csv
That will save it to files like code1.csv etc., skipping the header line.

How to sort this CSV file by date with the Unix sort command?

I have never used UNIX before, and am using this because I could not find a solution on Windows to sort this list by date for such a large file.
I am trying to sort a CSV file with 14 million entries (the file is 2gigs). The file is all of the taxi transactions that happened in 2013 during the month of January. I wanted to sort the list by date so that I could only select data from the first week.
I found the https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html and I have been trying to write a command that will do what I want. What I have tried so far is
sort -t, -k 6n 8-trip_data_1.csv
that didn't work.
I think I want to tell it to sort by the 6th column (pickup date time) and then the 9,10 indexes of that column because that is all that will be changing in the data column across the file. I put some of the table below.
medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
A6699B6310BFDF8D1EE42C12622D94FA,66C6E65E8D6476B8DDA075A01D63E78A,VTS,1,,2013-01-16 19:21:00,2013-01-16 19:35:00,2,840,1.71,-73.986603,40.739986,-73.99221,40.719715
B45D26A20BE724B0F752461C624233CB,B240D08915F9F593F219D9109127FF1A,VTS,1,,2013-01-16 19:26:00,2013-01-16 19:32:00,3,360,.67,-73.982338,40.768349,-73.981285,40.774017
You don't need the n — indeed, it is counterproductive. The dates are in ISO 8601 format, and they sort in time order when sorted alphanumerically. Numeric sorting only pays attention to the 2013 part of the field; the rest isn't part of a single number. You also don't need to worry about subsetting the time information — the fact that only some parts change won't matter.
You've given a very minimal data set with the pickup-time information already in sorted order, so we have to get a little inventive. The heading information won't sort numerically; you can remove it, or let it float around. To show that the sorting works when the data is sorted, I specify r (reverse order). This puts the heading data at the top and reverses the two lines of actual data.
$ sort -t, -k6r data.file
medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
B45D26A20BE724B0F752461C624233CB,B240D08915F9F593F219D9109127FF1A,VTS,1,,2013-01-16 19:26:00,2013-01-16 19:32:00,3,360,.67,-73.982338,40.768349,-73.981285,40.774017
A6699B6310BFDF8D1EE42C12622D94FA,66C6E65E8D6476B8DDA075A01D63E78A,VTS,1,,2013-01-16 19:21:00,2013-01-16 19:35:00,2,840,1.71,-73.986603,40.739986,-73.99221,40.719715
$
Or, in ascending order (the heading goes at the end):
$ sort -t, -k6 data.file
A6699B6310BFDF8D1EE42C12622D94FA,66C6E65E8D6476B8DDA075A01D63E78A,VTS,1,,2013-01-16 19:21:00,2013-01-16 19:35:00,2,840,1.71,-73.986603,40.739986,-73.99221,40.719715
B45D26A20BE724B0F752461C624233CB,B240D08915F9F593F219D9109127FF1A,VTS,1,,2013-01-16 19:26:00,2013-01-16 19:32:00,3,360,.67,-73.982338,40.768349,-73.981285,40.774017
medallion,hack_license,vendor_id,rate_code,store_and_fwd_flag,pickup_datetime,dropoff_datetime,passenger_count,trip_time_in_secs,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
$
Also, you can decide which dates are relevant and modify this grep command to select the correct dates for the first week — which reduces the data size to about one quarter of its original size.
grep ',2013-01-0[1-7] [0-2][0-9]:[0-5][0-9]:[0-5][0-9],' data.file
That looks for dates in the range 2013-01-01 through 2013-01-07 (allowing any time for each day). You could omit the regex after the blank if you prefer; if the data is valid, it won't make any difference, but the regex avoids selecting some invalid data. Obviously, you can change the dates if you want the first week to run, for example, from the first Sunday through the first Saturday (Sunday 6th to Saturday 12th 2013):
grep -E ',2013-01-(0[6-9]|1[012]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9],' data.file
You could then run this reduced data set through the sort process.
In future, please give 5 lines or so for sample data — it's easier to demonstrate what's working and what's not.
I am sure you do not want to remove the header nor want it to "float", so create executable file sort_csv:
#!/usr/bin/perl
use strict;
sub my_cmp($$)
{
my $a = shift;
my $b = shift;
return substr($a, 81, 8) cmp substr($b, 81, 8); # assuming seconds are always zero
}
print scalar (<>);
print sort my_cmp <>;
And then:
# Make it executable
chmod +x sort_csv
sort_csv <input.csv >sorted.csv

Adding a prefix to header row in csv file with awk

I am currently working with datasets collected in large CSV files (over 1600 columns and 100 rows). Excel or LibreOffice calc can't easily handle these files for concatenating a prefix or suffix to the header row, which is what I would have done on a smaller dataset.
Researching the topic I was able to come up with the following command:
awk 'BEGIN { FS=OFS="," } {if(NR==1){print "prefix_"$0}; if(NR>1){print; next}}' input.csv >output.csv
Unfortunately, this only adds the prefix to the first cell. For example:
Input:
head_1,head_2,head_3,[...],head_n
"value_1","value_2","value_3",[...],"value_n"
Expected Output:
prefix_head_1,prefix_head_2,prefix_head_3,[...],prefix_head_n
"value_1","value_2","value_3",[...],"value_n"
Real Output:
prefix_head_1,head_2,head_3,[...],head_n
"value_1","value_2","value_3",[...],"value_n"
As the column number may be variable across different csv files, I would like a solution that doesn't require enumeration of all columns as found elsewhere.
This is necessary as the following step is to combine various (5 or 6) large csv files in a single csv database by combining all columns (the rows refer to the same instances, in the same order, across all files).
Thanks in advance for your time and help.
awk 'BEGIN{FS=OFS=","} NR==1{for (i=1;i<=NF;i++) $i="prefix_"$i} 1' file

Print first, penultimate and last fields in CSV file

I have big comma separated file with 20000 row and five column, I want to extract particular column, but there are more values so more comma, except header, so how to cut such column.
example file:
name,v1,v2,v3,v4,v5
as,"10,12,15",21,"12,11,10,12",5,7
bs,"11,15,16",24,"19,15,18,23",9,3
This is my desired output:
name,v4,v5
as,5,7
bs,9,3
I tried following cut command but doesn't work
cut -d, -f1,5,6
In general, for these scenarios is best to use a proper csv parser. You can find those in Python, for example.
However, since your data seems to have fields with commas just in the very beginning, you can decide to print the first field and then the penultimate and last one:
$ awk 'BEGIN{FS=OFS=","} {print $1, $(NF-1), $NF}' file
name,v4,v5
as,5,7
bs,9,3
In TXR Lisp:
$ txr extract.tl < data
name,v4,v5
as,5,7
bs,9,3
Code in extract.tl:
(mapdo
(lambda (line)
(let ((f (tok-str line #/"[^"]*"|[^,]+/)))
(put-line `#[f 0],#[f 4],#[f 5]`)))
(get-lines))
As a condensed one liner:
$ txr -t '(mapcar* (do let ((f (tok-str #1 #/"[^"]*"|[^,]+/)))
`#[f 0],#[f 4],#[f 5]`) (get-lines))' < data

Delete rows of CSV file based on the value of a column

Here's an example of a few lines of my CSV file:
movieID,actorID,actorName,ranking
1,don_rickles,Don Rickles,3
1,jack_angel,Jack Angel,6
1,jim_varney,Jim Varney,4
1,tim_allen,Tim Allen,2
1,tom_hanks,Tom Hanks,1
1,wallace_shawn,Wallace Shawn,5
I would like to remove all rows that have a ranking of > 4, so far I've been trying use this awk line:
awk -F ',' 'BEGIN {OFS=","} { if (($4) < 5) print }' file.csv > file_out.csv
It should print all the rows with a ranking (4th column) of less than 5 to a new file. I can't tell exactly what this line actually does, but it's not what I want. can someone tell me where I've gone wrong with that line?
Instead of deleting the records, think of which ones you're going to print. I guess it's <=4. In idiomatic awk you can write this as
$ awk -F, '$4<=4' file
1,don_rickles,Don Rickles,3
1,jim_varney,Jim Varney,4
1,tim_allen,Tim Allen,2
1,tom_hanks,Tom Hanks,1