I have multiple CSV files that I need to split into 67 separate files each. Each sheet has over a million rows and dozens of columns. One of the columns is called "Code" and it ranges from 1 to 67 which is what I have to base the split on. I have been doing this split manually by selecting all of the rows within each value (1, 2, 3, etc) and pasting them into their own CSV file and saving them, but this is taking way too long. I usually use ArcGIS to create some kind of batch file split, but I am not having much luck in doing so this go around. Any tips or tricks would be greatly appreciated!
If you have access to awk there's a good way to do this.
Assuming your file looks like this:
Code,a,b,c
1,x,x,x
2,x,x,x
3,x,x,x
You want a command like this:
awk -F, 'NR > 1 {print $0 >> "code" $1 ".csv"}' data.csv
That will save it to files like code1.csv etc., skipping the header line.
Related
I have a csv like below. some of columns have line break like column B below. when I doing wc -l file.csv unix is returning 4 but actually these are 3 records. I don't want to replace line break with space, I am going to load data in database using sql loader and want to load data as it is. what should I do so that unix consider line break as one record?
A,B,C,D
1,"hello
world",sds,sds
2,sdsd,sdds,sdds
Unless you're dealing with trivial cases (No quoted fields, no embedded commas, no embedded newlines, etc.), CSV data is best processed with tools that understand the format. Languages like perl and python have CSV parsing libraries available, there are packages like csvkit that provide useful utilities, and more.
Using csvstat from csvkit on your example:
$ csvstat -H --count foo.csv
Row count: 3
I am relatively new to using the unix shell, and i am having trouble with a .csv file. My aim is to create a new file that has all of the same data but sorted. I have achieved this to an extent where i use the command
sort -t, datafile.csv>newdatafile.csv
However i seem to lose some lines. The original file has 271116 lines and the new sorted file has 33889, why have some lines been thrown away?
I would also like to know how I can take the first 100 lines of a csv file and create a new file with just those 100 lines.
Thanks
To print the only the first 100 lines of a file you should use the command head:
head -n 100 datafile.csv > newdatafile.csv
By default head prints the first 10 lines.
Use -n xxx to print more or less lines.
awk at the rescue:
Assuming it's a .csv file, then:
awk -F"," 'NR == 1,NR == 100 {print $0 > "newdatafile.csv"}' datafile.csv
It'll save the first 100 lines of the file into a new file named :newdatafile.csv
Hope this helps.
You are very close w.r.t the sorting requirement of your question:
use the below command, for example, to sort based on column 3 in your csv:
sort -t, -k3 datafile.csv>newdatafile.csv
This will sort the file based on column 3 alphabetically.
In case of numerical columns, use the below command to sort ascending:
sort -t, -nk3 datafile.csv>newdatafile.csv
To sort descending numerically:
sort -t, -nrk3 datafile.csv>newdatafile.csv
Also to get the first 100 rows from the sorted file use :
sort -t, -k3 datafile.csv | head -100 >newdatafile.csv
This will sort the datafile.csv alphabetically on column 3 and then select first 100 rows and write into newdatafile.csv
I am currently working with datasets collected in large CSV files (over 1600 columns and 100 rows). Excel or LibreOffice calc can't easily handle these files for concatenating a prefix or suffix to the header row, which is what I would have done on a smaller dataset.
Researching the topic I was able to come up with the following command:
awk 'BEGIN { FS=OFS="," } {if(NR==1){print "prefix_"$0}; if(NR>1){print; next}}' input.csv >output.csv
Unfortunately, this only adds the prefix to the first cell. For example:
Input:
head_1,head_2,head_3,[...],head_n
"value_1","value_2","value_3",[...],"value_n"
Expected Output:
prefix_head_1,prefix_head_2,prefix_head_3,[...],prefix_head_n
"value_1","value_2","value_3",[...],"value_n"
Real Output:
prefix_head_1,head_2,head_3,[...],head_n
"value_1","value_2","value_3",[...],"value_n"
As the column number may be variable across different csv files, I would like a solution that doesn't require enumeration of all columns as found elsewhere.
This is necessary as the following step is to combine various (5 or 6) large csv files in a single csv database by combining all columns (the rows refer to the same instances, in the same order, across all files).
Thanks in advance for your time and help.
awk 'BEGIN{FS=OFS=","} NR==1{for (i=1;i<=NF;i++) $i="prefix_"$i} 1' file
So I'm just starting out with this whole Powershell thing and so far so good - until now. I just can't figure out how to do this!
I'm looking at manipulating CSV files which are output from one system (which I can't change at output), renaming some column headers and merging a couple of the results into one column so that it matches the input requirements to upload into another system (again, I can't change those parameters).
So, as an example.
The first file is created:
File1.csv
"A","B","C""1","2","3"
I want a powershell script that will output:
File2.csv
"X","Y""1","23"
So I can import it into another system.
I hope that all makes sense, and thanks in advance for any assistance.
I'm going to assume that your actual/desired formats of your files look like this:
"A","B","C"
"1","2","3"
"X","Y"
"1","23"
rather than having everything in one line. If that's correct you can import File1.csv with Import-Csv, rename and merge columns with calculated properties:
... | Select-Object #{n='X';e={$_.A}}, #{n='Y';e={$_.B + $_.C}} | ...
and write the result to File2.csv with Export-Csv.
My second homework assignment asks for us to write a command in Unix or Linux to extract certain sections of multiple files using head and tail. I'm not understanding how to go about this. Here is the question:
(5 points) Using head and tail, write a command to extract the second section of a file (i.e. the data section).
Turn this into an executable script called extractdata (you do not need to hand this in). Then use find and extractdata, write a command to get the second section of all .csv files in the month directories, and place the output into a file called polls.csv. Be sure to keep this file in your homedir. You will use it again on the next assignment. [hint] Inside the script don't forget the command line variable $1. example:
head -52 $1
The .csv files consist of three parts: (1) a two line header, describing the fields; (2) 51 lines representing data for each state (plus Washington DC); (3) the rest of the file is summary information. The data fields for each state in the second part is comma separated.
I have to get the second section.
Thank you.
Take it in stages:
Read what head and tail both do, (get the first and last n lines)
think about what you need (the middle 51 lines)
how can you do that?
Use head to extract the first 53 lines. Use tail to extract the last 51 lines of the result (effectively ignoring the first 2 header lines).
The problem I had was figuring out how to get the data from multiple .csv files. I used wild cards to solve my issue.
If anyone else needed to know I used this:
head -n 53 $1 /usr/local/tmp/election2008/*/*.csv | tail -n 51 $1