Adding a prefix to header row in csv file with awk - csv

I am currently working with datasets collected in large CSV files (over 1600 columns and 100 rows). Excel or LibreOffice calc can't easily handle these files for concatenating a prefix or suffix to the header row, which is what I would have done on a smaller dataset.
Researching the topic I was able to come up with the following command:
awk 'BEGIN { FS=OFS="," } {if(NR==1){print "prefix_"$0}; if(NR>1){print; next}}' input.csv >output.csv
Unfortunately, this only adds the prefix to the first cell. For example:
Input:
head_1,head_2,head_3,[...],head_n
"value_1","value_2","value_3",[...],"value_n"
Expected Output:
prefix_head_1,prefix_head_2,prefix_head_3,[...],prefix_head_n
"value_1","value_2","value_3",[...],"value_n"
Real Output:
prefix_head_1,head_2,head_3,[...],head_n
"value_1","value_2","value_3",[...],"value_n"
As the column number may be variable across different csv files, I would like a solution that doesn't require enumeration of all columns as found elsewhere.
This is necessary as the following step is to combine various (5 or 6) large csv files in a single csv database by combining all columns (the rows refer to the same instances, in the same order, across all files).
Thanks in advance for your time and help.

awk 'BEGIN{FS=OFS=","} NR==1{for (i=1;i<=NF;i++) $i="prefix_"$i} 1' file

Related

Splitting a CSV File by value of specific column

I have multiple CSV files that I need to split into 67 separate files each. Each sheet has over a million rows and dozens of columns. One of the columns is called "Code" and it ranges from 1 to 67 which is what I have to base the split on. I have been doing this split manually by selecting all of the rows within each value (1, 2, 3, etc) and pasting them into their own CSV file and saving them, but this is taking way too long. I usually use ArcGIS to create some kind of batch file split, but I am not having much luck in doing so this go around. Any tips or tricks would be greatly appreciated!
If you have access to awk there's a good way to do this.
Assuming your file looks like this:
Code,a,b,c
1,x,x,x
2,x,x,x
3,x,x,x
You want a command like this:
awk -F, 'NR > 1 {print $0 >> "code" $1 ".csv"}' data.csv
That will save it to files like code1.csv etc., skipping the header line.

line feed within a column in csv

I have a csv like below. some of columns have line break like column B below. when I doing wc -l file.csv unix is returning 4 but actually these are 3 records. I don't want to replace line break with space, I am going to load data in database using sql loader and want to load data as it is. what should I do so that unix consider line break as one record?
A,B,C,D
1,"hello
world",sds,sds
2,sdsd,sdds,sdds
Unless you're dealing with trivial cases (No quoted fields, no embedded commas, no embedded newlines, etc.), CSV data is best processed with tools that understand the format. Languages like perl and python have CSV parsing libraries available, there are packages like csvkit that provide useful utilities, and more.
Using csvstat from csvkit on your example:
$ csvstat -H --count foo.csv
Row count: 3

Create libsvm from multiple csv files for xgboost external memory training

I am trying to train an xgboost model using its external memory version, which takes a libsvm file as training set. Right now, all the data is stored in a bunch of csv files which combine together are way larger than the memory I have, say 70G.(you can easily read any one of them). I just wonder how to create one large libsvm file for xgboost. Or if there is any other work round for this. Thank you.
If you csv files do not have headers you can combine them with the Unix cat command.
Example:
> ls
file1.csv file2.csv
> cat *.csv > combined.csv
Now combined.csv is the concatenation of all the other files.
If all your csv files have headers you''ll want to do something trickier, like take the n-1 lines with tail.
XGBoost supports csv as an input.
If you want to convert that to libsvm regardless, you can use phraug's scripts.

Sorting and separating data in UNIX shell

I am relatively new to using the unix shell, and i am having trouble with a .csv file. My aim is to create a new file that has all of the same data but sorted. I have achieved this to an extent where i use the command
sort -t, datafile.csv>newdatafile.csv
However i seem to lose some lines. The original file has 271116 lines and the new sorted file has 33889, why have some lines been thrown away?
I would also like to know how I can take the first 100 lines of a csv file and create a new file with just those 100 lines.
Thanks
To print the only the first 100 lines of a file you should use the command head:
head -n 100 datafile.csv > newdatafile.csv
By default head prints the first 10 lines.
Use -n xxx to print more or less lines.
awk at the rescue:
Assuming it's a .csv file, then:
awk -F"," 'NR == 1,NR == 100 {print $0 > "newdatafile.csv"}' datafile.csv
It'll save the first 100 lines of the file into a new file named :newdatafile.csv
Hope this helps.
You are very close w.r.t the sorting requirement of your question:
use the below command, for example, to sort based on column 3 in your csv:
sort -t, -k3 datafile.csv>newdatafile.csv
This will sort the file based on column 3 alphabetically.
In case of numerical columns, use the below command to sort ascending:
sort -t, -nk3 datafile.csv>newdatafile.csv
To sort descending numerically:
sort -t, -nrk3 datafile.csv>newdatafile.csv
Also to get the first 100 rows from the sorted file use :
sort -t, -k3 datafile.csv | head -100 >newdatafile.csv
This will sort the datafile.csv alphabetically on column 3 and then select first 100 rows and write into newdatafile.csv

Renaming CSV column header and merge results with Powershell

So I'm just starting out with this whole Powershell thing and so far so good - until now. I just can't figure out how to do this!
I'm looking at manipulating CSV files which are output from one system (which I can't change at output), renaming some column headers and merging a couple of the results into one column so that it matches the input requirements to upload into another system (again, I can't change those parameters).
So, as an example.
The first file is created:
File1.csv
"A","B","C""1","2","3"
I want a powershell script that will output:
File2.csv
"X","Y""1","23"
So I can import it into another system.
I hope that all makes sense, and thanks in advance for any assistance.
I'm going to assume that your actual/desired formats of your files look like this:
"A","B","C"
"1","2","3"
"X","Y"
"1","23"
rather than having everything in one line. If that's correct you can import File1.csv with Import-Csv, rename and merge columns with calculated properties:
... | Select-Object #{n='X';e={$_.A}}, #{n='Y';e={$_.B + $_.C}} | ...
and write the result to File2.csv with Export-Csv.