Date validation in CSV file [closed] - csv

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 months ago.
Improve this question
I need to validate DOB field in a CSV and remove the invalid data from the field.
The expected DOB format is YYYY-MM-DD only
Please see the below source file and the expected output. I'm expecting AWK command to solve this issue.
name,dob
pater,2022-12-10
john,1900-10-23
cader,apr 10 12056
tina,2020-maple road
mike,2019-01-35
carl,2010-03-18 new york
anne,hi how are you?
I need to clean the 2nd column as the DOB field. Note: in some rows, there are other text available in the DOB field and for such occurrences I need to keep only the valid date removing other text(ex: row 6)
Expected output
name,dob
pater,2022-12-10
john,1900-10-23
cader,
tina,
mike,
carl,2010-03-18
anne,

I was able to achieve this task by using the below command
awk 'BEGIN{FS=OFS=","}{$2=match($2,/[0-9]{4}-(0[1-9]|1[0-2])-(?:[0-9]|[12][0-9]|3[01])/)?substr($2,RSTART,RLENGTH):"";print}' input.csv > output.csv

Something like this might work
awk -F "," '{ if ($2 ~ /^[0-9]{4}-[0-9]{2}-[0-9]{2}(.*)$/) print $1 "," $2; else print $1 "," }' input.csv > output.csv

Related

Delete CSV row for matching record in awk [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
My csv file looks like:
data,code,name
2020-02-24,069,AA
2020-02-24,066,BB
2020-02-24,068,CC
2020-02-24,067,DD
2020-02-24,979,Updating
I would like to delete rows which have "Updating" in field name
So the output should be like:
data,code,name
2020-02-24,069,AA
2020-02-24,066,BB
2020-02-24,068,CC
2020-02-24,067,DD
use this command and you will get the result you requested:
awk -F, '{ if($3 != "Updating") print $0;}' test //test is your file name
before:
cat test
data,code,name
2020-02-24,069,AA
2020-02-24,066,BB
2020-02-24,068,CC
2020-02-24,067,DD
2020-02-24,979,Updating
After:
data,code,name
2020-02-24,069,AA
2020-02-24,066,BB
2020-02-24,068,CC
2020-02-24,067,DD
**so basically awk will look for the lines doesn't have Updating word in it and print them out.

Script that add double quotes to every column in csv file problem

i have a CSV file with 4 columns. e.g.
1132,John Doe,johndoe#gmail.com,3534534543
53213,John Doe,johndoe#test.com,51352363126
I want to add double quotes for every value so I use this script on MAC:
sed 's/[^,]*/"&"/g' file.csv > file2.csv
I receive
"1132","John Doe","johndoe#gmail.com","3534534543
"
"53213","John Doe","johndoe#test.com","51352363126
"
So I get the last quotes on new rows, most probably I should remove /r/n somehow, I tried but I couldn`t. Any ideas? It happens with the files that I receive if I fill values manually it works as expected.
Could you please try following.
awk 'BEGIN{FS=",";RS="\r\n";s1="\"";OFS="\",\""} {$1=$1;$0=s1 $0 s1} 1' Input_file
In case you want to leave empty lines then try following.
awk 'BEGIN{FS=",";RS="\r\n";s1="\"";OFS="\",\""} NF{$1=$1;$0=s1 $0 s1} 1' Input_file
As you suspected, it is possible that the file you received has different control characters at the end of the line.
One easy fix is to exclude control characters, as well as comma, from matching. That is, instead of searching for [^,]*, you can search for [^,[:cntrl:]]*.
I'd use a proper CSV parser on CSV data. Ruby ships with one, so you can write
ruby -rcsv -e '
csv_in = CSV.new(STDIN)
csv_out = CSV.new(STDOUT, force_quotes: true)
csv_in.each {|row| csv_out << row}
' < file.csv

Get rid of last (\n) symbol [duplicate]

This question already has answers here:
Why does my tool output overwrite itself and how do I fix it?
(3 answers)
Closed 4 years ago.
I am trying to parse a csv file, where input is enclosed in '"' and separated by comma ',' with the below code:
split($0,data,",")
print "\""data[1]"\",\""data[7]"\",\""data[2]
It should take columns separately, perform operations if needed, so don't advise to print as is ;)
So problem is the last column - its grabbed with '\n' symbol, so the next column overwrites my current line, initial file:
"00:00:00","87100","2381","",""," ","13"
"00:00:01","56270","0098","",""," ","37"
"00:00:01","86917","0942","",""," ","12"
so instead of this:
"00:00:00","13","87100"
"00:00:01","37","56270"
"00:00:01","12","86917"
I'm getting this:
","87100"
","87100"
","87100"
("data[1]","data[3) is being overwritten. I have removed last column from print list, and it worked fine. And also, I can't add commas after the last column, that is too much. Any other advises on code?
Rather than splitting each line, you should specify the field-separator as ',' (using -F). Then it's much simpler to print each field (still quote enclosed). You can still access the entire line as $0.
awk -F',' '{print $1","$7","$2}' csv_file

How can i join various CSV files into only one in same row comma separated? [duplicate]

This question already has answers here:
Paste side by side multiple files by numerical order
(3 answers)
Closed 6 years ago.
I have more or less 100 CSV files and i want to join all of them in the same file.csv, in the same row and ordered by modification date.
Actually I use paste -d, *.csv > out.csv but the files are named like this:
sample_1
And it orders the content like this:
sample_100
sample_101
sample_102
...
sample_10
sample_110
sample_111
...
The desired order is:
sample_1
sample_2
sample_3
...
sample_100
The solution can be order by modification date by I dont know how,
maybe something like ls -latr | paste -d, *.csv > out.csv
Thanks!
for i in $(ls -t); do paste -d, $i >> out.csv; done
ls -t : will list the files and order them by modification time.
">>" : instead of > to concatenate and not erase previous content
I tested it locally and it works well for me.

Delete lines if string does not appear exactly thrice

I have a CSV file that contains a few thousand lines. It looks something like this:
abc,123,hello,world
abc,124,goodbye,turtles
def,100,apples,pears
....
I want each unique entry in column one to be repeated exactly three times. For example: If exactly three lines have "abc" in the first column that is fine and nothing happens. But if there is not exactly three lines with "abc" in the first column, all lines with "abc" in column 1 must be deleted.
This
abc,123,hello,world
abc,124,goodbye,turtles
abc,167,cat,dog
def,100,apples,pears
def,10,foo,bar
ghi,2,one,two
ghi,6,three,four
ghi,4,five,six
ghi,9,seven,eight
Should become:
abc,123,hello,world
abc,124,goodbye,turtles
abc,167,cat,dog
Many Thanks,
Awk way
awk -F, 'FNR==NR{a[$1]++;next}a[$1]==3' test{,}
Set Field separator to ,
Whilst first file
Increment array with field 1 as key
Skip next instruction
Read file again
If the array counter is 3 print
this awk one-liner should do:
awk -F, 'NR==FNR{a[$1]++;next}a[$1]==3' file file
it doesn't require your file to be sorted.