I have a CSV and in one of the columns I have fields like
1. ABD_1≻1233;5665;123445
2. 120585_AOP9_AIDMS3≻0947;64820;0173
I need to replace this column with
1. ABD_1
2. AOP9_AIDMS3
Essentially from the first alphabetical character (the substring will never start with a numeric value) to the &. I thought I could use a
regex [a-zA-Z].+?(?=\&)
and awk to extract the column and replace it, but this is proving beyond my beginner skillset. Iterating over the string in some loop and writing some bash to parse it out is impractical as the file has some 20million+ entries.
Can anyone help?
First step, assuming you have only one column in your csv (to understand the complete solution below):
One column
You can use this regex:
sed -r 's/^[^a-zA-Z]*([a-zA-Z]+[^&;]+).*$/\1/' test.csv
Explanations:
-r: use extended regular expressions (avoid parenthesis and plus + symbol escaping)
^[^a-zA-Z]*: skip any non-alpha characters at the beginning, ...
([a-zA-Z]+[^&;]+) ... then captures at least one alpha character followed by a sequence of any character except ampersand & and semi-colon ; ...
.*$ ... and skip any remaining characters (if any, they must begin by either an ampersand or a semi-colon since sed pattern matching is greedy, i.e. it tries to match the longest sequence) until the end of line ...
\1 ... and replace the whole matched text (the line since the regex covers it) by the captured sequence.
Working example:
$ sed -r 's/^[^a-zA-Z]*([a-zA-Z]+[^&;]+).*$/\1/' << 'EOF'
> ABD_1&SC;1233;5665;123445
> 120585_AOP9_AIDMS3&SC;0947;64820;0173
> EOF
ABD_1
AOP9_AIDMS3
Multiple columns:
It looks like you want to process a specific column. If you want to process the n-th column, you can use this regex, which is based on the previous:
sed -r 's/^(([^,]+,){2})[^a-zA-Z]*([a-zA-Z]+[^&;,]+)[^,]*(.*)$/\1\3\4/'
^(([^,]+,){<n-1>}) captures the first (n-1)th columns; replace <n-1> by the real value (0 for the first column works too), and then...
[^a-zA-Z]*([a-zA-Z]+[^&;,]+) captures at least one alpha character followed by a sequence of any character except ampersand &, semi-colon ; or a comma, then ...
[^,]* ... skip any remaining characters which are not a comma ...
(.*)$ ... and captures columns, basically the remaining sequence until the end of line; since any non-comma character was already skipped before, if this sequence exists, it must begin with a comma; finally ...
\1\3\4/ ... replace the whole matched text (the line since the regex covers it) by the following captured sequences:
\1 : the (n-1)th columns (\2 is inside)
\3 : the text we want to keep from the n-th column
\4 : remaining columns if any
Working example (it processes the third column):
$ sed -r 's/^(([^,]+,){2})[^a-zA-Z]*([a-zA-Z]+[^&;,]+)[^,]*(.*)$/\1\3\4/' << 'EOF'
plaf,plafy,ABD_1&SC;1233;5665;123445,plet
trouf,troufi,120585_AOP9_AIDMS3&SC;0947;64820;0173,plot
EOF
plaf,plafy,ABD_1,plet
trouf,troufi,AOP9_AIDMS3,plot
Related
I'm currently coding a CSV validator using awk. Here's an example of the code:
awk 'BEGIN{FS=OFS=","} NF!=17{print "not enough fields"; exit}
!($1~/[[:alnum:]]$/) {print "1st field invalid"; exit}' npp_test.cs
However the alnum section won't accept both alphabetic and numerical characters.
So if the data is "t" the program will exit, and if the data is "1" the same thing. However if it is "t1" it won't recognise it as valid.
How would I go about getting the code to accept a mix of alpha and numeric data.
Also the top line isn't really relevant as its just field count:)
If your environment does not support POSIX character classes in awk you may use explicit character ranges in bracket expressions:
!($1 ~ /^[A-Z0-9]{1,25}$/)
Here,
^ - matches the start of a line
[A-Z0-9]{1,25} - matches 1 to 25 uppercase letters or digits
$ - end of string.
NOTE: To avoid any issues with collations, you may add LANG=C before the awk command.
I have a CSV file that I would like to move the first letter to the end of the first string and insert an underscore in front of the last two characters. I can't find anything on how to move a letter over with sed. Here is my example CSV:
name,number,number1,status,mode
B9AT0582B41,430,30,0,Loop
B8AU0302D11,448,0,0,Loop
B8AU0302D21,448,0,0,Loop
B8AU0302D31,448,0,0,Loop
B8AU0302D41,448,0,0,Loop
For example, the B9AT0582B41, I want it to be 9AT0582B_41B.
It needs to do this for each line and not change the state of the other CSV values.
I am open to forms other than sed.
In awk:
$ awk -F, -v OFS=, \
'NR > 1 { $1 = substr($1, 2, 8) "_" substr($1, 10) substr($1, 1, 1) } 1' infile
name,number,number1,status,mode
9AT0582B_41B,430,30,0,Loop
8AU0302D_11B,448,0,0,Loop
8AU0302D_21B,448,0,0,Loop
8AU0302D_31B,448,0,0,Loop
8AU0302D_41B,448,0,0,Loop
This sets input and output field separator to ,; then, for each line (except the first one) rearranges the first field (three calls to substr), then prints the line (the 1 at the end).
Or sed, a bit shorter:
sed -E '2,$s/^(.)([^,]*)([^,]{2})/\2_\3\1/' infile
This captures the first letter of each line (for lines 2 and up) in capture group 1, then everything up to two characters before the first comma in capture group 2 and the last two characters before the comma in capture group 3. The substitution then swaps and adds the underscore.
Here's my take on this.
$ sed -E 's/(.)(.{8})([^,]*)(.*)/\2_\3\1\4/' <<<"B9AT0582B41,430,30,0,Loop"
9AT0582B_41B,430,30,0,Loop
This uses an extended regular expression to make things easier to read. Sed's -E option causes the RE to be interpreted in extended notation. If your version of sed doesn't support this, check your man page to see if there's another option that does the same thing, or you can try to use BRE notation:
$ sed 's/\(.\)\(.\{8\}\)\([^,]*\)\(.*\)/\2_\3\1\4/' <<<"B9AT0582B41,430,30,0,Loop"
9AT0582B_41B,430,30,0,Loop
I have a CSV file that I'm working to manipulate using sed. What I'm doing is inserting the current YYYY-MM-DD HH:MM:SS into the 5th field after the IP Address. As you can see below, each value is enclosed by double quotes and each CSV column is separated by a comma.
"12345","","","None","192.168.2.1","qqq","000"
"67890","ABC-1234-5678","9.9","Low","192.168.2.1","qqq","000"
Using the command: sed 'N;s/","/","YYYY-MM-DD HH:MM:SS","/5' FILENAME
I am adding in the date after the 5th field. Normally this works, but often
certain values in the CSV file mess up this count that would insert the date into the 5th field. To remedy this issue, how can I not only add the date after the 5th field, but also make sure the 5th field is an IP Address?
The final output should be:
"12345","","","None","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
"67890","ABC-1234-5678","9.9","Low","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
Please respond with how this is done using SED and not AWK. And how I can make sure the 5th field is also an IP Address before the date is added in?
This answer currently assumes that the CSV file is beautifully consistent and simple (as in the sample data), so that:
Fields always have double quotes around them.
There are never fields like "…""…" to indicate a double quote embedded in the string.
There are never fields with commas in between the quotes ("this,that").
Given those pre-requisites, this sed script does the job:
sed 's/^\("[^"]*",\)\{4\}"\([0-9]\{1,3\}\.\)\{3\}[0-9]\{1,3\}",/&"YYYY-MM-DD HH:MM:SS",/'
Let's split that search pattern into pieces:
^\("[^"]*",\)\{4\}
Match start of line followed by: 4 repeats of a double quote, a sequence of zero or more non-double-quotes, a double quote and a comma.
In other words, this identifies the first four fields.
"\([0-9]\{1,3\}\.\)\{3\}
Match a double quote, then 3 repeats of 1-3 decimal digits followed by a dot — the first three triplets of an IPv4 dotted-decimal address.
[0-9]\{1,3\}",
Match 1-3 decimal digits followed by a double quote and a comma — the last triplet of an IPv4 dotted-decimal address plus the end of a field.
Clearly, for each idiosyncrasy of CSV files that you also need to deal with, you have to modify the regular expressions. That's not trivial.
Using extended regular expressions (enabled by -E on both GNU and BSD sed), you could write:
sed -E 's/^("(([^"]*"")*[^"]*)",){4}"([0-9]{1,3}\.){3}[0-9]{1,3}",/&"YYYY-MM-DD HH:MM:SS",/'
The pattern to recognize the first 4 fields is more complex than before. It matches 4 repeats of: double quote, zero or more occurrences of { zero or more non-double-quotes followed by two double quotes } followed by zero or more non-double-quotes followed by a double quote and a comma.
You can also write that in classic sed (basic regular expressions) with a liberal sprinkling of backslashes:
sed 's/^\("\(\([^"]*""\)*[^"]*\)",\)\{4\}"\([0-9]\{1,3\}\.\)\{3\}[0-9]\{1,3\}",/&"YYYY-MM-DD HH:MM:SS",/'
Given the data file:
"12345","","","None","192.168.2.1","qqq","000"
"67890","ABC-1234-5678","9.9","Low","192.168.2.1","qqq","000"
"23456","Quaternions","2.3","Pisces","Heredotus","qqq","000"
"34567","Commas, oh commas!","3.14159","""Quotes"" quoth he","192.168.99.37","zzz","011"
"45678","Commas, oh commas!","3.14159","""Quote me"",""or not""","192.168.99.37","zzz","011"
The first script shown produces the output:
"12345","","","None","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
"67890","ABC-1234-5678","9.9","Low","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
"23456","Quaternions","2.3","Pisces","Heredotus","qqq","000"
"34567","Commas, oh commas!","3.14159","""Quotes"" quoth he","192.168.99.37","zzz","011"
"45678","Commas, oh commas!","3.14159","""Quote me"",""or not""","192.168.99.37","zzz","011"
The first two lines are correctly mapped; the third is correctly unchanged, but the last two should have been mapped and were not.
The second and third commands produce:
"12345","","","None","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
"67890","ABC-1234-5678","9.9","Low","192.168.2.1","YYYY-MM-DD HH:MM:SS","qqq","000"
"23456","Quaternions","2.3","Pisces","Heredotus","qqq","000"
"34567","Commas, oh commas!","3.14159","""Quotes"" quoth he","192.168.99.37","YYYY-MM-DD HH:MM:SS","zzz","011"
"45678","Commas, oh commas!","3.14159","""Quote me"",""or not""","192.168.99.37","YYYY-MM-DD HH:MM:SS","zzz","011"
Note that Heredotus is not modified (correctly), and the last two lines get the date string added after the IP address (also correctly).
Those last regular expressions are not for the faint-of-heart.
Clearly, if you want to insist that the IP addresses only match numbers in the range 0..255 in each component, with no leading 0, then you have to beef up the IP address matching portion of the regular expression. It can be done; it is not pretty. It is easiest to do it with extended regular expressions:
([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])
You'd use that unit in place of each [0-9]{3} unit in the regexes shown before.
Note that this still does not attempt to deal with fields not surrounded by double quotes.
It also does not determine the value to substitute from the date command. That is doable with (if not elementary then) routine shell scripting carefully managing quotes:
dt=$(date +'%Y-%m-%d %H:%M:%S')
sed -E 's/^("(([^"]*"")*[^"]*)",){4}"([0-9]{1,3}\.){3}[0-9]{1,3}",/&"'"$dt"'",/'
The '…"'"$dt"'",/' sequence is part of what starts out as a single-quoted string. The first double quote is simple data in the string; the next single quote ends the quoting, the "$dt" interpolates the value from date inside shell double quotes (so the space doesn't cause any trouble), then the single quote resumes the single-quoted notation, adding another double quote, a comma and a slash before the string (argument to sed) is terminated.
Try:
awk -vdate1=$(date +"%Y-%m-%d") -vdate2=$(date +"%H:%M:%S") -F, '$5 ~ /[0-9]+\.[0-9]+\.[0-9]+\.[0-9]/{$5=$5 FS date1 " " date2} 1' OFS=, Input_file
Also if you want to edit the same Input_file you could take above command's output into a temp file and later rename(mv command) to the same Input_file
Adding one-liner form of solution too now.
awk -vdate1=$(date +"%Y-%m-%d") -vdate2=$(date +"%H:%M:%S") -F, '
$5 ~ /[0-9]+\.[0-9]+\.[0-9]+\.[0-9]/{
$5=$5 FS date1 " " date2
}
1
' OFS=, Input_file
Looking to compare two CSV files. Suppose the field separator is $, each record has two fields, and the file can be formatted something like:
a$simple line$
b$run-on-
line$
c$simple line$
Is there some switch or variety of Unix diff command that will let me run the comparison where the record separator (line separator) is the $ sign immediately followed by a new line?
Ideally I want to be guaranteed that diff outputs the entire record when any change is detected.
With the default behavior, I could potentially get a partial record as diff output (in scenarios where the record runs over several lines).
Is there some smarter way to do this that I'm not considering?
--
Edited to add: sample of expected output
If I compared the CSV file above with:
a$simple line$
b$run-on-changed-
line$
c$simple line$
... I would want to see the entire record b reported as a difference. Something like
2c2
< b$run-on-\nline$
---
> b$run-on-changed-\nline$
Peter, there is no direct support of custom line separator in gnu diff: http://man7.org/linux/man-pages/man1/diff.1.html (gnu diffutils)
You may try to use sed twice: sed to convert your format to one-record-per-line for diffing; diff converted files; sed back to multiline record format.
First sed will convert all $\n to real \n; and \n without $ before it to some unique special sequence, like #%#$%#$%#$#.
Then do diff.
Second sed will convert #%#$%#$%#$# back to \n (or to \\n to easier viewing of diff output)
There are diff variants which support working with csv. Some of them may handle csv with line breaks inside fields:
https://pypi.python.org/pypi/csvdiff (python)
csvdiff allows you to compare the semantic contents of two CSV files, ignoring things like row and column ordering in order to get to what’s actually changed. This is useful if you’re comparing the output of an automatic system from one day to the next, so that you can look at just what’s changed.
https://github.com/agardiner/csv-diff (ruby)
Unlike a standard diff that compares line by line, and is sensitive to the ordering of records, CSV-Diff identifies common lines by key field(s), and then compares the contents of the fields in each line.
http://csvdiff.sourceforge.net/ (perl)
csvdiff is a perl script to compare/diff two (comma) seperated files with each other. The part that is different to standard diff is, that you'll get the number of the record where the difference occours and the field/column which is different. The separator can be set to the value you want it to, not just comma. Also you can to provide a third file which contains the columnnames in one(!) line separated by your separator.
I have the following csv where I have to replace the thousand comma separator with nothing. In example below, when I have the amount "1,000.00" I should have 1000.00 (no comma, no quotes) instead.
I use JREPL to remove header from my csv
jrepl "(?:.*\n){1,1}([\s\S]*)" "$1" /m /f "csv/Transactions.csv" /o "csv/Transactionsfeed.csv")
I was wondering if I could do the process of removing header + dealing with the thousand comma in one step.
I am also opened to the option of doing it with another command in a second step...
Tnx ID,Trace ID - Gateway,Profile,Customer PIN,Customer,Ext. ID,Identifier,Amount,Chrg,Curr,Processor,Type,Status,Created By,Date Created,RejectReason
1102845,3962708,SL,John,Mohammad Alo,NA,455015*****9998,900.00,900.00,$,Un,Credit Card,Rejected,Internet,2016-05-16 06:54:10,"-330: Fail by bank, try again later(refer to acquirer)"
1102844,3962707,SL,John,Mohammad Alo,NA,455015*****9998,"1,000.00","1,000.00",$,Un,Credit Card,Rejected,Internet,2016-05-16 06:52:26,"-330: Fail by bank, try again later(refer to acquirer)"
Yes, there is a very efficient and fairly compact and straight-forward solution:
jrepl "\q(\d{1,3}(?:,\d{3})*(?:\.\d*)*)\q" "$1.replace(/,/g,'')" /x /j /jendln "if (ln==1) $txt=false" /f "csv/Transactions.csv" /o "csv/Transactionsfeed.csv"
The /JENDLN JScript expression strips the header line by setting $txt to false if it is the first line.
The search string matches any quoted number that contains commas as thousand separators, and $1 is the number without the quotes.
The replace string is a JScript expression that replaces all commas in the matching $1 number with nothing.
EDIT
Note that the above will likely work with any CSV that you are likely to have. However, it would fail if you have a quoted field that contains a quoted number string literal. Something like the following would yield a corrupted CSV with the code above:
...,"some text ""123,456.78"" more text",...
This issue can be fixed with a bit more regex code. You only want to modify a quoted number if the opening quote is preceded by a comma or the beginning of the line, and the closing quote should be followed by a comma or the end of line.
A look-ahead assertion can be used for the trailing comma/EOL. But JREPL does not support look-behind. So the leading comma/BOL must be captured and preserved in the replacement
jrepl "(^|,)\q(\d{1,3}(?:,\d{3})*(?:\.\d*)*)\q(?=$|,)" "$1+$2.replace(/,/g,'')" /x /j /jendln "if (ln==1) $txt=false" /f "csv/Transactions.csv" /o "csv/Transactionsfeed.csv"
EDIT in response to changing requirement in comment
The following will simply remove all quotes and commas from quoted CSV fields. I don't like this concept, and I suspect there is a much better way to handle this for import into mysql, but this is what the OP is asking for.
jrepl "(^|,)(\q(?:[^\q]|\q\q)*\q)(?=$|,)" "$1+$2.replace(/,|\x22/g,'')" /x /j /jendln "if (ln==1) $txt=false" /f "csv/Transactions.csv" /o "csv/Transactionsfeed.csv"
May I suggest you a different, simpler solution? The 5-lines Batch file below do what you want; save it with .bat extension:
#set #a=0 /*
#cscript //nologo //E:JScript "%~F0" < "csv/Transactions.csv" > "csv/Transactionsfeed.csv"
#goto :EOF */
WScript.Stdin.ReadLine();
WScript.Stdout.Write(WScript.Stdin.ReadAll().replace(/(\"(\d{1,3}),(\d{3}\.\d{2})\")/g,"$2$3"));
JREPL.BAT is a large and complex program capable of advanced replacement tasks; however, your request is very simple. This code is also a Batch-JScript hybrid script that use the replace method in the same way as JREPL.BAT, but that is tailored to your specific request.
The first ReadLine() read the header line of the input file, so the posterior ReadAll() read and process the rest of lines.
The regexp (\"(\d{1,3}),(\d{3}\.\d{2})\") define 3 submatches enclosed in parentheses: the first one is the whole number enclosed in quotes, like "1,000.00"; the second submatch is the digits before the comma and the third submatch is the digits after the comma, including the decimal point.
The .replace method change the previous regexp, that is, the whole number enclosed in quotes by just the second and third submatches.