CSV first letter of line moved to end of field - csv

I have a CSV file that I would like to move the first letter to the end of the first string and insert an underscore in front of the last two characters. I can't find anything on how to move a letter over with sed. Here is my example CSV:
name,number,number1,status,mode
B9AT0582B41,430,30,0,Loop
B8AU0302D11,448,0,0,Loop
B8AU0302D21,448,0,0,Loop
B8AU0302D31,448,0,0,Loop
B8AU0302D41,448,0,0,Loop
For example, the B9AT0582B41, I want it to be 9AT0582B_41B.
It needs to do this for each line and not change the state of the other CSV values.
I am open to forms other than sed.

In awk:
$ awk -F, -v OFS=, \
'NR > 1 { $1 = substr($1, 2, 8) "_" substr($1, 10) substr($1, 1, 1) } 1' infile
name,number,number1,status,mode
9AT0582B_41B,430,30,0,Loop
8AU0302D_11B,448,0,0,Loop
8AU0302D_21B,448,0,0,Loop
8AU0302D_31B,448,0,0,Loop
8AU0302D_41B,448,0,0,Loop
This sets input and output field separator to ,; then, for each line (except the first one) rearranges the first field (three calls to substr), then prints the line (the 1 at the end).
Or sed, a bit shorter:
sed -E '2,$s/^(.)([^,]*)([^,]{2})/\2_\3\1/' infile
This captures the first letter of each line (for lines 2 and up) in capture group 1, then everything up to two characters before the first comma in capture group 2 and the last two characters before the comma in capture group 3. The substitution then swaps and adds the underscore.

Here's my take on this.
$ sed -E 's/(.)(.{8})([^,]*)(.*)/\2_\3\1\4/' <<<"B9AT0582B41,430,30,0,Loop"
9AT0582B_41B,430,30,0,Loop
This uses an extended regular expression to make things easier to read. Sed's -E option causes the RE to be interpreted in extended notation. If your version of sed doesn't support this, check your man page to see if there's another option that does the same thing, or you can try to use BRE notation:
$ sed 's/\(.\)\(.\{8\}\)\([^,]*\)\(.*\)/\2_\3\1\4/' <<<"B9AT0582B41,430,30,0,Loop"
9AT0582B_41B,430,30,0,Loop

Related

Update a CSV file to drop the first number and insert a decimal place in a particular column

I need help to perform the following
My CSV file looks like this
900001_10459.jpg,036921,Initiated
900002_10454.jpg,027964,Initiated
900003_10440.jpg,021449,Initiated
900004_10440.jpg,016650,Initiated
900005_10440.jpg,013929,Initiated
What I need to do is generate a new csv file to be as follows
900001_10459.jpg,3692.1,Initiated
900002_10454.jpg,2796.4,Initiated
900003_10440.jpg,2144.9,Initiated
900004_10440.jpg,1665.0,Initiated
900005_10440.jpg,1392.9,Initiated
if I was to do this as a test
echo '036921' | awk -v range=1 '{print substr($0,range+1)}' | sed 's/.$/.&/'
I get
3692.1
Can anyone help me so I can incorporate that, (or anything similar) to change my CSV file?
Try
sed 's/,0*([0-9]*)([0-9]),/,\1.\2,/' myfile.csv
Using awk and with the conditions specified in the comment, you can use:
$ awk -F, '{ printf "%s,%06.1f,%s\n", $1, $2 / 10, $3 }' data
900001_10459.jpg,3692.1,Initiated
900002_10454.jpg,2796.4,Initiated
900003_10440.jpg,2144.9,Initiated
900004_10440.jpg,1665.0,Initiated
900005_10440.jpg,1392.9,Initiated
$
With the printf format string providing the commas, there's no need to set OFS (because OFS is not used by printf).
Assuming that values with leading zeros appears solely in 2nd column I would use GNU AWK for this task following way, let file.txt content be
900001_10459.jpg,036921,Initiated
900002_10454.jpg,027964,Initiated
900003_10440.jpg,021449,Initiated
900004_10440.jpg,016650,Initiated
900005_10440.jpg,013929,Initiated
then
awk 'BEGIN{FS=",0?";OFS=","}{$2=gensub(/([0-9])$/, ".\\1", 1, $2);print}' file.txt
output
900001_10459.jpg,3692.1,Initiated
900002_10454.jpg,2796.4,Initiated
900003_10440.jpg,2144.9,Initiated
900004_10440.jpg,1665.0,Initiated
900005_10440.jpg,1392.9,Initiated
Explanation: I set field separator (FS) to be , optionally followed by 0, so leading zero will be discarded as part of separator. In 2nd I replace last digit by . followed by that digit. Finally I print such changed line, using , as separators.
(tested in gawk 4.2.1)
I wish to have 4 numbers (including zeros) and the last value (5th value) separated from the 4 values by a decimal point.
If I understand, you need not all digits of that field but only the last five digits.
Using awk you can get the last five with the substr function and then print the field with the last digit separeted from de previous 4 by a decimal point, using the sub() function:
awk -F',' -v OFS=',' '{$2= substr($2, length($2) - 4, length($2) ); sub(/[[:digit:]]{1}$/, ".&",$2);print}' file
900001_10459.jpg,3692.1,Initiated
900002_10454.jpg,2796.4,Initiated
900003_10440.jpg,2144.9,Initiated
900004_10440.jpg,1665.0,Initiated
900005_10440.jpg,1392.9,Initiated

bash Replace column in csv with a substring of that column

I have a CSV and in one of the columns I have fields like
1. ABD_1&SC;1233;5665;123445
2. 120585_AOP9_AIDMS3&SC;0947;64820;0173
I need to replace this column with
1. ABD_1
2. AOP9_AIDMS3
Essentially from the first alphabetical character (the substring will never start with a numeric value) to the &. I thought I could use a
regex [a-zA-Z].+?(?=\&)
and awk to extract the column and replace it, but this is proving beyond my beginner skillset. Iterating over the string in some loop and writing some bash to parse it out is impractical as the file has some 20million+ entries.
Can anyone help?
First step, assuming you have only one column in your csv (to understand the complete solution below):
One column
You can use this regex:
sed -r 's/^[^a-zA-Z]*([a-zA-Z]+[^&;]+).*$/\1/' test.csv
Explanations:
-r: use extended regular expressions (avoid parenthesis and plus + symbol escaping)
^[^a-zA-Z]*: skip any non-alpha characters at the beginning, ...
([a-zA-Z]+[^&;]+) ... then captures at least one alpha character followed by a sequence of any character except ampersand & and semi-colon ; ...
.*$ ... and skip any remaining characters (if any, they must begin by either an ampersand or a semi-colon since sed pattern matching is greedy, i.e. it tries to match the longest sequence) until the end of line ...
\1 ... and replace the whole matched text (the line since the regex covers it) by the captured sequence.
Working example:
$ sed -r 's/^[^a-zA-Z]*([a-zA-Z]+[^&;]+).*$/\1/' << 'EOF'
> ABD_1&SC;1233;5665;123445
> 120585_AOP9_AIDMS3&SC;0947;64820;0173
> EOF
ABD_1
AOP9_AIDMS3
Multiple columns:
It looks like you want to process a specific column. If you want to process the n-th column, you can use this regex, which is based on the previous:
sed -r 's/^(([^,]+,){2})[^a-zA-Z]*([a-zA-Z]+[^&;,]+)[^,]*(.*)$/\1\3\4/'
^(([^,]+,){<n-1>}) captures the first (n-1)th columns; replace <n-1> by the real value (0 for the first column works too), and then...
[^a-zA-Z]*([a-zA-Z]+[^&;,]+) captures at least one alpha character followed by a sequence of any character except ampersand &, semi-colon ; or a comma, then ...
[^,]* ... skip any remaining characters which are not a comma ...
(.*)$ ... and captures columns, basically the remaining sequence until the end of line; since any non-comma character was already skipped before, if this sequence exists, it must begin with a comma; finally ...
\1\3\4/ ... replace the whole matched text (the line since the regex covers it) by the following captured sequences:
\1 : the (n-1)th columns (\2 is inside)
\3 : the text we want to keep from the n-th column
\4 : remaining columns if any
Working example (it processes the third column):
$ sed -r 's/^(([^,]+,){2})[^a-zA-Z]*([a-zA-Z]+[^&;,]+)[^,]*(.*)$/\1\3\4/' << 'EOF'
plaf,plafy,ABD_1&SC;1233;5665;123445,plet
trouf,troufi,120585_AOP9_AIDMS3&SC;0947;64820;0173,plot
EOF
plaf,plafy,ABD_1,plet
trouf,troufi,AOP9_AIDMS3,plot

How to reformat file with sed/vim?

I have a .csv file that looks like this.
atomnum,atominfo,metric
238,A-30-CYS-SG,53.7723
889,A-115-CYS-SG,46.2914
724,A-94-CYS-SG,44.6405
48,A-6-CYS-SG,37.2108
630,A-80-CYS-SG,29.574
513,A-64-CYS-SG,23.1925
981,A-127-CYS-SG,19.8903
325,A-41-GLN-OE1,17.6205
601,A-76-CYS-SG,17.5079
I want to change it like this:
atomnum,atominfo,metric
238,C30-SG,53.7723
889,C115-SG,46.2914
724,C94-SG,44.6405
48,C6-SG,37.2108
630,C80-SG,29.574
513,C64-SG,23.1925
981,C127-SG,19.8903
325,Q41-OE1,17.6205
601,C76-SG,17.5079
The part between the commas is an atom identifier: where A-30-CYS-SG is the gamma sulfur of the residue 30, which is a cysteine, in chain A. Residues can be represented with three letters or just one (Table here https://www.iupac.org/publications/pac-2007/1972/pdf/3104x0639.pdf). Basically, I just want to a) change the three letter code to the one letter code, b) remove the chain id (A in this case) and c) put the residue number next to the one letter code.
I've tried matching the patterns between the commas within vim. Something like s%:\(-\d\+\-\)\(\u\+\):\2\1:g gives me c) i.e. (ACYS-30--SG). I do not know how to do a) with vim. I know how to do it with sed and an input file with all the substitute commands in it. But then maybe is better to do all the work with sed... I am asking if is it possible to do a) on vim?
Thanks
This might work for you (GNU sed):
sed -r '1b;s/$/\n:ALAA:ARGR:ASNN:ASPD:CYSC:GLUE:GLNQ:GLYG:HISH:ILEI:LEUL:LYSK:METM:PHEF:PROP:SERS:THRT:TRPW:TYRY:VALV/;s/,A-([0-9]+)-(...)(.*)\n.*:\2(.).*/,\4\1\3/' file
Append a lookup table to each line and use pattern matching to substitute a 3 letter code (and integer value) for a 1 letter code. The lookup key is a colon, followed by the 3 letter key, followed by the 1 letter code.
Using sed, paste, cut, & bash, given input atoms.csv:
paste -d, <(cut -d, -f1 atoms.csv) \
<(cut -d, -f2 atoms.csv | sed 's/.-//
s/\(.*\)-\([A-Z]\{3\}\)-/\2\1-/
s/^ALA/A/
s/^ARG/R/
s/^ASN/N/
s/^ASP/D/
s/^CYS/C/
s/^GLU/E/
s/^GLN/Q/
s/^GLY/G/
s/^HIS/H/
s/^ILE/I/
s/^LEU/L/
s/^LYS/K/
s/^MET/M/
s/^PHE/F/
s/^PRO/P/
s/^SER/S/
s/^THR/T/
s/^TRP/W/
s/^TYR/Y/
s/^VAL/V/') \
<(cut -d, -f3 atoms.csv)
Output:
atomnum,atominfo,metric
238,C30-SG,53.7723
889,C115-SG,46.2914
724,C94-SG,44.6405
48,C6-SG,37.2108
630,C80-SG,29.574
513,C64-SG,23.1925
981,C127-SG,19.8903
325,Q41-OE1,17.6205
601,C76-SG,17.5079
If you know how to do it in sed why not leverage that knowledge and simply call out from Vim?
:%!sed -e '<your sed script>'
Once you done that and it works you can pop it in a Vim function.
functioni Transform()
your sed command
endfunction
and then just use
:call Transform()
which you can map to a key.
Simples!

how do remove carriage returns in a txt file

I recently received some data items 99 pipe delimited txt files, however in some of them and ill use dataaddress.txt as an example, where there is a return in the address eg
14 MakeUp Road
Hull
HU99 9HU
It goming out on 3 rows rather than one, bear in made there is data before and after this address separated by pipes. It just seems to be this addresss issue which is causing me issues in oading the txt file correcting using SSIS.
Rather than go back to source I wondered if there was a way we can manipulate the txt file to remove these carriage returns while not affected the row end returns if that makes sense.
I would use sed or awk. I will show you how to do this with awk, because it more platform independent. If you do not have awk, you can download a mawk binary from http://invisible-island.net/mawk/mawk.html.
The idea is as follows - tell awk that your line separator is something different, not carriage return or line feed. I will use comma.
Than use a regular expression to replace the string that you do not like.
Here is a test file I created. Save it as test.txt:
1,Line before ...
2,Broken line ... 14 MakeUp Road
Hull
HU99 9HU
3,Line after
And call awk as follows:
awk 'BEGIN { RS = ","; ORS=""; s=""; } $0 != "" { gsub(/MakeUp Road[\n\r]+Hull[\n\r]+HU99 9HU/, "MakeUp Road Hull HU99 9HU"); print s $0; s="," }' test.txt
I suggest that you save the awk code into a file named cleanup.awk. Here is the better formatted code with explanations.
BEGIN {
# This block is executed at the beginning of the file
RS = ","; # Tell awk our records are separated by comma
ORS=""; # Tell awk not to use record separator in the output
s=""; # We will print this as record separator in the output
}
{
# This block is executed for each line.
# Remember, our "lines" are separated by commas.
# For each line, use a regular expression to replace the bad text.
gsub(/MakeUp Road[\n\r]+Hull[\n\r]+HU99 9HU/, "MakeUp Road Hull HU99 9HU");
# Print the replaced text - $0 variable represents the line text.
print s $0; s=","
}
Using the awk file, you can execute the replacement as follows:
awk -f cleanup.awk test.txt
To process multiple files, you can create a bash script:
for f in `ls *.txt`; do
# Execute the cleanup.awk program for each file.
# Save the cleaned output to a file in a directory ../clean
awk -f cleanup.awk $f > ../clean/$f
done
You can use sed to remove the line feed and carriage return characters:
sed ':a;N;$!ba;s/MakeUp Road[\n\r]\+/MakeUp Road /g' test.txt | sed ':a;N;$!ba;s/Hull[\n\r]\+/Hull /g'
Explanation:
:a create a label 'a'
N append the next line to the pattern space
$! if not the last line, ba branch (go to) label 'a'
s substitute command, \n represents new line, \r represents carriage return, [\n\r]+ - match new line or carriage return in a sequence as many times as they occur (at least one), /g global match (as many times as it can)
sed will loop through step 1 to 3 until it reach the last line, getting all lines fit in the pattern space where sed will substitute all \n characters

Converting lines in chunks into tab delimited

I have the following lines in 2 chunks (actually there are ~10K of that).
And in this example each chunk contain 3 lines. The chunks are separated by an empty line. So the chunks are like "paragraphs".
xox
91-233
chicago
koko
121-111
alabama
I want to turn it into tab-delimited lines, like so:
xox 91-233 chicago
koko 121-111 alabama
How can I do that?
I tried tr "\n" "\t", but it doesn't do what I want.
$ awk -F'\n' '{$1=$1} 1' RS='\n\n' OFS='\t' file
xox 91-233 chicago
koko 121-111 alabama
How it works
Awk divides input into records and it divides each record into fields.
-F'\n'
This tells awk to use a newline as the field separator.
$1=$1
This tells awk to assign the first field to the first field. While this seemingly does nothing, it causes awk to treat the record as changed. As a consequence, the output is printed using our assigned value for ORS, the output record separator.
1
This is awk's cryptic shorthand for print the line.
RS='\n\n'
This tells awk to treat two consecutive newlines as a record separator.
OFS='\t'
This tells awk to use a tab as the field separator on output.
This answer offers the following:
* It works with blocks of nonempty lines of any size, separated by any number of empty lines; John1024's helpful answer (which is similar and came first) works with blocks of lines separated by exactly one empty line.
* It explains the awk command used in detail.
A more idiomatic (POSIX-compliant) awk solution:
awk -v RS= -F '\n' -v OFS='\t' '$1=$1""' file
-v RS= tells awk to operate in paragraph mode: consider each run of nonempty lines a single record; RS is the input record separator.
Note: The implication is that this solution considers one or more empty lines as separating paragraphs (line blocks); empty means: no line-internal characters at all, not even whitespace.
-F '\n' tells awk to consider each line of an input paragraph its own field (breaks the multiline input record into fields by lines); -F sets FS, the input field separator.
-v OFS='\t' tells awk to separate fields with \t (tab chars.) on output; OFS is the output field separator.
$1=$1"" looks like a no-op, but, due to assigning to field variable $1 (the record's first field), tells awk to rebuild the input record, using OFS as the field separator, thereby effectively replacing the \n separators with \t.
The trailing "" is to guard against the edge case of the first line in a paragraph evaluating to 0 in a numeric context; appending "" forces treatment as a string, and any nonempty string - even if it contains "0" - is considered true in a Boolean context - see below.
Given that $1 is by definition nonempty and given that assignments in awk pass their value through, the result of assignment $1=$1"" is also a nonempty string; since the assignment is used as a pattern (a condition), and a nonempty string is considered true, and there is no associated action block ({ ... }), the implied action is to print the - rebuilt - input record, which now consists of the input lines separated with tabs, terminated by the default output record separator (ORS), \n.
another alternative,
$ sed '/^$/d' file | pr -3ats$'\t'
xox 91-233 chicago
koko 121-111 alabama
remove empty lines with sed and print to 3 columns with tab delimiter. In your real file, this should be the number of lines in blocks.
Note that this will only work if all your blocks are of the same size.
xargs -L3 < filename.log |tr ' ' '\t'
xox 91-233 chicago
koko 121-111 alabama
another version of awk to do this
awk '{if(NF>0){a=a$1"\t";i++};if(i%3==0&&NF>0){print a;a=""}}' input_file