removing multiple instances of a string in a line with sed - csv

I have a large tab delimited file that I'd like keep only a certain string (GO:#######) that appears multiple (and variable) times in each line as well as lines that are blank containing a period. When I use SED to replace all the non-GO strings it removes the entire middle of the line. How do I prevent this?
SED command I'm using and other permutations
sed -r 's/\t`.+`\t//g' file1.txt > file2.txt
What I have
GO:1234567 `text1`moretext` GO:5373845 `diff`text` GO:5438534 `text`text
.
GO:3333333 `txt`text` GO:5553535 `misc`text
.
.
What I'd like
GO:1234567 GO:5373845 GO:5438534
.
GO:3333333 GO:5553535
.
.
What I get
GO:1234567 GO:5438534 `text`text
.
GO:3333333 GO:5553535 `misc`text
.
.

With GNU awk:
awk 'BEGIN{FPAT="GO:[0-9]+"; OFS="\t"} {$1=$1; print}' file
Output is tab delimited:
GO:1234567 GO:5373845 GO:5438534
GO:3333333 GO:5553535
From man awk:
FPAT: A regular expression describing the contents of the fields in a record. When set, gawk parses the in‐
put into fields, where the fields match the regular expression, instead of using the value of FS as the
field separator.
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

sed -E 's/\t`[^\t]*//g'
\t- tab
` - a literal backtick
[^\t]* - any non-tab character 0 or more times
Alternative:
sed -E 's/\t(`[^`]*){2}`?//g'
\t - tab
( - start of group
` - a literal backtick
[^`]* - any non-backticks 0 or more times
) - end of group
{2} - repeat group twice
`? - an optional backtick (since the last column only has 2 instead of 3)
... and substitute with an empty string.
Output:
GO:1234567 GO:5373845 GO:5438534
.
GO:3333333 GO:5553535
.
.
Note: These examples assumes that there is exactly one tab between columns. It's hard to see here.

This awk solution would work with any version of awk:
awk '
BEGIN {
FS=OFS="\t"
}
{
for (i=1; i<=NF; ++i)
if ($i ~ /^GO:/)
s = (s ? s OFS : "") $i
print s
s = ""
}' file
GO:1234567 GO:5373845 GO:5438534
GO:3333333 GO:5553535
GO:3333333

I would match explicitly non `.
s/`[^`]*`[^`]*`//
Regex is greedy, `.+` matches anything, from the first backtick up to the last backtick.

This pattern \t`.+`\t matches from a tab followed by ` till the last occurrence of that same pattern, which matches too much.
There don't seem to be any spaces in the parts that start with a backtick that you want to remove.
I think awk is better suited for this task, but in that case with sed you can remove all strings that start with a backtick ` followed by non whitespace characters.
If you remove multiple consecutive fields, or a field at the start or end, there can occur gaps with multiple tabs that you can also replace with an empty string.
sed -E 's/(\t|^)`[^[:space:]]*//g;s/^\t+|\t+$|//g;s/\t{2,}/\t/g' file
The tab delimited content of file
GO:1234567 `text1`moretext` GO:5373845 `diff`text` GO:5438534 `text`text
.
GO:3333333 `txt`text` GO:5553535 `misc`text
..
`txt`text` GO:3333333 `txt`text` `txt`text` `txt`text` GO:5553535 `misc`text `misc`text
Output
GO:1234567 GO:5373845 GO:5438534
.
GO:3333333 GO:5553535
..
GO:3333333 GO:5553535

Related

Create CSV file with below output line

I have below output line , from this line I want to create CSV file. In CSV
it should print below line as first column and in second column I want to print the string before second delemeter ":".I am using below script but It is separating data wherever "," is present , and I want to print that whole line in first column and the string after second delimiter ":" in second column .Please help me to sort data in proper format
output line :/home/nagios/NaCl/files/chk_raid.pl:token=$$value=undef;next};my($lhys,$lytrn,$ccdethe
shell script : input="out.txt"
while IFS= read -r LINES
do
#echo "$LINES"
if [[ $LINES = /* ]]
then
filename=echo $LINES | cut -d ":" -f1
echo "$LINES,$filename" >> out.csv
fi
done < "$input"
I don't think I understand your question correctly.
You currently have this output
:/home/nagios/NaCl/files/chk_raid.pl:token=$$value=undef;next};my($lhys,$lytrn,$ccdethe
And you would like to have this kind of CSV output
2
3
/home/nagios/NaCl/files/chk_raid.pl:token=13704value=undef;next};my($lhys,$lytrn,$ccdethe
token=13704value=undef;next};my($lhys,$lytrn,$ccdethe
If that's what you want, you can use Miller like this
echo ":/home/nagios/NaCl/files/chk_raid.pl:token=$$value=undef;next};my(\$lhys,\$lytrn,\$ccdethe" | mlr --n2c --ifs ":" cut -x -f 1 then put '$2=$2.":".$3'
and you will have this two columns CSV
2,3
"/home/nagios/NaCl/files/chk_raid.pl:token=13704value=undef;next};my($lhys,$lytrn,$ccdethe","token=13704value=undef;next};my($lhys,$lytrn,$ccdethe"

Add an empty line prior to another matched line in jq?

Say I have a raw input like the following:
"```"
"include <stdio.h>"
"..."
"```"
"''some example''"
"*bob"
"**bob"
"*bob"
And I'd like to add a blank line right before the "*bob":
"```"
"include <stdio.h>"
"..."
"```"
"''some example''"
""
"*bob"
"**bob"
"*bob"
Can this be done with jq?
Yes, but to do so efficiently you'd effectively need jq 1.5 or higher:
foreach inputs as $line (0;
if $line == "*bob" then . + 1 else . end;
if . == 1 then "" else empty end,
$line)
Don't forget to use the -n command-line option!
Here is another solution which uses the -s (slurp) option
.[: .[["*bob"]][0]] + ["\n"] + .[.[["*bob"]][0]:] | .[]
that's a little unreadable but we can make it better with a few functions:
def firstbob: .[["*bob"]][0] ;
def beforebob: .[: firstbob ] ;
def afterbob: .[ firstbob :] ;
beforebob + ["\n"] + afterbob
| .[]
if the above filter is in filter.jq and the sample data is in data then
$ jq -Ms -f filter.jq data
produces
"```"
"include <stdio.h>"
"..."
"```"
"''some example''"
"\n"
"*bob"
"**bob"
"*bob"
One issue with this approach is that beforebob and afterbob won't quite work as you probably want if "*bob" is not in the input. The easiest way to address that is with an if guard:
if firstbob then beforebob + ["\n"] + afterbob else . end
| .[]
with that the input will be unaltered if "*bob" is not present.

awk: appending columns from multiple csv files into a single csv file

I have several CSV files (all have the same number of rows and columns). Each file follows this format:
1 100.23 1 102.03 1 87.65
2 300.56 2 131.43 2 291.32
. . . . . .
. . . . . .
200 213.21 200 121.81 200 500.21
I need to extract columns 2, 4 and 6, and add them to a single CSV file.
I have a loop in my shell script which goes through all the CSV files, extracts the columns, and appends these columns to a single file:
#output header column
awk -F"," 'BEGIN {OFS=","}{ print $1; }' "$input" > $output
for f in "$1"*.csv;
do
if [[ -f "$f" ]] #removes symlinks (only executes on files with .csv extension)
then
fname=$(basename $f)
arr+=("$fname") #array to store filenames
paste -d',' $output <(awk -F',' '{ print $2","$4","$6; }' "$f") > temp.csv
mv temp.csv "$output"
fi
done
Running this produces this output:
1 100.23 102.03 87.65 219.42 451.45 903.1 ... 542.12 321.56 209.2
2 300.56 131.43 291.32 89.57 897.21 234.52 125.21 902.25 254.12
. . . . . . . . . .
. . . . . . . . . .
200 213.23 121.81 500.21 231.56 5023.1 451.09 ... 121.09 234.45 709.1
My desired output is a single CSV file that looks something like this:
1.csv 1.csv 1.csv 2.csv 2.csv 2.csv ... 700.csv 700.csv 700.csv
1 100.23 102.03 87.65 219.42 451.45 903.1 542.12 321.56 209.2
2 300.56 131.43 291.32 89.57 897.21 234.52 125.21 902.25 254.12
. . . . . . . . . .
. . . . . . . . . .
200 213.23 121.81 500.21 231.56 5023.1 451.09 ... 121.09 234.45 709.1
In other words, I need a header row containing the file names in order to identify which files the columns were extracted from. I can't seem to wrap my head around how to do this.
What is the easiest way to achieve this (preferably using awk)?
I was thinking of storing the file names into an array, inserting a header row and then print the array but I can't figure out the syntax.
So, based on a few assumptions:
the inputs are called "*.csv" but they're actually whitespace-separated, as they appear.
the odd-numbered input columns just repeat the row number 3 times, and can be ignored
the column headings are just the filenames, repeated 3 times each
they are input to some other program, and the numbers are left-justified anyway, so you aren't particular about the column formatting (columns lining up, decimals aligned, ...)
Humble apologies because code PRE formatting is not working for me here
f=$(set -- *.csv; echo $*)
(echo $f; paste $f) |
awk 'NR==1 { for (i=1; i<=NF; i++) {x=x" "$i" "$i" "$i} }
NR > 1 { x=$1; for (i=2; i<= NF; i+=2) {x=x" "$i} }
{print x}'
hth

How can I enter similar lines of code with sublime text with just one char different?

I have to write code which is like
apple1 =1
banana1 =10
cat1 =100
dog1 =1000
apple2 =2
banana2 =20
cat2 =200
dog2 =2000
.
.
.
<to be done till>
apple50 =50
banana50 =500
cat50 =5000
dog50 =50000
Is there any shortcut to copy paste the first 4 line and keep pasting with running sequence ?
Any level of short cut is appreciated to do this partially or completely.
Thanks
As already mentioned the easiest way to do it is using a programming language, but you can use python in Sublime Text.
Open the ST console ctrl+` and paste:
view.run_command("insert", {"characters": "\n\n".join("apple{0} ={0}\nbanana{0} ={0}0\ncat{0} ={0}00\ndog{0} ={0}000".format(i) for i in range(1, 51))})
this will insert the requested content.
You could also write a plugin using Tools >> New Plugin... and paste:
import sublime
import sublime_plugin
class PasteSequenceCommand(sublime_plugin.TextCommand):
def run(self, edit):
view = self.view
content = sublime.get_clipboard()
content, sequence_number = content.replace("1", "{0}"), 2
if content == view.settings().get("ps_content"):
sequence_number = view.settings().get("ps_sequence_number") + 1
view.settings().set("ps_content", content)
view.settings().set("ps_sequence_number", sequence_number)
view.run_command("insert", {"characters": content.format(sequence_number)})
Afterwards add keybinding:
{
"keys": ["ctrl+shift+v"],
"command": "paste_sequence"
},
Then you can copy the block containing the 1 and each 1 will increase each time you use the paste sequence command.
For me it seems that this task is not for text editor. It looks more like task for a script. For example in bash it would be like following:
#!/bin/bash
for i in `seq 1 50`;
do
echo "apple$i .. ${i}=${i}" >> text.txt
echo "banana$i =${i}0" >> text.txt
echo "cat$i =${i}00" >> text.txt
echo "dog$i =${i}000" >> text.txt
done
To run it:
create file, say inserter.sh
make it executable by chmod +x inserter.sh
run it ./inserter.sh
Result will be in text.txt file in the same folder.
You need to redirect the output to a file.
#!/bin/bash
cntr=1
banana_cntr=10
cat_cntr=100
dog_cntr=1000
for i in `seq 1 1 50`
do
echo "apple${cntr}=$[$cntr * 1]"
echo "banana${cntr}=`expr $cntr \* $banana_cntr`"
echo "cat${cntr}=`expr $cntr \* $cat_cntr`"
echo "dog${cntr}=`expr $cntr \* $dog_cntr`"
cntr="$[cntr + 1]"
echo " "
done

Read from file into variable - Bash Script

I'm working on a bash script to backup MySQL. I need to read from a file a series of strings and pass them to a variable in my script. Example:
Something like this will be in the file (file.txt)
database1 table1
database1 table4
database2
database3 table2
My script needs to read the file and put these strings in a variable like:
#! bin/bash
LIST="database1.table1|database1.table4|database2|database3.table2"
Edit. I changed my mind, now I need this output:
database1.table1.*|database1.table4.*|database2*.*|database3.table2.*
You could use tr to replace the newlines and spaces:
LIST=$(tr ' \n' '.|' < file.txt)
Since the last line of the input file is likely to contain a newline by itself, you'd need to get rid of it:
LIST=$(tr ' ' '.' < file.txt | paste -sd'|')
Using awk:
s=$(awk '{$1=$1}1' OFS='.' ORS='|' file)
LIST="${s%|}"
echo "$LIST"
database1.table1|database1.table4|database2|database3.table2
bash (version 4 I believe)
mapfile -t lines < file.txt # read lines of the file into an array
lines=("${lines[#]// /.}") # replace all spaces with dots
str=$(IFS='|'; echo "${lines[*]}") # join the array with pipe
echo "$str"
database1.table1|database1.table4|database2|database3.table2
mapfile -t lines < file.txt
for ((i=0; i<${#lines[#]}; i++)); do
[[ ${lines[i]} == *" "* ]] && lines[i]+=" *" || lines[i]+="* *"
done
str=$(IFS='|'; echo "${lines[*]// /.}")
echo "$str"
database1.table1.*|database1.table4.*|database2*.*|database3.table2.*
You can just replace the new lines with a charater that you need using sed, if it doesn't occur in the data.
For example
FOO=$(sed '{:q;N;y/ /./;s/\n/|/g;t q}' /home/user/file.txt)