converting \s+ delimited file to csv using sed - csv

I'm trying to convert a file that has two or more white spaces separating each column.
YP_010083342.1 - 258 VOG00003 - 582 8.6e-22 80.7 0.2 1 1 5.3e-25 1e-21 80.4 0.2 193 363 5 185 1 251 0.60 anti-repressor protein [Staphylococcus phage LH1]
I'd like to convert this to a csv using sed. The following sed commands make no apparent changes to the file.
sed -i 's/\s+/,/g' file.ouput
sed -i 's/$\s+/,/g' file.ouput
sed -i 's/\t+/,/g' file.ouput
sed -i 's/$\t+/,/g' file.ouput
but the following command results in the following
sed -i 's/\s\s/,/g' file.ouput
YP_010083342.1,,, -,,,,,,258 VOG00003,,,,,, -,,,,,,582, 8.6e-22, 80.7, 0.2, 1, 1, 5.3e-25,, 1e-21, 80.4, 0.2, 193, 363,, 5, 185,, 1, 251 0.60 anti-repressor protein [Staphylococcus phage LH1]
Is anyone able to explain why this is occurring and how to properly solve this?

You can use this sed:
sed -E 's/ {2,}/,/g' file
YP_010083342.1,-,258 VOG00003,-,582,8.6e-22,80.7,0.2,1,1,5.3e-25,1e-21,80.4,0.2,193,363,5,185,1,251 0.60 anti-repressor protein [Staphylococcus phage LH1]
Or this awk:
awk -F ' {2,}' -v OFS=, '{$1=$1} 1' ff

The problem is that + is part of extended regular expressions, which have to be enabled using sed -r (or -E). Some seds such as GNU sed support it as an extension also in basic regular expressions, but it has to be escaped: \+. \s is also an extension, by the way.
Assuming GNU sed, any of these would work:
sed -i 's/\s\s\+/,/g' file.output
sed -E -i 's/\s\s+/,/g' file.output
sed -E -i 's/\s{2,}/,/g' file.output
More portable, working with any sed (redirect output to another file, then rename):
sed 's/[[:blank:]]\{2,\}/,/g' file.output

Related

Get tsv file with header using sed

So I wrote this sed commands to get .tsv files filtered (in this case) by chromosome 19. Unfortunatley i dont know how to get the Header for the tsv file as well. So far i only get headerless data. how should I modify my code?
wget https://www.dropbox.com/s/dataset.tsv.bgz -O temp.data.99.tsv.bgz
gunzip -c temp.data.99.tsv.bgz > temp.data.99.tsv
sed -n '/^19:/p' temp.data.99.tsv | sed 's/:/ /g' > finished_tsv_files/temp.data.99_Chr_19.tsv
rm temp.data.99.tsv
Replace
/^19:/p
with
1p; /^19:/p
to output first line, too.

How to replace a html comments using shell script

I am trying to uncomment a line in html file using shell script but I am not able to write a sed command for this .
I have a line
<!--<url="/">-->
I need to uncomment this line using shell script
<url="/"/>
sed -i -e "s|'<!--<url="/"/>-->'|<url="/">|g" myFile.html
Any idea how to replace this comment?
Use :
sed -re 's/(<!--)|(-->)//g'
e.g:
echo '<HTML><!--<url="/">--> <BODY>Test</BODY></HTML>' | sed -re 's/(<!--)|(-->)//g'
Like this?
sed -i 's|<!--<url="/">-->|<url="/">|g' myFile.html
It's better to use single quotes because it prevents interpretation of everything including double quotes.
You need to escape(add backslash) before / character.Secondly, both crucial arguments should be separated with /, but not with |.Use the following line:
sed -i 's/<!--<url="\/">-->/<url="\/">/g' myFile.html

How to add a comma to the end of every line in a json file [duplicate]

given a plain text document with several lines like:
c48 7.587 7.39
c49 7.508 7.345983
c50 5.8 7.543
c51 8.37454546 7.34
I need to add some info 2 spaces after the end of the line, so for each line I would get:
c48 7.587 7.39 def
c49 7.508 7.345983 def
c50 5.8 7.543 def
c51 8.37454546 7.34 def
I need to do this for thousands of files. I guess this is possible to do with sed, but do not know how to. Any hint? Could you also give me some link with a tutorial or table for this cases?
Thanks
if all your files are in one directory
sed -i.bak 's/$/ def/' *.txt
to do it recursive (GNU find)
find /path -type f -iname '*.txt' -exec sed -i.bak 's/$/ def/' "{}" +;
you can see here for introduction to sed
Other ways you can use,
awk
for file in *
do
awk '{print $0" def"}' $file >temp
mv temp "$file"
done
Bash shell
for file in *
do
while read -r line
do
echo "$line def"
done < $file >temp
mv temp $file
done
for file in ${thousands_of_files} ; do
sed -i ".bak" -e "s/$/ def/" file
done
The key here is the search-and-replace s/// command. Here we replace the end of the line $ with 2 spaces and your string.
Find the sed documentation at http://sed.sourceforge.net/#docs

Outputting data from 5gb file with awk

I have a csv file with approximately 300 columns.
I'm using awk to create a subset of this file where the 24th column is "CA".
Example of data:
Here's what I am trying:
awk -F "," '{if($24~/CA/)print}' myfile.csv > subset.csv
After approximately 10 minutes the subset file grew to 400 mb, and then I killed it because this is too slow.
How can I speed this up? Perhaps a combination of sed / awk?
\
tl;dr:
awk implementations can significantly differ in performance.
In this particular case, see if using gawk (GNU awk) helps.
Ubuntu comes with mawk as the default awk, which is usually considered faster than gawk. However, in the case at hand it seems that gawk is significantly faster (related to line length?), at least based on the following simplified tests, which I ran
in a VM on Ubuntu 14.04 on a 1-GB file with 300 columns of length 2.
The tests also include an equivalent sed and grep command.
Hopefully they provide at least a sense of comparative performance.
Test script:
#!/bin/bash
# Pass in test file
f=$1
# Suppress stdout
exec 1>/dev/null
awkProg='$24=="CA"'
echo $'\n\n\t'" $(mawk -W version 2>&1 | head -1)" >&2
time mawk -F, "$awkProg" "$f"
echo $'\n\n\t'" $(gawk --version 2>&1 | head -1)" >&2
time gawk -F, "$awkProg" "$f"
sedProg='/^([^,]+,){23}CA,/p'
echo $'\n\n\t'" $(sed --version 2>&1 | head -1)" >&2
time sed -En "$sedProg" "$f"
grepProg='^([^,]+,){23}CA,'
echo $'\n\n\t'" $(grep --version 2>&1 | head -1)" >&2
time grep -E "$grepProg" "$f"
Results:
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan
real 0m11.341s
user 0m4.780s
sys 0m6.464s
GNU Awk 4.0.1
real 0m3.560s
user 0m0.788s
sys 0m2.716s
sed (GNU sed) 4.2.2
real 0m9.579s
user 0m4.016s
sys 0m5.504s
grep (GNU grep) 2.16
real 0m50.009s
user 0m42.040s
sys 0m7.896s

Search and replace html tags in sed recursively

I am trying to write a script to search and remove htm and html tags from all files recursively. The starting point is given as input in the command to run the script. The resultant files should be saved in new file at the same place ending with _changed. e.g., start.html > start.html_changed.
Here is the script I wrote so far. It works fine, but the output prints out to the terminal, and I want it to be saved in files respectively.
#!/bin/bash
sudo find $1 -name '*.html' -type f -print0 | xargs -0 sed -n '/<div/,/<\/div>/p'
sudo find $1 -name '*.htm' -type f -print0 | xargs -0 sed -n '/<div/,/<\/div>/p'
Any help is much appreciated.
The following script works just fine, but it is not recursive. how can I make it recursive?
#!/bin/bash
for l in /$1/*.html
do
sed -n '/<div/,/<\/div>/p' $l > "${l}_nobody"
done
for m in /$1/*.htm
do
sed -n '/<div/,/<\/div>/p' $m > "${m}_nobody"
done
Just edit the xargs part as follows:
xargs -0 -I {} sh -c "sed -n '/<div/,/<\/div>/p' {} > {}_changed"
Explanation:
-I {}: sets a placeholder
> {}_changed": does redirection to the file with _changed suffix