For a data mining project I need to convert 80 tab delimited files(100 MB each) to CSV files. Anybody is aware of some tools that can be handy in this case.
Download python: https://www.python.org/downloads/
Install it.
And run a script similar to the following.
Save the following as convert_tsv_to_csv.py Or anything ending in .py:
import csv
with open('C:\\path\to\file','r') as f:
tab_file = csv.reader(f, dialect=csv.excel_tab)
with open('C:\path\to\outfile.csv','w') as g:
comma_file = csv.writer(g, dialect=csv.excel)
for row in tab_file:
comma_file.writerow(row)
Change the paths and run it like: python convert_tsv_to_csv.py
The basic idea:
If the files are big, read them line by line.
Learn your basic tools.
On any UNIX/Linux/OSX system, the following commands each should do the trick:
sed -i -e 's/\t/,/g' *.csv
perl -i -p -e 's/\t/,/g' *.csv
These perform the basic tab to comma substitution. They won't take care of things like quoting and escaping if your data contains columns with a tabular or comma, or chaning the file name for you! Note that the syntax of sed and perl are very similar... -i is inplace editing, -e is execute a command, s/// is the syntax for regular expression substitutions. Etc.
Either way, your basic unix tools for this job are
extremely fast (the "stream editor" sed is well optimized, low-level C code)
handy (just some 10 keypresses!)
easy to use, once you've learned the basics (i.e. read the manual)
Related
I have a Perl script that uses some local variables as per below:
my $cool_variable="Initial value";
COOLVAR="Initial value for COOLVAR"
I would like to replace the content between the quotes using a bash script.
I got it to work for a non-variable like below:
#!/bin/sh
dummy_var="Replaced value"
sed -i -r "s#^(COOLVAR=).*#\1$dummy_var#" perlscript.pl
But if I replace it with cool_variable or $cool_variable:
sed -i -r "s#^($cool_variable=).*#\1$dummy_var#" perlscript.pl
It does not work..
The are multiple code injection bugs in that snippet. You shouldn't be generating code from the shell or sed.
Say you have
var=COOLVAR
val=coolval
As per How can I process options using Perl in -n or -p mode?, you can use any of
perl -spe's{^$var=\K.*}{"\Q$val\E";};' -- -var="$var" -val="$val" perlscript.pl
var=var val=val perl -pe's{^$ENV{var}=\K.*}{"\Q$ENV{val}\E";};' perlscript.pl
export var
export val
perl -pe's{^$ENV{var}=\K.*}{"\Q$ENV{val}\E";};' perlscript.pl
to transform
COOLVAR="dummy";
HOTVAR="dummy";
into
COOLVAR="coolvar";
HOTVAR="dummy";
The values are passed to the program using arguments to avoid injecting them into the fixer, and the fixer uses Perl's quotemeta (aka \Q..\E) to quote special characters.
Note that $var is assumed to be a valid identifier. No validation checks are performed. This program is absolutely unsafe using untrusted input.
Use -i to modify the file in place.
I have a CSV of image details I want to loop over in a bash script. awk seems like an obvious choice to loop over the data.
For each row, I want to take the values, and use them to do Imagemagick stuff. The following isn't working (obviously):
awk -F, '{ magick "source.png" "$1.jpg" }' images.csv
GNU AWK excels at processing structured text data, although it can be used to summon commands using system function it is less handy for that than some other language, e.g. python has module of standard library called subprocess which is more feature-rich.
If you wish to use awk for this task anyway, then I suggest preparing output to be feed into bash command, say you have file.txt with following content
file1.jpg,file1.bmp
file2.png,file2.bmp
file3.webp,file3.bmp
and you have files listed in 1st column in current working directory and wish to convert them to files shown in 2nd column and access to convert command, then you might do
awk 'BEGIN{FS=","}{print "convert \"" $1 "\" \"" $2 "\""}' file.txt | bash
which is equvialent to starting bash and doing
convert "file1.jpg" "file1.bmp"
convert "file2.png" "file2.bmp"
convert "file3.webp" "file3.bmp"
Observe that I have used literal " to enclose filenames, so it should work with names containing spaces. Disclaimer: it might fail if name containing special character, e.g. ".
I am looking for an elegant way to parse a text file (i.e. a log file containing source and destination IPs and lots of other data) keeping each line intact, and replacing all IPv4 addresses with the same IP followed by a comma and the GeoIP country code of that IP.
I have tried doing this in bash, sed, perl, and python. I tried a hundred perl one-liners and never quite got it because substitution like s/original/replacement/g doesn't want to execute GeoIP lookup in the substitution field. For example:
perl -pe 's/([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})/($1,system(geoiplookup $1))/g' < log.csv
results in:
"srcip=(110.110.110.110,system(geoiplookup 110.110.110.110))"
instead of the executing geoiplookup.
I've tried this with backticks as well as exec, lots of different punctuation, with the same result.
In Python I tried some code that looks like:
rexp_ip = r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"
repl = { rexp_ip: rexp_ip+".test" }
---
while line:
line = i.readline()
print(re.sub(rexp_ip, lambda m: str(repl.get(m.group())), line))
It seems pretty close but I'm not sure whether I'm on the right track here.
I would be open to bash, sed, awk, perl, python, or any other solution.
This seems fairly simple to me and I may be over-thinking it!
I am guessing I'm not the first person who has tried this and maybe I'm 'reinventing the wheel' here.
Any insight would be appreciated.
I may have solved my own problem using perl with /e switch--
$ perl -lpe 's/([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})/(`printf $1;geoiplookup $1`)/eg' < log.csv
I am trying to replace page=#" in various html files with page=#/index.html". I have tried using the command:
sed -i -re 's|"(page=[0-9]+)"|"\1/index.html"|' *.html
along with numerous interpretations but have not been successful. The first part of the code sed -i -re 's|"(page=[0-9]+)"| seems to be working properly but I cannot seem to format the end to achieve my goal. Any suggestions to modify this command would be greatly appreciated!
If you're trying to replace page=#" where the actual strings look like page=99", then the first double quote in the RE isn't going to match anything correctly. It would only match if it looks like:
"page=99"
But I'm guessing this is at the end of a link in html so it probably does not have the initial double quote. This should work instead:
`sed -i -re 's|(page=[0-9]+)"|\1/index.html"|' *.html
Also to confirm, if you're on OS X, you can't use the GNU option -r or use -i without an argument, so it would look like this:
`sed -i '' -Ee 's|(page=[0-9]+)"|\1/index.html"|' *.html
-E means to use Extended Regular Expressions so you can write ( instead of \( for grouping. In GNU sed this is -r.
-i means to edit the files in-place, on GNU it can take no argument, but on other systems you need to pass the extension to make for a backup, or '' for no backup.
Hello and thank you for any help you can provide
I have my Apache2 web server set up so that when I go to a specific link, it will run and display the output of a shell script stored on my server. I need to output the results of an SVN command (svn log). If I simply put the command 'svn log -q' (-q for quiet), I get the output of:
(of course not blurred), and with exactly 72 dashes in between each line. I need to be able to take these dashes, and turn them into an html line break, like so:
Basically I need the shell script to take the output of the 'svn log -q' command, search and replace every chunk of 72 dashes with an html line break, and then echo the output.
Is this at all possible?
I'm somewhat a noob at shell scripting, so please excuse any mess-ups.
Thank you so much for your help.
svn log -q | sed -e 's,-{72},<br/>,'
If you want to write it in the script this might help:
${string//substring/replacement}
Replace all matches of $substring with $replacement.
stringZ=abcABC123ABCabc
echo ${stringZ/abc/xyz} # xyzABC123ABCabc
# Replaces first match of 'abc' with 'xyz'.
echo ${stringZ//abc/xyz} # xyzABC123ABCxyz
# Replaces all matches of 'abc' with # 'xyz'.