R - running commands in terminal and saving output to a dataframe - mysql

On OSX, I am using the system() function to run commands in terminal from the R console as part of a script I've written. The script requires connecting to a MySQL() database through an ssh tunnel, and I type into the command line "ps aux | grep ssh" to see what tunnels i am connected to. For example, some output:
.
> system("ps aux | grep ssh")
Home 50915 0.0 0.0 2501204 3264 ?? S 10:32AM server info
Home 50092 0.0 0.0 2504172 3048 ?? Ss 9:35AM server2 info
Home 50090 0.0 0.0 2501372 480 ?? Ss 9:35AM server3 info
Home 1155 0.0 0.0 2544220 1368 ?? S Thu07PM server4 info
Home 51333 0.0 0.0 2434840 800 ?? S 11:00AM 0:00.00 grep ssh
Home 51331 0.0 0.0 2438508 1124 ?? S 11:00AM 0:00.00 sh -c ps aux | grep ssh
.
I would like to turn this output into a dataframe, but cannot. Functions like as.data.frame(system("ps aux | grep ssh")) do not work as how I would hope them to work.
Any thoughts on this would be appreciated!
EDIT - just wanted to highlight error from one suggested comment
> read.table(pipe("ps aux | grep ssh"))
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 34 elements
> pipe("ps aux | grep ssh")
description class mode text opened can read can write
"ps aux | grep ssh" "pipe" "r" "text" "closed" "yes" "yes"

First pipe your output to an actual text file:
> system("ps aux | grep ssh") > output.txt
Then read in this file into R using read.table:
df.output <- read.table(file="output.txt", header=FALSE, sep="")
Note: Using sep="" (which is the default for read.table actually) will treat any type/amount of whitespace as a delimeter between columns. This should cover the output you are getting from your call to Linux.

You can get a little closer (to a character vector) with intern=TRUE:
as.data.frame(system("ps aux | grep ssh", intern=TRUE))

Related

BLAST+ exits with error exit status (2) when using nextflow

I'm using nextflow to analyse minION data. Blast+ terminates with error exit status (2), Command exit status:2 and Command output: (empty)
-HP-Z6-G4-Workstation:~/nextflow_pipelines/nf_pipeline/20221025_insect$ nextflow cat_working_nextflow.nf
N E X T F L O W ~ version 22.04.5
Launching `cat_working_nextflow.nf` [admiring_hopper] DSL1 - revision: 2916bc12af
executor > local (78)
[38/2d0584] process > concatinate (AIG363_pass_barcode01_0eb3c2c3_2.fastq) [100%] 38 of 38 ✔
[dd/3cabdf] process > fastqconvert (output.fastq) [100%] 38 of 38 ✔
[47/dab2cd] process > blast_raw (insect.fasta) [ 0%] 0 of 38
executor > local (78)
[38/2d0584] process > concatinate (AIG363_pass_barcode01_0eb3c2c3_2.fastq) [100%] 38 of 38 ✔
[dd/3cabdf] process > fastqconvert (output.fastq) [100%] 38 of 38 ✔
[47/dab2cd] process > blast_raw (insect.fasta) [ 2%] 1 of 37, failed: 1
Error executing process > 'blast_raw (insect.fasta)'
Caused by:
Process `blast_raw (insect.fasta)` terminated with an error exit status (2)
Command executed:
blastn -query insect.fasta -db /home/blast/nt_db_20221011/nt -outfmt 11 -out blastrawreads.asn -evalue 0.1 -numgnments 1
blast_formatter blastr-archive blastrawreads.asn awrea-outfmt 5 -out blastrawreads.xml
blast_formatter -archive blastrawreads.asn -outfmt "6 qaccver saccver pident length evalue bitscore stitle" -out blastrawreads_rt.tsv
sort -n -r -k 6 blastrawreads_unsort.tsv > blastrawreads.tsv
Command exit status:
2
Command output:
(empty)
Command error:
Warning: [blastn] Examining 5 or more matches is recommended
BLAST Database error: No alias or index file found for nucleotide database [/home/blast/nt_db_20221011/nt] in search path [/home/shaextflow_pipelines/nf_pipeline/20221025_insect/work/96/e885b7e53e1bcf30e33526265e9a3c::]
Work dir:
/home/nextflow_pipelines/nf_pipeline/20221025_insect/work/96/e885b7e53e1bcf30e33526265e9a3c
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
The nf file:
\#!/usr/bin/env nextflow
//data_location
params.outdir = './results'
params.in = "$PWD/\*.fastq"
dataset = Channel.fromPath(params.in)
params.db = "/home/blast/nt_db_20221011/nt"
process concatenate {
tag "$x"
publishDir "${params.outdir}", mode:'copy'
input:
path (x) from dataset
output:
path ("output.fastq") into cat_ch
script:
"""
cat $x > output.fastq
"""
}
process fastqconvert {
tag "$y"
publishDir "${params.outdir}", mode:'copy'
input:
path (y) from cat_ch
output:
path ("insect.fasta") into convert1_ch,convert2_ch,convert3_ch
script:
"""
seqtk seq -a $y > insect.fasta
"""
}
process blast_raw {
tag "$z"
publishDir "${params.outdir}", mode:'copy'
input:
path (z) from convert1_ch
output:
path ('blastrawreads.xml') into blastrawreads_xml_ch
script:
"""
blastn \
-query $z -db ${params.db} \
-outfmt 11 -out blastrawreads.asn \
-evalue 0.1 \
-num_alignments 1 \
blast_formatter \
-archive blastrawreads.asn \
-outfmt 5 -out blastrawreads.xml
blast_formatter \
-archive blastrawreads.asn \
-outfmt "6 qaccver saccver pident length evalue bitscore stitle" -out blastrawreads_unsort.tsv
sort -n -r -k 6 blastrawreads_unsort.tsv > blastrawreads.tsv
"""
}
I can see that the insect.fasta file has been produced and has the appropriate permissions and is located in the expected dir.
I used the following command to download the nt database
update_blastdb.pl --decompress nt --passive --source gcp
gcp is the google cloud in Australia
The nt database is ~26GiG in size.
I really need an excel, asn and fasta file from blast results for downstream analysis.
Any help would be much appreciated.
BLAST Database error: No alias or index file found for nucleotide
database [/home/blast/nt_db_20221011/nt]
I think you should be able to re-create the above error independently of Nextflow using:
blastdbcmd -db /home/blast/nt_db_20221011/nt -info
Note that the db argument must be a dbname, not a path. For /home/blast/nt_db_20221011/nt to work correctly, you should be able to list your db files using: ls /home/blast/nt_db_20221011/nt.*
Not sure if there's a typo in your question, but the size of the nt database is about an order of magnitude larger, at approximately 250G. I wonder if simply re-downloading the database fixes the problem? Note that you can get a list of BLAST databases (showing their sizes and dates last updated) using:
update_blastdb.pl --showall pretty --source gcp
Note also that DSL1 is now end-of-life1 and will be removed going forward. I strongly recommend migrating to using DSL2 syntax when you get a chance.
From the comments:
The problem is that when you use params to specify a path, the path or files specified will not be localized inside the process working directory when the task is run. What you want is just a second input (value) channel. For example, using DSL2 syntax:
params.db = "/home/blast/Geminiviridae_db_20221118/geminiviridae"
process blast_raw {
tag { query_fasta }
input:
path query_fasta
path db
output:
path "geminiviridae.xml"
"""
blastn \\
-query "${query_fasta}" \\
-db "${db}" \\
-max_target_seqs 10 \\
-outfmt 5 \\
-out "geminiviridae.xml"
"""
}
workflow {
db = file( params.db )
blast_raw( your_upstream_ch, db)
}

Bash - Pipe first line into variable and display the rest without using a temp file

I receive some json that I process until it becomes just text lines. In the first line there's a value that I would like to keep in a variable and all the rest after the first line should be displayed with less or other utils.
Can I do this without using a temporary file?
The context is this:
aws logs get-log-events --log-group-name "$logGroup" --log-stream-name "$logStreamName" --limit "$logSize" |
jq '{message:.nextForwardToken}, .events[] | .message' |
sed 's/^"//g' | sed 's/"$//g'
In the first line there's the nextForwardToken that I want to put in the variable and all the rest is log messages.
The json looks like this:
{
"events": [
{
"timestamp": 1518081460955,
"ingestionTime": 1518081462998,
"message": "08.02.2018 09:17:40.955 [SimpleAsyncTaskExecutor-138] INFO o.s.b.c.l.support.SimpleJobLauncher - Job: [SimpleJob: [name=price-update]] launched with the following parameters: [{time=1518081460875, sku=N-W7ZLH9U737B|N-XIBH22XQE87|N-3EXIRFNYNW0|N-U19C031D640|N-6TQ1847FQE6|N-NF0XCNG0029|N-UJ3H0OZROCQ|N-W2JKJD4S6YP|N-VEMA4QVV3X1|N-F40J6P2VM01|N-VIT7YEAVYL2|N-PKLKX1PAUXC|N-VPAK74C75DP|N-C5BLYC5HQRI|N-GEIGFIBG6X2|N-R0V88ZYS10W|N-GQAF3DK7Y5Z|N-9EZ4FDDSQLC|N-U15C031D668|N-B8ELYSSFAVH}]"
},
{
"timestamp": 1518081461095,
"ingestionTime": 1518081462998,
"message": "08.02.2018 09:17:41.095 [SimpleAsyncTaskExecutor-138] INFO o.s.batch.core.job.SimpleStepHandler - Executing step: [index salesprices]"
},
{
"timestamp": 1518082421586,
"ingestionTime": 1518082423001,
"message": "08.02.2018 09:33:41.586 [upriceUpdateTaskExecutor-3] DEBUG e.u.d.a.j.d.b.StoredMasterDataReader - Reading page 1621"
}
],
"nextBackwardToken": "b/33854347851370569899844322814554152895248902123886870536",
"nextForwardToken": "f/33854369274157730709515363051725446974398055862891970561"
}
I need to put in a variable this:
f/33854369274157730709515363051725446974398055862891970561
and display (or put in an other variable) the messages:
08.02.2018 09:17:40.955 [SimpleAsyncTaskExecutor-138] INFO o.s.b.c.l.support.SimpleJobLauncher - Job: [SimpleJob: [name=price-update]] launched with the following parameters: [{time=1518081460875, sku=N-W7ZLH9U737B|N-XIBH22XQE87|N-3EXIRFNYNW0|N-U19C031D640|N-6TQ1847FQE6|N-NF0XCNG0029|N-UJ3H0OZROCQ|N-W2JKJD4S6YP|N-VEMA4QVV3X1|N-F40J6P2VM01|N-VIT7YEAVYL2|N-PKLKX1PAUXC|N-VPAK74C75DP|N-C5BLYC5HQRI|N-GEIGFIBG6X2|N-R0V88ZYS10W|N-GQAF3DK7Y5Z|N-9EZ4FDDSQLC|N-U15C031D668|N-B8ELYSSFAVH}]
08.02.2018 09:17:41.095 [SimpleAsyncTaskExecutor-138] INFO o.s.batch.core.job.SimpleStepHandler - Executing step: [index salesprices]
08.02.2018 09:33:41.586 [upriceUpdateTaskExecutor-3] DEBUG e.u.d.a.j.d.b.StoredMasterDataReader - Reading page 1621
Thanks in advance for your help.
You might consider it a bit of trick, but you can use tee to pipe all the output to stderr and fetch the one line you want for your variable with head:
var="$(command | tee /dev/stderr | head -n 1)"
Or you can solve this with a bit of scripting:
first=true
while read -r line; do
if $first; then
first=false
var="$line"
fi
echo "$line"
done < <(command)
If you are interested in storing the contents to variables, use mapfile or read on older bash versions.
Just using read to get the first line do. I've added -r flag to jq print output without quotes
read -r token < <(aws logs get-log-events --log-group-name "$logGroup" --log-stream-name "$logStreamName" --limit "$logSize" | jq -r '{message:.nextForwardToken}, .events[] | .message')
printf '%s\n' "$token"
Or using mapfile
mapfile -t output < <(aws logs get-log-events --log-group-name "$logGroup" --log-stream-name "$logStreamName" --limit "$logSize" | jq -r '{message:.nextForwardToken}, .events[] | .message')
and loop through the array. The first element will always contain the token-id you want.
printf '%s\n' "${output[0]}"
Rest of the elements can be iterated over,
for ((i=1; i<${#output[#]}; i++)); do
printf '%s\n' "${output[i]}"
done
Straightforwardly:
aws logs get-log-events --log-group-name "$logGroup" \
--log-stream-name "$logStreamName" --limit "$logSize" > /tmp/log_data
-- set nextForwardToken variable:
nextForwardToken=$(jq -r '.nextForwardToken' /tmp/log_data)
echo $nextForwardToken
f/33854369274157730709515363051725446974398055862891970561
-- print all message items:
jq -r '.events[].message' /tmp/log_data
08.02.2018 09:17:40.955 [SimpleAsyncTaskExecutor-138] INFO o.s.b.c.l.support.SimpleJobLauncher - Job: [SimpleJob: [name=price-update]] launched with the following parameters: [{time=1518081460875, sku=N-W7ZLH9U737B|N-XIBH22XQE87|N-3EXIRFNYNW0|N-U19C031D640|N-6TQ1847FQE6|N-NF0XCNG0029|N-UJ3H0OZROCQ|N-W2JKJD4S6YP|N-VEMA4QVV3X1|N-F40J6P2VM01|N-VIT7YEAVYL2|N-PKLKX1PAUXC|N-VPAK74C75DP|N-C5BLYC5HQRI|N-GEIGFIBG6X2|N-R0V88ZYS10W|N-GQAF3DK7Y5Z|N-9EZ4FDDSQLC|N-U15C031D668|N-B8ELYSSFAVH}]
08.02.2018 09:17:41.095 [SimpleAsyncTaskExecutor-138] INFO o.s.batch.core.job.SimpleStepHandler - Executing step: [index salesprices]
08.02.2018 09:33:41.586 [upriceUpdateTaskExecutor-3] DEBUG e.u.d.a.j.d.b.StoredMasterDataReader - Reading page 1621
I believe the following meets the stated requirements, assuming a bash-like environment:
x=$(aws ... |
tee >(jq -r '.events[] | .message' >&2) |
jq .nextForwardToken) 2>&1
This makes the item of interest available as the shell variable $x.
Notice that the string manipulation using sed can be avoided by using the -r command-line option of jq.
Calling jq just once
x=$(aws ... |
jq -r '.nextForwardToken, (.events[] | .message)' |
tee >(tail -n +2 >&2) |
head -n 1) 2>&1
echo "x=$x"

get text from url list to csv

i need retrieve text from url list.
I have csv (about 150 000 rows) with ID and URL. On this URL is just plain text without HTML code.
I need write this text to csv with ID from input csv.
Its this possible with wget for example?
Input CSV
9788075020536|http://pemic-books.cz/ASPX/Annotation.aspx?kod=0180853
Output CSV
9788075020536|Učebnice je dílem kolektivu autorů katedry ústavního práva Právnické fakulty Univerzity Karlovy v Praze a externích spolupracovníků. V souladu s tradičním pojetím ústavního práva je obecná státověda podávána jako jeho vstupní a neoddělitelná součást. Kniha je reprintem původního vydání z roku 1998, v nakladatelství Leges vychází poprvé. Na učebnici navazuje Ústavní právo a státověda, 2. díl, Ústavní právo České republiky, který byl vydání nakladatelstvím Leges v roce 2011
Suppose you have the following columns
curlcsv file contents:
0001|columnbefore1|https://www.random.org/integers/?num=1&min=1&max=2&col=1&base=10&format=plain&rnd=new|columnafter1
0002|columnbefore2|https://www.random.org/integers/?num=1&min=3&max=4&col=1&base=10&format=plain&rnd=new|columnafter2
0003|columnbefore3|https://www.random.org/integers/?num=1&min=5&max=6&col=1&base=10&format=plain&rnd=new|columnafter3
Here are the "one-liners" you could use:
gawk:
gawk ' {
match($0, /^(([^|]+[|]){2})([^|]+)([|][^|]+)*$/, arr);
req = "curl -s \""arr[3]"\"";
req | getline res;
print arr[1]""res""arr[4];
}
' curlcsv >result
/^(([^|]+[|]){2}) - 2 here means skip 2 columns (in your case skip 1 column)
([^|]+) - get the contents of the url column
([|][^|]+)* - save the rest column values
result file would look like this:
0001|columnbefore1|2|columnafter1
0002|columnbefore2|3|columnafter2
0003|columnbefore3|5|columnafter3
this approach would hit limits on open files (see JaromírHeimlich comment below)
solution to this limitation problem could be:
split -l 100 curlcsv && ls | grep -v curlcsv | xargs -n 1 gawk ' {
match($0, /^(([^|]+[|]){2})([^|]+)([|][^|]+)*$/, arr);
req = "curl -s \""arr[3]"\"";
req | getline res;
print arr[1]""res""arr[4];
}
' >>../result
place curlcsv to an empty folder since split would create a lot of partial lists in that directory.
sed:
cat curlcsv | sed -e 's/^\(\([^|]\+[|]\)\{2\}\)\([^|]\+\)\([|][^|]\+\)*$/echo "\1"$(curl -s "\3")"\4"/' | bash >result
/^(([^|]+[|]){2}) - 2 here means skip 2 columns (in your case skip 1 column)
In this example sed constructs the bash script to get the result.
Since this solution generates bash commands, it doesn't have the limitation problem of the gawk solution.

Only one command line in PROJ.4

I would like to know if there are a way to write only one command line to obtain the expected results. I explain:
When you write this :
$ proj +proj=utm +zone=13 +ellps=WGS84 -f %12.6f
If you want to recieved the output data:
500000.000000 4427757.218739
You must to write in another line with the input data:
-105 40
Is it possible to write concatenated command line as this stile?:
$ proj +proj=utm +zone=13 +ellps=WGS84 -f %12.6f | -105 40
Thank you
I also ran into this problem and found the solution:
echo -105 40 | proj +proj=utm +zone=13 +ellps=WGS84 -f %12.6f
That should do the trick.
If you need to do this e.g. from within c#, the command you'd use is this:
cmd.exe /c echo -105 40 | proj +proj=utm +zone=13 +ellps=WGS84 -f %12.6f
Note: you may need to double up the % as the command processor interprets this as a variable.

Awk with function calling error in redirection

awk -F "|" 'function decToBin(dec) { printf "ibase=10; obase=2; $dec" | bc; } BEGIN {print $3" "$4" "$5" "$6" "$7" "decToBin($8)}' $Input
where Input is the path to file having
1|2|1.00|0.46|0.44|1.12|49.88|3
2|2|1.00|0.45|0.55|1.13|50.12|11
It was working correctly without function calling but after introducing function decToBin() it gives error. it gives error as
awk: fatal: expression for `|' redirection has null string value
got stuck dont know how to do that
please need help
myawkscript.awk:
function decToBin(dec) {
cmd="echo 'ibase=10; obase=2;" dec ";'|bc";
cmd|getline var
return var
}
//{print $3" "$4" "$5" "$6" "$7" "decToBin($8)}
Then
gawk -F"|" -f myawkscript.awk myfile
Gives you
1.00 0.46 0.44 1.12 49.88 11
1.00 0.45 0.55 1.13 50.12 1011
as expected
This can't work. bc is taken as the name of a variable, not the name of a command (awk doesn't behave like the shell). As the variable is undefined, it is treated as the null string. You must quote the command name:
$ awk 'BEGIN {printf "1+1\n" | "bc"}' /dev/null
2
In addition, $dec is not the value of dec, but the value of field number dec. Again, awk is not the shell. What you rather want it something like this:
$ awk 'BEGIN {dec = 21; printf "%d+%d\n", dec, dec | "bc"}' /dev/null
42