On OSX, I am using the system() function to run commands in terminal from the R console as part of a script I've written. The script requires connecting to a MySQL() database through an ssh tunnel, and I type into the command line "ps aux | grep ssh" to see what tunnels i am connected to. For example, some output:
.
> system("ps aux | grep ssh")
Home 50915 0.0 0.0 2501204 3264 ?? S 10:32AM server info
Home 50092 0.0 0.0 2504172 3048 ?? Ss 9:35AM server2 info
Home 50090 0.0 0.0 2501372 480 ?? Ss 9:35AM server3 info
Home 1155 0.0 0.0 2544220 1368 ?? S Thu07PM server4 info
Home 51333 0.0 0.0 2434840 800 ?? S 11:00AM 0:00.00 grep ssh
Home 51331 0.0 0.0 2438508 1124 ?? S 11:00AM 0:00.00 sh -c ps aux | grep ssh
.
I would like to turn this output into a dataframe, but cannot. Functions like as.data.frame(system("ps aux | grep ssh")) do not work as how I would hope them to work.
Any thoughts on this would be appreciated!
EDIT - just wanted to highlight error from one suggested comment
> read.table(pipe("ps aux | grep ssh"))
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 34 elements
> pipe("ps aux | grep ssh")
description class mode text opened can read can write
"ps aux | grep ssh" "pipe" "r" "text" "closed" "yes" "yes"
First pipe your output to an actual text file:
> system("ps aux | grep ssh") > output.txt
Then read in this file into R using read.table:
df.output <- read.table(file="output.txt", header=FALSE, sep="")
Note: Using sep="" (which is the default for read.table actually) will treat any type/amount of whitespace as a delimeter between columns. This should cover the output you are getting from your call to Linux.
You can get a little closer (to a character vector) with intern=TRUE:
as.data.frame(system("ps aux | grep ssh", intern=TRUE))
Related
I'm using nextflow to analyse minION data. Blast+ terminates with error exit status (2), Command exit status:2 and Command output: (empty)
-HP-Z6-G4-Workstation:~/nextflow_pipelines/nf_pipeline/20221025_insect$ nextflow cat_working_nextflow.nf
N E X T F L O W ~ version 22.04.5
Launching `cat_working_nextflow.nf` [admiring_hopper] DSL1 - revision: 2916bc12af
executor > local (78)
[38/2d0584] process > concatinate (AIG363_pass_barcode01_0eb3c2c3_2.fastq) [100%] 38 of 38 ✔
[dd/3cabdf] process > fastqconvert (output.fastq) [100%] 38 of 38 ✔
[47/dab2cd] process > blast_raw (insect.fasta) [ 0%] 0 of 38
executor > local (78)
[38/2d0584] process > concatinate (AIG363_pass_barcode01_0eb3c2c3_2.fastq) [100%] 38 of 38 ✔
[dd/3cabdf] process > fastqconvert (output.fastq) [100%] 38 of 38 ✔
[47/dab2cd] process > blast_raw (insect.fasta) [ 2%] 1 of 37, failed: 1
Error executing process > 'blast_raw (insect.fasta)'
Caused by:
Process `blast_raw (insect.fasta)` terminated with an error exit status (2)
Command executed:
blastn -query insect.fasta -db /home/blast/nt_db_20221011/nt -outfmt 11 -out blastrawreads.asn -evalue 0.1 -numgnments 1
blast_formatter blastr-archive blastrawreads.asn awrea-outfmt 5 -out blastrawreads.xml
blast_formatter -archive blastrawreads.asn -outfmt "6 qaccver saccver pident length evalue bitscore stitle" -out blastrawreads_rt.tsv
sort -n -r -k 6 blastrawreads_unsort.tsv > blastrawreads.tsv
Command exit status:
2
Command output:
(empty)
Command error:
Warning: [blastn] Examining 5 or more matches is recommended
BLAST Database error: No alias or index file found for nucleotide database [/home/blast/nt_db_20221011/nt] in search path [/home/shaextflow_pipelines/nf_pipeline/20221025_insect/work/96/e885b7e53e1bcf30e33526265e9a3c::]
Work dir:
/home/nextflow_pipelines/nf_pipeline/20221025_insect/work/96/e885b7e53e1bcf30e33526265e9a3c
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
The nf file:
\#!/usr/bin/env nextflow
//data_location
params.outdir = './results'
params.in = "$PWD/\*.fastq"
dataset = Channel.fromPath(params.in)
params.db = "/home/blast/nt_db_20221011/nt"
process concatenate {
tag "$x"
publishDir "${params.outdir}", mode:'copy'
input:
path (x) from dataset
output:
path ("output.fastq") into cat_ch
script:
"""
cat $x > output.fastq
"""
}
process fastqconvert {
tag "$y"
publishDir "${params.outdir}", mode:'copy'
input:
path (y) from cat_ch
output:
path ("insect.fasta") into convert1_ch,convert2_ch,convert3_ch
script:
"""
seqtk seq -a $y > insect.fasta
"""
}
process blast_raw {
tag "$z"
publishDir "${params.outdir}", mode:'copy'
input:
path (z) from convert1_ch
output:
path ('blastrawreads.xml') into blastrawreads_xml_ch
script:
"""
blastn \
-query $z -db ${params.db} \
-outfmt 11 -out blastrawreads.asn \
-evalue 0.1 \
-num_alignments 1 \
blast_formatter \
-archive blastrawreads.asn \
-outfmt 5 -out blastrawreads.xml
blast_formatter \
-archive blastrawreads.asn \
-outfmt "6 qaccver saccver pident length evalue bitscore stitle" -out blastrawreads_unsort.tsv
sort -n -r -k 6 blastrawreads_unsort.tsv > blastrawreads.tsv
"""
}
I can see that the insect.fasta file has been produced and has the appropriate permissions and is located in the expected dir.
I used the following command to download the nt database
update_blastdb.pl --decompress nt --passive --source gcp
gcp is the google cloud in Australia
The nt database is ~26GiG in size.
I really need an excel, asn and fasta file from blast results for downstream analysis.
Any help would be much appreciated.
BLAST Database error: No alias or index file found for nucleotide
database [/home/blast/nt_db_20221011/nt]
I think you should be able to re-create the above error independently of Nextflow using:
blastdbcmd -db /home/blast/nt_db_20221011/nt -info
Note that the db argument must be a dbname, not a path. For /home/blast/nt_db_20221011/nt to work correctly, you should be able to list your db files using: ls /home/blast/nt_db_20221011/nt.*
Not sure if there's a typo in your question, but the size of the nt database is about an order of magnitude larger, at approximately 250G. I wonder if simply re-downloading the database fixes the problem? Note that you can get a list of BLAST databases (showing their sizes and dates last updated) using:
update_blastdb.pl --showall pretty --source gcp
Note also that DSL1 is now end-of-life1 and will be removed going forward. I strongly recommend migrating to using DSL2 syntax when you get a chance.
From the comments:
The problem is that when you use params to specify a path, the path or files specified will not be localized inside the process working directory when the task is run. What you want is just a second input (value) channel. For example, using DSL2 syntax:
params.db = "/home/blast/Geminiviridae_db_20221118/geminiviridae"
process blast_raw {
tag { query_fasta }
input:
path query_fasta
path db
output:
path "geminiviridae.xml"
"""
blastn \\
-query "${query_fasta}" \\
-db "${db}" \\
-max_target_seqs 10 \\
-outfmt 5 \\
-out "geminiviridae.xml"
"""
}
workflow {
db = file( params.db )
blast_raw( your_upstream_ch, db)
}
I receive some json that I process until it becomes just text lines. In the first line there's a value that I would like to keep in a variable and all the rest after the first line should be displayed with less or other utils.
Can I do this without using a temporary file?
The context is this:
aws logs get-log-events --log-group-name "$logGroup" --log-stream-name "$logStreamName" --limit "$logSize" |
jq '{message:.nextForwardToken}, .events[] | .message' |
sed 's/^"//g' | sed 's/"$//g'
In the first line there's the nextForwardToken that I want to put in the variable and all the rest is log messages.
The json looks like this:
{
"events": [
{
"timestamp": 1518081460955,
"ingestionTime": 1518081462998,
"message": "08.02.2018 09:17:40.955 [SimpleAsyncTaskExecutor-138] INFO o.s.b.c.l.support.SimpleJobLauncher - Job: [SimpleJob: [name=price-update]] launched with the following parameters: [{time=1518081460875, sku=N-W7ZLH9U737B|N-XIBH22XQE87|N-3EXIRFNYNW0|N-U19C031D640|N-6TQ1847FQE6|N-NF0XCNG0029|N-UJ3H0OZROCQ|N-W2JKJD4S6YP|N-VEMA4QVV3X1|N-F40J6P2VM01|N-VIT7YEAVYL2|N-PKLKX1PAUXC|N-VPAK74C75DP|N-C5BLYC5HQRI|N-GEIGFIBG6X2|N-R0V88ZYS10W|N-GQAF3DK7Y5Z|N-9EZ4FDDSQLC|N-U15C031D668|N-B8ELYSSFAVH}]"
},
{
"timestamp": 1518081461095,
"ingestionTime": 1518081462998,
"message": "08.02.2018 09:17:41.095 [SimpleAsyncTaskExecutor-138] INFO o.s.batch.core.job.SimpleStepHandler - Executing step: [index salesprices]"
},
{
"timestamp": 1518082421586,
"ingestionTime": 1518082423001,
"message": "08.02.2018 09:33:41.586 [upriceUpdateTaskExecutor-3] DEBUG e.u.d.a.j.d.b.StoredMasterDataReader - Reading page 1621"
}
],
"nextBackwardToken": "b/33854347851370569899844322814554152895248902123886870536",
"nextForwardToken": "f/33854369274157730709515363051725446974398055862891970561"
}
I need to put in a variable this:
f/33854369274157730709515363051725446974398055862891970561
and display (or put in an other variable) the messages:
08.02.2018 09:17:40.955 [SimpleAsyncTaskExecutor-138] INFO o.s.b.c.l.support.SimpleJobLauncher - Job: [SimpleJob: [name=price-update]] launched with the following parameters: [{time=1518081460875, sku=N-W7ZLH9U737B|N-XIBH22XQE87|N-3EXIRFNYNW0|N-U19C031D640|N-6TQ1847FQE6|N-NF0XCNG0029|N-UJ3H0OZROCQ|N-W2JKJD4S6YP|N-VEMA4QVV3X1|N-F40J6P2VM01|N-VIT7YEAVYL2|N-PKLKX1PAUXC|N-VPAK74C75DP|N-C5BLYC5HQRI|N-GEIGFIBG6X2|N-R0V88ZYS10W|N-GQAF3DK7Y5Z|N-9EZ4FDDSQLC|N-U15C031D668|N-B8ELYSSFAVH}]
08.02.2018 09:17:41.095 [SimpleAsyncTaskExecutor-138] INFO o.s.batch.core.job.SimpleStepHandler - Executing step: [index salesprices]
08.02.2018 09:33:41.586 [upriceUpdateTaskExecutor-3] DEBUG e.u.d.a.j.d.b.StoredMasterDataReader - Reading page 1621
Thanks in advance for your help.
You might consider it a bit of trick, but you can use tee to pipe all the output to stderr and fetch the one line you want for your variable with head:
var="$(command | tee /dev/stderr | head -n 1)"
Or you can solve this with a bit of scripting:
first=true
while read -r line; do
if $first; then
first=false
var="$line"
fi
echo "$line"
done < <(command)
If you are interested in storing the contents to variables, use mapfile or read on older bash versions.
Just using read to get the first line do. I've added -r flag to jq print output without quotes
read -r token < <(aws logs get-log-events --log-group-name "$logGroup" --log-stream-name "$logStreamName" --limit "$logSize" | jq -r '{message:.nextForwardToken}, .events[] | .message')
printf '%s\n' "$token"
Or using mapfile
mapfile -t output < <(aws logs get-log-events --log-group-name "$logGroup" --log-stream-name "$logStreamName" --limit "$logSize" | jq -r '{message:.nextForwardToken}, .events[] | .message')
and loop through the array. The first element will always contain the token-id you want.
printf '%s\n' "${output[0]}"
Rest of the elements can be iterated over,
for ((i=1; i<${#output[#]}; i++)); do
printf '%s\n' "${output[i]}"
done
Straightforwardly:
aws logs get-log-events --log-group-name "$logGroup" \
--log-stream-name "$logStreamName" --limit "$logSize" > /tmp/log_data
-- set nextForwardToken variable:
nextForwardToken=$(jq -r '.nextForwardToken' /tmp/log_data)
echo $nextForwardToken
f/33854369274157730709515363051725446974398055862891970561
-- print all message items:
jq -r '.events[].message' /tmp/log_data
08.02.2018 09:17:40.955 [SimpleAsyncTaskExecutor-138] INFO o.s.b.c.l.support.SimpleJobLauncher - Job: [SimpleJob: [name=price-update]] launched with the following parameters: [{time=1518081460875, sku=N-W7ZLH9U737B|N-XIBH22XQE87|N-3EXIRFNYNW0|N-U19C031D640|N-6TQ1847FQE6|N-NF0XCNG0029|N-UJ3H0OZROCQ|N-W2JKJD4S6YP|N-VEMA4QVV3X1|N-F40J6P2VM01|N-VIT7YEAVYL2|N-PKLKX1PAUXC|N-VPAK74C75DP|N-C5BLYC5HQRI|N-GEIGFIBG6X2|N-R0V88ZYS10W|N-GQAF3DK7Y5Z|N-9EZ4FDDSQLC|N-U15C031D668|N-B8ELYSSFAVH}]
08.02.2018 09:17:41.095 [SimpleAsyncTaskExecutor-138] INFO o.s.batch.core.job.SimpleStepHandler - Executing step: [index salesprices]
08.02.2018 09:33:41.586 [upriceUpdateTaskExecutor-3] DEBUG e.u.d.a.j.d.b.StoredMasterDataReader - Reading page 1621
I believe the following meets the stated requirements, assuming a bash-like environment:
x=$(aws ... |
tee >(jq -r '.events[] | .message' >&2) |
jq .nextForwardToken) 2>&1
This makes the item of interest available as the shell variable $x.
Notice that the string manipulation using sed can be avoided by using the -r command-line option of jq.
Calling jq just once
x=$(aws ... |
jq -r '.nextForwardToken, (.events[] | .message)' |
tee >(tail -n +2 >&2) |
head -n 1) 2>&1
echo "x=$x"
i need retrieve text from url list.
I have csv (about 150 000 rows) with ID and URL. On this URL is just plain text without HTML code.
I need write this text to csv with ID from input csv.
Its this possible with wget for example?
Input CSV
9788075020536|http://pemic-books.cz/ASPX/Annotation.aspx?kod=0180853
Output CSV
9788075020536|Učebnice je dílem kolektivu autorů katedry ústavního práva Právnické fakulty Univerzity Karlovy v Praze a externích spolupracovníků. V souladu s tradičním pojetím ústavního práva je obecná státověda podávána jako jeho vstupní a neoddělitelná součást. Kniha je reprintem původního vydání z roku 1998, v nakladatelství Leges vychází poprvé. Na učebnici navazuje Ústavní právo a státověda, 2. díl, Ústavní právo České republiky, který byl vydání nakladatelstvím Leges v roce 2011
Suppose you have the following columns
curlcsv file contents:
0001|columnbefore1|https://www.random.org/integers/?num=1&min=1&max=2&col=1&base=10&format=plain&rnd=new|columnafter1
0002|columnbefore2|https://www.random.org/integers/?num=1&min=3&max=4&col=1&base=10&format=plain&rnd=new|columnafter2
0003|columnbefore3|https://www.random.org/integers/?num=1&min=5&max=6&col=1&base=10&format=plain&rnd=new|columnafter3
Here are the "one-liners" you could use:
gawk:
gawk ' {
match($0, /^(([^|]+[|]){2})([^|]+)([|][^|]+)*$/, arr);
req = "curl -s \""arr[3]"\"";
req | getline res;
print arr[1]""res""arr[4];
}
' curlcsv >result
/^(([^|]+[|]){2}) - 2 here means skip 2 columns (in your case skip 1 column)
([^|]+) - get the contents of the url column
([|][^|]+)* - save the rest column values
result file would look like this:
0001|columnbefore1|2|columnafter1
0002|columnbefore2|3|columnafter2
0003|columnbefore3|5|columnafter3
this approach would hit limits on open files (see JaromírHeimlich comment below)
solution to this limitation problem could be:
split -l 100 curlcsv && ls | grep -v curlcsv | xargs -n 1 gawk ' {
match($0, /^(([^|]+[|]){2})([^|]+)([|][^|]+)*$/, arr);
req = "curl -s \""arr[3]"\"";
req | getline res;
print arr[1]""res""arr[4];
}
' >>../result
place curlcsv to an empty folder since split would create a lot of partial lists in that directory.
sed:
cat curlcsv | sed -e 's/^\(\([^|]\+[|]\)\{2\}\)\([^|]\+\)\([|][^|]\+\)*$/echo "\1"$(curl -s "\3")"\4"/' | bash >result
/^(([^|]+[|]){2}) - 2 here means skip 2 columns (in your case skip 1 column)
In this example sed constructs the bash script to get the result.
Since this solution generates bash commands, it doesn't have the limitation problem of the gawk solution.
I would like to know if there are a way to write only one command line to obtain the expected results. I explain:
When you write this :
$ proj +proj=utm +zone=13 +ellps=WGS84 -f %12.6f
If you want to recieved the output data:
500000.000000 4427757.218739
You must to write in another line with the input data:
-105 40
Is it possible to write concatenated command line as this stile?:
$ proj +proj=utm +zone=13 +ellps=WGS84 -f %12.6f | -105 40
Thank you
I also ran into this problem and found the solution:
echo -105 40 | proj +proj=utm +zone=13 +ellps=WGS84 -f %12.6f
That should do the trick.
If you need to do this e.g. from within c#, the command you'd use is this:
cmd.exe /c echo -105 40 | proj +proj=utm +zone=13 +ellps=WGS84 -f %12.6f
Note: you may need to double up the % as the command processor interprets this as a variable.
awk -F "|" 'function decToBin(dec) { printf "ibase=10; obase=2; $dec" | bc; } BEGIN {print $3" "$4" "$5" "$6" "$7" "decToBin($8)}' $Input
where Input is the path to file having
1|2|1.00|0.46|0.44|1.12|49.88|3
2|2|1.00|0.45|0.55|1.13|50.12|11
It was working correctly without function calling but after introducing function decToBin() it gives error. it gives error as
awk: fatal: expression for `|' redirection has null string value
got stuck dont know how to do that
please need help
myawkscript.awk:
function decToBin(dec) {
cmd="echo 'ibase=10; obase=2;" dec ";'|bc";
cmd|getline var
return var
}
//{print $3" "$4" "$5" "$6" "$7" "decToBin($8)}
Then
gawk -F"|" -f myawkscript.awk myfile
Gives you
1.00 0.46 0.44 1.12 49.88 11
1.00 0.45 0.55 1.13 50.12 1011
as expected
This can't work. bc is taken as the name of a variable, not the name of a command (awk doesn't behave like the shell). As the variable is undefined, it is treated as the null string. You must quote the command name:
$ awk 'BEGIN {printf "1+1\n" | "bc"}' /dev/null
2
In addition, $dec is not the value of dec, but the value of field number dec. Again, awk is not the shell. What you rather want it something like this:
$ awk 'BEGIN {dec = 21; printf "%d+%d\n", dec, dec | "bc"}' /dev/null
42