Local blasting (16SrRNA) - taxonomy annotations from NCBI different from Silva - blast

I am performing local blasting against both the NCBI database and the Silva database using this commands:
blastn -db db/16SMicrobial -query input.fa -out outputNCBI.csv -task blastn -dust no -max_target_seqs 1 -outfmt "10 pident evalue bitscore score seqid stitle staxids scomnames sscinames sskingdom"
blastn -db db/Silva -query input.fa -out outputSilva.csv -task blastn -dust no -max_target_seqs 1 -outfmt "10 pident evalue bitscore score seqid stitle staxids scomnames sscinames sskingdom"
The problem comes with the results. The annotated taxonomy is completely different. I have manually checked some of them by blasting them online and it seems that Silva is giving the right resuls, meaning the local Silva results coincide with online NCBI results.
Both databases are in the same file. This is what's in it:
#:~/Desktop/db$ ls -a
16SMicrobial.nni Silva_119.fasta taxdb.btd
16SMicrobial.nog Silva_123.fasta taxdb.bti
16SMicrobial.nhr 16SMicrobial.nsd Silva.nhr tax_slv_ssu_123.txt
16SMicrobial.nin 16SMicrobial.nsi Silva.nin
16SMicrobial.nnd 16SMicrobial.nsq Silva.nsq
Does anyone have any idea what could be the cause?
I would appreciate any help.
Greets,
Irina

Related

How to fetch json/xml data using REGEX

I am new to regex and Linux, I want to know how we can fetch JSON/XML data using regex using a Linux terminal. I am a windows user so currently working on the windows subsystem for Linux. I know I should use jq for JSON but the requirement is to use regex, the pattern that I will have will be same in every report so regex can be used even though it is not really recommended.
I wanna know two things
How I can test my regex in windows subsystem for Linux.
how I can add it in the shell script, as of now I am using jq.
This is how I am using jq to fetch data in shell script
cat abc.json | jq -r '.url'
So how i can achieve the same thing using regex ?
my abc.json is as below
{"url":"http://www.netcharles.com/orwell/essays.htm",
"domain":"netcharles.com",
"title":"Orwell Essays & Journalism Section - Charles' George Orwell Links",
"tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],
"index":2931,
"time_created":1345419323,
"num_saves":24}
I tried this in windows subsystem for linux
sed -E '(url|title|tags)":"((\"|[^"])*)' abc.json
got this error
sed: -e expression #1, char 1: unknown command: `('
Expected output
\\"url\\":\\"http://www.netcharles.com/orwell/essays.htm\\",\\"domain\\":\\"netcharles.com\\",\\"title\\":\\"Orwell Essays & Journalism Section - Charles\\' George Orwell Links\\",\\"tags\\":[\\"orwell\\",\\"writing\\",\\"literature\\",\\"journalism\\",\\"essays\\",\\"politics\\",\\"essay\\",\\"reference\\",\\"language\\",\\"toread\\"],\\"index\\":2931,\\"time_created\\":1345419323,\\"num_saves\\":24
or could someone please tell me what would be regex for accessing something like this. ps - the value of first_name will be a string
cat user.json | jq -r '.user_data.username.first_name'

Sorting file with 1.8 million records using script

I am trying to remove identical lines in a file having 1.8 million records and create a new file. Using the following command:
sort tmp1.csv | uniq -c | sort -nr > tmp2.csv
Running the script creates a new file sort.exe.stackdump with the following information:
"Exception: STATUS_ACCESS_VIOLATION at rip=00180144805
..
..
program=C:\cygwin64\bin\sort.exe, pid 6136, thread main
cs=0033 ds=002B es=002B fs=0053 gs=002B ss=002B"
The script works for a small file with 10 lines. Seems like sort.exe cannot handle so many records. How do I work with such a large file with more than 1.8 million records? We do not have any database other than ACCESS and I was trying to do this manually in ACCESS.
It sounds like your sort command is broken. Since the path says cygwin, i'm assuming this is GNU sort, which generally should have no problem with this task, given sufficient memory and disk space. Try playing with flags to adjust where and how much it uses the disk: http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html
The following awk command seemed to be a much faster way to get rid of the uniqe values:
awk '!v[$0]++' $FILE2 > tmp.csv
where $FILE2 is the file name with duplicate values.

Parse ClamAV logs in Bash script using Regex to insert in MySQL

Morning/Evening all,
I've got a problem where I'm making a script for work that uses ClamAV to scan for malware, and then place it's results in MySQL by taking the resultant ClamAV logs using grep with awk to convert the right parts of the log to a variable. The problem I have is that whilst I have done the summary ok, the syntax of detections makes it slightly more difficult. I'm no expert at regex by all means and this is a bit of a learning experience, so there is probably a far better way of doing it than I have!
The lines I'm trying to parse looks like these:
/net/nas/vol0/home/recep/SG4rt.exe: Worm.SomeFool.P FOUND
/net/nas/vol0/home/recep/SG4rt.exe: moved to '/srv/clamav/quarantine/SG4rt.exe'
As far as I was able to establish, I need a positive lookbehind to match what happens after and before the colon, without actually matching the colon or the space after it, and I can't see a clear way of doing it from RegExr without it thinking I'm trying to look for two colons. To make matters worse, we sometimes get these too...
WARNING: Can't open file /net/nas/vol0/home/laser/samples/sample1.avi: Permission denied
The end result is that I can build a MySQL query that inserts the path, malware found and where it was moved to or if there was an error then the path, then the error encountered so as to convert each element to a variable contents in a while statement.
I've done the scan summary as follows:
Summary looks like:
----------- SCAN SUMMARY -----------
Known viruses: 329
Engine version: 0.97.1
Scanned directories: 17350
Scanned files: 50342
Infected files: 3
Total errors: 1
Data scanned: 15551.73 MB
Data read: 16382.67 MB (ratio 0.95:1)
Time: 3765.236 sec (62 m 45 s)
Parsing like this:
SCANNED_DIRS=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Scanned directories" | awk '{gsub("Scanned directories: ", "");print}')
SCANNED_FILES=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Scanned files" | awk '{gsub("Scanned files: ", "");print}')
INFECTED=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Infected files" | awk '{gsub("Infected files: ", "");print}')
DATA_SCANNED=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Data scanned" | awk '{gsub("Data scanned: ", "");print}')
DATA_READ=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Data read" | awk '{gsub("Data read: ", "");print}')
TIME_TAKEN=$(cat /srv/clamav/$IY-scan-$LOGTIME.log | grep "Time" | awk '{gsub("Time: ", "");print}')
END_TIME=$(date +%s)
mysql -u scanner_parser --password=removed sc_live -e "INSERT INTO bs.live.bs_jobstat VALUES (NULL, '$CURRTIME', '$PID', '$IY', '$SCANNED_DIRS', '$SCANNED_FILES', '$INFECTED', '$DATA_SCANNED', '$DATA_READ', '$TIME_TAKEN', '$END_TIME');"
rm -f /srv/clamav/$IY-scan-$LOGTIME.log
Some of those variables are from other parts of the script and can be ignored. The reason I'm doing this is to save logfile clutter and have a simple web based overview of the status of the system.
Any clues? Am I going about all this the wrong way? Thanks for help in advance, I do appreciate it!
From what I can determine from the question, it seems like you are asking how to distinguish the lines you want from the logger lines that start with WARNING, ERROR, INFO.
You can do this without getting to fancy with lookahead or lookbehind. Just grep for lines beginning with
"/net/nas/vol0/home/recep/SG4rt.exe: "
then using awk you can extract the remainder of the line. Or you can gsub the prefix out like you are doing in the summary processing section.
As far as the question about processing the summary goes, what strikes me most is that you are processing the entire file multiple times, each time pulling out one kind of line. For tasks like this, I would use Perl, Ruby, or Python and make one pass through the file, collecting the pieces of each line after the colon, storing them in regular programming language variables (not env variables), and forming the MySQL insert string using interpolation.
Bash is great for some things but IMHO you are justified in using a more general scripting language (Perl, Python, Ruby come to mind).

Processing MySQL result in bash

I'm currently having a already a bash script with a few thousand lines which sends various queries MySQL to generate applicable output for munin.
Up until now the results were simply numbers which weren't a problem, but now I'm facing a challenge to work with a more complex query in the form of:
$ echo "SELECT id, name FROM type ORDER BY sort" | mysql test
id name
2 Name1
1 Name2
3 Name3
From this result I need to store the id and name (and their respective association) and based on the IDs need to perform further queries, e.g. SELECT COUNT(*) FROM somedata WHERE type = 2 and later output that result paired with the associated name column from the first result.
I'd know easily how to do it in PHP/Ruby , but I'd like to spare to fork another process especially since it's polled regularly, but I'm complete lost where to start with bash.
Maybe using bash is the wrong approach anyway and I should just fork out?
I'm using GNU bash, version 3.2.39(1)-release (i486-pc-linux-gnu).
My example is not Bash, but I'd like to point out my parameters at invoking the mysql command, they surpress the boxing and the headers.
#!/bin/sh
mysql dbname -B -N -s -e "SELECT * FROM tbl" | while read -r line
do
echo "$line" | cut -f1 # outputs col #1
echo "$line" | cut -f2 # outputs col #2
echo "$line" | cut -f3 # outputs col #3
done
You would use a while read loop to process the output of that command.
echo "SELECT id, name FROM type ORDER BY sort" | mysql test | while read -r line
do
# you could use an if statement to skip the header line
do_something "$line"
done
or store it in an array:
while read -r line
do
array+=("$line")
done < <(echo "SELECT id, name FROM type ORDER BY sort" | mysql test)
That's a general overview of the technique. If you have more specific questions post them separately or if they're very simple post them in a comment or as an edit to your original question.
You're going to "fork out," as you put it, to the mysql command line client program anyhow. So either way you're going to have process-creation overhead. With your approach of using a new invocation of mysql for each query you're also going to incur the cost of connecting to and authenticating to the mysqld server multiple times. That's expensive, but the expense may not matter if this app doesn't scale up.
Making it secure against sql injection is another matter. If you prompt a user for her name and she answers "sally;drop table type;" she's laughing and you're screwed.
You might be wise to use a language that's more expressive in the areas that are important for data-base access for some of your logic. Ruby, PHP, PERL are all good choices. PERL happens to be tuned and designed to run snappily under shell script control.

Mapnik ignoring my lat long bounding box

Can anyone see anything wrong with the following set of commands? Every time I run these image.png is a image of the UK and not the JOSM map I exported. I'm guessing there's something awry with the db import however the output mentions that it's processing my coords and data.
Steps:
1 - Exported a .osm file from JOSM or Merkaator.
2 - Imported into psql using the following command:
osm2pgsql -m -d gis -S ~/mapnik/default.style -b 103,1.3,104,1.4 ion.osm -v -c
The output for this looks like:
marshall#ubuntu:~/mapnik$ osm2pgsql -m -d gis -S ~/mapnik/default.style -b 103,1.3,104,1.4 ion.osm -v -c
osm2pgsql SVN version 0.66-
Using projection SRS 900913 (Spherical Mercator)
Applying Bounding box: 103.000000,1.300000 to 104.000000,1.400000
Setting up table: planet_osm_point
Setting up table: planet_osm_line
Setting up table: planet_osm_polygon
Setting up table: planet_osm_roads
Mid: Ram, scale=100
Reading in file: ion.osm
Processing: Node(25k) Way(3k) Relation(0k)
Node stats: total(25760), max(844548651)
Way stats: total(3783), max(69993379)
Relation stats: total(27), max(536780)
Writing way(3k)
Writing rel(0k)
Committing transaction for planet_osm_point
Sorting data and creating indexes for planet_osm_point
Committing transaction for planet_osm_line
Committing transaction for planet_osm_roads
Sorting data and creating indexes for planet_osm_line
Committing transaction for planet_osm_polygon
Sorting data and creating indexes for planet_osm_roads
Sorting data and creating indexes for planet_osm_polygon
Completed planet_osm_polygon
Completed planet_osm_roads
Completed planet_osm_point
Completed planet_osm_line
I can see the correct lat/lon coords being passed in, I'm not sure how to verify this within the database
3 - ./generate_xml.py --accept-none --dbname gis --symbols ./symbols/ --world_boundaries ../world_boundaries/
4 - ./generate_image.py
At this point image.png is a map of the UK, not Singapore which I have specified.
Can anyone see anything wrong with this? This is with mapnik 0.71 on ubuntu
Found the solution.
The issue is that the generate_image.py script does not read the data from the database but rather has it hardcoded inside. I'm not sure the reasoning behind this.
The solution is to edit generate_image.py manually and change the relevant line:
ll = (103,1.3,104,1.4)