Extract reverse BLAST matches with samtools faidx - reverse

I am using samtools faidx to extract the sequences matching the ranges given by a BLAST output file (format is tabular -outfmt 6).
Unfortunately, there are BLAST matches forward and reverse.
The forward ranges are processed without problems by samtools faidx.
The reverse ranges are causing samtools to give and error message.
The syntax is:
samtools faidx file.faa "$scaffold":"$start"-"$end"
And in the case of a reverse match ("$start" > "$end"), the error message is:
>$Scaffold:$start-$end/rc
[faidx] Zero length sequence: $Scaffold:$start-$end
Would anyone know if samtools can process the reverse inputs, with a specific flag?
I have not found a script yep for this situation that should be quite common in biological science!

I have not seen any specific "tag" in samtools which cloud process the reverse inputs.
Try:
samtools faidx file.faa "$scaffold": "$end"-"$start"
It's still the region you want. If you want the extracted sequence in a reverse (and complement) direction, try some online tools.
Also, here's explanation about BLAST output reversed

Related

how to convert/match a handwritten list of names? (HWR)

I would like to see if I can scan a sign-in sheet for a class. The good news is I know 90% of the names that might be written.
My idea was to use tessaract to parse an image of names, and then use the Levenshtein algorithm to compare each line with a list of names in my database and if I get reasonably close matches, then that name is right.
Does this approach sound like a good one? If not, other ideas?
I tried using tesseract on a sample sheet (see below)
I used:
tesseract simple.png -psm 4 outtxt
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
I am assuming it didn't like line 2 because I went below the line.
The results I got were:
1.. AM: (harm;
l. ’E (J 22 a 00k
2‘ wau \\) [HQ
4. KIM TAYLOE
5. LN] Davis
6‘ Mzflé! Ha K
Obviously not the greatest, my guess is the distance matches for 4 & 5 would work, but the rest are not even close.
I have control of my sign-in sheet, but not the handwriting of folks coming in, so if any changes to that I can do to help, please let me know.
Since your goal is to get names only - I would suggest you to reduce tessedit_char_whitelist to english alphabetical ones("ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.") so that you will not get characters that you don't expect as output like \\) [ .
Your initial approach to calculate L distance is fine if you success to extract text from handwritten image (which is a hard task for tesseract).
I would also suggest to run some preprocessing on your image. For example you can remove horizontal lines and extract text ROIs around them. In the best case you will be able to extract separated characters, but even if you don't do that - you will get better results & will be able to distinguish result names "line by line".
You should also try other recommended output quality improvement stages which you can find in Tesseract OCR wiki (link)

How do I use the modulo operators in Socrata SoQL?

https://dev.socrata.com/docs/datatypes/number.html#, says that % and ^ can be used to get the modulo of one number divided by another number. I cannot get them to work and cannot find examples.
When I try ^ I appear to get exponentiation. Example:
http://data.cityofchicago.org/resource/pubx-yq2d.json?$select=streetnumberto,streetnumberfrom,(streetnumberto-streetnumberfrom)^100%20as%20address_length_in_blocks
When I try % itself, I get a "malformed" error, not so surprisingly. When I try the %25 code for %, I still get a "malformed" error but one that seems to suggest that it correctly inserted the % but does not know what it means. (I am restricted from posting more than two links but just replace the ^ above with % and %25.)
Can anyone help me get this working?
By the way, at the risk of mixing topics, I would ideally like to use an int or round sort of function but they do not seem to exist in SoQL so I was trying to back into getting the integer portion of dividing by 100.
Thank you.
Great question. I just tried it myself and I'm getting a malformed exception too, and it's definitely getting through to the query optimizer:
https://data.cityofchicago.org/resource/erhc-fkv9.json?totalfees=4160%20%25%2010.0
Note I'm using the SODA 2.1 version of that API, which I recommend you migrate to:
https://dev.socrata.com/foundry/#/data.cityofchicago.org/erhc-fkv9
I'll check with our engineering team and see what might be going on. I'll pass on the feature request for round, ceil, floor, etc as well.

Can I read the rest of the line after a positive value of IOSTAT?

I have a file with 13 columns and 41 lines consisting of the coefficients for the Joback Method for 41 different groups. Some of the values are non-existing, though, and the table lists them as "X". I saved the table as a .csv and in my code read the file to an array. An excerpt of two lines from the .csv (the second one contains non-exisiting coefficients) looks like this:
48.84,11.74,0.0169,0.0074,9.0,123.34,163.16,453.0,1124.0,-31.1,0.227,-0.00032,0.000000146
X,74.6,0.0255,-0.0099,X,23.61,X,797.0,X,X,X,X,X
What I've tried doing was to read and define an array to hold each IOSTAT value so I can know if an "X" was read (that is, IOSTAT would be positive):
DO I = 1, 41
(READ(25,*,IOSTAT=ReadStatus(I,J)) JobackCoeff, J = 1, 13)
END DO
The problem, I've found, is that if the first value of the line to be read is "X", producing a positive value of ReadStatus, then the rest of the values of those line are not read correctly.
My intent was to use the ReadStatus array to produce an error message if JobackCoeff(I,J) caused a read error, therefore pinpointing the "X"s.
Can I force the program to keep reading a line after there is a reading error? Or is there a better way of doing this?
As soon as an error occurs during the input execution then processing of the input list terminates. Further, all variables specified in the input list become undefined. The short answer to your first question is: no, there is no way to keep reading a line after a reading error.
We come, then, to the usual answer when more complicated input processing is required: read the line into a character variable and process that. I won't write complete code for you (mostly because it isn't clear exactly what is required), but when you have a character variable you may find the index intrinsic useful. With this you can locate Xs (with repeated calls on substrings to find all of them on a line).
Alternatively, if you provide an explicit format (rather than relying on list-directed (fmt=*) input) you may be able to do something with non-advancing input (advance='no' in the read statement). However, as soon as an error condition comes about then the position of the file becomes indeterminate: you'll also have to handle this. It's probably much simpler to process the line-as-a-character-variable.
An outline of the concept (without declarations, robustness) is given below.
read(iunit, '(A)') line
idx = 1
do i=1, 13
read(line(idx:), *, iostat=iostat) x(i)
if (iostat.gt.0) then
print '("Column ",I0," has an X")', i
x(i) = -HUGE(0.) ! Recall x(i) was left undefined
end if
idx = idx + INDEX(line(idx:), ',')
end do
An alternative, long used by many many Fortran programmers, and programmers in other languages, would be to use an editor of some sort (I like sed) and modify the file by changing all the Xs to NANs. Your compiler has to provide support for IEEE NaNs for this to work (most of the current crop in widespread use do) and they will correctly interpret NAN in the input file to a real number with value NaN.
This approach has the benefit, compared with the already accepted (and perfectly good) answer, of not requiring clever programming in Fortran to parse input lines containing mixed entries. Use an editor for string processing, use Fortran for reading numbers.

Amending a condition as the stack unwinds in Common Lisp

I'm working on some lisp code that munges a CVS repository in order to get it into shape for a git conversion. As such, when something goes wrong it might not be a bug in my code but might instead be an inaccuracy in the input data which needs fixing.
Deep down in the bowels of my code, I'll have (say) some function that merges two date ranges by taking an intersection. If that intersection turns out to be empty, I'll raise an error. This is all good, but my "merge dates" function is missing lots of the information that the user (me!) needs in order to figure out what went wrong. For example, which CVS master file ("foo,v") was I working on? What branch was I thinking about? Etc. etc.
My partial solution to this problem is to handle and re-raise the error. For example, this sort of code:
(handler-case
(do-something)
(unclear-graft-point (c)
(setf (slot-value c 'master) master)
(setf (slot-value c 'branch) branch)
(error c)))
which sets some extra slots with useful info before re-raising the error.
The report function for the condition then checks whether those slots have been set and, if so, will use them to give a more helpful error message.
Unfortunately, the backtrace that I get stops at the top-level re-raise. That makes sense: it's the code I wrote. But it's not really what I want...
Is there a way in Common Lisp to "annotate" a condition as it hurtles past, without re-raising it and, hence, partially unwinding the stack without reporting that in the back trace?
I gave a reasonable amount of background info about what I'm doing in the hope that, if this isn't possible, someone will say "Ah, what you should do in this situation is...": am I just going about this the wrong way?
Yes, there is a Common Lisp technology applicapble to your case - it is called HANDLER-BIND. You can find a detailed explanation and use cases for it alongside other condition-handling machinery in Peter Seibel's Practical Common Lisp Chapter 19. Conditions and Restarts.

How to compile a complete list of MySQL "Words"

Really getting into MySQL and one thought I've had on mastering one aspect of it is to gather a complete listing of MySQL words. One example of this might be the Reserved Words list, though it appears that's not a complete list; example: CONCAT, CRC32, etc.
Bizarre as it may seem, I was thinking that such a list might exist, or that there might even be a query that would yield it, and/or a way to extract it from the source code of MySQL.
It is a non-scientific method, but what I would do is:
extract all strings from Native_func_registry func_array. Lookup for it sql/item_create.cc , e.g in
http://bazaar.launchpad.net/~mysql/mysql-server/mysql-trunk/view/head:/sql/item_create.cc
Those should cover builtin functions.
extract strings from 'symbols' and 'functions' in lexer :
http://bazaar.launchpad.net/~mysql/mysql-server/mysql-trunk/view/head:/sql/lex.h
extract symbols from bison input http://bazaar.launchpad.net/~mysql/mysql-server/mysql-trunk/view/head:/sql/sql_yacc.yy from lines
%token SOMETOKEN
except when tokens have _SYM suffix (they are covered by sql/lex.h)
Combine all of those, and the resulting set might come near :)