Extract aligned sections of FASTA to new file - extract

I've already looked here and in other forums, but couldn't find the answer to my question. I want to design baits for a target enrichment Sequencing approach and have the output of a MarkerMiner search for orthologous loci from four different genomes with A. thaliana as a Reference as. These output alignments are separate Fasta-Files for each A. thaliana annotated gene with the sequences from my datasets aligned to it.
I have already run a script to filter out those loci supported to be orthologous by at least two of my four input datasets.
However, now, I'm stumped.
My alignments are gappy, since the input data is mostly RNAseq whereas the Reference contains the introns as well. So it looks like this :
AT01G1234567
ATCGATCGATGCGCGCTAGCTGAATCGATCGGATCGCGGTAGCTGGAGCTAGSTCGGATCGC
MyData1
CGATGCGCGC-----------CGGATCGCGG---------------CGGATCGC
MyData2
CGCTGCGCGC------------GGATAGCGG---------------CGGATCCC
To effectively design baits I now need to extract all the aligned parts from the file, so that I will end up with separate files; or separate alignments within the file; for the parts that are aligned between MyData and the Reference sequence with all the gappy parts excluded. There are about 1300 of these fasta files, so doing it manually is no option.
I have a bit of programming experience in python and with Linux command line tools, however I am completely lost on how to go about this. I would appreciate a hint, on what kind of tools are out there I could use or what kind of algorithm I need to come up with.
Thank you.
Cheers

Related

How can I set U-boot CONFIG_MYBOARD_XXXX values from the command line?

I'm working on custom hardware, and have added a new myboard board to my U-boot repro. The make pattern is very standard:
make myboard_defconfig
make
which sets the U-Boot configuration to the myboard defaults, and then builds the resulting U-boot image. It all works, but I need to take it one step further.
The hardware actually comes in two closely-related flavors, and I need to build slightly different U-Boot images for the two flavors. Rather than defining two completely different boards, I'd like to build the same board type twice, but with a CONFIG_MYBOARD_XXXX symbol having different values. My myboard.c file will then have an #if CONFIG_MYBOARD_XXXX == YYYY test to differentiate the results.
Problem: I want to set CONFIG_MYBOARD_XXXX's value from within my parent Makefile, not by running anything interactive like make menuconfig.
What's the "right" way to do this?
The U-boot make process has a lot of magic in it, and there seems to be a number of unstated rules on how files need to be named. So, I assumed that the configs/myboard_defconfig file, and the argument to make myboard_defconfig, had to match the official name of my board followed by _defconfig.
Turns out I was wrong: these files can be named anything, as long as they end in _defconfig. So, to have two closely-related versions of myboard, I just have two different defconfig files, e.g. myboard_one_defconfig and myboard_two_defconfig, with the configuration values in the two files specifying the configuration for the two different flavors of myboard.
Easy peasy!

how to convert/match a handwritten list of names? (HWR)

I would like to see if I can scan a sign-in sheet for a class. The good news is I know 90% of the names that might be written.
My idea was to use tessaract to parse an image of names, and then use the Levenshtein algorithm to compare each line with a list of names in my database and if I get reasonably close matches, then that name is right.
Does this approach sound like a good one? If not, other ideas?
I tried using tesseract on a sample sheet (see below)
I used:
tesseract simple.png -psm 4 outtxt
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
I am assuming it didn't like line 2 because I went below the line.
The results I got were:
1.. AM: (harm;
l. ’E (J 22 a 00k
2‘ wau \\) [HQ
4. KIM TAYLOE
5. LN] Davis
6‘ Mzflé! Ha K
Obviously not the greatest, my guess is the distance matches for 4 & 5 would work, but the rest are not even close.
I have control of my sign-in sheet, but not the handwriting of folks coming in, so if any changes to that I can do to help, please let me know.
Since your goal is to get names only - I would suggest you to reduce tessedit_char_whitelist to english alphabetical ones("ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.") so that you will not get characters that you don't expect as output like \\) [ .
Your initial approach to calculate L distance is fine if you success to extract text from handwritten image (which is a hard task for tesseract).
I would also suggest to run some preprocessing on your image. For example you can remove horizontal lines and extract text ROIs around them. In the best case you will be able to extract separated characters, but even if you don't do that - you will get better results & will be able to distinguish result names "line by line".
You should also try other recommended output quality improvement stages which you can find in Tesseract OCR wiki (link)

Generating truth tables for basic logic circuits

Let's say I have a text file that looks like this:
<number> <name> <type> <inputs...>
1 XOR1 XOR A B
2 SUM XOR 1 C
What would be the best approach to generate the truth table for this circuit?
That depends on what you have available, and how big your file is.
Perl is optimized for reading files and generating simple text output. It doesn't have a library of boolean operators, but they're easy enough to write. I'd use that if I just wanted text-in, text-out.
If I wanted to display the data online AND generate a results file, I'd use PHP to read the data and write the table to a CSV file that could either be opened in Excel, or posted online in an HTML table.
If your data is in a REALLY BIG data file, I'd use SQL.
If your data is in a really huge file that you want to be accessible to authorized users online, and you want THEM to be able to create truth tables, I'd use Oracle's APEX to create an easy interface for them to build their own truth tables and play around with the data without altering it.
If you're in an electrical engineering environment, use the tools designed for your problem -- Verilog or similar.
Whatcha got? Whatcha wanna do with it?
-- Ada
I prefer using C#. I already have the code to 'parse' the input
text file. I just don't know where to start in terms of
actually 'simulating' it. The output can simply be a text file
with inputs and output values – Don 12 mins ago
How many inputs and how many outputs in the circuit you want to simulate?
The size of the simulation determines how it can most easily be run. If the circuit is small(ish), you can enter the inputs and circuit values into vector arrays, then cross them to get the output matrix.
Matlab is ideal for this, as it was written for processing arrays.
Again: Whatcha got, and whatcha wanna do with it?
-- Ada

Drools: Merged cells in CSV

I have a set of Drools rules stored in an Excel document that for various reasons needs to be replaced with a .csv file. The problem is that .csv files don't support merged cells, making it difficult if not impossible to properly convert the rules.
After a lot of googling, I found references to using "..." to indicate merged cells, but no explicit examples on how to use it. Documentation found in the source code gives a few more hints, but is still too ambiguous; I've tried countless different interpretations of it without any success.
Any help would be appreciated.
We had the same issue as you. After reviewing their source code: CsvParser + DefaultRuleSheetListener, I found the solution. Here this post can help you to save time.
Only specify ... at ObjectType Matching row, i.e. the one below CONDTION, ACTION row. Starting from the begin of the Merged cell to the end of the merged cell. Please note, for the continued merged cell, you cannot just use "...", but the code will ignore it after normalized and trim and treat it as an empty cell and silently ignore it. Put anything such as a..., b..., etc. Here is the example.
Please also note Drools uses buffered reader, not CSV reader, it cannot handle one cell value spanning multiple lines. Unless you have your CSVParser which uses CSVReader.
Here is a simplified example.
CONDITION,CONDITION,CONDITION,ACTION,ACTION
$Client:Client(),$Product:Product()...,anythingButNotJust3Dots...,,
"clientType == ""$param""","planType == ""$param""","accountType == ""$param""","documents.add(""$param"");","documents.add(""$param"");"
INDIVIDUAL,RRSP,CASH,document1,document2
INDIVIDUAL,RESP,CASH,document2,
INDIVIDUAL,RIF,CASH,document3,
INDIVIDUAL,,MARGIN,document4,document6

When could a CSV records *not* have the same number of fields?

I am storing a series of events to a CSV file, each event type comes with a different set of data.
To illustrate, say I have two events (there will be many more):
Running, which has a data set containing speed and incline.
Sleeping, which has a data set containing snores.
There are two options to store this data in CSV records:
Option A
Storing each possible item of data in it's own field...
speed, incline, snores
therefore...
15mph, 20%, ,
, , 12
16mph, 20%, ,
14mph, 20%, ,
Option B
Storing each event in its own record...
event, value1...
therefore...
running, 15mph, 20%
sleeping, 12
running, 16mph, 20%
running, 14mph, 20%
Without a specific CSV specification, the consensus seems to be:
Each record "should" contain the same number of comma-separated fields.
Context
There are a number of events which each have a large & different set of data values.
CSV data is to be of use to other developers (I will/could/should/won't use either structure).
The 'other developers' to be toward the novice end of the spectrum and/or using resource limited systems. CSV is accessible.
The CSV format is being provided non-exclusively as feature not requirement. Although, if said application is providing a CSV file it should be provided in the correct manner from now on.
Question
Would it be valid – in this case - to go with Option B?
Thoughts
Option B maintains a level of human readability, which is an advantage say CSV is read by human not processor. Neither method is more complex to parse using a custom parser, but will Option B void the usefulness of a CSV format with other libraries, frameworks, applications et al. With Option A future changes/versions to the data set of an individual event may break the CSV structure (zombie , , to maintain forwards compatibility); whereas Option B will fail gracefully.
edit
This may be aimed at students and frameworks like OpenFrameworks, Plask, Proccessing et al. where CSV is easier to implement.
Any "other frameworks, libraries and applications" I've ever used all handle CSV parsing differently, so trying to conform to one or many of these standards might over-complicate your end result. My recommendation would be to keep it simple and use what works for your specific task. If human readbility is a requirement, then CSV in the form of Option B would work fine. Otherwise, you may want to consider JSON or XML.
As you say there is no "CSV Standard" with regard to contents. The real answer depend on what you are doing and why. You mention "other frameworks, libraries and applications". The one thing I've learnt is "Dont over engineer". i.e. Don't write reams of code today on the assumption that you will plug it into some other framework tomorrow.
I'd say option B is fine, unless you have specific requirements to use other apps etc.
< edit >
Having re-read your context, I'd probably pick one output format and use it, and forget about having multiple formats:
Having multiple output formats is a source of inconsistency (e.g. bug in one format but not another).
Having multiple formats means more code that needs to be
tested
documented
supported
< /edit >
Is there any reason you can't use XML? Yes, it's slightly more difficult to parse, at least for novices, but if so they probably need the practice. File size would be much greater, of course, but it's compressible.