Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: Title is very long: 1127 characters (max is 1000) - warnings

I'm running blastx on my de novo transcriptome assembly. While the program is still running I've been obtaining errors like this one:
Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: Title is very long: 1127 characters (max is 1000)
...and others, where the number of characters varies. I've searched for this specific error online but I don't seem to find anything regarding it. I was hoping that someone that has run across it can help me understand what it means and specially, if I should stop the run and start with different parameters or make some change to my assembly.

I'm facing the same problem in the ncbi-blast-2.2.29+ version.
Then, I used an older version (2.2.25+) and makeblastdb worked fine to me, without these two message errors:
Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: Title is very long: 1141 characters (max is 1000)
Error: (1431.1) FASTA-Reader: Warning: FASTA-Reader: Ignoring FASTA modifier(s) found because the input was not expected to have any.
Well, you can use an older version too, until the developers fix the problem.

Did you ever figure this out? I'm running into the same issue with fasta files generated by a Trinity assembly. The fasta file is not altered in any way, so I'm not sure why there would be a problem. I did some research and found the source code that generates this error:
void CFastaReader::ParseTitle(
00848 const SLineTextAndLoc & lineInfo, IMessageListener * pMessageListener)
00849 {
00850 const static size_t kWarnTitleLength = 1000;
00851 if( lineInfo.m_sLineText.length() > kWarnTitleLength ) {
00852 FASTA_WARNING(lineInfo.m_iLineNum,
00853 "FASTA-Reader: Title is very long: " << lineInfo.m_sLineText.length()
00854 << " characters (max is " << kWarnTitleLength << ")",
00855 ILineError::eProblem_TooLong, "defline");
00856 }
This code was found at: enter link description here

I ended up using a one-liner to parse the extraneous information out of the fasta headers:
sed -e 's/>* .*$//' original.fasta > truncated.fasta
But I'd recommend doing that on a test file first, as your headers are most likely going to be different than mine.
Thanks for the pointer!

Related

How to validate mysql database URI

I'm trying to integrate a Gem named blazer with my Rails application and I have to specify mysql database URL in blazer.yml file so that it can access data in staging and production environments.
I believe the standard format to define MySQL database URL is
mysql2://user:password#hostname:3306/database
I defined my URL in the same format as a string and when I validate the URI I get the below error
URI::InvalidURIError: bad URI(is not URI?):
mysql2://f77_oe_85_staging:LcCh%264855c6M;kG9yGhjghjZC?JquGVK#factory97-aurora-staging-cluster.cluster-cmj77682fpy4kjl.us-east-1.rds.amazonaws.com/factory97_oe85_staging
Defined Mysql database URL:
'mysql2://f77_oe_85_staging:LcCh%264855c6M;kG9yGhjghjZC?JquGVK#factory97-aurora-staging-cluster.cluster-cmj77682fpy4kjl.us-east-1.rds.amazonaws.com/factory97_oe85_staging'
Please advice
The URI is invalid.
The problem is the password contains characters which are not valid in a URI. The username:password is the userinfo part of a URI. From RFC 3986...
foo://example.com:8042/over/there?name=ferret#nose
\_/ \______________/\_________/ \_________/ \__/
| | | | |
scheme authority path query fragment
authority = [ userinfo "#" ] host [ ":" port ]
userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
Specifically it's the ? in the password LcCh%264855c6M;kG9yGhjghjZC?JquGVK. It looks like the password is only partially escaped.
I think a problem is the issue is not well isolated. Here is an example strategy of how to isolate it.
The error code of URI::InvalidURIError: bad URI(is not URI?): only indicates the library (blazer gem) successfully read a file, which may or may not be the file you have edited, /YOUR_DIR/blazer.yml or something, but nevertheless failed to parse the URI.
Now, the issues to consider include:
blazer gem really read /YOUR_DIR/blazer.yml?
does the preprocessor of the yml work as expected?
is the uri key specified correct?
mysql: or mysql2?
are the formats of IP, port, account name, password, and database name all correct? In particular, are special characters correctly escaped? (See MySql document about special characters)
I suppose the OP knows answers of some of these questions but we don't. So, let's assume any of them can be an issue.
Then a proposed strategy is this:
Find a URI that is at least in a correct format and confirm it is parsed and recognised correctly by Gem blazer. Note you only need to test the format and so dummy parameters are fine. For example, try a combination of the following and see which does not issue the error URI::InvalidURIError:
mysql://127.0.0.1/test
mysql://adam:alphabetonly#127.0.0.1/test
jdbc:mysql://adam:alphabetonly#127.0.0.1/test
Now, you know at least the potential issues (1),(3),(4) are irrelevant.
Replace the IP (hostname), account name, password, and database name one by one with the real one and find which raises the error URI::InvalidURIError. Now you have narrowed down which part causes a problem. In the OP's case, I suspect the problem is an incorrect escape of the special characters in the password part. Let's assume that is the case, and then,
properly escape the part so that they form a correct URI format as a whole. The answer by #Schwern is a good summary about the format. As a tip, you can get an escape URI by opening Rail's console (via rails c) and typing URI.encode('YOUR_PASSWORD') or alternatively, run ruby directly from the command-line in a (UNIX-shell) terminal:
ruby -ruri -e "puts URI.encode('YOUR_PASSWORD')"
Replace the password part in the URI in /YOUR_DIR/blazer.yml with the escaped string, and confirm it does not issue the error URI::InvalidURIError (hopefully).
In these processing, I deliberately avoided the preprocessor part, (2).
This answer to "Rails not parsing database URL on production" mentions about URI.encode('YOUR_PASSWORD') in a yml file, but it implicitly assumes a preprocessor works fine. During the test phase, that just adds another layer of complication, and so it is better to skip it. If you need it in your production (to mask the password etc), implement it later, when you know everything else works fine.
Hope by the time the OP has tried all of these, the problem is solved.

How to decode an HTTP request with utf-8 and treat the surrogate keys (Emojis)

I'm having a hard time dealing with some parsing issues related to Emojis.
I have a json requested through the brandwatch site using urllib.(1) Then, I must decode it in utf-8, however, when I do so, I'm getting surrogate keys and the json.loader cannot deal with them. (2)
I've tried to use BeautifulSoup4, which works great, however, when there's a &quot on the site result, it is transformed to ", and then, the json.loader cannot deal with it for it says that a , is missing. After tons of search, I gave up trying to escape the " which would be the ideal.(3)
So now, I'm stuck with both "solutions/problems". Any ideas on how to proceed?
Obs: This is a program that fetchs data from the brandwatch and put it inside an MySQL database. So performance is an issue here.
Obs2: PyJQ is a JQ for Python with does the request and I can change the opener.
(1) - Dealing with the first approach using urllib, these are the relevants parts of the code used for it:
def downloader(url):
return json.loads(urllib.request.urlopen(url).read().decode('utf8'))
...
parsed = pyjq.all(jqparser,url=url, vars={"today" : start_date}, opener=downloader)
Error Casted:
Exception ignored in: '_pyjq.pyobj_to_jv'
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 339: surrogates not allowed
*** Error in `python': munmap_chunk(): invalid pointer: 0x00007f5f806303f0 ***
If I print the result of urllib.request.urlopen(url).read().decode('utf8') instead of sending it to json.loader, that's what appears. These keys seems to be Emojis.
"fullname":"Botinhas\uD83D\uDC62"
(2) Dealing with the second approach using BeautifulSoup4, here's the relevant part of the code. (Same as above, just changed the downloader function)
def downloader(url):
return json.loads(BeautifulSoup(urllib.request.urlopen(url), 'lxml').get_text())
...
parsed = pyjq.all(jqparser,url=url, vars={"today" : start_date}, opener=downloader)
And this is the error casted:
Expecting ',' delimiter: line 1 column 4814765 (char 4814764)
Doing the print, the " before Diretas Já should be escaped.
"title":"Por "Diretas Já", manifestações pelo país ocorrem em preparação ao "Ocupa Brasília" - Sindicato dos Engenheiros no Estado do Rio de Janeiro"
I've thought of running a regex, however, I'm not sure whether this would be the most appropriate solution to this case as performance is an issue.
(3) - Part of Brandwatch result with the &quot problem mentioned above
UPDATE:
As Martin stated in the comments, I ran a replace swapping &quot for nothing. Then, it raised the former problem, of the emoji.
Exception ignored in: '_pyjq.pyobj_to_jv'
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 339: surrogates not allowed
*** Error in `python': munmap_chunk(): invalid pointer: 0x00007f5f806303f0 ***
UPDATE2:
I've added this to the downloader function:
re.sub(r'\\u(d|D)([a-z|A-Z|0-9]{3})', "", urllib.request.urlopen(url).read().decode('utf-8','ignore'))
It solved the issue, however, I don't think it's the best way to solve it. If anybody knows a better option.

CSV::MalformedCSVError: Illegal quoting in line 1 with SmarterCSV

I have an issue, when trying to process a csv file, using SmarterCSV.
The error I get is -
CSV::MalformedCSVError: Illegal quoting in line 1
This is where the code I use to process the csv file
SmarterCSV.process(file_path)
I have gone through similar questions. But no where I find a good fit that could help me.
I tried to resolve it using some options of SmarterCSV such as -
:remove_empty_values, :remove_empty_hashes etc. But in vain.
I welcome the suggestions or refactoring to make this work? Thanks all
This is due to illegal Unicode characters inside your file.
You can process file with Unicode characters with
f = File.open(file_path, "r:bom|utf-8"); data = SmarterCSV.process(f); f.close
here data will contain parsed data.
Also refer official documentation on this:https://github.com/tilo/smarter_csv#notes-about-file-encodings

Why consecutive event jsons fall on the same line in some packages in githubarchive?

In http://www.githubarchive.org/ that Ilya Grigorik has provided ,I found that in many gz files , some consecutive events are logged to same file .
for example in 2011-03-15-21.json.gz
To get the above do :
wget http://data.githubarchive.org/2011-03-15-21.json.gz
In this gz for example if you search for id 1484832 , you can find that the 2 consecutive events(jsons) are in same line
see
http://codebeautify.org/jsonviewer/2cb891
the two jsons in same line is a combination of
http://codebeautify.org/jsonviewer/c7e18e
and
http://codebeautify.org/jsonviewer/945d56
.
What is the impact ?
when I was loading each line and loading it with python's(why python ? because I felt python is comfortable in dealing with jsons) json.loads it said it was invalid as it was a combination of two jsons .
Question :
1) How did you solve these kind of bugs when you processed that github archive data ?
2) I already have the data in my local . so how can I overcome this problem . Shall I write code specific to this case to overcome ?
the code i wrote was like
jsonlist = line.split('}{')
json.loads(jsonlist[0] + '}', "ISO-8859-1") # load and navigate through this json
json.loads('{' + jsonlist[1], "ISO-8859-1") # load and navigate through this json
I got the solution here
1) How did you solve these kind of bugs when you processed that github archive data ?
https://github.com/vadasg/githubarchive-parser/blob/master/src/FixGitHubArchiveDelimiters.rb
. This script removes the problems of two or more events appearing on the same line .
so now after running this script the jsons appear in different lines .
2) I already have the data in my local . so how can I overcome this problem . Shall I write code specific to this case to overcome ? the code i wrote was like
This script removes the necessity to write the code I mentioned above .
Note :
Related issues found on the github archive project in github
https://github.com/igrigorik/githubarchive.org/issues/53
https://github.com/igrigorik/githubarchive.org/issues/17
WARNING :
When I was running this script I got an error related to the encoding used . Because by default the Yajl::Parser.parse(jsonInputFile)
line checks if characters it parses adheres to UTF-8 encoding ,if not it will throw errors .
As github data also contains non UTF-8 characters , this error will be thrown in our case too. So to bypass that problem(or may be a fix) I put it as
Yajl::Parser.parse(jsonInputFile, :check_utf8 => false)
for doubts refer docs: http://rdoc.info/github/brianmario/yajl-ruby/Yajl/Parser.parse

IF and ! = ns2 error

I have a problem with path in a tcl file. I tried to use
source " /tmp/mob.tcl "
and this path in bash file :
/opt/ns-allinone-2.35/ns-2.35/indep-utils/cmu-scen-gen/setdest/setdest -v 1 -n $n -p 10 -M 64 -t 100 -x 250 -y 250 >> /tmp/mob.tcl
The terminal give me this output:
..."
(procedure "source" line 8)
invoked from within
"source "/tmp/mob.tcl" "
(file "mobilita_source.tcl" line 125)
How I can do this?
Firstly, this:
source " /tmp/mob.tcl "
is very unlikely to be correct. The spaces around the filename inside the quotes will confuse the source command. (It could be correct, but only if you have a directory in your current directory whose name is a single space. That's really unlikely, unless you're a great deal more evil than I am.)
It really helps a lot if you stop making this error.
Secondly, the error message is both
Incomplete, with just an ellipsis instead of a full error on the first line
Really worrying, with source claimed to be a procedure (second line of that short trace).
It's legal to make a procedure called source, and sometimes the right thing to do, but if you're doing it then you have to be ever so careful to duplicate the semantics of the standard Tcl command or odd things will happen.
Thirdly, you've got a file of what is apparently generated code, and you're hitting a problem in it, and you're not telling us what is on/around line 125 of the file (the error trace is pretty clear on that front) or in the contents of the source procedure (which is non-standard; the standard source is implemented in C) and you're expecting us to guess what's going wrong for you??? Seriously?
Tcl error traces are usually quite clear enough for you to figure out what went wrong and where. If there's an unclear error, and it didn't come from user code (by calling error or return -code error) then let us know; we'll help (or possibly even change Tcl to make things clearer in the future). But right now, there's a complete shortage of information.
Here's an example of what a normal source error looks like:
% source /tmp/foo/bar/boo
couldn't read file "/tmp/foo/bar/boo": no such file or directory
% puts $errorInfo
couldn't read file "/tmp/foo/bar/boo": no such file or directory
while executing
"source /tmp/foo/bar/boo"
If a script generates an error directly, it's encouraged to be as clear as that, but we cannot enforce it. Sometimes you have to be a bit of a detective yourself…