Advice for storing dna sequence data in mysql - mysql

I am creating a database that will store DNA sequence data, which are strings like this: 'atcgatcgatcg' and protein sequence data that are also strings like this: 'MKLPKRML'.
I am a beginner in MySQL management I want to ask you for the proper configuration of these columns in terms of data types, character set and collation. There will be around one million of DNA and protein sequence rows, and I want to use string comparisons as higher performance possible.
I've reading about this problem and I have these conclusions and doubts
I could use VARCHAR(MAX) because the length of my strings will not be higher than 65,535 characters.
BOLD fields commparisons are fasters. Is better than VARCHAR in this case? I am also thinking in issues associated with a retrieve of data because of the retrieve must be in string type, not bytes
Is better to use latin-1 instead of utf-8? I'am only storing alphabet without special characters
Thank you for your help!

Related

How to insert large protein sequences in a table in mysql?

I am creating a protein database, which consists of large protein sequences. I have created four tables, one of which stores genomeID, sequenceID, name, and sequence. I can easily store other values using INSERT INTO command MySQL, but I am struck in inserting these large protein sequences in fasta format because one genome has thousands of peptide sequences.
I want to figure out a way where I can store these sequences using MySQL.
Any help would be appreciated!
Thanks
Given the fact that genomes contain several billions of characters it isn't feasible to store this in a sql database because even the longest data type can at most hold around 1000 characters. Instead what I would recommend is using a NoSQL key-value db to store your sequences.

How to store a sequence of digits (or sequence of characters from some other set)?

I want to save some sequence of digits, which may be a number (eg 12345, 1230) or not (eg 00123, 0120). Which type of column is the most effective (by memory, by indexing speed) for that purpose?
Also, I need to store strings of characters from specially defined alphabet (eg "digits and comma" or "digits and english letters and dots and commas"). How to effectively do this?
Can I set limits on CHAR/VARCHAR type of a column to reduce the memory size it takes?
This is no different from any string column which implies varchar/text type.
VARCHAR(n) takes care of storage dynamically.
Reference here.
For constraining allowed symbols you could use CHECK constraints, but they don't work in MySQL so a workaround is to use trigger and define accepted characters there. See this question.
You could use REGEXP function to define your allowed alphabet.
For sql you can go with
varchar() and if your using POSTgreSQL you can make use of something
known as citext

store 300 digit number in sql

Which datatype can I use to store really big integer in SQL. I am using phpmyAdmin to view data and java program for storing and retrieving values. Actually I am working with Bilinear Maps which uses random numbers generated from Zp where p is very large prime number and then "raised to" operations on those number.
I want to store some numbers in database like public keys. What data type can I use for table columns in SQL for such values?
You could store them as strings of decimal digits using type CHARACTER. While this does waste some space, an advantage is that the database will be easier for humans to understand.
You could store them as raw binary big-endian values using type BLOB. This is the most efficient for software to access and takes up the least space. However, humans will not be able to easily query the database for these values or understand them in dumps.
Personally, I would opt for the blob unless there's a real need for the database to be understandable by humans using standard query tools. If you can't get around needing to administer the database with tools that don't understand your data format, then just use decimal values in text.
For MySQL, VARCHAR(300) CHARACTER SET ascii.
VAR, assuming the numbers won't always be exactly 300.
CHAR -- no big advantage in BLOB.
ascii -- no need for utf8 involvement.
DECIMAL won't work because there is a 64-digit limit.
The space taken will be 2+length bytes (302 in your example), where the 2 is for length for VAR.

database (mysql) data type choice: Text vs Binary

What are the tradeoffs in choosing a TEXT datatype as compared to a BLOB or BINARY datatype? I don't intend to index on the column or use it in a WHERE clause, it's just data in the database, that happens to be textual. If there's a performance or storage advantage to my choice of datatype though, that would be good to know... Thanks!
Short answer: If it's text, use the TEXT datatype.
Long answer: TEXT columns are treated as text, i.e. they have an assigned character set. If you do comparisons between TEXT values, it will use your character set's collation rather than just comparing their numeric values. BLOB columns, on the other hand, are just arrays of bytes; they have no defined character set. If you're storing unicode (or other 'wide character' encodings), you will definitely want to use TEXT, since your data will be basically opaque to the db unless you do.
They can both store the same amount of data (depending on which of the subtypes you're using, of course), but you might see a small performance gain in using BLOB for binary data since there won't be any text-related processing going on with it.

Storing very large integers in MySQL

I need to store a very large number (tens of millions) of 512-bit SHA-2 hashes in a MySQL table. To save space, I'd like to store them in binary form, rather than a string a hex digits. I'm using an ORM (DBix::Class) so the specific details of the storage will be abstracted from the code, which can inflate them to any object or structure that I choose.
MySQL's BIGINT type is 64 bits. So I could theoretically split the hash up amongst eight BIGINT columns. That seems pretty ridiculous though. My other thought was just using a single BLOB column, but I have heard that they can be slow to access due to MySQL's treating them as variable-length fields.
If anyone could offer some widsom that will save me a couple hours of benchmarking various methods, I'd appreciate it.
Note: Automatic -1 to anyone who says "just use postgres!" :)
Have you considered 'binary(64)' ? See MySQL binary type.
Use the type BINARY(64) ?