How to insert large protein sequences in a table in mysql? - mysql

I am creating a protein database, which consists of large protein sequences. I have created four tables, one of which stores genomeID, sequenceID, name, and sequence. I can easily store other values using INSERT INTO command MySQL, but I am struck in inserting these large protein sequences in fasta format because one genome has thousands of peptide sequences.
I want to figure out a way where I can store these sequences using MySQL.
Any help would be appreciated!
Thanks

Given the fact that genomes contain several billions of characters it isn't feasible to store this in a sql database because even the longest data type can at most hold around 1000 characters. Instead what I would recommend is using a NoSQL key-value db to store your sequences.

Related

Better to have multiple columns or single column with multiple values in CSV?

I have a CSV file that contains grain size analysis data. The type of data isn't particularly important for my question - I think the question applies to spreadsheets of data values in general. One of the columns ("mode in phi") that is returned from the lab analysis can contain multiple values if the sample is multi-modal. Usually the highest number of mode values is 3.
Is it better to store the values as a list in a single column or multiple columns with a single value in each column (with "NA" when necessary) for this type of data structure? Is there another option I'm unaware of?
Pros and cons I've considered:
Single column pros: nice to have a single column, values are separated with a semicolon so they're easily distinguished from comma-delimited columns and could be parsed programmatically.
Single column cons: less machine readable because the cell is read as a string rather than numbers.
Multi-column pros: each cell has a single value so it's easily read.
Multi-column cons: how would a user/machine know how many columns of "mode" there would be -- could be different different between different datasets. Could expand to many columns, potentially. Lots of "NA" values.
After Googling, I saw this SO post and read about first normal form (FNF), but I'm not sure if FNF applies to a single CSV file rather than a relational database. Are there any other standards or recommendations for CSVs of single data files?
I know there are lots of similar questions on SO, but mostly about how to split the multiple values or questions specific to databases. I couldn't find much particular to a single CSV.

what is the effect of storing a lot of row in a mysql table (maybe more than one bilion) for a search engine?

I want to store more than one billion row in my search engine db.
I think when the number of rows going upper than a number we experiencing pressure in fetching queries to introduce result.
What is the best way to store a lot of data in mysql? one approach that I tested was split data to more tables instead of using just one table, but it make me to write some complicated methods to fetching the data from different tables in each search.
What is the effect of storing a lot of row in a mysql table (maybe more than one billion) for a search engine and what we can do to increasing number of rows without negative effects on fetch speed ?
I think google and other search engines use a technique to render some query sets and produce results based on a ranking algorithm, is this true?
I experienced data storage in splitted tables for better efficiency and it was good but complicated for fetching.
Please give me a technical approach to store more data in one table with minimum resource usage.
In any database, the primary methods for accessing large tables are:
Proper indexing
Partitioning (i.e. storing single tables in multiple "files").
These usually suffice and should be sufficient for your data.
There are some additional techniques that depend on the data and database. Two that come to mind are:
Vertical/column partitioning: storing separate columns in separate table spaces.
Data compression for wide columns.
You can easily store billions of data in MySQL and for fast handling, you should do proper indexing, and the splitting of tables is mainly for normalization by column-wise distribution, not for row-wise distribution.
This earlier link may help:How to store large number of records in MySql database?

Advice for storing dna sequence data in mysql

I am creating a database that will store DNA sequence data, which are strings like this: 'atcgatcgatcg' and protein sequence data that are also strings like this: 'MKLPKRML'.
I am a beginner in MySQL management I want to ask you for the proper configuration of these columns in terms of data types, character set and collation. There will be around one million of DNA and protein sequence rows, and I want to use string comparisons as higher performance possible.
I've reading about this problem and I have these conclusions and doubts
I could use VARCHAR(MAX) because the length of my strings will not be higher than 65,535 characters.
BOLD fields commparisons are fasters. Is better than VARCHAR in this case? I am also thinking in issues associated with a retrieve of data because of the retrieve must be in string type, not bytes
Is better to use latin-1 instead of utf-8? I'am only storing alphabet without special characters
Thank you for your help!

How to store big matrix(data frame) that can be subsetted easily later

I will generate a big matrix(data frame) in R whose size is about 1300000*10000, about 50 GB. I want to store this matrix in a appropriate format, so later I can feed the data into Python or other program codes to make some analysis. Of course I cannot feed the data one time, so I have to subset the matrix and feed them little by little.
But I don't know how to store the matrix. I think of two ways, but I think neither is appropriate:
(1) plain text(including csv or excel table), because it is very hard to subset(e.g. if I just want some columns and some rows of the data)
(2) database, I have searched information about mysql and sqlite, but it seems that the number of columns is limited in sql database(1024).
So I just want to know if there are any good strategies to store the data, so that I can subset the data by row/column indexes or name.
Have separate column(s) for each of the few columns you need to search/filter on. Then put the entire 10K columns into some data format that is convenient for the client code to parse. JSON is one common possibility.
So the table would 1.3M rows and perhaps 3 columns: an id (auto_increment, primary key), the column search on, and the JSON blob - as datatype JSON or TEXT (depending on software version) for the many data values.

Storing large prime numbers in a database

This problem struck me as a bit odd. I'm curious how you could represent a list of prime numbers in a database. I do not know of a single datatype that would be able to acuratly and consistently store a large amount of prime numbers. My concern is that when the prime numbers are starting to contain 1000s of digits, that it might be a bit difficult to reference form the database. Is there a way to represent a large set of primes in a DB? I'm quite sure that this has topic has been approached before.
One of the issues about this that makes it difficult is that prime numbers can not be broken down into factors. If they could this problem would be much easier.
If you really want to store primes as numbers and one of questions, stopping you is "prime numbers can not be broken down into factors", there are another thing: store it in list of modulus of any number ordered by sequence.
Small example:
2831781 == 2*100^3 + 83*100^2 + 17*100^1 + 81*100^0
List is:
81, 17, 83, 2
In real application is useful to split by modulus of 2^32 (32-bits integers), specially if prime numbers in processing application stored as byte arrays.
Storage in DB:
create table PRIMES
(
PRIME_ID NUMBER not null,
PART_ORDER NUMBER(20) not null,
PRIME_PART_VALUE NUMBER not null
);
alter table PRIMES
add constraint PRIMES_PK primary key (PRIME_ID, PART_ORDER) using index;
insert for example above (1647 is for example only):
insert into primes(PRIME_ID, PART_ORDER, PRIME_PART_VALUE) values (1647, 0, 81);
insert into primes(PRIME_ID, PART_ORDER, PRIME_PART_VALUE) values (1647, 1, 17);
insert into primes(PRIME_ID, PART_ORDER, PRIME_PART_VALUE) values (1647, 2, 83);
insert into primes(PRIME_ID, PART_ORDER, PRIME_PART_VALUE) values (1647, 3, 82);
prime_id value can be assigned from oracle sequence ...
create sequence seq_primes start with 1 increment by 1;
Get ID of next prime number to insert:
select seq_primes.nextval from dual;
select prime number content with specified id:
select PART_ORDER, PRIME_PART_VALUE
from primes where prime_id = 1647
order by part_order
You could store them as binary data. They won't be human readable straight from the database, but that shouldn't be a problem.
Databases (depending on which) can routinely store numbers up to 38-39 digits accurately. That gets you reasonably far.
Beyond that you won't be doing arithmetic operations on them (accurately) in databases (barring arbitrary-precision modules that may exist for your particular database). But numbers can be stored as text up to several thousand digits. Beyond that you can use CLOB type fields to store millions of digits.
Also, it's worth nothing that if you're storing sequences of prime numbers and your interest is in space-compression of that sequence you could start by storing the difference between one number and the next rather than the whole number.
This is a bit inefficient, but you could store them as strings.
If you are not going to use database-side calculations with these numbers, just store them as bit sequences of their binary representation (BLOB, VARBINARY etc.)
Here's my 2 cents worth. If you want to store them as numbers in a database then you'll be constrained by the maximum size of integer that your database can handle. You'd probably want a 2 column table, with the prime number in one column and it's sequence number in the other. Then you'd want some indexes to make finding the stored values quick.
But you don't really want to do that do you, you want to store humongous (sp?) primes way beyond any integer datatype you've even though of yet. And you say that you are averse to strings so it's binary data for you. (It would be for me too.) Yes, you could store them in a BLOB in a database but what sort of facilities will the DBMS offer you for finding the n-th prime or checking the primeness of a candidate integer ?
How to design a suitable file structure ? This is the best I could come up with after about 5 minutes thinking:
Set a counter to 2.
Write the two-bits which represent the first prime number.
Write them again, to mark the end of the section containing the 2-bit primes.
Set the counter to counter+1
Write the 3-bit primes in order. ( I think there are two: 5 and 7)
Write the last of the 3-bit primes again to mark the end of the section containing the 3-bit primes.
Go back to 4 and carry on mutatis mutandis.
The point about writing the last n-bit prime twice is to provide you with a means to identify the end of the part of the file with n-bit primes in it when you come to read the file.
As you write the file, you'll probably also want to make note of the offsets into the files at various points, perhaps the start of each section containing n-bit primes.
I think this would work, and it would handle primes up to 2^(the largest unsigned integer you can represent). I guess it would be easy enough to find code for translating a 325467-bit (say) value into a big integer.
Sure, you could store this file as a BLOB but I'm not sure why you'd bother.
It all depends on what kinds of operations you want to do with the numbers. If just store and lookup, then just use strings and use a check constraint / domain datatype to enforce that they are numbers. If you want more control, then PostgreSQL will let you define custom datatypes and functions. You can for instance interface with the GMP library to have correct ordering and arithmetic for arbitrary precision integers. Using such a library will even let you implement a check constraint that uses the probabilistic primality test to check if the numbers really are prime.
The real question is actually whether a relational database is the correct tool for the job.
I think you're best off using a BLOB. How the data is stored in your BLOB depends on your intended use of the numbers. If you want to use them in calculations I think you'll need to create a class or type to store the values as some variety of ordered binary value and allow them to be treated as numbers, etc. If you just need to display them then storing them as a sequence of characters would be sufficient, and would eliminate the need to convert your calculatable values to something displayable, which can be very time consuming for large values.
Share and enjoy.
Probably not brilliant, but what if you stored them in some recursive data structure. You could store it as an int, it's exponent, and a reference to the lower bit numbers.
Like the string idea, it probably wouldn't be very good for memory considerations. And query time would be increased due to the recursive nature of the query.