Time series data base for scientific experiments - mysql

I have to perform scientific experiments using time series.
I intend to use MySQL as the data storage platform.
I'm thinking of using the following set of tables to store the data:
Table1 --> ts_id (store the time series index, I will have to deal with several time series)
Table2 --> ts_id, obs_date, value (should be indexed by {ts_idx,obs_date})
Because there will be many time series (hundreds) each with possibly millions of observations, table 2 may grow very large.
The problem is that I have to replicate this experiment several times, so I'm not sure what would be the best approach:
add an experiment_id to the tables and allow them to grow even more.
create a separate data base for each experiment.
if option 2 is better (I personally think so), what would be the best logical way to do this? I have many different experiments to perform, each needing replication. If I create a different data base for every replication, I'd get hundreds of data bases pretty soon. Is there a way to logically organize them, such as each replication as a "sub-database" of its experiment master database?

You might want to start out by considering how you will need to analyze your data.
Assumably your analysis will need to know about experiment name, experiment replica number, internal replicates (e.g. at each timepoint there 3 "identical" subjects measured for each treatment). So your db schema might be something like this:
experiments
exp_id int unsigned not null auto_increment primary key,
exp_name varchar(45)
other fields that any kind of experiment can have
replicates
rep_id int unsigned not null auto_increment primary key,
exp_id int unsigned not null foreign key to experiments
other fields that any kind of experiment replica can have
subjects
subject_id int unsigned not null auto_increment primary key,
subject_name varchar(45),
other fields that any kind of subject can have
observations
ob_id int unsigned not null auto_increment primary key,
rep_id int unsigned not null foreign key to replicates,
subject_id int unsigned not null foreign key to subjects,
ob_time timestamp
other fields to hold the measurements you make at each timepoint
If you have internal replicates you'll need another table to hold the internal replicate/subject relationship.
Don't worry about your millions of rows. As long as you index sensibly, there won't likely be any problems. But if worse comes to worst you can always partition your observation table (likely to be the largest) by rep_id.

Should you have more than one data base, one for each experiment?
The answer to your question hinges on your answer to this question: Will you want to do a lot of analysis that compares one experiment to another?
If you will do a lot of experiment-to-experiment comparison, it will be a horrendous pain in the neck to have a separate data base for every experiment.
I think your suggestion of an experiment ID column in your observation table is a fine idea. That way you can build an experiment table with an overall description of your experiment. That table can also hold the units of observation in your value column (e.g. temp, voltage, etc).
If you have some kind of complex organization of your multiple experiments, you can store that organization in your experiment table.
Notice that MySQL is quite efficient at slinging around short rows of data. You can buy a nice server for the cost of a few dozen hours of your labor, or rent one on a cloud service for the cost of a few hours of labor.
Notice also that MySQL offers the MERGE storage engine. http://dev.mysql.com/doc/refman/5.5/en/merge-storage-engine.html This allows a bunch of different tables with the same column structure to be accessed as if it were one table. This would allow you to store results from individual experiments or groups of them in their own tables, and then access them together. If you have problems scaling up your data collection system, you may want to consider this. But the good news is you can get your database working and then convert to this.
Another question: why do you have a table with nothing but ts_id values in it? I don't get that.

Related

Multiple possible data types for the same attribute: null entries, EAV, or store as varchar?

I'm creating a database for combustion experiments. Each experiment has some scientific metadata which I call 'details'. For example ('Fuel', 'C2H6') or ('Pressure', 120). Because the same detail names (like 'Fuel') show up a lot, I created a table just to store the names and units. Here's a simplified version:
CREATE TABLE properties (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(50) NOT NULL,
units NVARCHAR(15) NOT NULL DEFAULT 'dimensionless',
);
I also created a table called 'details' which maps 'properties' to values.
CREATE TABLE details (
id INT AUTO_INCREMENT PRIMARY KEY,
property_id INT NOT NULL,
value VARCHAR(30),
FOREIGN KEY(property_id) REFERENCES properties(id)
);
This isn't ideal because the value attribute is sometimes a chemical name and sometimes a float. In the future, there may even be new entries that have integer values. Storing everything in a VARCHAR seems wasteful. Since it'll be hard to change later, I want to make the right decision now.
I've been researching this for hours and have considered four options:
Store everything as varchar under value (simplest to develop)
Use an EAV model (most complicated to develop).
Create a column for each type, and have plenty of NULL entries.
value_float, value_int, value_char
Use the JSON datatype.
Looking into each one, it seems like they're all bad in different ways. (1) is bad since it takes up extra space and I have to do extra operations to parse strings into numeric values. (2) is bad because of the huge increase in complexity (four extra tables and a lot more join operations), plus I hear EAV is to be avoided. (3) is a middle-ground in complexity, but there will be two NULL values for each table entry. (4) seems similar to (1), and I'm not sure how it might be better or worse.
I don't expect to have huge growth on this database or millions of entries. It just needs to be fast and searchable for researchers. I'm willing to have more backend complexity for a better/faster user experience.
By now I realize that there aren't that many clear-cut answers in database design. I'm simply asking for some insight into my three options, or perhaps another option I haven't thought of.
EDIT: Added JSON as an option.
Well, you have to sacrify something. Either HD space, or performance, or specific/general dimension or easy/complex to develop dimension. Choose a mix suitable for your needs and situation. - I solved it in 2000 in a general kind of EAV solution this way: basic record had a common properties shared by majority of events, then joins to properties without values (associative table), and those ones very specific properties/values I stored in a BLOB in XML like tags. This way I combined frequent properties with those very specific ones. AS this was intended as VERY GENERAL solution, you probably don't need, I'd sacrifice space, it's cheap today. Who cares if you take more space than it's "correct according to data modeling theory". Ok data model will be ugly, so what ? - You'll still need to decide on specific/general dimension - how specific attributes will be solved - either as specific columns (yes if they are repeated often) or in Property-TypeOfProperty-Value type of table.

MySQL: Which is smaller, storing 2 sets of similar data in 1 table vs 2 tables (+indexes)?

I was asked to optimize (size-wise) statistics system for a certain site and I noticed that they store 2 sets of stat data in a single table. Those sets are product displays on search lists and visits on product pages. Each row has a product id, stat date, stat count and stat flag columns. The flag column indicates if it's a search list display or page visit stat. Stats are stored per day and product id, stat date (actually combined with product ids and stat types) and stat count have indexes.
I was wondering if it's better (size-wise) to store those two sets as separate tables or keep them as a single one. I presume that the part which would make a difference would be the flag column (lets say its a 1 byte TINYINT) and indexes. I'm especially interested about how the space taken by indexes would change in 2 table scenario. The table in question already has a few millions of records.
I'll probably do some tests when I have more time, but I was wondering if someone had already challenged a similar problem.
Ordinarily, if two kinds of observations are conformable, it's best to keep them in a single table. By "conformable," I mean that their basic data is the same.
It seems that your observations are indeed conformable.
Why is this?
First, you can add more conformable observations trivially easily. For example, you could add sales to search-list and product-page views, by adding a new value to the flag column.
Second, you can report quite easily on combinations of the kinds of observations. If you separate these things into different tables, you'll be doing UNIONs or JOINs when you want to get them back together.
Third, when indexing is done correctly the access times are basically the same.
Fourth, the difference in disk space usage is small. You need indexes in either case.
Fifth, the difference in disk space cost is trivial. You have several million rows, or in other words, a dozen or so gigabytes. The highest-quality Amazon Web Services storage costs about US$ 1.00 per year per gigabyte. It's less than the heat for your office will cost for the day you will spend refactoring this stuff. Let it be.
Finally I got a moment to conduct a test. It was just a small scale test with 12k and 48k records.
The table that stored both types of data had following structure:
CREATE TABLE IF NOT EXISTS `stat_test` (
`id_off` int(11) NOT NULL,
`stat_date` date NOT NULL,
`stat_count` int(11) NOT NULL,
`stat_type` tinyint(11) NOT NULL,
PRIMARY KEY (`id_off`,`stat_date`,`stat_type`),
KEY `id_off` (`id_off`),
KEY `stat_count` (`stat_count`)
) ENGINE=InnoDB DEFAULT CHARSET=latin2;
The other two tables had this structure:
CREATE TABLE IF NOT EXISTS `stat_test_other` (
`id_off` int(11) NOT NULL,
`stat_date` date NOT NULL,
`stat_count` int(11) NOT NULL,
PRIMARY KEY (`id_off`,`stat_date`),
KEY `id_off` (`id_off`),
KEY `stat_count` (`stat_count`)
) ENGINE=InnoDB DEFAULT CHARSET=latin2;
In case of 12k records 2 separate tables were actually slightly bigger than the one storing everything, but in case of 48k records, two tables were smaller and by a noticeable value.
In the end I didn't split the data into two tables to solve my initial space problem. I managed to considerably reduce the size of the database, by removing the redundant id_off index and adjusting the data types (in most cases unsigned smallint was more than enough to store all the values I needed). Note that originally stat_type was also of type int and for this column unsigned tinyint was enough. All in all this reduced the size of the database from 1.5GB to 600MB (and my limit was just 2GB for the database). Another advantage of this solution was the fact that I didn't have to modify a single line of code to make everything work (since the site was written by someone else, I didn't had to spend hours trying to understand the source code).

If threads are getting 3000 posts each is it maybe better to make a new table per thread?

There's 12 million posts already and people seem to be using things as a chat. I don't know if it's more efficient to have a bunch of little tables than having the database scan for the last 10 messages in a database with so many entries. I know I'd have to benchmark but just asking if anyone has any observations or anecdotes if they've ever had similar situations.
edit add schema:
create table reply(
id int(11) unsigned not null auto_increment,
thread_id int(10) unsigned not null default 0,
ownerId int(9) unsigned not null default 0,
ownerName varchar(20),
profileId int(9) unsigned,
profileName varchar(50),
creationDate dateTime,
ip int unsigned,
pic varchar(255) default '',
reply text,
index(thread_id),
primary key(id)) TYPE=MyISAM;
It's not a good idea to use variable table names. If you've indexed the columns that would be turned into separate tables, the database will do a better job using the index than you can do by creating separate tables. That's what the database was designed for.
I assume that "thread" here means thread in a pool of postings.
The way you are going to get long-term scalability here is to develop an architecture in which you can have multiple database instances, and avoid having queries that need to be performed across all instances.
Creating multiple tables on the same DB won't really help in terms of scalability. (In fact, it might even reduce throughput ... due to increasing the load on the DB's caches.) But it sounds like in your application you could partition into "pools" of messages in different databases, provided that you can arrange that a reply to a message goes into the same pool as the message it replies to.
The problem that arises is that certain things will involve querying across data in all DB instances. In this case, it might be listing all of a user's messages, or doing a keyword search. So you really have to look at the entire picture to figure out how best to achieve a partitioning. You need to analyze all of the queries, taking account of their relative frequencies. And at the end of the day, the solution to might involve denormalizing the schema so that the database can be partitioned.
Dynamic tables are typically a very bad idea in relational schema. Key/value stores make different trade-offs, so some are better at things like dynamic tables but at the cost of things like weak data integrity/consistency guarantees. You don't appear to have defined any foreign key references and you're using MyISAM so data integrity/reliability probably isn't a priority; the important thing to understand is that different designs have different things they're good at so what's good design for one DB can be bad design for another DB.
I can't help with much else as I focus on Pg and this is a MySQL question. Untagging.
(Note that in PostgreSQL at least, many operations on the relation set are O(n), so huge numbers of relations can be quite harmful.)

mysql table structure proposal?

is this table any good for mysql? I wanted to make it flexible in the future for this type of data storage. With this table structure, you can't use a PRIMARY KEY but an index ...
Should I change the format of the table to have headers - Primary Key, Width, Length, Space, Coupling ...
ID_NUM Param Value
1 Width 5e-081
1 Length 12
1 Space 5e-084
1 Coupling 1.511
1 Metal Layer M3-0
2 Width 5e-082
2 Length 1.38e-061
2 Space 5e-081
2 Coupling 1.5
2 Metal Layer M310
No, this is a bad design for a relational database. This is an example of the Entity-Attribute-Value design. It's flexible, but it breaks most rules of what it means to be a relational database.
Before you descend into the EAV design as a solution for a flexible database, read this story: Bad CaRMa.
More specifically, some of the problems with EAV include:
You don't know what attributes exist for any given ID_NUM without querying for them.
You can't make any attribute mandatory, the equivalent of NOT NULL.
You can't use database constraints.
You can't use SQL data types; the value column must be a long VARCHAR.
Particularly in MySQL, each VARCHAR is stored on its own data page, so this is very wasteful.
Queries are also incredibly complex when you use the EAV design. Magento, an open-source ecommerce platform, uses EAV extensively, and many users say it's very slow and hard to query if you need custom reports.
To be relational, you should store each different attribute in its own column, with its own name and an appropriate datatype.
I have written more about EAV in my presentation Practical Object-Oriented Models in SQL and in my blog post EAV FAIL, and in my book, SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.
What you suggest is called EAV model (Entity–Attribute–Value)
It has several drawbacks like severe difficulties in enforcing referential integrity constraints. In addition, the queries you'll have to come up with, will be a bit more complicated than with a normalized table as your second suggestion (table with columns: Primary Key, Width, Length, Space, Coupling, etc).
So, for a simple project, do not use EAV model.
If your plans are for a more complex project and you want maximum flexibility, do not use EAV either. You should look into 6NF (6th Normal Form) which is even harder to implement and certainly not an easy task in MySQL. But if you succeed, you'll have both goods: flexibility and normalization to the highest level (some people call "EAV" as "6NF done wrongly").
In my experience this idea of storing fields row-wise needs to be considered extremely carefully - although it seems give many advantages it makes many common tasks much more difficult.
On the positive side: It is easily extensible without changes to the structure of the database and in some ways abstracts the details of the data storage.
On the negative side: You need to look at all the everyday things storing fields column-wise gives you automatically in the DBMS: Simple inner/outer joins, one statement inserts/updates, uniqueness, foreign keys and other db level constraint checking, simple filtering ad ordering of search results.
Consider in your architecture a query to return all items with MetalLayer=X and Width between y and z - results sorted by Coupling by length. This query is much harder for you to construct and much, much harder for the DBMS to execute than it would be using columns to store specific fields.
In the balance the only time I have used a structure like the one you suggest was in a context where random unstructured additional data needed to be added on an ad-hoc basis. In my opinion this would be a last resort strategy if there was no way I could make a more traditional table structure work.
A few things to consider here:
There is no single primary key. This can be overcome by making the primary key consist of two columns (like in the second example of Carl T)
The Param column is repeated and to normalize this you should look at the example of MGA.
Thirdly the "Metal layer" column is a string and not a float value like the others.
So best to go for a table def like this:
create table yourTable(
ID int primary key,
ParamId int not null,
Width float,
Length float,
Space float,
Coupling float,
Metal_layer varchar(20),
Foreign key(ParamID) references Param(ID),
Value varchar(20)
)
create table Param(
ID int primary key,
Name varchar(20)
)
The main question you have to ask when creating a table specially for future use is how will this data be retrieved and what purpose it is having. Personally I always have a unique identifier usually an ID to the table.
Looking at you list do not seem to have anything that uniquely defines the entries so you will not be able to track duplicate entries nor uniquely retrieve a record.
If you want to keep this design you could create a composite primary key composed of the name and the param-value.
CREATE TABLE testtest (
ID INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
Name VARCHAR(100) NOT NULL,
value number NOT NULL
/*add other fields here*/
);
CREATE TABLE testtest (
name VARCHAR(100) NOT NULL,
value int NOT NULL,
/*add other fields here*/
primary key(name,value)
);
Those create table example express the 2 above mentioned options.

Scalable one to many table (MySQL)

I have a MySQL database, and a particular table in that database will need to be self-referencing, in a one-to-many fashion. For scalability, I need to find the most efficient solution possible. The two ways most evident to me are:
1) Add a text field to the table, and store a serialized list of primary keys there
2) Keep a linker table, with each row being a one-to-one.
In case #1, I see the table growing very very wide (using a spatial analogy), but in case #2, I see the linker table growing to a very large number of rows, which would slow down lookups (by far the most common operation).
What's the most efficient manner in which to implement such a one-to-many relationship in MySQL? Or, perhaps, there is a much saner solution keeping the data all directly on the filesystem somehow, or else some other storage engine?
Just keep a table for the "many", with a key column for the primary table.
I quarantee you'll have lots of other more important problems to solve before you run into efficiency or capacity constraints in a standard industrial-strength relational dbms.
IMHO the most likely second option (with numerous alternative products) is to use an isam.
If you need to do deep/recursive traversals into the data, a graph database like Neo4j (where I'm on the team) is a good choice. You'll find some information in the article Should you go Beyond Relational Databases? and in this post at High Scalability. For a use case that may be similar to yours, read this thread on MetaFilter. For information on language bindings and other things you may also find the Neo4j wiki and mailing list useful.
Not so much an answer but a few questions and a possible approach....
If you want to make the table self referencing and only use one field ... there are some options. A calculated maskable 'join' field describes a way to associate many rows with each other.
The best solution will probably consider the nature of the data and relationships?
What is the nature of the data and lookups? What sort of relationship are you trying to contain? Association? Related? Parent/Children?
My first comment would be that you'll get better responses if you can describe how the data will be used (frequency of adds/updates vs lookups, adds vs updates, etc) in addition to what you've already described. That being said, my first thought would be to just go with a generic representation of
CREATE TABLE IF NOT EXISTS one_table (
`one_id` INT UNSIGNED NOT NULL AUTO_INCREMENT
COMMENT 'The The ID of the items in the one table' ,
... other data
)
CREATE TABLE IF NOT EXISTS many_table (
`many_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT
COMMENT 'the id of the items in the many table',
`one_id` INT UNSIGNED NOT NULL
COMMENT 'The ID of the item in the one table that this many item belongs to' ,
... other data
)
Making sure, of course, to create an index on the one_id in both tables.