I have a lot of spectra that I want to store in a database. A spectrum is basically an array of integers with in my case a variable length of typically 512 or 1024. How best to store these spectra? Along with the spectra I want to store some additional data like time and a label, which will be simple fields in my database. The spectra will not be retrieved often and if I need them, I need them as a whole.
For storing the spectra I can think of 2 possible solutions:
Storing them as a string, like "1,7,9,3,..."
Storing the spectra in a separate table, with each value in a separate row, containing fields like spectrum_id, index and value
Any suggestions on which one to use? Other solutions are much appreciated of course!
Your first solution is a common mistake when people transition from the procedural/OO programming mindset to the database mindset. It's all about efficiency, least number of records to fetch etc. The database world requires a different paradigm to store and retrieve data.
Here's how I'd do it: make 2 tables:
spectra
---------
spectra_id (primary key)
label
time
spectra_detail
---------
spectra_id
index
value
To retrieve them:
SELECT *
FROM spectra s
INNER JOIN spectra_detail sd ON s.spectra_id = sd.spectra_id
WHERE s.spectra_id = 42
If you have a small dataset (hundreds of MB), there is no problem in using an SQL DBMS with any of the alternatives.
As proposed by Maciej, serialization is an improvement over the other alternative, such as you can group each spectrum sweep into a single tuple (row in a table), reducing the overhead in keys and other information.
For the serialization, you may consider using objects such as linestring or multipoint in order to be able to better process the data using SQL functions. This will require some scaling but will allow querying the data and if you use WKB you may also achieve a relevant gain in storage use with little loss in the performance.
The problem is that spectrum data tends to accumulate and storage usage may become a problem that will not be easily solved by the serialization trick. You should carefully consider this in your project.
Working on a similar problem, I came to the conclusion that it is a bad idea to use any SQL DMBS (MySQL, SQL Server, Postgre, and such) to manage large numerical matrix data, such as spectrum sweep measurements. It's a bit like trying to create an image library CMS by storing images pixel by pixel into a database.
The following table presents a comparison between a few formats in my experiment. This may help to understand the problem of using SQL DBMS to store numerical data matrixes.
MySQL Table Table with key - Int(10) - and value - decimal(4,1) 1 157 627 904 B
TXT CSV decimal(4,1), equivalent to 14bit 276 895 606 B
BIN (original) Matrix 1 byte x 51200 columns x 773 rows + Metadata 40 038 580 B
HDF5 Matrix 3 bytes x 51200 columns x 773 rows + Metadata 35 192 973 B
TXT + Zip CSV decimal (4,1) + standard zip compression 34 175 971 B
PNGRGBa Matrix 4 bytes x 51200 columns x 773 rows 33 997 095 B
ZIP(BIN) Original BIN file compressed with standard zip 26 028 780 B
PNG 8bIndexed Matrix 1 byte x 51200 columns x 773 rows + Color scale 25 947 324 B
The example using MySQL didn't use any serialization. I didn't try it but one may expect a reduction to almost half the size of the occupied storage by using WKT linestrings or similar features. Even so, the storage used would be almost the double of the corresponding CSV and more than 20 times the size of a PNG8b with the same data.
These numbers are expected when you stop to think about how much extra data you are storing in terms of keys and search optimization when you use an SQL DBMS.
For closing remarks, I would suggest that you consider the use of PNG, TIFF, HDF5, or any other digital format that is more suitable to construct your front-end to store the spectrum data (or any other large matrix) and maybe using an SQL DBMS for the dimensions around this core data, such as who measures, when, with which equipment, to which end, etc. In short, have a BLOB within the database with the files or outside, as it better suits your system architecture.
Alternatively, is worthwhile to consider using a big data solution around some digital format such as HDF5. Each tool to its end.
Related
I am creating an application that stores ECG data.
I want to eventually graph this data in react but for now I need help storing it.
The biggest problem is storing the data points that will go along the x & y axis on the graph. Along the bottom is time and the y axis will be some value between. There are no limits but as it’s basically a heart rhythm most points will lie close to 0.
What is the best way to store the x and y data??
An example of the y data : [204.77, 216.86 … 3372.872]
The files that I will be getting this data from can contain millions of data points, depending on the sampling rate and the time the experiment took.
What is the best way to store this type of data in MySQL. I cannot use any other DB as they’re not installed on the server this will be hosted on.
Thanks
Well as you said there are million of points, so
JSON is the best way to store these points.
The space required to store a JSON document is roughly the same as for LONGBLOB or LONGTEXT;
Please have a look into this -
https://dev.mysql.com/doc/refman/8.0/en/json.html
The JSON encoding of your sample data would take 7-8 bytes per reading. Multiply that by the number of readings you will get at a single time. There is a practical limit of 16MB for a string being fed to MySQL. That seems "too big".
A workaround is to break the list into, say, 1K points at a time. Then there would be several rows, each row being manageable. There would be virtually no limit on the number of points you could store.
FLOAT is 4 bytes, but you would need a separate row for each reading. So, let's say about 25 bytes per row (including overhead); So size is not a problem, however, two other problems could arise. 7 digits is about the limit of precision for FLOAT. Fetching a million rows will not be very fast.
DOUBLE is 8 bytes, 16 digits of precision.
DECIMAL(6,2) is 3 bytes and overflows above 9999.99.
Considering that a computer monitor has less than 4 digits of precision (4K pixels < 10^4), I would argue for FLOAT as "good enough".
Another option is to take the JSON string, compress it, then store that in a LONGBLOB. The compression will give you an average of about 2.5 bytes per reading and the row for one complete reading will be a few megabytes.
I have experience difficulty in INSERTing a row bigger than 1MB. Changing a setting let me got to 16MB; I have not tried any row bigger than that. If you run into troubles there, start a new question with just that topic. I will probably come along and explain how to chunk up the data, thereby even allowing a "row" spread over multiple database rows that could effectively be even bigger than 4GB. That is the 'hard' limit for JSON, LONGTEXT and LONGBLOB.
You did not mention the X values. Are you assuming that they are evenly spaced? If you need to provide X,Y pairs, the computations above get a bit messier, but I have provided some of the data for analysis.
I have a datamodel question for a GPS tracking app. When someone uses our app it will save latitude, longitude, current speed, timestamp and burned_calories every 5 seconds. When a workout is completed, the average speed, total time/distance and burned calories of the workout will be stored in a database. So far so good..
What we want is to also store the data that is saved those every 5 seconds, so we can utilize this later on to plot graphs/charts of a workout for example.
How should we store this amount of data in a database? A single workout can contain 720 rows if someone runs for an hour. Perhaps a serialised/gzcompressed data array in a single row. I'm aware though that this is bad practice..
A relational one/many to many model would be undone? I know MySQL can easily handle large amounts of data, but we are talking about 720 * workouts
twice a week * 7000 users = over 10 million rows a week.
(Ofcourse we could only store the data of every 10 seconds to halve the no. of rows, or every 20 seconds, etc... but it would still be a large amount of data over time + the accuracy of the graphs would decrease)
How would you do this?
Thanks in advance for your input!
Just some ideas:
Quantize your lat/lon data. I believe that for technical reasons, the data most likely will be quantized already, so if you can detect that quantization, you might use it. The idea here is to turn double numbers into reasonable integers. In the worst case, you may quantize to the precision double numbers provide, which means using 64 bit integers, but I very much doubt your data is even close to that resolution. Perhaps a simple grid with about one meter edge length is enough for you?
Compute differences. Most numbers will be fairly large in terms of absolute values, but also very close together (unless your members run around half the world…). So this will result in rather small numbers. Furthermore, as long as people run with constant speed into a constant direction, you will quite often see the same differences. The coarser your spatial grid in step 1, the more likely you get exactly the same differences here.
Compute a Huffman code for these differences. You might try encoding lat and long movement separately, or computing a single code with 2d displacement vectors at its leaves. Try both and compare the results.
Store the result in a BLOB, together with the dictionary to decode your Huffman code, and the initial position so you can return data to absolute coordinates.
The result should be a fairly small set of data for each data set, which you can retrieve and decompress as a whole. Retrieving individual parts from the database is not possible, but it sounds like you wouldn't be needing that.
The benefit of Huffman coding over gzip is that you won't have to artificially introduce an intermediate byte stream. Directly encoding the actual differences you encounter, with their individual properties, should work much better.
The general idea of problem is that data is arranged in following three columns in a table
"Entity" "parent entity" "value"
A001 B001 .10
A001 B002 .15
A001 B003 .2
A001 B004 .3
A002 B002 .34
A002 B003 .13
..
..
..
A002 B111 .56
There is graph of entities and values can be seen as weight of directed edge from parent entity to entity. I have to calculate how many different subsets of parent entity of a particular entity are greater than .5 (say). To further calculate something (later part is easy, not complex computationally)
The point is data is huge (Excel files says data lost :( ). which language or tool I can use? some people have suggested me SAS or STATA.
Thanks in advance
You can do this in SQL. Two options for the desktop (without having to install a SQL server of some kind) are MS Access or OpenOffice Database. Both can read CSV files into a database.
In there, you can run SQL queries. The syntax is a bit odd but this should get you started:
select ParentEntity, sum(Value)
from Data
where sum(Value) > .5
group by ParentEntity
Data is the name of the table in which you loaded the data, Entity and Value are the names of columns in the Data table.
If you're considering SAS you could take a look at R, a free language / environment used for data mining.
I'm guessing that the table you refer to is actually in a file, and that the file is too big for Excel to handle. I'd suggest that you use a language that you know well. Of those you know, select the one with these characteristics:
-- able to read files line by line;
-- supports data structures of the type that you want to use in memory;
-- has good maths facilities.
SAS is an excellent language for quickly processing huge datasets (hundreds of millions of records in which each record has hundreds of variables). It is used in academia and in many industries (we use it for warranty claims analysis; many clinical trials use it for statistical analysis & reporting).
However, there are some caveats: the language has several deficiencies in my opinion which makes it difficult to write modular, reusable code (there is a very rich macro facility, but no user defined functions until version 9.2). Probably a bigger caveat is that a SAS license is very expensive; thus, it probably wouldn't be practical for a single individual to purchase a license for their own experimentation, though the price of a license may not be prohitive to a large company. Still, I believe SAS sells a learning edition, which is likely less expensive.
If you're interested in learning SAS, here are some excellent resources:
Official SAS Documentation:
http://support.sas.com/documentation/onlinedoc/base/index.html
SAS White Papers / Conference
Proceedings:
http://support.sas.com/events/sasglobalforum/previous/online.html
SAS-L Newsgroup (Much, much more
activity regarding SAS questions than
here on stackoverflow):
http://listserv.uga.edu/cgi-bin/wa?A0=sas-l&D=0
There are also regional and local SAS users groups, from which you can learn a lot (for example in my area there is a MWSUG (Midwest SAS Users Group) and MISUG (Michigan SAS User's Group)).
If you don't mind really getting into a language and using some operating system specific calls, C with memory-mapped files is very fast.
You would first need to write a converter that would translate the textual data that you have into a memory map file, and then a second program that maps the file into memory and scans through the data.
I hate to do this, but I would reccomend simply C. What you need is actually to figure out your problem in the language of math, then implement it into C. The ways of storing a graph in memory is a large research area. You could use an adjacency matrix if the graph is dense (highly connected), or an adjacency list if it is not. Each of the subtree searches will be some fancy code and it might be a hard problem.
As others have said, SQL can do it, and the code has even been posted. If you need help putting the data from a text file into a SQL database, that's a different question. Look up bulk data inserts.
The problem with SQL is that even though it is a wonderfully succinct language, it is parsed by the database engine and the underlying code might not be the best method. For most data access routines, the SQL database engine will produce some amazing code efficiencies, but for graphs and very large computations like this, I would not trust it. That's why you go to C. Some lower level language that makes you do it yourself will be the most efficient.
I assume you will need efficient code due to the bulk of the data.
All of this assumes the dataset fits into memory. If your graph is larger than your workstation's ram, (get one with 24GB if you can), then you should find a way to partition the data such that it does fit.
Mathematica is quite good in my experience...
Perl would be a good place to start, it is very effeciant at handling file input and string parsing. You could then hold the whole set in memory or only the subsets.
SQL is a good option. Database servers are designed to manage huge amounts of data, and are optimized to use every ressource available on the machine efficiently to gain performance.
Notably, Oracle 10 is optimized for multi-processor machines, automatically splitting requests across processors if possible (with the correct configuration, search for "oracle request parallelization" on your favorite search engine).
This solution is particularly efficient if you are in a big organization with good database servers already available.
I would use Java's BigInteger library and something functional, like say Hadoop.
at least simple SQL statement wont
work (please read problem carefuly) i
need to find sum of all subsets and
check tht sum of elements of the set
.5 or not . thanks – asin Aug 18 at
7:36
Since your data is in Stata, here is the code to do what you ask in Stata (paste this code into your do-file editor):
//input the data
clear
input str10 entity str10 parent_entity value
A001 B001 .10
A001 B002 .15
A001 B003 .2
A001 B004 .3
A002 B002 .34
A002 B003 .13
A002 B111 .56
end
//create a var. for sum of all subsets
bysort entity : egen sum_subset = total(value)
//flag the sets that sum > .5
bysort entity : gen indicator = 1 if sum_subset>.5
recode ind (.=0)
lab def yn 1 "YES", modify
lab def yn 0 "No", modify
lab val indicator yn
li *, clean
Keep in mind that when using Stata, your data is kept in memory so you are limited only by your system's memory resources. If you try to open your .dta file & it says 'op. sys refuses to provide mem', then you need to try to use the command -set mem- to increase your memory to run the data.
Ultimately, StefanWoe's question:
ay you give us an idea of HOW huge the
data set is? Millions? Billions of
records? Also an important questions:
Do you have to do this only once? Or
every day in the future? Or hundreds
of times each hour? – StefanWoe Aug 18
at 13:15
This really drives your question more than which software to use... Automating this using Stata, even on an immense amount of data, wouldn't be difficult but you you could max your resource limits quickly.
We know the MS Access database engine is 'throttled' to allow a maximum file size of 2GB (or perhaps internally wired to be limited to fewer than some power of 2 of 4KB data pages). But what does this mean in practical terms?
To help me measure this, can you tell me the maximum number of rows that can be inserted into a MS Access database engine table?
To satisfy the definition of a table, all rows must be unique, therefore a unique constraint (e.g. PRIMARY KEY, UNIQUE, CHECK, Data Macro, etc) is a requirement.
EDIT: I realize there is a theoretical limit but what I am interested in is the practical (and not necessarily practicable), real life limit.
Some comments:
Jet/ACE files are organized in data pages, which means there is a certain amount of slack space when your record boundaries are not aligned with your data pages.
Row-level locking will greatly reduce the number of possible records, since it forces one record per data page.
In Jet 4, the data page size was increased to 4KBs (from 2KBs in Jet 3.x). As Jet 4 was the first Jet version to support Unicode, this meant that you could store 1GB of double-byte data (i.e., 1,000,000,000 double-byte characters), and with Unicode compression turned on, 2GBs of data. So, the number of records is going to be affected by whether or not you have Unicode compression on.
Since we don't know how much room in a Jet/ACE file is taken up by headers and other metadata, nor precisely how much room index storage takes, the theoretical calculation is always going to be under what is practical.
To get the most efficient possible storage, you'd want to use code to create your database rather than the Access UI, because Access creates certain properties that pure Jet does not need. This is not to say there are a lot of these, as properties set to the Access defaults are usually not set at all (the property is created only when you change it from the default value -- this can be seen by cycling through a field's properties collection, i.e., many of the properties listed for a field in the Access table designer are not there in the properties collection because they haven't been set), but you might want to limit yourself to Jet-specific data types (hyperlink fields are Access-only, for instance).
I just wasted an hour mucking around with this using Rnd() to populate 4 fields defined as type byte, with composite PK on the four fields, and it took forever to append enough records to get up to any significant portion of 2GBs. At over 2 million records, the file was under 80MBs. I finally quit after reaching just 700K 7 MILLION records and the file compacted to 184MBs. The amount of time it would take to get up near 2GBs is just more than I'm willing to invest!
Here's my attempt:
I created a single-column (INTEGER) table with no key:
CREATE TABLE a (a INTEGER NOT NULL);
Inserted integers in sequence starting at 1.
I stopped it (arbitrarily after many hours) when it had inserted 65,632,875 rows.
The file size was 1,029,772 KB.
I compacted the file which reduced it very slightly to 1,029,704 KB.
I added a PK:
ALTER TABLE a ADD CONSTRAINT p PRIMARY KEY (a);
which increased the file size to 1,467,708 KB.
This suggests the maximum is somewhere around the 80 million mark.
As others have stated it's combination of your schema and the number of indexes.
A friend had about 100,000,000 historical stock prices, daily closing quotes, in an MDB which approached the 2 Gb limit.
He pulled them down using some code found in a Microsoft Knowledge base article. I was rather surprised that whatever server he was using didn't cut him off after the first 100K records.
He could view any record in under a second.
It's been some years since I last worked with Access but larger database files always used to have more problems and be more prone to corruption than smaller files.
Unless the database file is only being accessed by one person or stored on a robust network you may find this is a problem before the 2GB database size limit is reached.
We're not necessarily talking theoretical limits here, we're talking about real world limits of the 2GB max file size AND database schema.
Is your db a single table or
multiple?
How many columns does each table have?
What are the datatypes?
The schema is on even footing with the row count in determining how many rows you can have.
We have used Access MDBs to store exports of MS-SQL data for statistical analysis by some of our corporate users. In those cases we've exported our core table structure, typically four tables with 20 to 150 columns varying from a hundred bytes per row to upwards of 8000 bytes per row. In these cases, we would bump up against a few hundred thousand rows of data were permissible PER MDB that we would ship them.
So, I just don't think that this question has an answer in absence of your schema.
It all depends. Theoretically using a single column with 4 byte data type. You could store 300 000 rows. But there is probably alot of overhead in the database even before you do anything. I read some where that you could have 1.000.000 rows but again, it all depends..
You can also link databases together. Limiting yourself to only disk space.
Practical = 'useful in practice' - so the best you're going to get is anecdotal. Everything else is just prototyping and testing results.
I agree with others - determining 'a max quantity of records' is completely dependent on schema - # tables, # fields, # indexes.
Another anecdote for you: I recently hit 1.6GB file size with 2 primary data stores (tables), of 36 and 85 fields respectively, with some subset copies in 3 additional tables.
Who cares if data is unique or not - only material if context says it is. Data is data is data, unless duplication affects handling by the indexer.
The total row counts making up that 1.6GB is 1.72M.
When working with 4 large Db2 tables I have not only found the limit but it caused me to look really bad to a boss who thought that I could append all four tables (each with over 900,000 rows) to one large table. the real life result was that regardless of how many times I tried the Table (which had exactly 34 columns - 30 text and 3 integer) would spit out some cryptic message "Cannot open database unrecognized format or the file may be corrupted". Bottom Line is Less than 1,500,000 records and just a bit more than 1,252,000 with 34 rows.
I'm developing a back-end application for a search system. The search system copies files to a temporary directory and gives them random names. Then it passes the temporary files' names to my application. My application must process each file within a limited period of time, otherwise it is shut down - that's a watchdog-like security measure. Processing files is likely to take long so I need to design the application capable of handling this scenario. If my application gets shut down next time the search system wants to index the same file it will likely give it a different temporary name.
The obvious solution is to provide an intermediate layer between the search system and the backend. It will queue the request to the backend and wait for the result to arrive. If the request times out in the intermediate layer - no problem, the backend will continue working, only the intermediate layer is restarted and it can retrieve the result from the backend when the request is later repeated by the search system.
The problem is how to identify the files. Their names change randomly. I intend to use a hash function like MD5 to hash the file contents. I'm well aware of the birthday paradox and used an estimation from the linked article to compute the probability. If I assume I have no more than 100 000 files the probability of two files having the same MD5 (128 bit) is about 1,47x10-29.
Should I care of such collision probability or just assume that equal hash values mean equal file contents?
Equal hash means equal file, unless someone malicious is messing around with your files and injecting collisions. (this could be the case if they are downloading stuff from the internet) If that is the case go for a SHA2 based function.
There are no accidental MD5 collisions, 1,47x10-29 is a really really really small number.
To overcome the issue of rehashing big files I would have a 3 phased identity scheme.
Filesize alone
Filesize + a hash of 64K * 4 in different positions in the file
A full hash
So if you see a file with a new size you know for certain you do not have a duplicate. And so on.
Just because the probability is 1/X it does not mean that it won't happen to you until you have X records. It's like the lottery, you're not likely to win, but somebody out there will win.
With the speed and capacity of computers these days (not even talking about security, just reliability) there is really no reason not to just use a bigger/better hash function than MD5 for anything critical. Stepping up to SHA-1 should help you sleep better at night, but if you want to be extra cautious then go to SHA-265 and never think about it again.
If performance is truly an issue then use BLAKE2 which is actually faster than MD5 but supports 256+ bits to make collisions less likely while having same or better performance. However, while BLAKE2 has been well-adopted, it probably would require adding a new dependency to your project.
I think you shouldn't.
However, you should if you have the notion of two equal files having different (real names, not md5-based). Like, in search system two document might have exactly same content, but being distinct because they're located in different places.
I came up with a Monte Carlo approach to be able to sleep safely while using UUID for distributed systems that have to serialize without collisions.
from random import randint
from math import log
from collections import Counter
def colltest(exp):
uniques = []
while True:
r = randint(0,2**exp)
if r in uniques:
return log(len(uniques) + 1, 2)
uniques.append(r)
for k,v in Counter([colltest(20) for i in xrange(1000)]):
print k, "hash orders of magnitude events before collission:",v
would print something like:
5 hash orders of magnitude events before collission: 1
6 hash orders of magnitude events before collission: 5
7 hash orders of magnitude events before collission: 21
8 hash orders of magnitude events before collission: 91
9 hash orders of magnitude events before collission: 274
10 hash orders of magnitude events before collission: 469
11 hash orders of magnitude events before collission: 138
12 hash orders of magnitude events before collission: 1
I had heard the formula before: If you need to store log(x/2) keys, use a hashing function that has at least keyspace e**(x).
Repeated experiments show that for a population of 1000 log-20 spaces, you sometimes get a collision as early as log(x/4).
For uuid4 which is 122 bits that means I sleep safely while several computers pick random uuid's till I have about 2**31 items. Peak transactions in the system I am thinking about is roughly 10-20 events per second, I'm assuming an average of 7. That gives me an operating window of roughly 10 years, given that extreme paranoia.
Here's an interactive calculator that lets you estimate probability of collision for any hash size and number of objects - http://everydayinternetstuff.com/2015/04/hash-collision-probability-calculator/