Best practice for storing GPS data of a tracking app in mysql database - mysql

I have a datamodel question for a GPS tracking app. When someone uses our app it will save latitude, longitude, current speed, timestamp and burned_calories every 5 seconds. When a workout is completed, the average speed, total time/distance and burned calories of the workout will be stored in a database. So far so good..
What we want is to also store the data that is saved those every 5 seconds, so we can utilize this later on to plot graphs/charts of a workout for example.
How should we store this amount of data in a database? A single workout can contain 720 rows if someone runs for an hour. Perhaps a serialised/gzcompressed data array in a single row. I'm aware though that this is bad practice..
A relational one/many to many model would be undone? I know MySQL can easily handle large amounts of data, but we are talking about 720 * workouts
twice a week * 7000 users = over 10 million rows a week.
(Ofcourse we could only store the data of every 10 seconds to halve the no. of rows, or every 20 seconds, etc... but it would still be a large amount of data over time + the accuracy of the graphs would decrease)
How would you do this?
Thanks in advance for your input!

Just some ideas:
Quantize your lat/lon data. I believe that for technical reasons, the data most likely will be quantized already, so if you can detect that quantization, you might use it. The idea here is to turn double numbers into reasonable integers. In the worst case, you may quantize to the precision double numbers provide, which means using 64 bit integers, but I very much doubt your data is even close to that resolution. Perhaps a simple grid with about one meter edge length is enough for you?
Compute differences. Most numbers will be fairly large in terms of absolute values, but also very close together (unless your members run around half the world…). So this will result in rather small numbers. Furthermore, as long as people run with constant speed into a constant direction, you will quite often see the same differences. The coarser your spatial grid in step 1, the more likely you get exactly the same differences here.
Compute a Huffman code for these differences. You might try encoding lat and long movement separately, or computing a single code with 2d displacement vectors at its leaves. Try both and compare the results.
Store the result in a BLOB, together with the dictionary to decode your Huffman code, and the initial position so you can return data to absolute coordinates.
The result should be a fairly small set of data for each data set, which you can retrieve and decompress as a whole. Retrieving individual parts from the database is not possible, but it sounds like you wouldn't be needing that.
The benefit of Huffman coding over gzip is that you won't have to artificially introduce an intermediate byte stream. Directly encoding the actual differences you encounter, with their individual properties, should work much better.

Related

Store large chart data points in MySQL

I am creating an application that stores ECG data.
I want to eventually graph this data in react but for now I need help storing it.
The biggest problem is storing the data points that will go along the x & y axis on the graph. Along the bottom is time and the y axis will be some value between. There are no limits but as it’s basically a heart rhythm most points will lie close to 0.
What is the best way to store the x and y data??
An example of the y data : [204.77, 216.86 … 3372.872]
The files that I will be getting this data from can contain millions of data points, depending on the sampling rate and the time the experiment took.
What is the best way to store this type of data in MySQL. I cannot use any other DB as they’re not installed on the server this will be hosted on.
Thanks
Well as you said there are million of points, so
JSON is the best way to store these points.
The space required to store a JSON document is roughly the same as for LONGBLOB or LONGTEXT;
Please have a look into this -
https://dev.mysql.com/doc/refman/8.0/en/json.html
The JSON encoding of your sample data would take 7-8 bytes per reading. Multiply that by the number of readings you will get at a single time. There is a practical limit of 16MB for a string being fed to MySQL. That seems "too big".
A workaround is to break the list into, say, 1K points at a time. Then there would be several rows, each row being manageable. There would be virtually no limit on the number of points you could store.
FLOAT is 4 bytes, but you would need a separate row for each reading. So, let's say about 25 bytes per row (including overhead); So size is not a problem, however, two other problems could arise. 7 digits is about the limit of precision for FLOAT. Fetching a million rows will not be very fast.
DOUBLE is 8 bytes, 16 digits of precision.
DECIMAL(6,2) is 3 bytes and overflows above 9999.99.
Considering that a computer monitor has less than 4 digits of precision (4K pixels < 10^4), I would argue for FLOAT as "good enough".
Another option is to take the JSON string, compress it, then store that in a LONGBLOB. The compression will give you an average of about 2.5 bytes per reading and the row for one complete reading will be a few megabytes.
I have experience difficulty in INSERTing a row bigger than 1MB. Changing a setting let me got to 16MB; I have not tried any row bigger than that. If you run into troubles there, start a new question with just that topic. I will probably come along and explain how to chunk up the data, thereby even allowing a "row" spread over multiple database rows that could effectively be even bigger than 4GB. That is the 'hard' limit for JSON, LONGTEXT and LONGBLOB.
You did not mention the X values. Are you assuming that they are evenly spaced? If you need to provide X,Y pairs, the computations above get a bit messier, but I have provided some of the data for analysis.

gnuRadio Dual Tone detection

I am trying to come up with an efficient way to characterize two narrowband tones separated by about 900kHz (one at around 100kHZ and one at around 1MHz once translated to baseband). They don't move much in freq over time but may have amplitude variations we want to monitor.
Each tone is roughly about 100Hz wide and we are required to characterize these two beasts over long periods of time down to a resolution of about 0.1 Hz. The samples are coming in at over 2M Samples/sec (TBD) to adequately acquire the highest tone.
I'm trying to avoid (if possible) doing brute force >2MSample FFTs on the data once a second to extract frequency domain data. Is there an efficient approach? Something akin to performing two (much) smaller FFTs around the bands of interest? Ive looked at Goertzel and chirp z methods but I am not certain it helps save processing.
Something akin to performing two (much) smaller FFTs around the bands of interest
There is, it's called Goertzel, and is kind of the FFT for single bins, and you already have looked at it. It will save you CPU time.
Anyway, there's no reason to do a 2M-point FFT; first of all, you only want a resolution of about 1/20 the sampling rate, hence, a 20-point FFT would totally do, and should be pretty doable for your CPU at these low rates; since you don't seem to care about phase of your tones, FFT->complex_to_mag.
However, there's one thing that you should always do: look at your signal of interest, and decimate down to the rate that fits exactly that. Since GNU Radio's filters are implemented cleverly, the filter itself will only run at the decimated rate, and you can spend the CPU cycles saved on a better filter.
Because a direct decimation from 2MHz to 100Hz (decimation: 20000) will really have an ugly filter length, you should do this multi-rated:
I'd try first decimating by 100, and then in a second step by 100, leaving you with 200Hz observable spectrum. The xlating fir filter blocks will let you use a simple low-pass filter (use the "Low-Pass Filter Taps" block to define a variable that contains such taps) as a band-selector.

mysql getting rid of redundant values

I am creating a database to store data from a monitoring system that I have created. The system takes a bunch of data points(~4000) a couple times every minute and stores them in my database. I need to be able to down sample based on the time stamp. Right now I am planning on using one table with three columns:
results:
1. point_id
2. timestamp
3. value
so the query I'd be like to do would be:
SELECT point_id,
MAX(value) AS value
FROM results
WHERE timestamp BETWEEN date1 AND date2
GROUP BY point_id;
The problem I am running into is this seems super inefficient with respect to memory. Using this structure each time stamp would have to be recorded 4000 times, which seems a bit excessive to me. The only solutions I thought of that reduce the memory footprint of my database requires me to either use separate tables (which to my understanding is super bad practice) or storing the data in CSV files which would require me to write my own code to search through the data (which to my understanding requires me not to be a bum... and probably search substantially slower). Is there a database structure that I could implement that doesn't require me to store so much duplicate data?
A database on with your data structure is going to be less efficient than custom code. Guess what. That is not unusual.
First, though, I think you should wait until this is actually a performance problem. A timestamp with no fractional seconds requires 4 bytes (see here). So, a record would have, say 4+4+8=16 bytes (assuming a double floating point representation for value). By removing the timestamp you would get 12 bytes -- savings of 25%. I'm not saying that is unimportant. I am saying that other considerations -- such as getting the code to work -- might be more important.
Based on your data, the difference is between 184 Mbytes/day and 138 Mbytes/day, or 67 Gbytes/year and 50 Gbytes. You know, you are going to have to deal with biggish data issues regardless of how you store the timestamp.
Keeping the timestamp in the data will allow you other optimizations, notably the use of partitions to store each day in a separate file. This should be a big benefit for your queries, assuming the where conditions are partition-compatible. (Learn about partitioning here.) You may also need indexes, although partitions should be sufficient for your particular query example.
The point of SQL is not that it is the most optimal way to solve any given problem. Instead, it offers a reasonable solution to a very wide range of problems, and it offers many different capabilities that would be difficult to implement individually. So, the time to a reasonable solution is much, much less than developing bespoke code.
Using this structure each time stamp would have to be recorded 4000 times, which seems a bit excessive to me.
Not really. Date values are not that big and storing the same value for each row is perfectly reasonable.
...use separate tables (which to my understanding is super bad practice)
Who told you that!!! Normalising data (splitting into separate, linked data structures) is actually a good practise - so long as you don't overdo it - and SQL is designed to perform well with relational tables. It would perfectly fine to create a "time" table and link to the data in the other table. It would use a little more memory, but that really shouldn't concern you unless you are working in a very limited memory environment.

Filter records from a database on a minimum time interval for making graph

We have a MySQL database table with statistical data that we want to present as a graph, with timestamp used as the x axis. We want to be able to zoom in and out of the graph between resolutions of, say, 1 day and 2 years.
In the zoomed out state, we will not want to get all data from the table, since that would mean to much data being shipped through the servers, and the graph resolution will be good enough with less data anyway.
In MySQL you can make queries that only select e.g. every tenth value and similar, which could be usable in this case. However, the intervals between values stored in the database isn't consistent, two values can be separated by as little as 10 minutes and as much as 6 hours, possibly more.
So the issue is that it is difficult to calculate a good stepping interval for the query, if we skip every tenth value for some reslution, that may work for series 10 minutes inbetween, but for 6 hour intervals we will throw away too much and the graph will end up having a too low resolution for comfort.
My impression is that MySQL isn't able to have a stepping interval depend on time so it would skip rows that are e.g. in the vicinity of five minutes of am included rows.
One solution could be to set 6 hours as a minimal resolution requirement for the graph, so we don't throw away values unless 6 hours is represented by a sufficiently small distance in the graph. I fear that this may result in too much data being read and sent through the system if the interval actually is smaller.
Another solution is to have more intelligence in the Java code, reading sets of data iteratively from low resolution and downwards until the data is good enough.
Any ideas for a solution that would enable us to get optimal resolution in one read, without too large result sets being read from the database, while not putting too much load on the database? I'm having wild ideas about installing an intermediate NoSQL component to store the values in, that might support time intervals the way I want - not sure if that actually is an option in the organisation.

How would you handle a very large vector in Ruby?

I'm planning to write a program in Ruby to analyse some data which has come back from an online questionnaire. There are hundreds of thousands of responses, and each respondent answers about 200 questions. Each question is multiple-choice, so there are a fixed number of possible responses to each.
The intention is to use a piece of demographic data given by each respondent to train a system which can then guess that same piece of demographic data (age, for example) from a respondent who answers the same questionnaire, but doesn't specify the demographic data.
So I plan to use a vector (in the mathematical sense, not in the data structure sense) to represent the answers for a given respondent. This means each vector will be large (over 200 elements), and the total data set will be huge. I plan to store the data in a MySQL database.
So. 2 questions:
How should I store this in the database? One row per response to a single question, or one row per respondent? Or something else?
I'm planning to use something like the k-nearest neighbour algorithm, or a simple machine learning algorithm like a naive bayesian classifier to learn to classify new responses. Should I manipulate the data purely through SQL or should I load it into memory and store it in some kind of vast array?
First thing that comes to mind: Storing it in Memory can be absolutely reasonable for processing purposes. Lets say you reserve one byte for each answer, you have a million responses and 200 questions, then you have a 200 MB array. Not small but definitely not memory exhausting on a modern desktop, even with a 32 bit OS.
As for the database I think you should have three tables. One for the respondent with the demographical data, one for the questions, and, since you have a n:m relation between these tables, a third one with the Respondent-ID, the Question-ID and the Answercode.
If you don't need additional data for the questions (like the question-text or something) you can even optimize away the question table.
Use an array of arrays, in memory. I just created a 500000x200 array and it required about 500MB of RAM. Easily manageable on a 2GB machine, and many, many orders of magnitude faster than using SQL.
Personally, I wouldn't bother putting the data in MySQL at all. Just Marshal it in and out, and/or use JSON or CSV.
If you definitely need database storage, and the comments elsewhere about alternatives are worth considering, then I'd advise against storing 200-odd responses in 200-odd rows: you don't seem to have any obvious need for the flexibility that such a design would give and performance across hundreds of thousands of respondents is going to be dire.
Using a RDBMS gives you the ability to store very large amounts of data, access them in a variety of multi-dimensional ways and extend the structure of your data ad hoc over time. But what you gain in flexibility over a flat file (or Marshalled, or other) option you often lose in performance. I have to confess to reaching for third normal form far too early myself. I guess the questions are, how much flexibility in querying do you expect to need, and how much change do you think your data is likely to undergo? If you think you're at the low end of both, consider leaving the SQL on the shelf. If you abstract your data access into a separate layer then changing should be cheap later. Just a thought...
I'd expect you can encode an individual's response in such a way that it can easily be used in code and it's unlikely to take more than 200 characters, less if you use some sort of packing or bit-mapping. I rather like the idea of bit-mapping, come to think of it - it makes simple comparison using something like Hamming distance an absolute breeze.
I'm not a great database person, so I'll just answer #2:
If you'd really like to save on memory (or foresee a situation where there will be a lot more data) you could take the best of both worlds: Use ruby as essentially a data-mining tool. Have it pull some of the data from the DB, then write the results back to the DB (probably under a different table or database altogether). This has the benefit of only using as much memory as you want it to.
Don't forget that Ruby is a dynamic object language, as such, a simple integer will probably take up more space than a simple int in C. It needs additional space to be able to characterise if it has been 'garnished' with any additional information, methods etc.