I have multiple records of time series data related to human movement of multiple repetitions. In this data, rows represent time stamps and columns show recorded parameters (features) at each timestamp. Most of the records have different timestamps. I want to create a 3D array of shapes (number of repetitions, timestamps, features) for movement modeling. For this, I need to make each data record of equal length. Can you please suggest a method(s) {except resampling} to make length equal without losing movement-related information?
Related
I'm just thinking about the database design of an application that I am going to create soon. It will use a temperature sensor to track the environmental temperature and display a graph of the temperature over time.
I'm just realizing that I've never done something like this before. In terms of the database, do you just create a row for every measurement you take? That is, to have a table, and insert a new row containing the temperature and the current datetime, at a given interval?
Assuming that you will use a numeric field for the temperature, I don't see any problem in adding a row for each measurement. Even if you measure every second, it will be just 3600 rows per hour. MySql can easily deal with million or even billion rows.
insert a new row containing the temperature and the current datetime, at a given interval?
Yup. Two columns, value and time. Use the datetime datatype native to your DBMS, so you can operate on it to get statistics over intervals or at particular times, such as at noon every Monday or whatever.
There's nothing new or unusual about capturing time-varying information. That very phrase appears in Codd's paper introducing the relational model, in 1970.
I need help regarding how to structure overlapping date ranges in my data warehouse. My objective is to model the data in a way that allows date-level filtering on the reports.
I have dimensions — DimEmployee, DimDate and a fact called FactAttendance. The records in this fact are stored as follows —
To represent this graphically —
A report needs to be created out of this data, that will allow the end-user to filter it by making a selection of a date range. Let's assume user selects date range D1 to D20. On making this selection, the user should see the value for how many days at least one of the employees was on leave. In this particular example, I should see the addition of light-blue segments in the bottom i.e. 11 days.
An approach that I am considering is to store one row per employee per date for each of the leaves. The only problem with this approach is that it will exponentially increase the number of records in the fact table. Besides, there are other columns in the fact that will have redundant data.
How are such overlapping date/time problems usually handled in a warehouse? Is there a better way that does not involve inserting numerous rows?
Consider modelling your fact like this:
fact_attendance (date_id,employee_id,hours,...)
This will enable you to answer your original question by simply filtering on the Date dimension, but you will also be able to handle issues like leave credits, and fractional day leave usage.
Yes, it might use a little more storage than your first proposal, but it is a better dimensional representation, and will satisfy more (potential) requirements.
If you are really worried about storage - probably not a real worry - use a DBMS with columnar compression, and you'll see large savings in disk.
The reason I say "not a real worry" about storage is that your savings are meaningless in today's world of storage. 1,000 employees with 20 days leave each per year, over five years would mean a total of 100,000 rows. Your DBMS would probably execute the entire star join in RAM. Even one million employees would require less than one terabyte before compression.
I have a scenario and two options to achieve it. Which one will be more efficient?
I am using mySQL to store attendance of students (around 100 million). And later use this attendance data to plot charts and results based on user’s selection.
Approach.1) Store attendance data of student for each day in new row (which will increase the number of rows exponentially and reduce processing time)
Approach.2) Store serialized or JSON formatted row of one year’s attendance data of each student in a row (Which will increase processing time when updating attendance each day and reduce database size)
First I think you are confused, the number of rows will increase linear not exponential that is a big difference.
Second 100k is nothing for a database. even if you store 365 days that is only 36 million, I have that in a week,
Third Store in a JSON may complicated future query.
So I suggest go with approach 1
Using proper Index, design and a fast HDD a db can handle billion of records.
Also you may consider save historic data in a different schema so current data is a little faster, but that is just a minor tuneup
Spatial Indexes
Given a spatial index, is the index utility, that is to say the overall performance of the index, only as good as the overall geometrys.
For example, if I were to take a million geometry data types and insert them into a table so that their relative points are densely located to one another, does this make this index perform better to identical geometry shapes whose relative location might be significantly more sparse.
Question 1
For example, take these two geometry shapes.
Situation 1
LINESTRING(0 0,1 1,2 2)
LINESTRING(1 1,2 2,3 3)
Geometrically they are identical, but their coordinates are off by a single point. Imagine this was repeated one million times.
Now take this situation,
Situation 2
LINESTRING(0 0,1 1,2 2)
LINESTRING(1000000 1000000,1000001 10000001,1000002 1000002)
LINESTRING(2000000 2000000,2000001 20000001,2000002 2000002)
LINESTRING(3000000 3000000,3000001 30000001,3000002 3000002)
In the above example:
the lines dimensions are identical to the situation 1,
the lines are of the same number of points
the lines have identical sizes.
However,
the difference is that the lines are massively futher apart.
Why is this important to me?
The reason I ask this question is because I want to know if I should remove as much precision from my input geometries as I possibly can and reduce their density and closeness to each other as much as my application can provide without losing accuracy.
Question 2
This question is similar to the first question, but instead of being spatially close to another geometry shape, should the shapes themselves be reduced to the smalest possible shape to describe what it is that the application requires.
For example, if I were to use a SPATIAL index on a geometry datatype to provide data on dates.
If I wanted to store a date range of two dates, I could use a datetime data type in mysql. However, what if I wanted to use a geometry type, so that I convery the date range by taking each individual date and converting it into a unix_timestamp().
For example:
Date("1st January 2011") to Timestamp = 1293861600
Date("31st January 2011") to Timestamp = 1296453600
Now, I could create a LINESTRING based on these two integers.
LINESTRING(1293861600 0,1296453600 1)
If my application is actually only concerned about days, and the number of seconds isn't important for date ranges at all, should I refactor my geometries so that they are reduced to their smallest possible size in order to fulfil what they need.
So that instead of "1293861600", I would use "1293861600" / (3600 * 24), which happens to be "14975.25".
Can someone help fill in these gaps?
When inserting a new entry, the engine chooses the MBR which would be minimally extended.
By "minimally extended", the engine can mean either "area extension" or "perimeter extension", the former being default in MySQL.
This means that as long as your nodes have non-zero area, their absolute sizes do not matter: the larger MBR's remain larger and the smaller ones remain smaller, and ultimately all nodes will end up in the same MBRs
These articles may be of interest to you:
Overlapping ranges in MySQL
Join on overlapping date ranges
As for the density, the MBR are recalculated on page splits, and there is a high chance that all points too far away from the main cluster will be moved away on the first split to their own MBR. It would be large but be a parent to all outstanding points in few iterations.
This will decrease the search time for the outstanding points and will increase the search time for the cluster points by one page seek.
It appears to be a common practice to let the time dimension of OLAP cubes be in a table of its own, like the other dimensions.
My question is: why?
I simply don't see what the advantage would be to have a time_dimension table of (int, timestamp) that is joined with the cube on some time_id foreign key, instead of having a timestamp column in the cube itself.
Principally, points in time are immutable and constant, and they are their own value. I don't find it very likely that one would want to change the associated value for a given time_id.
In addition, the timestamp column type is 4 bytes wide (in MySQL), as is the int type that would otherwise typically be the key, so cannot be to save space either.
In discussing this with my colleagues, the only somewhat sensible argument I have been able to come up with is conformity with the other dimensions. But I find this argument rather weak.
I believe that it's often because the time dimension table contains a number of columns such as week/month/year/quarter, which allows for faster queries to get all of X for a particular quarter.
Given that the majority of OLAP cubes are written to get queries over time, this makes sense to me.
Paddy is correct, the time dimension contains useful "aliases" for the time primitives. You can capture useful information about the dates themselves such as quarter, national holiday, etc. You can write much quicker queries this way because there's no need to code every holiday in your query.