Rescale data to be sinuosoidal - regression

I have some time series data I'm looking at in Python that I know should follow a sine2 function, but for various reasons doesn't quite fit it. I'm taking an FFT of it and it has a fairly broad frequency spread, when it should be a very narrow single frequency. However, the errors causing this are quite consistent--if I take data again it matches very closely to the previous data set and gives a very similar FFT.
So I've been trying to come up with a way I can rescale the time axis of the data so that it is at a single frequency, and then apply this same rescaling to future data I collect. I've tried various filtering techniques to smooth the data or to cut frequencies from the FFT without much luck. I've also tried fitting a frequency varying sine2 to the data, but haven't been able to get a good fit (if I was able to, I would use the frequency vs time function to rescale the time axis of the original data so that it has a constant frequency and then apply the same rescaling to any new data I collect).
Here's a small sample of the data I'm looking at (the full data goes for a few hundred cycles). And the resulting FFT of the full data
Any suggestions would be greatly appreciated. Thanks!

Related

Finding images in RAM dump

Extracting screenshots from RAM dumps
Some classical security / hacking challenges include having to analyze the dump of the physical RAM of a system. volatility does a great job at extracting useful information, including wire-view of the windows displayed at the time (using the command screenshot). But I would like to go further and find the actual content of the windows.
So I'd like to reformulate the problem as finding raw images (think matrix of pixels) in a large file. If I can do this, I hope to find the content of the windows, at least partially.
My idea was to rely on the fact that a row of pixels is similar to the next one. If I find a large enough number of lines of the same size, then I let the user fiddle around with an interactive tool and see if it decodes to something interesting.
For this, I would compute a kind of spectrogram. More precise a heatmap where the shade show how likely it is for the block of data #x to be part of an image of width y bytes, with x and y the axis of the spectrogram. Then I'd just have to look for horizontal lines in it. (See the examples below.)
The problem I have right now is to find a method to compute that "kind of spectrogram" accurately and quickly. As an order of magnitude, I would like to be able to find images of width 2048 in RGBA (8192 bytes per row) in a 4GB file in a few minutes. That means processing a few tens of MB per second.
I tried using FFT and autocorrelation, but they do not show the kind of accuracy I'm after.
The problem with FFT
Since finding the length of a mostly repeating pattern looks like finding a frequency, I tried to use a Fourier transform with 1 byte = 1 sample and plot the absolute value of the spectrum.
But the main problem is the period resolution. Since I'm interested in finding the period of the signal (the byte length of a row of pixels), I want to plot the spectrogram with period length on the y axis, not the frequency. But the way the discrete Fourier transform work is that it computes the frequencies multiple of 1/n (for n data points). Which gives me a very low resolution for large periods and a higher-than-needed resolution for short periods.
Here is a spectrogram computed with this method on a 144x90 RGB BMP file. We expect a peak at an offset 432. The window size for the FFT was 4320 bytes.
And the segment plot of the first block of data.
I calculated that if I need to distinguish between periods k and k+1, then I need a window size of roughly k². So for 8192 bytes, that makes the FFT window about 16MB. Which would be way too slow.
So the FFT computes too much information I don't need and not enough information I would need. But given a reasonable window size, it usually show a sharp peak at about the right period.
The problem with autocorrelation
The other method I tried is to use a kind of discrete autocorrelation to plot the spectrogram.
More exactly, what I compute is the cross-correlation between a block of data and half of it. And only compute it for the offsets where the small block is fully inside the large block. The size of the large block has to be twice larger than the max period to plot.
Here is an example of spectrogram computed with this method on the same image as before.
And the segment plot of the autocorrelation of the first block of data.
Altough it produces just the right amount of data, the value of the autocorrelation change slowly, thus not making a sharp peak for the right period.
Question
Is there a way to get both a sharp peak and around the correct period and enough precision around the large periods? Either by tweaking the afformentioned algorithms or by using a completely different one.
I can't judge much about the FFT part. From the title ("Finding images in RAM dump") it seems you are trying to solve a bigger problem and FFT is only a part of it, so let me answer on those parts where I have some knowledge.
analyze the RAM dump of a system
This sounds much like physical RAM. If an application takes a screenshot, that screenshot is in virtual RAM. This results in two problems:
a) the screenshot may be incomplete, because parts of it are paged out to disk
b) you need to perform a physical address to virtual address mapping in order to bring the bytes of the screenshot into correct order
find raw images from the memory dump
To me, the definition of what a raw image is is unclear. Any application storing an image will use an image file format. Storing only the data makes the screenshot useless.
In order to perform an FFT on the data, you should probably know whether it uses 24 bit per pixel or 32 bit per pixel.
I hope to find a screenshot or the content of the current windows
This would require an application that takes screenshots. You can of course hope. I can't judge about the likeliness of that.
rely on the fact that a row of pixels is similar to the next one
You probably hope to find some black on white text. For that, the assumption may be ok. If the user is viewing his holiday pictures, this may be different.
Also note that many values in a PC are 32 bit (Integer, Float) and 0x00000000 is a very common value. Your algorithm may detect this.
images of width 2048
Is this just a guess? Or would you finally brute-force all common screen sizes?
in RGBA
Why RGBA? A screenshot typically does not have transparency.
With all of the above, I wonder whether it wouldn't be more efficient to search for image signatures like JPEG, BMP or PNG headers in the dump and then analyze those headers and simply get the picture from the metadata.
Note that this has been done before, e.g. WinDbg has some commands in the ext debugger extension which is loaded by default
!findgifs
!findjpegs
!findjpgs
!findpngs

gnuRadio Dual Tone detection

I am trying to come up with an efficient way to characterize two narrowband tones separated by about 900kHz (one at around 100kHZ and one at around 1MHz once translated to baseband). They don't move much in freq over time but may have amplitude variations we want to monitor.
Each tone is roughly about 100Hz wide and we are required to characterize these two beasts over long periods of time down to a resolution of about 0.1 Hz. The samples are coming in at over 2M Samples/sec (TBD) to adequately acquire the highest tone.
I'm trying to avoid (if possible) doing brute force >2MSample FFTs on the data once a second to extract frequency domain data. Is there an efficient approach? Something akin to performing two (much) smaller FFTs around the bands of interest? Ive looked at Goertzel and chirp z methods but I am not certain it helps save processing.
Something akin to performing two (much) smaller FFTs around the bands of interest
There is, it's called Goertzel, and is kind of the FFT for single bins, and you already have looked at it. It will save you CPU time.
Anyway, there's no reason to do a 2M-point FFT; first of all, you only want a resolution of about 1/20 the sampling rate, hence, a 20-point FFT would totally do, and should be pretty doable for your CPU at these low rates; since you don't seem to care about phase of your tones, FFT->complex_to_mag.
However, there's one thing that you should always do: look at your signal of interest, and decimate down to the rate that fits exactly that. Since GNU Radio's filters are implemented cleverly, the filter itself will only run at the decimated rate, and you can spend the CPU cycles saved on a better filter.
Because a direct decimation from 2MHz to 100Hz (decimation: 20000) will really have an ugly filter length, you should do this multi-rated:
I'd try first decimating by 100, and then in a second step by 100, leaving you with 200Hz observable spectrum. The xlating fir filter blocks will let you use a simple low-pass filter (use the "Low-Pass Filter Taps" block to define a variable that contains such taps) as a band-selector.

Best practice for storing GPS data of a tracking app in mysql database

I have a datamodel question for a GPS tracking app. When someone uses our app it will save latitude, longitude, current speed, timestamp and burned_calories every 5 seconds. When a workout is completed, the average speed, total time/distance and burned calories of the workout will be stored in a database. So far so good..
What we want is to also store the data that is saved those every 5 seconds, so we can utilize this later on to plot graphs/charts of a workout for example.
How should we store this amount of data in a database? A single workout can contain 720 rows if someone runs for an hour. Perhaps a serialised/gzcompressed data array in a single row. I'm aware though that this is bad practice..
A relational one/many to many model would be undone? I know MySQL can easily handle large amounts of data, but we are talking about 720 * workouts
twice a week * 7000 users = over 10 million rows a week.
(Ofcourse we could only store the data of every 10 seconds to halve the no. of rows, or every 20 seconds, etc... but it would still be a large amount of data over time + the accuracy of the graphs would decrease)
How would you do this?
Thanks in advance for your input!
Just some ideas:
Quantize your lat/lon data. I believe that for technical reasons, the data most likely will be quantized already, so if you can detect that quantization, you might use it. The idea here is to turn double numbers into reasonable integers. In the worst case, you may quantize to the precision double numbers provide, which means using 64 bit integers, but I very much doubt your data is even close to that resolution. Perhaps a simple grid with about one meter edge length is enough for you?
Compute differences. Most numbers will be fairly large in terms of absolute values, but also very close together (unless your members run around half the world…). So this will result in rather small numbers. Furthermore, as long as people run with constant speed into a constant direction, you will quite often see the same differences. The coarser your spatial grid in step 1, the more likely you get exactly the same differences here.
Compute a Huffman code for these differences. You might try encoding lat and long movement separately, or computing a single code with 2d displacement vectors at its leaves. Try both and compare the results.
Store the result in a BLOB, together with the dictionary to decode your Huffman code, and the initial position so you can return data to absolute coordinates.
The result should be a fairly small set of data for each data set, which you can retrieve and decompress as a whole. Retrieving individual parts from the database is not possible, but it sounds like you wouldn't be needing that.
The benefit of Huffman coding over gzip is that you won't have to artificially introduce an intermediate byte stream. Directly encoding the actual differences you encounter, with their individual properties, should work much better.

Efficient represention for growing circles in 2D space?

Imagines there's a 2D space and in this space there are circles that grow at different constant rates. What's an efficient data structure for storing theses circles, such that I can query "Which circles intersect point p at time t?".
EDIT: I do realize that I could store the initial state of the circles in a spatial data structure and do a query where I intersect a circle at point p with a radius of fastest_growth * t, but this isn't efficient when there are a few circles that grow extremely quickly whereas most grow slowly.
Additional Edit: I could further augment the above approach by splitting up the circles and grouping them by there growth rate, then applying the above approach to each group, but this requires a bounded time to be efficient.
Represent the circles as cones in 3d, where the third dimension is time. Then use a BSP tree to partition them the best you can.
In general, I think the worst-case for testing for intersection is always O(n), where n is the number of circles. Most spacial data structures work by partitioning the space cleverly so that a fraction of the objects (hopefully close to half) are in each half. However, if the objects overlap then the partitioning cannot be perfect; there will always be cases where more than one object is in a partition. If you just think about the case of two circles overlapping, there is no way to draw a line such that one circle is entirely on one side and the other circle is entirely on the other side. Taken to the logical extreme, assuming arbitrary positioning of the circles and arbitrary radiuses, there is no way to partition them such that testing for intersection takes O(log(n)).
This doesn't mean that, in practice, you won't get a big advantage from using a tree, but the advantage you get will depend on the configuration of the circles and the distribution of the queries.
This is a simplified version of another problem I have posted about a week ago:
How to find first intersection of a ray with moving circles
I still haven't had the time to describe the solution that was expected there, but I will try to outline it here(for this simplar case).
The approach to solve this problem is to use a kinetic KD-tree. If you are not familiar with KD trees it is better to first read about them. You also need to add the time as additional coordinate(you make the space 3d instead of 2d). I have not implemented this idea yet, but I believe this is the correct approach.
I'm sorry this is not completely thought through, but it seems like you might look into multiplicatively-weighted Voronoi Diagrams (MWVDs). It seems like an adversary could force you into computing one with a series of well-placed queries, so I have a feeling they provide a lower-bound to your problem.
Suppose you compute the MWVD on your input data. Then for a query, you would be returned the circle that is "closest" to your query point. You can then determine whether this circle actually contains the query point at the query time. If it doesn't, then you are done: no circle contains your point. If it does, then you should compute the MWVD without that generator and run the same query. You might be able to compute the new MWVD from the old one: the cell containing the generator that was removed must be filled in, and it seems (though I have not proved it) that the only generators that can fill it in are its neighbors.
Some sort of spatial index, such as an quadtree or BSP, will give you O(log(n)) access time.
For example, each node in the quadtree could contain a linked list of pointers to all those circles which intersect it.
How many circles, by the way? For small n, you may as well just iterate over them. If you constantly have to update your spatial index and jump all over cache lines, it may end up being faster to brute-force it.
How are the centres of your circles distributed? If they cover the plane fairly evenly you can discretise space and time, then do the following as a preprocessing step:
for (t=0; t < max_t; t++)
foreach circle c, with centre and radius (x,y,r) at time t
for (int X = x-r; X < x+r; x++)
for (int Y = x-r; Y < y+r; y++)
circles_at[X][Y][T].push_back (&c)
(assuming you discretise space and time along integer boundaries, scale and offset however you like of course, and you can add circles later on or amortise the cost by deferring calculation for distant values of t)
Then your query for point (x,y) at time (t) could do a brute-force linear check over circles_at[x][y][ceil(t)]
The trade-off is obvious, increasing the resolution of any of the three dimensions will increase preprocessing time but give you a smaller bucket in circles_at[x][y][t] to test.
People are going to make a lot of recommendations about types of spatial indices to use, but I would like to offer a bit of orthogonal advice.
I think you are best off building a few indices based on time, i.e. t_0 < t_1 < t_2 ...
If a point intersects a circle at t_i, it will also intersect it at t_{i+1}. If you know the point in advance, you can eliminate all circles that intersect the point at t_i for all computation at t_{i+1} and later.
If you don't know the point in advance, then you can keep these time-point trees (built based on the how big each circle would be at a given time). At query time (e.g. t_query), find i such that t_{i-1} < t_query <= t_i. If you check all the possible circles at t_i, you will not have any false negatives.
This is sort of a hack for a data structure that is "time dynamics aware", but I don't know of any. If you have a threaded environment, then you only need to maintain one spacial index and be working on the next one in the background. It will cost you a lot of computation for the benefit of being able to respond to queries with low latency. This solution should be compared at the very least to the O(n) solution (go through each point and check if dist(point, circle.center) < circle.radius).
Instead of considering the circles, you can test on their bounding boxes to filter out the ones which do not contain the point. If your bounding box sides are all sorted, this is essentially four binary searches.
The tricky part is reconstructing the sorted sides for any given time, t. To do that, you can start off with the original points: two lists for the left and right sides with the x coordinate, and two lists for top and bottom with the y coordinates. For any time greater than 0, all the left side points will move to left, etc. You only need to check each location to the one next to it to obtain a points where the element and the one next to it are are swapped. This should give you a list of time points to modify your ordered lists. If you now sort these modification records by time, for any given starting time and an ending time you can extract all the modification records between the two, and apply them to your four lists in order. I haven't completely figured out the algorithm, but I think there will be edge cases where three or more successive elements can cross over exactly at the same time point, so you may need to modify the algorithm to handle those edge cases as well. Perhaps a list modification record that contains the position in list, and the number of records to reorder would suffice.
I think it's possible to create a binary tree that solves this problem.
Each branch should contain a growing circle, a static circle for partitioning and the latest time at which the partitioning circle cleanly partitions. Further more the growing circle that is contained within a node should always have a faster growing rate than either of it's child nodes' growing circles.
To do a query, take the root node. First check it's growing circle, if it contains the query point at the query time, add it to the answer set. Then, if the time that you're querying is greater than the time at which the partition line is broken, query both children, otherwise if the point falls within the partitioning circle, query the left node, else query the right node.
I haven't quite completed the details of performing insertions, (the difficult part is updating the partition circle so that the number of nodes on the inside and outside is approximately equal and the time when the partition is broken is maximized).
To combat the few circles that grow quickly case, you could sort the circles in descending order by rate of growth and check each of the k fastest growers. To find the proper k given t, I think you can perform a binary search to find the index k such that k*m = (t * growth rate of k)^2 where m is a constant factor you'll need to find by experimentation. The will balance the part the grows linearly with k with the part that falls quadratically with the growth rate.
If you, as already suggested, represent growing circles by vertical cones in 3d, then you can partition the space as regular (may be hexagonal) grid of packed vertical cylinders. For each cylinder calculate minimal and maximal heights (times) of intersections with all cones. If circle center (vertex of cone) is placed inside the cylinder, then minimal time is zero. Then sort cones by minimal intersection time. As result of such indexing, for each cylinder you’ll have the ordered sequence of records with 3 values: minimal time, maximal time and circle number.
When you checking some point in 3d space, you take the cylinder it belongs to and iterate its sequence until stored minimal time exceeds the time of the given point. All obtained cones, which maximal time is less than given time as well, are guaranteed to contain given point. Only cones, where given time lies between minimal and maximal intersection times, are needed to recalculate.
There is a classical tradeoff between indexing and runtime costs – the less is the cylinder diameter, the less is the range of intersection times, therefore fewer cones need recalculation at each point, but more cylinders have to be indexed. If circle centers are distributed non-evenly, then it may be worth to search better cylinder placement configuration then regular grid.
P.S. My first answer here - just registered to post it. Hope it isn’t late.

Algorithm for online approximation of a slowly-changing, real valued function

I'm tackling an interesting machine learning problem and would love to hear if anyone knows a good algorithm to deal with the following:
The algorithm must learn to approximate a function of N inputs and M outputs
N is quite large, e.g. 1,000-10,000
M is quite small, e.g. 5-10
All inputs and outputs are floating point values, could be positive or negative, likely to be relatively small in absolute value but no absolute guarantees on bounds
Each time period I get N inputs and need to predict the M outputs, at the end of the time period the actual values for the M outputs are provided (i.e. this is a supervised learning situation where learning needs to take place online)
The underlying function is non-linear, but not too nasty (e.g. I expect it will be smooth and continuous over most of the input space)
There will be a small amount of noise in the function, but signal/noise is likely to be good - I expect the N inputs will expain 95%+ of the output values
The underlying function is slowly changing over time - unlikely to change drastically in a single time period but is likely to shift slightly over the 1000s of time periods range
There is no hidden state to worry about (other than the changing function), i.e. all the information required is in the N inputs
I'm currently thinking some kind of back-propagation neural network with lots of hidden nodes might work - but is that really the best approach for this situation and will it handle the changing function?
With your number of inputs and outputs, I'd also go for a neural network, it should do a good approximation. The slight change is good for a back-propagation technique, it should not have to 'de-learn' stuff.
I think stochastic gradient descent (http://en.wikipedia.org/wiki/Stochastic_gradient_descent) would be a straight forward first step, it will probably work nicely given the operating conditions you have.
I'd also go for an ANN. Single layer might do fine since your input space is large. You might wanna give it a shot before adding a lot of hidden layers.
#mikera What is it going to be used for? Is it an assignment in a ML course?