Which DEM format is efficient .tiff or .hgt - gis

I know is a vague question, but I've no expertise in GIS world. I was building this terrain map for a game like simulation project that i'm invested in. Data accuracy is least or no concern, but the quality of the terrain it generated. I'm not storing 27GB of data, instead i'm procedure'ly generating the height maps, from low quality SRTM30(couple of GB) as i need them. I use mainly gdal and interpolation( Kriging), numpy and some scikit learn. But i was unable to do this in real time.
My question is this, which DEM format is most suitable for gdal or an expensive loop to work with, to produce close real time result. Is their other choices?.

Related

6D pose estimation of a known 3D CAD object with limited model training for a new object

I'm working on a project where I need to estimate the 6DOF pose of a known 3D CAD object in a single RGB image - i.e. this task: https://paperswithcode.com/task/6d-pose-estimation. There are several constraints on the problem:
Usable commercially (licensed under BSD, MIT, BOOST, etc.), not GPL.
The CAD object is known and we do NOT aim for generality (i.e.recognize the class of all chairs).
The CAD object can be uploaded by a user, so it may have symmetries and a range of textures.
Inference step will be run on a smartphone, and should be able to run at >30fps.
The inference step can either be a) find the pose of the object once and then I can write code to continue to track it or b) find the pose of the object continuously. I.e. the model doesn't need to have any continuous refinement steps after the initial pose estimate is found.
Can be anywhere on the scale of single instance of a single object to multiple instances of multiple objects (MiMo). MiMO is preferred, but not required.
If a deep learning approach is used, the training time required for a new CAD object should be on the order of hours, not days.
Can either 1) just find the initial pose of an object and not have any refinement steps after or 2) find the initial pose of the object and also have refinement steps after.
I am open to traditional approaches (i.e. 2D->3D correspondences then solving with PnP), but it seems like deep learning approaches outperform them (classical are too slow - Real time 6D pose estimation of known 3D CAD objects from a single 2D image or point clouds from RGBD Camera when objects are one on top of the other?). Looking at deep learning approaches (poseCNN, HybridPose, Pix2Pose, CosyPose), it seems most of them match these constraints, except that they require model training time. Though perhaps I can use a single pre-trained model and then specialize it for each new CAD object with a shorter training step. But I am not sure of this, and I think success probably relies on the specific model chosen. For example, this project says it requires 3 hours of training time: https://github.com/DLR-RM/AugmentedAutoencoder.
So, my question: would somebody know what the state of the art, commercially usable implementation that doesn't require extensive training time for a new CAD object is?

NIfTi vs DICOM for 3D volumetric data

Are there major benefits of selecting NIfTi over DICOM (or viz.) as the choice of data format? I am working on 3D Volumetric semantic segmentation. I will have to convert either format to numpy array or tensor before feeding to the network, but curious on the performance benefits of selection.
(This question risks being opinion-based, so trying to stick to facts.)
DICOM is a very powerful, flexible but complex format, and its strength is to provide interoperability between different hardware and software. However, DICOM is not particularly efficient for image processing and analysis. One potential drawback of DICOM is that a single volume is stored as a sequence of 2D slices, which can be cumbersome to deal with.
NIfTi is an improved version of the Analyze file format, which was designed to be simpler than DICOM, while still retaining all the essential metadata. And it has the added benefit of being able to store a volume in a single file, with a simple header followed by raw data. This makes it fast to load and process.
There are several other medical file formats suitable for this task. You may also wish to consider NRRD which has many features in common with NIfTi. Simple format, fast to parse and load, flexible storage encoding for 2,3,4D data. Many tools and libraries can process NRRD files too.
So given your primary need is for efficient storage and analysis, NIfTi or NRRD would be a better choice.

How to input audio data into deep learning algorithm?

I'm very new in deep learning, and I'm targeting to use GAN (Generative Adversarial Network) to recognize emotional speech. I've only known images being as inputs to most deep learning algorithms, such as GAN. but I'm curious as to how audio data can be an input into it, besides of using images of the spectrograms as the input. also, i'd appreciate it if you can explain it in laymen terms.
Audio data can be be represented in form of numpy arrays but before moving to that you must understand what audio really is. If you give a thought on what an audio looks like, it is nothing but a wave like format of data, where the amplitude of audio change with respect to time.
Assuming that our audio is represented in time domain, we can extract the values at every half-second(arbitrary). This is called sampling rate.
Converting the data into frequency domain can reduce the amount of computation requires as the sampling rate is less.
Now, let's load the data. We'll use a library called librosa , which can be installed using pip.
data, sampling_rate = librosa.load('audio.wav')
Now, you have both the data and the sampling rate. We can plot the waveform now.
librosa.display.waveplot(data, sr=sampling_rate)
Now, you have the audio data in form of numpy array. You can now study the features of the data and extract the ones you find interesting to train your models.
Further to Ayush’s discussion, for information on the challenges and work arounds of dealing with large amounts of data at different time scales in audio data I suggest this post on WaveNet: https://deepmind.com/blog/article/wavenet-generative-model-raw-audio
After that it sounds like you want to do classification. In that case a GAN on it’s own is not suitable. If you have plenty of data you could use a straight LSTM (or another type of RNN) which is designed to model time series, or you can take set sized chunks of input and use a 1-d CNN (similar to WaveNet). If you have lots of unlabelled data from the same or similar domain and limited training data you could use a GAN to learn to generate new samples, then use the discriminator from the GAN as pre-trained weights for a CNN classifier.
Since you are trying to perform Speech Emotion Recognition (SER) using deep learning, you can go for a recurrent architecture (LSTM or GRU) or a combination of CNN and recurrent network architecture (CRNN) instead of GANs since GANs are complicated and difficult to train.
In a CRNN, the CNN layers will extract features of varying details and complexity, whereas the recurrent layers will take care of the temporal dependencies. You can then finally use a fully connected layer for regression or classification output, depending on whether your output label is discrete (for categorical emotions like angry, sad, neutral etc) or continuous (arousal and valence space).
Regarding the choice of input, you can use either a spectrogram input (2D) or raw speech signal (1D) as input. For spectrogram input, you have to use a 2D CNN whereas for a raw speech signal you can use a 1D CNN. Mel scale spectrograms are usually preferred over linear spectrograms since our ears hear frequencies in log scale and not linearly.
I have used a CRNN architecture to estimate the level of verbal conflict arising from conversational speech. Even though it is not SER, it is a very similar task.
You can find more details in the paper
http://www.eecs.qmul.ac.uk/~andrea/papers/2019_SPL_ConflictNET_Rajan_Brutti_Cavallaro.pdf
Also, check my github code for the same paper
https://github.com/smartcameras/ConflictNET
and a SER paper whose code I reproduced in Python
https://github.com/vandana-rajan/1D-Speech-Emotion-Recognition
And finally as Ayush mentioned, Librosa is one of the best Python libraries for audio processing. You have functions to create spectrograms in Librosa.

Training model to recognize one specific object (or scene)

I am trying to train a learning model to recognize one specific scene. For example, say I would like to train it to recognize pictures taken at an amusement park and I already have 10 thousand pictures taken at an amusement park. I would like to train this model with those pictures so that it would be able to give a score for other pictures of the probability that they were taken at an amusement park. How do I do that?
Considering this is an image recognition problem, I would probably use a convolutional neural network, but I am not quite sure how to train it in this case.
Thanks!
There are several possible ways. The most trivial one is to collect a large number of negative examples (images from other places) and train a two-class model.
The second approach would be to train a network to extract meaningful low-dimensional representations from an input image (embeddings). Here you can use siamese training to explicitly train the network to learn similarities between images. Such an approach is employed for face recognition, for instance (see FaceNet). Having such embeddings, you can use some well-established methods for outlier detections, for instance, 1-class SVM, or any other classifier. In this case you also need negative examples.
I would heavily augment your data using image cropping - it is the most obvious way to increase the amount of training data in your case.
In general, your success in this task strongly depends on the task statement (are restricted to parks only, or any kind of place) and the proper data.

Web Audio Pitch Detection for Tuner

So I have been making a simple HTML5 tuner using the Web Audio API. I have it all set up to respond to the correct frequencies, the problem seems to be with getting the actual frequencies. Using the input, I create an array of the spectrum where I look for the highest value and use that frequency as the one to feed into the tuner. The problem is that when creating an analyser in Web Audio it can not become more specific than an FFT value of 2048. When using this if i play a 440hz note, the closest note in the array is something like 430hz and the next value seems to be higher than 440. Therefor the tuner will think I am playing these notes when infact the loudest frequency should be 440hz and not 430hz. Since this frequency does not exist in the analyser array I am trying to figure out a way around this or if I am missing something very obvious.
I am very new at this so any help would be very appreciated.
Thanks
There are a number of approaches to implementing pitch detection. This paper provides a review of them. Their conclusion is that using FFTs may not be the best way to go - however, it's unclear quite what their FFT-based algorithm actually did.
If you're simply tuning guitar strings to fixed frequencies, much simpler approaches exist. Building a fully chromatic tuner that does not know a-priori the frequency to expect is hard.
The FFT approach you're using is entirely possible (I've built a robust musical instrument tuner using this approach that is being used white-label by a number of 3rd parties). However you need a significant amount of post-processing of the FFT data.
To start, you solve the resolution problem using the Short Timer FFT (STFT) - or more precisely - a succession of them. The process is described nicely in this article.
If you intend building a tuner for guitar and bass guitar (and let's face it, everyone who asks the question here is), you'll need t least a 4092-point DFT with overlapping windows in order not to violate the nyquist rate on the bottom E1 string at ~41Hz.
You have a bunch of other algorithmic and usability hurdles to overcome. Not least, perceived pitch and the spectral peak aren't always the same. Taking the spectral peak from the STFT doesn't work reliably (this is also why the basic auto-correlation approach is also broken).