Treat point clouds as RGB images for activity recognition

Treat point clouds as RGB images for activity recognition - deep-learning

I have a VLP-16 lidar and I want to use it for activity recognition, I want to treat point clouds as RGB images so that I can use openpose. I use this repository to convert my point clouds into depth images and the following are some examples:
#Use Pillow and OpenCV to check image's format
Pillow: RGB (150, 64)
OpenCV: (64, 150, 3)
Just as these two images show, I have two choices: low-resolution image or higher resolution but the image is blurring and deform(due to VLP-16 is sparse in the vertical direction).
I've tried some methods to process these two types image:
Use opencv or API to resize / crop images.
Use image super-resolution(results show below).
Use filter for depth image completion(results show below).
Use CycleGAN for image-to-image translation.
Use network for low-resolution depth images.
In addition, I also tried Lidar Super-resolution(point cloud upsampling), but these methods can't make me avhieve my goal: Treat point clouds as RGB images for activity recognition.
May i have some suggestions to achieve my goal?
Any help is much appreciated!

Related

Using padding in cnn layer distorts edges of feature maps

I try to implementing cnn for image denoising. I use noisy image fragments (size 32x32) and same-sized clear image for validation as a training set. I used the trained network on a noisy image and noticed that denoised image contains artifacts, like a grid 32x32 pixels. I visualised feature maps produced by convolution layers and noticed that layers with zero-padding give distorted edges. I found this topic in which as a solution of same problem describes convolution, last step of which (eg for 3x3 kernel) is operation divide result by 9 or 4, when using zero-padding.
In all articles about convolution operations that I read, this is not mentioned. Does anyone know where I can read more about this?

U-net how to understand the cropped output

I'm looking for U-net implementation for landmark detection task, where the architecture is intended to be similar to the figure above. For reference please see this: An Attention-Guided Deep Regression Model for Landmark Detection in Cephalograms
From the figure, we can see the input dimension is 572x572 but the output dimension is 388x388. My question is, how do we visualize and correctly understand the cropped output? From what I know, we ideally expect the output size is the same as input size (which is 572x572) so we can apply the mask to the original image to carry out segmentation. However, from some tutorial like (this one), the author recreate the model from scratch then use "same padding" to overcome my question, but I would prefer not to use same padding to achieve same output size.
I couldn't use same padding because I choose to use pretrained ResNet34 as my encoder backbone, from PyTorch pretrained ResNet34 implementation they didn't use same padding on the encoder part, which means the result is exactly similar as what you see in the figure above (intermediate feature maps are cropped before being copied). If I would to continue building the decoder this way, the output will have smaller size compared to input image.
The question being, if I want to use the output segmentation maps, should I pad its outside until its dimension match the input, or I just resize the map? I'm worrying the first one will lost information about the boundary of image and also the latter will dilate the landmarks predictions. Is there a best practice about this?
The reason I must use a pretrained network is because my dataset is small (only 100 images), so I want to make sure the encoder can generate good enough feature maps from the experiences gained from ImageNet.

After some thinking and testing of my program, I found that PyTorch's pretrained ResNet34 didn't loose the size of image because of convolution, instead its implementation is indeed using same padding. An illustration is
Input(3,512,512)-> Layer1(64,128,128) -> Layer2(128,64,64) -> Layer3(256,32,32)
-> Layer4(512,16,16)
so we can use deconvolution (or ConvTranspose2d in PyTorch) to bring the dimension back to 128, then dilate the result 4 times bigger to get the segmentation mask (or landmarks heatmaps).

How do CNNs process RGB images

In a Convolutional Neural Network, the process of convolving is abundant.
It is known that if you take a 5x5 greyscale image (1 channel) and convolve it with a 3x3 filter (containing certain weights) you get a 3x3 feature map as a result as demonstrated by this picture: Convolutions
But what happens once you extend this notion of convolving into RGB images whereby now you have 3 channel (R,G,B) to convolve over? Well you simply add a channel to your filter proportional to the # of channel in your original image right? Lets say we did, the process of convolving with an RGB would like the following: a 6x6x3 RGB image convolved with a 3x3x3 filter. This apparently results in a 4x4x1 rather than what one would expect 4x4x3.
My question is why is this so?
If you surf the internet for visualizations of feature maps, they return with some form of colorful low & high level features. Are those visualizations of the kernels themselves or the feature maps? Either way, they all have color which means they must have more than 1 channel no?

Look at pytorch's Conv2d you'll notice that the size of the kernel is affected not only by its spatial width and height (3x3 in your question), but also by the number of input channels and output channels.
So, if you have an input RGB image (= 3 input channels) and a filter of size 3x3x3 (=a single output channel, for 3 input channels and spatial width/height = 3), then your output would indeed be 4x4x1.
You can visualize this filter since you can interpret it as a tiny 3x3 RGB image.
Visualizing features/filters that are deeper in the network is not at all trivial, and the images you see are usually the result of optimization processes designed to "uncover" the filters. this page gives an overview of some intricate methods for feature visualization.

Well, color images are :3 channels by definition, as well you can see a color picture as a stack of 3 matrices of values , so 2 Red and blue can be set up to zero,, also you should check about the sparcity of a network...

Drawing shapes versus rendering images?

I am using Pygame 1.9.2a with Python 2.7 for designing an experiment and have been so far using Pygame only on a need basis and am not familiar with all Pygame classes or concepts (Sprites, for instance, I have no idea about).
I am required to draw many (45 - 50 at one time) shapes on the screen at different locations to create a crowded display. The shapes vary from displaced Ts , displaced Ls to line intersections. [ Like _| or † or ‡ etc.]! I'm sorry that I am not able to post an image of this because I apparently do not have a reputation of 10, which is necessary to post images.
I also need these shapes in 8 different orientations. I was initially contemplating generating point lists and using these to draw lines. But, for a single shape, I will need four points and I need 50 of these shapes. Again, I'm not sure how to rotate these once drawn. Can I use the Pygame Transform or something? I think they can be used, say on Rects. Or will I have to generate points for the different orientations too, so that when drawn, they come out looking rotated, that is, in the desired orientation?
The alternative I was thinking of was to generate images for the shapes in GIMP or some software like that. But, for any screen, I will have to load around 50 images. Will I have to use Pygame Image and make 50 calls for something like this? Or is there an easier way to handle multiple images?
Also, which method would be a bigger hit to performance? Since, it is an experiment, I am worried about timing precision too. I don't know if there is a different way to generate shapes in Pygame. Please help me decide which of these two (or a different method) is better to use for my purposes.
Thank you!

It is easer to use pygame.draw.rect() or pygame.draw.polygon() (because you don't need to know how to use GIMP or InkScape :) ) but you have to draw it on another pygame.Surface() (to get bitmap) and than you can rotate it, add alpha (to make transparet) and than you can put it on screen.
You can create function to generate images (using Surface()) with all shapes in different orientations at program start. If you will need better looking images you can change function to load images created in GIMP.
Try every method on your own - this is the best method to check which one is good for you.
By the way: you can save generated images pygame.image.save() and then load it. You can have all elements on one image and use part of image Surface.get_clip()

How to find pixel co-ordinates of corners of a square pattern?

This may not be a programming related but possibly programmers would be in the best position to answer it.
For camera calibration I have a 8 x 8 square pattern printed on sheet of paper. I have to manually enter these co-ordinates into a text file. The software would then pick it up from there and compute the calibration parameters.
Is there a script or some software that I can run on these images and get the pixel co-ordinates of the 4 corners of each of the 64 squares?

You can do this with a traditional chessboard pattern (i.e. black and white squares with no gaps) using cvFindChessboardCorners(). You can read more about the function in the OpenCV API Reference and see some sample code in O'Reilly's OpenCV Book or elsewhere online. As an added bonus, OpenCV has built-in functions that calculate the intrinsic parameters of the camera and an array of extrinsic parameters for the multiple views of a planar calibration object.

I would:
apply threshold and get binarized image.
apply SobelX filter to image. You get an image with the vertical lines. This belong to the sides of the squares that are almost vertical. Keep this as image1.
apply SobelY filter to image. You get an image with the horizontal lines. This belong to the sides of the squares that are almost horizontal. Keep this as image2.
make (image1 xor image2). You get a black image with white pixels indicating the corner positions.
Hope it helps.

I'm sure there are many computer vision libraries with varying capabilities and licenses out there, but one that I can remember off the top of my head is ARToolKit, which should be able to recognize this pattern. And if that's not possible, it comes with a set of very good patterns that are tailored so that they can be recognized even if they're partially obscured.

I don't know ARToolKit (although i've heard a lot about it) but with OpenCV this processing is trivial.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008