I slice inside a NAL unit of type 1 - h.264

I recently came across an interesting H.264 bit-stream and wanted to understand if it is valid in terms of the spec.
Assume there is a bit-stream consisting only of NAL units of type 1 (coded slice of a non-IDR picture), but inside these units there are slices of type 7 (I slice). Looking at the specification this seems valid, but up to this point I lived with the belief that I should always try to expect at least one NAL unit of type 5 (coded slice of an IDR picture) to start decoding, while this shows that I should also examine non-IDR pictures for I slices. Is that correct? Is there any rationale of not using IDR pictures and putting I slices inside non-IDR pictues?

Completely normal. If one slice in an AU (frame) is an IDR, that ALL VCL slices in that AU MUST also be IDR. The I slice allows you to mix I/P/B slices in the same AU. This allows for features such as periodic intra refresh

Related

How to interpolate medical images of different ranges

I would like to get the data of PET and CT interpolated so underlying arrays will have the same dimensions - in normal case it is simple yet I have data set where ranges of the PET and CT scans differ - hence I Would need to first trim the bigger study.
The problem is that may appear is choosing the subset of slices will lead to small yet observable part of the bigger image to overhang from the smaller ones (because voxels are of different sizes) and if I understand it correctly it may spoil the interpolation .
I suppose it is common problem so how to achieve interpolation where not only voxel dimension is different but also the range ?
Code below do not work as I would like to in case of different ranges of images
sitk.Resample(imagePET, ctImage)
Link to the mentioned dataset
https://wiki.cancerimagingarchive.net/display/Public/Head-Neck-PET-CT
What you probably want is to read both DICOM series as a 3D image, then do the resampling, and then write the resulting image as a series of slices (in DICOM or another format). This example should be helpful.

H.264 encode with B-frames frame count

Suppose i encode three YUV frames using simple H.264 and get a framepattern IPIP
But then allowing B-frames in encoding I get IBPBIPBP.....
These are clearly more frames than the simple one so do we play these frames at a higher rate to get original three frames?
In other words how is this related to actual time?
Coders generate B frames (if they have the capability) not to play tricks with the frame rate, but to encode those frames with fewer bits in the channel, or to get higher quality for the same channel bitrate than just IPPPPIPPPP.

Question about applying Alpha zero general to different games

I’m trying to use alpha zero general to apply on a different game (kind of like Chess), this is the original code for othello:
https://github.com/suragnair/alpha-zero-general
However, after a few iterations (about 300 self play), it still seems like doing nonsense moves. So I’m wondering whether my code is wrong. Here are some few questions I came up with:
should num_channels be modified?
note: I’m actually confused about the parameter “number of channels”. In my opinion, it should be at most 3 for othello (there are only “black”, “white” and “none” type of pieces), however, num_channels is set to 128 in the original case.
Another question is about the “board”, which will be the input of the nnet. The original code uses a 2D array to present a Othello board, which uses 1 to represent a white piece, -1 for the black piece, 0 for none.
However, there are 6, for instance, kind of pieces in chess. Normally it should be input as a 3D dimension, 2D for the board, another dimension for different kinds of pieces. (or channel, just like an RGB 128*128 picture should be a 3D input (3*128*128))
I now use a 2D array only, where 1,2,3,... represents king, queen, bishop, etc. I’m not sure if this causes a problem.
I’ve tried to figure it out from the written code but I couldn’t find an answer.
Yes, you will need to have a channel for each type of piece, or another more compact representation of the board. The reason has got to do with the 'squashing' functions (non-linearities) and the fact that the network is learning a continuous map. You put too much of a 'burden' on a single input neuron. You will need a channel for each type of piece for each player, if you have 8 types of piece and 2 players you will need 16 channels. Use 0-1 matracies for each channel.
Think about what will happen to the input values if they are integers when we run them though this function.

Why is it possible to use BT.709 in H.264 to represent more colors than BT.601?

Why is it possible to use BT.709 in H.264 to represent more colors than BT.601? I think that for YUV, they are all only Y, U, and V data. When converting to RGB, only different matrices are used. Using different matrices may cause U to become more V and less. BT.709 It is not possible to indicate more colors than BT.601. Or there is a color that BT.709 cannot represent but BT.601 can represent.Can anyone tell me where the mistake is.
601, and 709 can display the same number of colors. 601s colors tend to map better to what a CRT can actually display, where as 709 maps to LCD. However 601 more commonly uses partial range, meaning each byte is encoded using the range 16-235, where as it is common to use full range with 709 (0-255). This is because partial range leaves padding at the extremes for analog distribution. Since the broadcast TV digital switch over happened in the US around the same time as everybody change from CRT to LCD the range/space tend to be linked.
DVD used SMPTE 170M a.k.a. SMPTE C (SMPTE C uses D65 white point) primaries, while BT.601 matrix was derived from ITU-R Rec. BT.470 System M a.k.a. NTSC 1953 primaries and white point (that is Illuminant C and not D65 as the white point). (PAL DVD used BT.601 matrix too and so is JPEG, that defaults to BT.709 primaires.) What that means there is a mismatch between primaries and white point and matrix, while there should be none for optimal use of codepoints. So when new matrix was derived from BT.709 and even newer for BT.2020 primaires, it was more suitable. The conversion from primaries and white points to get two values that define YCbCr matrix is specified in ITU-T Rec. H.273.
What is also a problem is that BT.709 primaries are what is de facto standard for SDR displays, even with WCG one must to limit to BT.709 (that is done in Natural mode on galaxy devices and iPhones and LG TVs). But after decoding BT.601 content you will have SMPTE C or PAL primaries, that MUST be color managed to BT.709 primaries, that also introduces some loss.

what is COLOR_FormatYUV420Flexible?

I wish to encode video/avc on my android encoder. The encoder (Samsung S5) publishes COLOR_FormatYUV420Flexible as one of its supported formats. yay!
but I dont quite understand what it is and how I can use it. the docs say:
Flexible 12 bits per pixel, subsampled YUV color format with 8-bit chroma and luma components.
Chroma planes are subsampled by 2 both horizontally and vertically. Use this format with Image. This format corresponds to YUV_420_888, and can represent the COLOR_FormatYUV411Planar, COLOR_FormatYUV411PackedPlanar, COLOR_FormatYUV420Planar, COLOR_FormatYUV420PackedPlanar, COLOR_FormatYUV420SemiPlanar and COLOR_FormatYUV420PackedSemiPlanar formats
This seems to suggest that I can use this const with just about any kind of YUV data: planer, semi-planer, packed etc. this seems unlikely: how would the encoder know how to interpret the data unless I specify exactly where the U/V values are?
is there any meta-data that I need to provide in addition to this const? does it just work?
Almost, but not quite.
This constant can be used with almost any form of YUV (both planar, semiplanar, packed and all that). But, the catch is, it's not you who can choose the layout and the encoder has to support it - it's the other way around. The encoder will choose the surface layout and will describe it via the flexible description, and you need to support it, whichever one it happens to be.
In practice, when using this, you don't call getInputBuffers() or getInputBuffer(int index), you call getInputImage(int index), which returns an Image, which contains pointers to the start of the three planes, and their row and pixel strides.
Note - when calling queueInputBuffer afterwards, you have to supply a size parameter, which can be tricky to figure out - see https://stackoverflow.com/a/35403738/3115956 for more details on that.