GIF Data Storage specification question - gif

In the GIF specification, here:
http://www.w3.org/Graphics/GIF/spec-gif89a.txt
I am having trouble understanding two parts of it:
It specifies for 'LZW minimum code size' that:
This byte determines the initial number of bits used for LZW codes in the image data, as described in Appendix F.
What does it mean 'initial number of bits used for LZW codes'?
How do LZW codes work in the context of GIFs? (I understand it refers to Lempel-Ziv-Welch).
Where is the elusive 'appexdix F' it refers to? (It isn't in the body).
It also specifies that after the single byte for the 'LZW minimum code size', there is a block called 'Image Data', of which the actual size is unspecified, and is just called 'Data Sub-blocks'.
What does it mean 'Data sub-blocks'?
How do I work out the size of the data sub-blocks?
And does this relate to the LZW codes? If so, how should I interpret it?
Sorry for all the questions. Thank you for your time.
As a side-note: Even a partial answer or an answer to any of the questions would be greatly appreciated.

GIF applies the LZW algorithm with a variable (increasing) size of codes as
described in Wikipedia LZW. The 'initial number of bits' is the initial
size of the codes.
This is described in the document you refer to. The list of color codes of the
pixels is LZW compressed (paragraph "a" just above the part you are citing).
It is there in the file (on page 30 ;-) near the end, just search for
"Variable-Length-Code LZW Compression")
The "data sub-blocks" are the actual image data in chunks of 255 (or less bytes).
Maybe the
The linked lists used by the image data and the extension blocks consist of series of sub-blocks, each sub-block beginning with a byte giving the number of subsequent data bytes in the sub-block (1 to 255), the series terminated by the empty sub-block (a 0 byte).
[Wikipedia GIF]
"How do I work out the size of the data sub-blocks?" See above. It is the first byte of
each of the blocks.
See above.

Related

Producing many different hashes of a jpg file with minimal change to picture

My goal is to write a program (e.g. in Python or C++) which takes as input a JPG file (e.g. tux.jpg) and make tiny changes to it, such that it outputs many different images (maybe a thousand images or even more), but in a way that all these images, while having different hash, look almost the same visually, i.e. the changes should have the least impact to the original image as possible.
I first though to play around with the jpg header but that might not be enough to make the many thousands of different pictures I want.
As a naive way, I thought to flip a random bit in the file, but that bit can possibly result in a less than desirable result, which can be seen especially in small pictures (e.g. a dark pixel in the white space in the tux picture). Ideally, I would like to change a random pixel with a "neighboring" color, such that the two resulting pictures have almost no visual difference.
For this purpose, I read the JPG codec example but I find it very confusing and hard to understand. Can someone help me what my program should look for as it parses the file in binary format and how to change a random pixel with a "neighboring" color?
You can change the comment part of the file by playing with the file header. A simple way to do that is to use a ready made open source program that allows you to put the comment of your choice, example HLLO repeated 8 times. That should give you 256 bits to play with. You can then determine the place where the HLLO pattern is located in the file using a hex editor. You then load the data in memory and start changing these 32 bytes and calculate the hash each time to get a collision (a hash that matches)
By the time you find a collision, the universe will have ended.
Although in theory doable, it's practically impossible to crack SHA256 in a reasonable amount of time, standard encryption protocols would be over and hackers would be enjoying their time.

How to concat two tensors of size [B,C,13,18] and [B,C,14,18] respectively in Pytorch?

I often met this problem when the height or width of an image or a tensor becomes odd.
For example, suppose the original tensor is of size [B,C,13,18]. After forwarding a strided-2 conv and several other conv layers, its size will become [B,C,7,9]. If we upsample the output by 2 and concat it with the original feature map as most cases, the error occurs.
I found that in many source codes, they use even sizes like (512,512) for training, so this kind of problem won't happen. But for test, I use the original image size to keep fine details and often met this problem.
What should I do? Do I need to change the network architecture?
Concatenating tensors with incompatible shapes does not make sense. Information is missing, and you need to specify it by yourself. The question is, what do you are expected from this concatenation ? Usually, you pad the input with zeros, or truncate the output, in order to get compatible shapes (in the general case, being even is not the required condition). If the height and width are large enough, the edge effect should be negligible (well, except perhaps on the edge, it depends).
So if you are dealing with convolutions only, no need to change the architecture strictly speaking, just to add a padding layer somewhere it seems appropriate.

What is the position embedding in the convolutional sequence to sequence learning model?

I don't understand the position embedding in paper Convolutional Sequence to Sequence Learning, anyone can help me?
From what I understand, for each word to translate, the input contains both the word itself and its position in the input chain (say, 0, 1, ...m).
Now, encoding such a data with simply having a cell with value pos (in 0..m) would not perform very well (for the same reason we use one-hot vectors to encode words). So, basically, the position will be encoded in a number of input cells, with one-hot representation (or similar, I might think of a binary representation of the position being used).
Then, an embedding layer will be used (just as it is used for word encodings) to transform this sparse and discrete representation into a continuous one.
The representation used in the paper chose to have the same dimension for the word embedding and the position embedding and to simply sum up the two.
From what I perceive, the position embedding is still the procedure of building a low dimension representation for the one-hot vectors. Whereas this time the dimension of the one-hot vector is the length of the sentence. BTY, I think whether placing the 'one hot' in position order really doesn't matter. It just gives the model a sense of 'position awareness'.
From what I have understood so far is for each word in the sentence they have 2 vectors:
One hot encoding vector to encode the word.
One hot encoding vector to encode the position of the word in the sentence.
Both of these vectors are now passed separately as input and they embed both the inputs into f dimensional space. Once they had activation values from both the inputs which ∈ R^f. They just add these activations to obtain a combined input element representation.
I think khaemuaset's answer is correct.
To reinforce: As I understand from the paper (I'm reading A Convolutional Encoder Model for Machine Translation) and the corresponding Facebook AI Research PyTorch source code, the position embedding is a typical embedding table, but for seq position one-hot vectors instead of vocab one-hot vectors. I verified this with the source code here. Notice the inheritance of nn.Embedding and the call to its forward method at line 32.
The class I linked to is used in the FConvEncoder here.

OCR match frame´s position to field in credit card

I am developing an OCR to detect credit card.
After scanning the image I get a list of words with it´s positions.
Any tips/suggestions about the best approach to detect which words correspond to each field of credit card (number, date, name)?
For example:
position = 96.00 491.00
text = CARDHOLDER
Thanks in advance
Your first problem is that most OCRs are not optimised for small amounts of text that take up most of the "page" (or card image, in your case) in spatially separated chunks. They expect lines, or pages of text from a scanned book or a newspaper. So straight away they're not likely to do that well at analysing the image.
Because the font is fairly uniform they'll likely recognise the characters well, but the layout will confuse the page segmentation algorithm and so the text you get out might not be in the right order. For example, the "1234" of the card number and the smaller "1234" below it constitute a single column of text, likewise the second two sets of four numbers and the expiration date.
For specialized cases where you know the layout in advance you really want to develop your own page segmentation algorithm to break up the image into zones, e.g. card number, card holder name, start and expiration dates. This shouldn't be too hard because I think the location of these components are standardised on credit cards. Assuming good preprocessing and binarization you could basically do a horizontal histogram and split the image at the troughs.
Then extract each zone as a separate image containing just one line of text and feed it to the OCR.
Alternately (the quick and dirty approach)
Instruct the OCR that what you want to recognise consists of a single column (i.e. prevent it from trying to figure out the page layout itself). You can do this with Tesseract using the -psm (page segmentation mode) parameter set to, probably, 6 (but try and see what gives you the best results)
Make Tesseract output hOCR format, which you can set in the configfile. hOCR format includes the bounding boxes of the lines that get output relative to the whole image.
write an algorithm that compares the bounding boxes in the hOCR to where you know each card component should be (looking for some percentage of overlap, it won't match exactly for obvious reasons.)
In addition to the good tips provided by Mikesname, you can greatly improve the recognition result regardless of which OCR engine you use if you use image processing to convert the image to bitonal (pure black and white), such as the attached copy of your image.

Options to Convert 16 bit Image

when i open a 16 bit image in tiff format, it opens up as a black image. The 16-bit tiff image only opens in the program ImageJ; however, it does not open in Preview. I am wondering what my options are now to view the format in an easier way that does not reduce the resolution than to open ImageJ to view it. Should I convert it to an 8-bit format, but wouldn't I lose data when the format is reduced from 16 to 8 bit? Also, I was thinking about converting the tiff image to jpeg, but would that result in a reduction in resolution?
From the ImageJ wiki's Troubleshooting page:
This problem can arise when 12-bit, 14-bit or 16-bit images are loaded into ImageJ without autoscaling. In that case, the display is scaled to the full 16-bit range (0 - 65535 intensity values), even though the actual data values typically span a much smaller range. For example, on a 12-bit camera, the largest possible intensity value is 4095—but with 0 mapped to black and 65535 mapped to white, 4095 ends up (linearly) mapped to a very very dark gray, nearly invisible to the human eye.
You can fix this by clicking on Image ▶ Adjust ▶ Brightness/Contrast... and hitting the Auto button.
You can verify whether the actual data is there by moving the mouse over the image, and looking at the pixel probe output in the status bar area of the main ImageJ window.
In other words, it is unlikely that your image is actually all 0 values, but rather the display range is probably not set to align with the data range. If your image has intensity values ranging from e.g. 67 to 520, but stored as a 16-bit image (with potential values ranging from 0 to 65535), the default display range is also 0=black, 65535=white, and values in between scaled linearly. So all those values (57 to 520) will appear near black. When you autoscale, it resets the display range to match the data range, making values 67 and below appear black, values 520 and above appear white, and everything in between scaled linearly.
If you use the Bio-Formats plugin to import your images into ImageJ, you can check the "Autoscale" option to ensure this dynamic display range scaling happens automatically.
As for whether to convert to JPEG: JPEG is a lossy compression format, which has its own problems. If you are going to do any quantitative analysis at all, I strongly advise not converting to JPEG. See the Fiji wiki's article on JPEG for more details.
Similarly, converting to 8-bit is fine if you want to merely visualize in another application, but it would generally be wrong to perform quantitative analysis on the down-converted image. Note that when ImageJ converts a 16-bit image as 8-bit (using the Image > Type menu), it "burns in" whatever display range mapping you currently have set in the Brightness/Contrast dialog, making the apparent pixel values into the actual 8-bit pixel values.
Changing from a 16 bit image to an 8 bit image would potentially reduce the contrast but not necessarily the resolution. A reduction in resolution would come from changing the number of pixels. To convert 16bit to 8bit the number of pixels would be the same but the bit depth would change.
The maximum pixel value in a 16 bit unsigned grayscale image would be 2^16-1
The maximum pixel value in an 8 bit unsigned grayscale image would be 2^8-1
One case where the resolution would be affected is if you had a 16 bit image with a bunch of pixels of pixel value x and another bunch of pixels with pixel values x + 1 and converting to an 8 bit image mapped the pixels to the same value 'y' then you would not be able to resolve the two sets of pixels.
If you look at the maximum and minimum pixel values you may well be able to convert to an 8 bit image without loosing any data.
You could perform the conversion and check using various metrics if the information in the 8bit image is reduced. One such metric would be the entropy. This quantity should be the same if you have not lost any data. Note that the converse is not necessarily true i.e. just because the entropy is the same does not mean the data is the same.
If you want some more suggestions on how to validate the conversion and to see if you have lost any data let me know.