How to use gdal_rasterize to create a LZW compressed Tiff - gis

I want to use gdal_rasterize to generate a TIFF from a .shp shapefile. Usually the result is big, so I want to compress it using the LZW compress option.
I tried to do so with the command
gdal_rasterize.exe -burn 255 -burn 255 -burn 0 -burn 255 -ot Byte -tr 0.0332147 0.0332147 shp.shp shp0.tif --config COMPRESS LZW
but it seems the --config COMPRESS LZW option doesn't have any effect. (The result is exactly the same size as without the option.)
Maybe I have some misunderstanding of how to use this option.

You should add an = symbol between the option and the value. Without your data i cant test your specific example, but for me this fails:
gdal_translate --config COMPRESS LZW infile.tif outfile.tif
and this works fine:
gdal_translate --config COMPRESS=LZW infile.tif outfile.tif
You can also write the --config as -co, and wrapping it with quotes also works, which is how i usually do it.
gdal_translate -co "COMPRESS=LZW" infile.tif outfile.tif

Related

Multicore gzip uncompression with spliting output file (csv) to parts by 1Gb/file

I have 10Gb gzip archive (uncompressed is about 60Gb).
Is there a way to decompress this file with multithreading + on the fly splitting output to parts by 1Gb/part (n-lines/part, maybe)?
If I do something like this:
pigz -dc 60GB.csv.gz | dd bs=8M skip=0 count=512 of=4G-part-1.csv
I can get a 4Gb file, but it don't care about starting always from next line, so lines in my files won't be ended properly.
Also, as I notised, my GCE instance with persistant disk has maximum 33kb block size, so I can't actually use command like above, but have to print something like:
pigz -dc 60GB.csv.gz | dd bs=1024 skip=0 count=4194304 of=4G-part-1.csv
pigz -dc 60GB.csv.gz | dd bs=1024 skip=4194304 count=4194304 of=4G-part-2.csv
pigz -dc 60GB.csv.gz | dd bs=1024 skip=$((4194304*2)) count=4194304 of=4G-part-3.csv
So, I have to make some trick to always start file from new line..
UPDATE:
zcat 60GB.csv.gz |awk 'NR%43000000==1{x="part-"++i".csv";}{print > x}'
did the trick.
Based on the sizes you mention in your question, it looks like you get about 6-to-1 compression. That doesn't seem great for text, but anyway...
As Mark states, you can't just dip mid stream into your gz file and expect to land on a new line. Your dd options won't work because dd only copies bytes, it doesn't detect compressed newlines. If indexing is out of scope for this, the following command line solution might help:
$ gzcat 60GB.csv.gz | awk -v n=1 '!NR%20000000{n++} {print|("gzip>part-"n".gz")}'
This decompresses your file so that we can count lines, then processes the stream, changing the output file name every 20000000 lines. You can adjust your recompression options where you see "gzip" in the code above.
If you don't want your output to be compressed, you can simplify the last part of the line:
$ gzcat 60GB.csv.gz | awk -v n=1 '!NR%3500000{n++} {print>("part-"n".csv")}'
You might have to play with the number of lines to get something close to the file size you're aiming for.
Note that if your shell is csh/tcsh, you may have to escape the exclamation point in the awk script to avoid it being interpreted as a history reference.
UPDATE:
If you'd like to get status of what the script is doing, awk can do that. Something like this might be interesting:
$ gzcat 60GB.csv.gz | awk -v n=1 '!NR%3500000{n++} !NR%1000{printf("part=%d / line=%d\r",n,NR)} {print>("part-"n".csv")}'
This should show you the current part and line number every thousand lines.
Unless it was especially prepared for such an operation, or unless an index was built for that purpose, then no. The gzip format inherently requires the decompression of the data before any point in the stream, in order to decompress data after that point in the stream. So it cannot be parallelized.
The way out is to either a) recompress the gzip file with synchronization points and save those locations, or b) go through the entire gzip file once and create another file of entry points with the previous context at those points.
For a), zlib provides Z_FULL_FLUSH operations that insert synchronization points in the stream from which you can start decompression with no previous history. You would want to create such points sparingly, since they degrade compression.
For b), zran.c provides an example of how to build in index into a gzip file. You need to go through the stream once in serial order to build the index, but having done so, you can then start decompression at the locations you have saved.

gdalwarp too slow (compared to gdal_merge)

I have 70+ raster images in TIFF format that I am trying to merge.
Originals can be found here:
http://www.faa.gov/air_traffic/flight_info/aeronav/digital_products/vfr/
After pre-processing (pct2rgb, gdalwarp individual charts, gdal_translate to cut the collars) I try to run them through gdalwarp to mosaic them using a command like this:
gdalwarp --config GDAL_CACHEMAX 3000 -overwrite -wm 3000 -r bilinear -srcnodata 0 -dstnodata 0 -wo "NUM_THREADS=3" /data/aeronav/sec/c/Albuquerque_c.tif .....70 other file names ...master.tif
After 12 hours of processing:
Creating output file that is 321521P x 125647L.
Processing input file /data/aeronav/sec/c/Albuquerque_c.tif.
0...10...20...30...40...
This means gdalwarp is never going to finish.
In contrast. A gdal_merge command like this:
gdal_merge.py -n 0 -a_nodata 0 -o /data/aeronav/sec/master.tif /data/aeronav/sec/c/Albuquerque_c.tif ......70 plus files.....
Finishes in couple of hours.
Problem with gdal_merge is inferior quality output because of "average" sampling. I would like to use "bilinear" at the minimum - and "cubic" sampling if possible and for that gdalwarp is required.
Why is there such a big difference in performance of the two ? Why doesn't gdalwarp want to finish ? Is there any other command line option to speed things up in gadalwarp or is there a way to add sampling option to gdal_merge ?
It seems gdalwarp is not the ideal command to merge these GeoTiffs (since I am not interested in warping again). Instead I used
gdalbuildvrt /data/aeronav/sec/master.virt .... 70+ files in order
to build a virtual mosaic. And then I used gdal_translate to convert the virt file into a GeoTiff:
gdal_translate -of GTiff /data/aeronav/sec/master.virt /data/aeronav/sec/master.tif
That's it—this took less than an hour (even faster than gdal_merge and preserves quality of original files).

Clipping a geotiff file where it does NOT overlap with a shapefile

I have a geotiff file which overlaps with a shapefile. To clip for the overlapping part of the tif file, I can do this:
gdalwarp -co compress=deflate -dstnodata 255 -cutline shapefile.shp original.tif overlap.tif
But how can I clip for the non-intersecting part? That is, I want to create the complement of "overlap.tif" w.r.t. "original.tif".
You can use gdal_rasterize to burn a value where the shapefile overlaps the file. It works on an existing file, so make sure you use a copy.
gdal_rasterize -burn 255 shapefile.shp copy_of_original.tif
This burns a value of 255, setting -a_nodata 255 doesnt work on my version of GDAL. If you need it to be a real nodata value using gdal_translate with -a_nodata 255 afterwards would do the trick.
Gdal_rasterize also has a convenient -i flag which inverts the shapefile.

How to create a PATCH file for the binary difference output file

I want to know how to create a PATCH for the difference file I got by comparing two binary files.
$cmp -l > output file name
I checked for text files 'diff" can be used to compare and generate a PATCH file
$ diff -u oldFile newFile > mods.diff # -u tells diff to output unified diff format
I want to apply the PATCH on the old binary image file to get my new binary image file.
Diff and Patch are designed to work with text files, not arbitrary binary data. You should use something like bsdiff instead.
If your repository, or package is using git you can make binary diff with
git diff --patch --binary old_dir patched_dir
Of course you can also use it with commits
git diff --patch --binary commit1 commit2
JDIFF is a program that outputs the differences between two (binary) files.
Also you can use therdiff command.
If you still want to use diff & patch. Here is a way...
Write a c program yourself to insert a newline character at the end of every 512/1024/your_choice bytes (this is just to fool the diff as it compares the files line by line). Run this script on your two input files.
Then run 'diff -au file1 file2 > mod.diff (you will get the patch here)'
Patching is simple 'patch < mod.diff'
Than again write a program to remove the newlines from the binary file. That is all...

DIFF utility works for 2 files. How to compare more than 2 files at a time?

So the utility Diff works just like I want for 2 files, but I have a project that requires comparisons with more than 2 files at a time, maybe up to 10 at a time. This requires having all those files side by side to each other as well. My research has not really turned up anything, vimdiff seems to be the best so far with the ability to compare 4 at a time.
My question: Is there any utility to compare more than 2 files at a time, or a way to hack diff/vimdiff so it can do multiple comparisons? The files I will be comparing are relatively short so it should not be too slow.
Displaying 10 files side-by-side and highlighting differences can be easily done with Diffuse. Simply specify all files on the command line like this:
diffuse 1.txt 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt 8.txt 9.txt 10.txt
Vim can already do this:
vim -d file1 file2 file3
But you're normally limited to 4 files. You can change that by modifying a single line in Vim's source, however. The constant DB_COUNT defines the maximum number of diffed files, and it's defined towards the top of diff.c in versions 6.x and earlier, or about two thirds of the way down structs.h in versions 7.0 and up.
diff has built-in option --from-file and --to-file, which compares one operand to all others.
--from-file=FILE1
Compare FILE1 to all operands. FILE1 can be a directory.
--to-file=FILE2
Compare all operands to FILE2. FILE2 can be a directory.
Note: argument name --to-file is optional.
e.g.
# this will compare foo with bar, then foo with baz .html files
$ diff --from-file foo.html bar.html baz.html
# this will compare src/base-main.js with all .js files in git repo,
# that has 'main' in their filename or path
$ git ls-files :/*main*.js | xargs diff -u --from-file src/base-main.js
Checkout "Beyond Compare": http://www.scootersoftware.com/
It lets you compare entire directories of files, and it looks like it runs on Linux too.
if your running multiple diff's based off one file you could probably try writing a script that has a for loop to run through each directory and run the diff. Although it wouldn't be side by side you could at least compare them quickly. hope that helped.
Not answering the main question, but here's something similar to what Benjamin Neil has suggested but diffing all files:
Store the filenames in an array, then loop over the combinations of size two and diff (or do whatever you want).
files=($(ls -d /path/of/files/some-prefix.*)) # Array of files to compare
max=${#files[#]} # Take the length of that array
for ((idxA=0; idxA<max; idxA++)); do # iterate idxA from 0 to length
for ((idxB=idxA + 1; idxB<max; idxB++)); do # iterate idxB + 1 from idxA to length
echo "A: ${files[$idxA]}; B: ${files[$idxB]}" # Do whatever you're here for.
done
done
Derived from #charles-duffy's answer: https://stackoverflow.com/a/46719215/1160428
There is a simple an good way to do this = GREP.
Depending on the size of the text you can copy and paste it, or you can redirect the input of the file to the grep command. If you make a grep -vir /path to make a reverse search or a grep -ir /path. This is my way for certification exams.