gdalwarp too slow (compared to gdal_merge) - gis

I have 70+ raster images in TIFF format that I am trying to merge.
Originals can be found here:
http://www.faa.gov/air_traffic/flight_info/aeronav/digital_products/vfr/
After pre-processing (pct2rgb, gdalwarp individual charts, gdal_translate to cut the collars) I try to run them through gdalwarp to mosaic them using a command like this:
gdalwarp --config GDAL_CACHEMAX 3000 -overwrite -wm 3000 -r bilinear -srcnodata 0 -dstnodata 0 -wo "NUM_THREADS=3" /data/aeronav/sec/c/Albuquerque_c.tif .....70 other file names ...master.tif
After 12 hours of processing:
Creating output file that is 321521P x 125647L.
Processing input file /data/aeronav/sec/c/Albuquerque_c.tif.
0...10...20...30...40...
This means gdalwarp is never going to finish.
In contrast. A gdal_merge command like this:
gdal_merge.py -n 0 -a_nodata 0 -o /data/aeronav/sec/master.tif /data/aeronav/sec/c/Albuquerque_c.tif ......70 plus files.....
Finishes in couple of hours.
Problem with gdal_merge is inferior quality output because of "average" sampling. I would like to use "bilinear" at the minimum - and "cubic" sampling if possible and for that gdalwarp is required.
Why is there such a big difference in performance of the two ? Why doesn't gdalwarp want to finish ? Is there any other command line option to speed things up in gadalwarp or is there a way to add sampling option to gdal_merge ?

It seems gdalwarp is not the ideal command to merge these GeoTiffs (since I am not interested in warping again). Instead I used
gdalbuildvrt /data/aeronav/sec/master.virt .... 70+ files in order
to build a virtual mosaic. And then I used gdal_translate to convert the virt file into a GeoTiff:
gdal_translate -of GTiff /data/aeronav/sec/master.virt /data/aeronav/sec/master.tif
That's it—this took less than an hour (even faster than gdal_merge and preserves quality of original files).

Related

Removing columns from csv when cut and csvfilter both stop before finishing

I'm trying to take a large csv file (800,000 rows, 160 columns). I'm trying to remove select columns, but keep all rows. I've tried two different methods--the standard cut command, and csvfilter--but neither of them will return all rows. In fact, they both return different numbers of rows, with cut returning a dozen or so more than csvfilter, but both a little over 4000.
I've looked at the original csv to try to see what might be making it choke, but I can't see anything: no quote marks in the row, no special characters.
Can anyone suggest a reliable method to remove columns from a csv or a way to more effectively troubleshoot csvfilter and/or cut? I'm mostly working on a Mac, but can work on Windows as well.
I recommend GoCSV's select command. It's already built for macOS/darwin, so go straight to the latest release and downloading the binary of your choosing.
I'm not sure why csvfilter would truncate your file. I'm especially skeptical that cut would eliminate any line, but I haven't tried 800K lines before.
Testing cut; comparing GoCSV
Here's a Python script to generate a CSV, large.csv, that is 800_000 rows by 160 columns:
with open('large.csv', 'w') as f:
# Write header
cols = ['Count']
cols += [f'H{k+1}' for k in range(159)]
f.write(','.join(cols) + '\n')
# Write data
for i in range(800_000):
cols = [str(i+1)]
cols += [f'C{k+1}' for k in range(159)]
f.write(','.join(cols) + '\n')
Ensure large has 800K lines:
wc -l large.csv
800001 large.csv
And with GoCSV's dims (dimensions) command:
gocsv dims large.csv
Dimensions:
Rows: 800000
Columns: 160
(GoCSV always counts the first row/line as the "header", which doesn't have any effect for cutting/selecting columns)
And now cutting the columns:
time cut -d ',' -f1,160 large.csv > cut.csv
cut -d, -f1,160 large.csv > cut.csv 8.10s user 0.38s system 99% cpu 8.483 total
time gocsv select -c 1,160 large.csv > gocsv_select.csv
gocsv select -c 1,160 large.csv > gocsv_select.csv 5.25s user 2.55s system 106% cpu 7.322 total
Compare the two methods:
cmp gocsv_select.csv cut.csv
and since they are the same, looking at the head and tail of one counts for both:
head -n2 cut.csv
Count,H159
1,C159
tail -n2 cut.csv
799999,C159
800000,C159
So, both did what looks like the right thing, specifically cut didn't filter/drop any lines/rows. And GoCSV, actually did it faster.
I'm curious what your cut command looks like, but I think the bigger point to stress is to use a CSV-aware tool whenever you can (always).

Multicore gzip uncompression with spliting output file (csv) to parts by 1Gb/file

I have 10Gb gzip archive (uncompressed is about 60Gb).
Is there a way to decompress this file with multithreading + on the fly splitting output to parts by 1Gb/part (n-lines/part, maybe)?
If I do something like this:
pigz -dc 60GB.csv.gz | dd bs=8M skip=0 count=512 of=4G-part-1.csv
I can get a 4Gb file, but it don't care about starting always from next line, so lines in my files won't be ended properly.
Also, as I notised, my GCE instance with persistant disk has maximum 33kb block size, so I can't actually use command like above, but have to print something like:
pigz -dc 60GB.csv.gz | dd bs=1024 skip=0 count=4194304 of=4G-part-1.csv
pigz -dc 60GB.csv.gz | dd bs=1024 skip=4194304 count=4194304 of=4G-part-2.csv
pigz -dc 60GB.csv.gz | dd bs=1024 skip=$((4194304*2)) count=4194304 of=4G-part-3.csv
So, I have to make some trick to always start file from new line..
UPDATE:
zcat 60GB.csv.gz |awk 'NR%43000000==1{x="part-"++i".csv";}{print > x}'
did the trick.
Based on the sizes you mention in your question, it looks like you get about 6-to-1 compression. That doesn't seem great for text, but anyway...
As Mark states, you can't just dip mid stream into your gz file and expect to land on a new line. Your dd options won't work because dd only copies bytes, it doesn't detect compressed newlines. If indexing is out of scope for this, the following command line solution might help:
$ gzcat 60GB.csv.gz | awk -v n=1 '!NR%20000000{n++} {print|("gzip>part-"n".gz")}'
This decompresses your file so that we can count lines, then processes the stream, changing the output file name every 20000000 lines. You can adjust your recompression options where you see "gzip" in the code above.
If you don't want your output to be compressed, you can simplify the last part of the line:
$ gzcat 60GB.csv.gz | awk -v n=1 '!NR%3500000{n++} {print>("part-"n".csv")}'
You might have to play with the number of lines to get something close to the file size you're aiming for.
Note that if your shell is csh/tcsh, you may have to escape the exclamation point in the awk script to avoid it being interpreted as a history reference.
UPDATE:
If you'd like to get status of what the script is doing, awk can do that. Something like this might be interesting:
$ gzcat 60GB.csv.gz | awk -v n=1 '!NR%3500000{n++} !NR%1000{printf("part=%d / line=%d\r",n,NR)} {print>("part-"n".csv")}'
This should show you the current part and line number every thousand lines.
Unless it was especially prepared for such an operation, or unless an index was built for that purpose, then no. The gzip format inherently requires the decompression of the data before any point in the stream, in order to decompress data after that point in the stream. So it cannot be parallelized.
The way out is to either a) recompress the gzip file with synchronization points and save those locations, or b) go through the entire gzip file once and create another file of entry points with the previous context at those points.
For a), zlib provides Z_FULL_FLUSH operations that insert synchronization points in the stream from which you can start decompression with no previous history. You would want to create such points sparingly, since they degrade compression.
For b), zran.c provides an example of how to build in index into a gzip file. You need to go through the stream once in serial order to build the index, but having done so, you can then start decompression at the locations you have saved.

How to annote a large file with mercurial?

i'm trying to find out who did the last change on a certain line in a large xml file ( ~100.00 lines ).
I tried to use the log file viewer of thg but it does not give me any results.
Using hg annote file.xml -l 95000 .. took forever and eventually died with an error message.
Is there a way to annote a single line in a large file that does not take forever ?
You can use hg grep to dig into a file if you have interest in very specific text:
hg grep "text-to-search" file.txt
You will likely need to add the --all switch to get every change that matches and then limit your results to a specific changeset range with -r firstchage:lastchange syntax.
I don't have a file on hand of the size you are working with, so this may also have trouble past a certain point, particularly if the search string matches many many lines in the file. But if you can get as specific as possible with your search string, you should be able to track it down.

Sorting file with 1.8 million records using script

I am trying to remove identical lines in a file having 1.8 million records and create a new file. Using the following command:
sort tmp1.csv | uniq -c | sort -nr > tmp2.csv
Running the script creates a new file sort.exe.stackdump with the following information:
"Exception: STATUS_ACCESS_VIOLATION at rip=00180144805
..
..
program=C:\cygwin64\bin\sort.exe, pid 6136, thread main
cs=0033 ds=002B es=002B fs=0053 gs=002B ss=002B"
The script works for a small file with 10 lines. Seems like sort.exe cannot handle so many records. How do I work with such a large file with more than 1.8 million records? We do not have any database other than ACCESS and I was trying to do this manually in ACCESS.
It sounds like your sort command is broken. Since the path says cygwin, i'm assuming this is GNU sort, which generally should have no problem with this task, given sufficient memory and disk space. Try playing with flags to adjust where and how much it uses the disk: http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html
The following awk command seemed to be a much faster way to get rid of the uniqe values:
awk '!v[$0]++' $FILE2 > tmp.csv
where $FILE2 is the file name with duplicate values.

DIFF utility works for 2 files. How to compare more than 2 files at a time?

So the utility Diff works just like I want for 2 files, but I have a project that requires comparisons with more than 2 files at a time, maybe up to 10 at a time. This requires having all those files side by side to each other as well. My research has not really turned up anything, vimdiff seems to be the best so far with the ability to compare 4 at a time.
My question: Is there any utility to compare more than 2 files at a time, or a way to hack diff/vimdiff so it can do multiple comparisons? The files I will be comparing are relatively short so it should not be too slow.
Displaying 10 files side-by-side and highlighting differences can be easily done with Diffuse. Simply specify all files on the command line like this:
diffuse 1.txt 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt 8.txt 9.txt 10.txt
Vim can already do this:
vim -d file1 file2 file3
But you're normally limited to 4 files. You can change that by modifying a single line in Vim's source, however. The constant DB_COUNT defines the maximum number of diffed files, and it's defined towards the top of diff.c in versions 6.x and earlier, or about two thirds of the way down structs.h in versions 7.0 and up.
diff has built-in option --from-file and --to-file, which compares one operand to all others.
--from-file=FILE1
Compare FILE1 to all operands. FILE1 can be a directory.
--to-file=FILE2
Compare all operands to FILE2. FILE2 can be a directory.
Note: argument name --to-file is optional.
e.g.
# this will compare foo with bar, then foo with baz .html files
$ diff --from-file foo.html bar.html baz.html
# this will compare src/base-main.js with all .js files in git repo,
# that has 'main' in their filename or path
$ git ls-files :/*main*.js | xargs diff -u --from-file src/base-main.js
Checkout "Beyond Compare": http://www.scootersoftware.com/
It lets you compare entire directories of files, and it looks like it runs on Linux too.
if your running multiple diff's based off one file you could probably try writing a script that has a for loop to run through each directory and run the diff. Although it wouldn't be side by side you could at least compare them quickly. hope that helped.
Not answering the main question, but here's something similar to what Benjamin Neil has suggested but diffing all files:
Store the filenames in an array, then loop over the combinations of size two and diff (or do whatever you want).
files=($(ls -d /path/of/files/some-prefix.*)) # Array of files to compare
max=${#files[#]} # Take the length of that array
for ((idxA=0; idxA<max; idxA++)); do # iterate idxA from 0 to length
for ((idxB=idxA + 1; idxB<max; idxB++)); do # iterate idxB + 1 from idxA to length
echo "A: ${files[$idxA]}; B: ${files[$idxB]}" # Do whatever you're here for.
done
done
Derived from #charles-duffy's answer: https://stackoverflow.com/a/46719215/1160428
There is a simple an good way to do this = GREP.
Depending on the size of the text you can copy and paste it, or you can redirect the input of the file to the grep command. If you make a grep -vir /path to make a reverse search or a grep -ir /path. This is my way for certification exams.