nco extracting data from netCDF with hyperslab - extract

I have a netCDF file with ten minute resolution data. I would like to extract the hourly data from this and write a new netCDF file that grabs the data at the top of each hour in the original ten minute file. I think that I would do this with the ncks -d hyperslab flag but I am not entirely sure if this is the best way.

Yes, the best way is to use
ncks -d time,min,max,stride in.nc out.nc
e.g.,
ncks -d time,0,,6 in.nc out.nc

Related

how to import a ".csv " data-file into the Redis database

How do I import a .csv data-file into the Redis database? The .csv file has "id, time, latitude, longitude" columns in it. Can you suggest to me the best possible way to import the CSV file and be able to perform spatial queries?
This is a very broad question, because we don't know what data structure you want to have. What queries do you expect, etc. In order to solve your question you need:
Write down expected queries. Write down expected partitions. Is this file your complete dataset?
Write down your data structure. It will heavily depend on answers from p1.
Pick any (scripting) language you are most comfortable with. Load your file, process it in CSV library, map to your data structure from p2, push to Redis. You can do the latter with client library or with redis-cli.
The if for example, you want to lay your data in sorted sets where your id is zset's key, timestamp is score and lat,lon is the payload, you can do this:
$ cat data.csv
id1,1528961481,45.0,45.0
id1,1528961482,45.1,45.1
id2,1528961483,50.0,50.0
id2,1528961484,50.1,50.0
cat data.csv | awk -F "," '{print $1" "$2" "$3" "$4}' | xargs -n4 sh -c 'redis-cli -p 6370 zadd $1 $2 "$3,$4"' sh
127.0.0.1:6370> zrange id2 0 -1
1) "50.0,50.0"
2) "50.1,50.0"

Search in large csv files

The problem
I have thousands of csv files in a folder. Every file has 128,000 entries with four columns in each line.
From time to time (two times a day) I need to compare a list (10,000 entries) with all csv files. If one of the entries is identical with the third or fourth column of one of the csv files I need to write the whole csv row to an extra file.
Possible solutions
Grep
#!/bin/bash
getArray() {
array=()
while IFS= read -r line
do
array+=("$line")
done < "$1"
}
getArray "entries.log"
for e in "${array[#]}"
do
echo "$e"
/bin/grep $e ./csv/* >> found
done
This seems to work, but it lasts forever. After almost 48 hours the script checked only 48 entries of about 10,000.
MySQL
The next try was to import all csv files to a mysql database. But there I had problems with my table at around 50,000,000 entries.
So I wrote a script which created a new table after 49,000,000 entries and so I was able to import all csv files.
I tried to create an index on the second column but it always failed (timeout). To create the index before the import process wasn't possible, too. It slowed down the import to days instead of only a few hours.
The select statement was horrible, but it worked. Much faster than the "grep" solution but still to slow.
My question
What else can I try to search within the csv files?
To speed things up I copied all csv files to an ssd. But I hope there are other ways.
This is unlikely to offer you meaningful benefits, but some improvements to your script
use the built-in mapfile to slurp a file into an array:
mapfile -t array < entries.log
use grep with a file of patterns and appropriate flags.
I assume you want to match items in entries.log as fixed strings, not as regex patterns.
I also assume you want to match whole words.
grep -Fwf entries.log ./csv/*
This means you don't have to grep the 1000's of csv files 1000's of times (once for each item in entries.log). Actually this alone should give you a real meaningful performance improvement.
This also removes the need to read entries.log into an array at all.
In awk assuming all the csv files change, otherwise it would be wise to keep track of the already checked files. But first some test material:
$ mkdir test # the csvs go here
$ cat > test/file1 # has a match in 3rd
not not this not
$ cat > test/file2 # no match
not not not not
$ cat > test/file3 # has a match in 4th
not not not that
$ cat > list # these we look for
this
that
Then the script:
$ awk 'NR==FNR{a[$1];next} ($3 in a) || ($4 in a){print >> "out"}' list test/*
$ cat out
not not this not
not not not that
Explained:
$ awk ' # awk
NR==FNR { # process the list file
a[$1] # hash list entries to a
next # next list item
}
($3 in a) || ($4 in a) { # if 3rd or 4th field entry in hash
print >> "out" # append whole record to file "out"
}' list test/* # first list then the rest of the files
The script hashes all the list entries to a and reads thru the csv files looking for 3rd and 4th field entries in the hash outputing when there is a match.
If you test it, let me know how long it ran.
You can build a patterns file and then use xargs and grep -Ef to search for all patterns in batches of csv files, rather than one pattern at a time as in your current solution:
# prepare patterns file
while read -r line; do
printf '%s\n' "^[^,]+,[^,]+,$line,[^,]+$" # find value in third column
printf '%s\n' "^[^,]+,[^,]+,[^,]+,$line$" # find value in fourth column
done < entries.log > patterns.dat
find /path/to/csv -type f -name '*.csv' -print0 | xargs -0 grep -hEf patterns.dat > found.dat
find ... - emits a NUL-delimited list of all csv files found
xargs -0 ... - passes the file list to grep, in batches

Multicore gzip uncompression with spliting output file (csv) to parts by 1Gb/file

I have 10Gb gzip archive (uncompressed is about 60Gb).
Is there a way to decompress this file with multithreading + on the fly splitting output to parts by 1Gb/part (n-lines/part, maybe)?
If I do something like this:
pigz -dc 60GB.csv.gz | dd bs=8M skip=0 count=512 of=4G-part-1.csv
I can get a 4Gb file, but it don't care about starting always from next line, so lines in my files won't be ended properly.
Also, as I notised, my GCE instance with persistant disk has maximum 33kb block size, so I can't actually use command like above, but have to print something like:
pigz -dc 60GB.csv.gz | dd bs=1024 skip=0 count=4194304 of=4G-part-1.csv
pigz -dc 60GB.csv.gz | dd bs=1024 skip=4194304 count=4194304 of=4G-part-2.csv
pigz -dc 60GB.csv.gz | dd bs=1024 skip=$((4194304*2)) count=4194304 of=4G-part-3.csv
So, I have to make some trick to always start file from new line..
UPDATE:
zcat 60GB.csv.gz |awk 'NR%43000000==1{x="part-"++i".csv";}{print > x}'
did the trick.
Based on the sizes you mention in your question, it looks like you get about 6-to-1 compression. That doesn't seem great for text, but anyway...
As Mark states, you can't just dip mid stream into your gz file and expect to land on a new line. Your dd options won't work because dd only copies bytes, it doesn't detect compressed newlines. If indexing is out of scope for this, the following command line solution might help:
$ gzcat 60GB.csv.gz | awk -v n=1 '!NR%20000000{n++} {print|("gzip>part-"n".gz")}'
This decompresses your file so that we can count lines, then processes the stream, changing the output file name every 20000000 lines. You can adjust your recompression options where you see "gzip" in the code above.
If you don't want your output to be compressed, you can simplify the last part of the line:
$ gzcat 60GB.csv.gz | awk -v n=1 '!NR%3500000{n++} {print>("part-"n".csv")}'
You might have to play with the number of lines to get something close to the file size you're aiming for.
Note that if your shell is csh/tcsh, you may have to escape the exclamation point in the awk script to avoid it being interpreted as a history reference.
UPDATE:
If you'd like to get status of what the script is doing, awk can do that. Something like this might be interesting:
$ gzcat 60GB.csv.gz | awk -v n=1 '!NR%3500000{n++} !NR%1000{printf("part=%d / line=%d\r",n,NR)} {print>("part-"n".csv")}'
This should show you the current part and line number every thousand lines.
Unless it was especially prepared for such an operation, or unless an index was built for that purpose, then no. The gzip format inherently requires the decompression of the data before any point in the stream, in order to decompress data after that point in the stream. So it cannot be parallelized.
The way out is to either a) recompress the gzip file with synchronization points and save those locations, or b) go through the entire gzip file once and create another file of entry points with the previous context at those points.
For a), zlib provides Z_FULL_FLUSH operations that insert synchronization points in the stream from which you can start decompression with no previous history. You would want to create such points sparingly, since they degrade compression.
For b), zran.c provides an example of how to build in index into a gzip file. You need to go through the stream once in serial order to build the index, but having done so, you can then start decompression at the locations you have saved.

How to annote a large file with mercurial?

i'm trying to find out who did the last change on a certain line in a large xml file ( ~100.00 lines ).
I tried to use the log file viewer of thg but it does not give me any results.
Using hg annote file.xml -l 95000 .. took forever and eventually died with an error message.
Is there a way to annote a single line in a large file that does not take forever ?
You can use hg grep to dig into a file if you have interest in very specific text:
hg grep "text-to-search" file.txt
You will likely need to add the --all switch to get every change that matches and then limit your results to a specific changeset range with -r firstchage:lastchange syntax.
I don't have a file on hand of the size you are working with, so this may also have trouble past a certain point, particularly if the search string matches many many lines in the file. But if you can get as specific as possible with your search string, you should be able to track it down.

DIFF utility works for 2 files. How to compare more than 2 files at a time?

So the utility Diff works just like I want for 2 files, but I have a project that requires comparisons with more than 2 files at a time, maybe up to 10 at a time. This requires having all those files side by side to each other as well. My research has not really turned up anything, vimdiff seems to be the best so far with the ability to compare 4 at a time.
My question: Is there any utility to compare more than 2 files at a time, or a way to hack diff/vimdiff so it can do multiple comparisons? The files I will be comparing are relatively short so it should not be too slow.
Displaying 10 files side-by-side and highlighting differences can be easily done with Diffuse. Simply specify all files on the command line like this:
diffuse 1.txt 2.txt 3.txt 4.txt 5.txt 6.txt 7.txt 8.txt 9.txt 10.txt
Vim can already do this:
vim -d file1 file2 file3
But you're normally limited to 4 files. You can change that by modifying a single line in Vim's source, however. The constant DB_COUNT defines the maximum number of diffed files, and it's defined towards the top of diff.c in versions 6.x and earlier, or about two thirds of the way down structs.h in versions 7.0 and up.
diff has built-in option --from-file and --to-file, which compares one operand to all others.
--from-file=FILE1
Compare FILE1 to all operands. FILE1 can be a directory.
--to-file=FILE2
Compare all operands to FILE2. FILE2 can be a directory.
Note: argument name --to-file is optional.
e.g.
# this will compare foo with bar, then foo with baz .html files
$ diff --from-file foo.html bar.html baz.html
# this will compare src/base-main.js with all .js files in git repo,
# that has 'main' in their filename or path
$ git ls-files :/*main*.js | xargs diff -u --from-file src/base-main.js
Checkout "Beyond Compare": http://www.scootersoftware.com/
It lets you compare entire directories of files, and it looks like it runs on Linux too.
if your running multiple diff's based off one file you could probably try writing a script that has a for loop to run through each directory and run the diff. Although it wouldn't be side by side you could at least compare them quickly. hope that helped.
Not answering the main question, but here's something similar to what Benjamin Neil has suggested but diffing all files:
Store the filenames in an array, then loop over the combinations of size two and diff (or do whatever you want).
files=($(ls -d /path/of/files/some-prefix.*)) # Array of files to compare
max=${#files[#]} # Take the length of that array
for ((idxA=0; idxA<max; idxA++)); do # iterate idxA from 0 to length
for ((idxB=idxA + 1; idxB<max; idxB++)); do # iterate idxB + 1 from idxA to length
echo "A: ${files[$idxA]}; B: ${files[$idxB]}" # Do whatever you're here for.
done
done
Derived from #charles-duffy's answer: https://stackoverflow.com/a/46719215/1160428
There is a simple an good way to do this = GREP.
Depending on the size of the text you can copy and paste it, or you can redirect the input of the file to the grep command. If you make a grep -vir /path to make a reverse search or a grep -ir /path. This is my way for certification exams.