arangoimp of graph from CSV file - csv

I have a network scan in a TSV file that contains data in a form like the following sample
source IP target IP source port target port
192.168.84.3 192.189.42.52 5868 1214
192.168.42.52 192.189.42.19 1214 5968
192.168.4.3 192.189.42.52 60680 22
....
192.189.42.52 192.168.4.3 22 61969
Is there an easy way to import this using arangoimp into the (pre-created) edge collection networkdata?

You could combine the TSV importer, if it wouldn't fail converting the IPs (fixed in ArangoDB 3.0), so you need a bit more conversion logic to get valid CSV. One will use the ede attribute conversion option to convert the first two columns to valid _from and _to attributes during the import.
You shouldn't specify column subjects with blanks in them, and it should really be tabs or a constant number of columns. We need to specify a _from and a _to field in the subject line.
In order to make it work, you would pipe the above through sed to get valid CSV and proper column names like this:
cat /tmp/test.tsv | \
sed -e "s;source IP;_from;g;" \
-e "s;target IP;_to;" \
-e "s; port;Port;g" \
-e 's; *;",";g' \
-e 's;^;";' \
-e 's;$;";' | \
arangoimp --file - \
--type csv \
--from-collection-prefix sourceHosts \
--to-collection-prefix targetHosts \
--collection "ipEdges" \
--create-collection true \
--create-collection-type edge
Sed with these regular expressions will create an intermediate representation looking like that:
"_from","_to","sourcePort","targetPort"
"192.168.84.3","192.189.42.52","5868","1214"
The generated edges will look like that:
{
"_key" : "21056",
"_id" : "ipEdges/21056",
"_from" : "sourceHosts/192.168.84.3",
"_to" : "targetHosts/192.189.42.52",
"_rev" : "21056",
"sourcePort" : "5868",
"targetPort" : "1214"
}

Related

Can you separate distinct JSON attributes into two files using jq?

I am following this tutorial from Vault about creating your own certificate authority. I'd like to separate the response (change the output to API call using cURL to see the response) into two distinct files, one file possessing the certificate and issuing_ca attributes, the other file containing the private_key. The tutorial is using jq to parse JSON objects, but my unfamiliarity with jq isn't helpful here, and most searches are returning info on how to merge JSON using jq.
I've tried running something like
vault write -format=json pki_int/issue/example-dot-com \
common_name="test.example.com" \
ttl="24h" \
format=pem \
jq -r '.data.certificate, .data.issuing_ca > test.cert.pem \
jq -r '.data.private_key' > test.key.pem
or
vault write -format=json pki_int/issue/example-dot-com \
common_name="test.example.com" \
ttl="24h" \
format=pem \
| jq -r '.data.certificate, .data.issuing_ca > test.cert.pem \
| jq -r '.data.private_key' > test.key.pem
but no dice.
It is not an issue with jq invocation, but the way the output files get written. Per your usage indicated, after writing the file test.cert.pem, the contents over the read end of the pipe (JSON output) is no longer available to extract the private_key contents.
To duplicate the contents over at the write end of pipe, use tee along with process substitution. The following should work on bash/zsh or ksh93 and not on POSIX bourne shell sh
vault write -format=json pki_int/issue/example-dot-com \
common_name="test.example.com" \
ttl="24h" \
format=pem \
| tee >( jq -r '.data.certificate, .data.issuing_ca' > test.cert.pem) \
>(jq -r '.data.private_key' > test.key.pem) \
>/dev/null
See this in action
jq -n '{data:{certificate: "foo", issuing_ca: "bar", private_key: "zoo"}}' \
| tee >( jq -r '.data.certificate, .data.issuing_ca' > test.cert.pem) \
>(jq -r '.data.private_key' > test.key.pem) \
>/dev/null
and now observe the contents of both the files.
You could abuse jq's ability to write to standard error (version 1.6 or later) separately from standard output.
vault write -format=json pki_int/issue/example-dot-com \
common_name="test.example.com" \
ttl="24h" \
format=pem \
| jq -r '.data as $f | ($f.private_key | stderr) | ($f.certificate, $f.issuing_ca)' > test.cert.pem 2> test.key.pem
There's a general technique for this type of problem that is worth mentioning
because it has minimal prerequisites (just jq and awk), and because
it scales well with the number of files. Furthermore it is quite efficient in that only one invocation each of jq and awk is needed. The idea is to setup a pipeline of the form: jq ... | awk ...
There are many variants
of the technique but in the present case, the following would suffice:
jq -rc '
.data
| "test.cert.pem",
"\t\(.certificate)",
"\t\(.issuing_ca)",
"test.key.pem",
"\t\(.private_key)"
' | awk -F\\t 'NF == 1 {fn=$1; next} {print $2 > fn}'
Notice that this works even if the items of interest are strings with embedded tabs.

JMESPath filter with >1 match ANDING

I saw the ORING post; this should cover ANDING; I struggled with this one.
Given this while loop:
while read -r resourceID resourceName; do
pMsg "Processing: $resourceID with $resourceName"
aws emr describe-cluster --cluster-id="$resourceID" --output table > ${resourceName}.md"
done <<< "$(aws emr list-clusters --active --query='Clusters[].Id' \
--output text | sortExpression)"
I need to feed my loop with the ID AND Name of the clusters. One is easy; two is eluding me. Any help is appreciated.
If your goal is to end up with a output looking like this from list-clusters:
1 ABCD
2 EFGH
In order to feed it to describe-cluster, then you should create a multiselect list.
Something like:
Clusters[].[Id, Name]
This is actually described in the user guide about text output format, where they show that:
'Reservations[*].Instances[*].[Placement.AvailabilityZone, State.Name,
InstanceId]' --output text
Gives
us-west-2a running i-4b41a37c
us-west-2a stopped i-a071c394
us-west-2b stopped i-97a217a0
us-west-2a running i-3045b007
us-west-2a running i-6fc67758
Source: https://docs.aws.amazon.com/cli/latest/userguide/cli-usage-output-format.html#text-output
So you should end up with
while read -r resourceID resourceName; do
pMsg "Processing: $resourceID with $resourceName"
aws emr describe-cluster \
--cluster-id="$resourceID" \
--output table > ${resourceName}.md"
done <<< "$(aws emr list-clusters \
--active \
--query='Clusters[].[Id, Name]' \
--output text | sortExpression \
)"

How to insert variable in fswatch regex?

I'm trying to use a variable to identify mxf or mov file extensions. The following works where I explicitly name the file extensions with a regular expression.
${FSWATCH_PATH} -0 \
-e ".*" --include ".*\.[ mxf|mov ]" \
--event Updated --event Renamed --event MovedTo -l $LATENCY \
$LOCAL_WATCHFOLDER_PATH \
| while read -d "" event
do
<code here>
done
How can I use a variable for the file extensions, where the variable name is FileTriggerExtensions? The code below doesn't work:
FileTriggerExtensions=mov|mxf
${FSWATCH_PATH} -0 \
-e ".*" --include ".*\.[ $FileTriggerExtensions ]" \
--event Updated --event Renamed --event MovedTo -l $LATENCY \
$LOCAL_WATCHFOLDER_PATH \
| while read -d "" event
do
done
I guess you use Bash or a similar shell?
FileTriggerExtensions=mov|mxf
-bash: mxf: command not found
Use quotes or escape the pipe symbol.

Import data from Mongodb on GCE to Bigquery

My task is to import data from a mongodb collection hosted on GCE to Bigquery. I tried the following. Since bigquery does not accept '$' symbol in field names, I ran the following to remove the $oid field,
mongo test --quiet \
--eval "db.trial.find({}, {_id: 0})
.forEach(function(doc) {
print(JSON.stringify(doc)); });" \
> trial_noid.json
But, while importing the result file, I get an error that says
parse error: premature EOF (error code: invalid)
Is there a way to avoid these steps and directly transfer the data to bigquery from mongodb hosted on GCE?
In my opinion, the best practice is building your own extractor. That can be done with the language of your choice and you can extract to CSV or JSON.
But if you looking to a fast way and if your data is not huge and can fit within one server, then I recommend using mongoexport to extract to JSON. Let's assume you have a simple document structure such as below:
{
"_id" : "tdfMXH0En5of2rZXSQ2wpzVhZ",
"statuses" : [
{
"status" : "dc9e5511-466c-4146-888a-574918cc2534",
"score" : 53.24388894
}
],
"stored_at" : ISODate("2017-04-12T07:04:23.545Z")
}
Then you need to define your BigQuery Schema (mongodb_schema.json) such as:
$ cat > mongodb_schema.json <<EOF
[
{ "name":"_id", "type": "STRING" },
{ "name":"stored_at", "type": "record", "fields": [
{ "name":"date", "type": "STRING" }
]},
{ "name":"statuses", "type": "record", "mode": "repeated", "fields": [
{ "name":"status", "type": "STRING" },
{ "name":"score", "type": "FLOAT" }
]}
]
EOF
Now, the fun part starts :-) Extracting data as JSON from your MongoDB. Let's assume you have a cluster with replica set name statuses, your db is sample, and your collection is status.
mongoexport \
--host statuses/db-01:27017,db-02:27017,db-03:27017 \
-vv \
--db "sample" \
--collection "status" \
--type "json" \
--limit 100000 \
--out ~/sample.json
As you can see above, I limit the output to 100k records because I recommend you run sample and load to BigQuery before doing it for all your data. After running above command you should have your sample data in sample.json BUT there is a field $date which will cause you an error loading to BigQuery. To fix that we can use sed to replace them to simple field name:
# Fix Date field to make it compatible with BQ
sed -i 's/"\$date"/"date"/g' sample.json
Now you can compress, upload to Google Cloud Storage (GCS) and then load to BigQuery using following commands:
# Compress for faster load
gzip sample.json
# Move to GCloud
gsutil mv ./sample.json.gz gs://your-bucket/sample/sample.json.gz
# Load to BQ
bq load \
--source_format=NEWLINE_DELIMITED_JSON \
--max_bad_records=999999 \
--ignore_unknown_values=true \
--encoding=UTF-8 \
--replace \
"YOUR_DATASET.mongodb_sample" \
"gs://your-bucket/sample/*.json.gz" \
"mongodb_schema.json"
If everything was okay, then go back and remove --limit 100000 from mongoexport command and re-run above commands again to load everything instead of 100k sample.
With this solution, you can import your data with the same hierarchy to BigQuery but if you want to flat your data, then below alternative solution would work better.
ALTERNATIVE SOLUTION:
If you want more flexibility and performance is not your concern, then you can use mongo CLI tool as well. This way you can write your extract logic in a JavaScript and execute it against your data and then send output to BigQuery. Here is what I did for the same process but used JavaScript to output in CSV so I can load it much easier to BigQuery:
# Export Logic in JavaScript
cat > export-csv.js <<EOF
var size = 100000;
var maxCount = 1;
for (x = 0; x < maxCount; x = x + 1) {
var recToSkip = x * size;
db.entities.find().skip(recToSkip).limit(size).forEach(function(record) {
var row = record._id + "," + record.stored_at.toISOString();;
record.statuses.forEach(function (l) {
print(row + "," + l.status + "," + l.score)
});
});
}
EOF
# Execute on Mongo CLI
_MONGO_HOSTS="db-01:27017,db-02:27017,db-03:27017/sample?replicaSet=statuses"
mongo --quiet \
"${_MONGO_HOSTS}" \
export-csv.js \
| split -l 500000 --filter='gzip > $FILE.csv.gz' - sample_
# Load all Splitted Files to Google Cloud Storage
gsutil -m mv ./sample_* gs://your-bucket/sample/
# Load files to BigQuery
bq load \
--source_format=CSV \
--max_bad_records=999999 \
--ignore_unknown_values=true \
--encoding=UTF-8 \
--replace \
"YOUR_DATASET.mongodb_sample" \
"gs://your-bucket/sample/sample_*.csv.gz" \
"ID,StoredDate:DATETIME,Status,Score:FLOAT"
TIP: In above script I did the small trick by piping output to able to split the output into multiple files with sample_ prefix. Also during split, it will GZip the output so you can load it easier to GCS.
When using NEWLINE_DELIMITED_JSON to import data into BigQuery, one JSON object, including any nested/repeated fields, must appear on each line.
The issue with your input file appears to be that the JSON object is split into many lines; if you collapse it to a single line, it will resolve this
error.
Requiring this format allows BigQuery to split the file and process it in parallel without being concerned that splitting the file will put one part of a JSON object in one split, and another part in the next split.

Get JSON value from URL in BASH

I need to get the "snapshot" value at the top of the file from this url: https://s3.amazonaws.com/Minecraft.Download/versions/versions.json
So I should get a variable that contains "14w08a" when I run the command to parse the json.
This will do the trick
$ curl -s "$url" | grep -Pom 1 '"snapshot": "\K[^"]*'
14w08a
best thing to do is use a tool with a JSON parser. For example:
value=$(
curl -s "$url" |
ruby -rjson -e 'data = JSON.parse(STDIN.read); puts data["latest"]["snapshot"]'
)