How to extract certain data using Perl from a file? - json

I have data that needs to be extracted from a file, the lines I need for the moment are name,location and host. This is example of the extract. How would I go about getting these lines into a separate file? I have the Original file and the new file i want to create as the input/output file, there are thousands of devices contained within the output file and they are all the same formatting as in my example.
#!/usr/bin/perl
use strict;
use warnings;
use POSIX qw(strftime);
#names of files to be input output
my $inputfile = "/home/nmis/nmis_export.csv";
my $outputfile = "/home/nmis/nmis_data.csv";
open(INPUT,'<',$inputfile) or die $!;
open(OUTPUT, '>',$outputfile) or die $!;
my #data = <INPUT>;
close INPUT;
my $line="";
foreach $line (#data)
{
======Sample Extract=======
**"group" : "NMIS8",
"host" : "1.2.3.4",
"location" : "WATERLOO",
"max_msg_size" : 1472,
"max_repetitions" : 0,
"model" : "automatic",
"netType" : "lan",
"ping" : 1,
"polling_policy" : "default",
"port" : 161,
"rancid" : 0,
"roleType" : "access",
"serviceStatus" : "Production",
"services" : null,
"threshold" : 1,
"timezone" : 0,
"version" : "snmpv2c",
"webserver" : 0
},
"lastupdate" : 1616690858,
"name" : "test",
"overrides" : {}
},
{
"activated" : {
"NMIS" : 1
},
"addresses" : [],
"aliases" : [],
"configuration" : {
"Stratum" : 3,
"active" : 1,
"businessService" : "",
"calls" : 0,
"cbqos" : "none",
"collect" : 0,
"community" : "public",
"depend" : [
"N/A"
],
"group" : "NMIS8",
"host" : "1.2.3.5",
"location" : "WATERLOO",
"max_msg_size" : 1472,
"max_repetitions" : 0,
"model" : "automatic",
"netType" : "lan",
"ping" : 1,
"polling_policy" : "default",
"port" : 161,
"rancid" : 0,
"roleType" : "access",
"serviceStatus" : "Production",
"services" : null,
"threshold" : 1,
"timezone" : 0,
"version" : "snmpv2c",
"webserver" : 0
},
"lastupdate" : 1616690858,
"name" : "test2",
"overrides" : {}
},**

I would use jq for this not Perl. You just need to query a JSON document. That's what jq is for. You can see an example here
The jq query I created is this one,
.[] | {name: .name, group: .configuration.group, location: .configuration.location}
This breaks down into
.[] # iterate over the array
| # create a filter to send it to
{ # that produces an object with the bellow key/values
.name,
group: .configuration.group,
location: .configuration.location
}
It provides an output like this,
{
"name": "test2",
"group": "NMIS8",
"location": "WATERLOO"
}
{
"name": "test2",
"group": "NMIS8",
"location": "WATERLOO"
}
You can use this to generate a csv
jq -R '.[] | [.name, .configuration.group, .configuration.location] | #csv' ./file.json
Or this to generate a csv with a header,
jq -R '["name","group","location"], (.[] | [.name, .configuration.group, .configuration.location]) | #csv' ./file.json

You can use the JSON distribution for this. Read the entire file in one fell swoop to put the entire JSON string into a scalar (as opposed to putting it into an array and iterating over it), then simply decode the string into a Perl data structure:
use warnings;
use strict;
use JSON;
my $file = 'file.json';
my $json_string;
{
local $/; # Locally reset line endings to nothing
open my $fh, '<', $file or die "Can't open file $file!: $!";
$json_string = <$fh>; # Slurp in the entire file
}
my $perl_data_structure = decode_json $json_string;

As what you have there is JSON, you should parse it with a JSON parser. JSON::PP is part of the standard Perl distribution. If you want something faster, you could install something else from CPAN.
Update: I included a link to JSON::PP in my answer. Did you follow that link? If you did, you would have seen the documentation for the module. That has more information about how to use the module than I could include in an answer on SO.
But it's possible that you need a little more high-level information. The documentation says this:
JSON::PP is a pure perl JSON decoder/encoder
But perhaps you don't know what that means. So here's a primer.
JSON is a text format for storing complex data structures. The format was initially used in Javascript (the acronym stands for "JavaScript Object Notation") but it is now a standard that is used across pretty much all programming languages.
You rarely want to actually deal with JSON in a program. A JSON document is just text and manipulating that would require some complex regular expressions. When dealing with JSON, the usual approach is to "decode" the JSON into a data structure inside your program. You can then manipulate the data structure however you want before (optionally) "encoding" the data structure back into JSON so you can write it to an output file (in your case, you don't need to do that as you want your output as CSV).
So there are pretty much only two things that a Perl JSON library needs to do:
Take some JSON text and decode it into a Perl data structure
Take a Perl data structure and encode it into JSON text
If you look at the JSON::PP documentation you'll see that it contains two functions, encode_json() and decode_json() which do what I describe above. There's also an OO interface, but let's not overcomplicate things too quickly.
So your program now needs to have the following steps:
Read the JSON from the input file
Decode the JSON into a Perl data structure
Walk the Perl data structure to extract the items that you need
Write the required items into your output file (for which Text::CSV will be useful
Having said all that, it really does seem to me that the jq solution suggested by user157251 is a much better idea.

Related

How to extract data from JSON converted from ruby script?

How to convert a ruby file to json?
I use the above approach (rb2json0.rb is in the above link) to convert a ruby script to JSON. But the JSON is not well formatted as it only has arrays but not dictionaries, making it difficult to work with the JSON output.
I specifically want to extract fields in update_info, e.g.,
Name
Description
License
References
Author
and to extract the fields in register_options, e.g.,
LHOST
SOURCE
FILENAME
DOCAUTHOR
Note that the extraction should not assume the field names are fixed to these specific ones, as other field names can be used in other similar files.
The output should be a two-column TSV, with the field name as the first column and the field value as the second column. For example,
Name<TAB>Microsoft Word UNC Path Injector
...
Could anybody let me know the best jq way to achieve this? Thanks.
word_unc_injector.rb is at
https://github.com/rapid7/metasploit-framework/blob/master/modules/auxiliary/docx/word_unc_injector.rb
$ rb2json0.rb < word_unc_injector.rb | jq . # too long to include all output.
[
"program",
[
[
"command",
[
"#ident",
"require",
[
...
EDIT. The full solution of this problem may be complicated. But the first step might be to extract the part corresponding to update_info. Here is the relevant JSON fragment.
...
[
"method_add_arg",
[
"fcall",
[
"#ident",
"update_info",
[
24,
10
]
]
],
[
"arg_paren",
[
"args_add_block",
[
[
"var_ref",
[
"#ident",
"info",
[
24,
22
]
]
],
[
"bare_assoc_hash",
[
[
"assoc_new",
[
"string_literal",
[
"string_content",
[
"#tstring_content",
"Name",
...
The following illustrates how to convert the array-oriented JSON produced by rb2json0.rb to a more object-oriented JSON, in a way that you can query for update_info#ident in a straightforward way.
def objectify:
if type == "array"
then if length>=2 and (.[0]|type) == "string" and (.[1]|type) == "array"
then {(.[0]): ( .[1:] | map(objectify)) | objectify }
elif length>=3 and .[0][0:1] == "#" and (.[1]|type) == "string"
then { (.[1] + .[0]): ( .[2:] | map(objectify)) | objectify }
else map(objectify)
end
else .
end;
Illustrative query
Given the snippet shown, the following produces the output shown below:
objectify | .. | objects | .["update_info#ident"]? // empty
Output
[[24,10]]

How to find something in a json file using Bash

I would like to search a JSON file for some key or value, and have it print where it was found.
For example, when using jq to print out my Firefox' extensions.json, I get something like this (using "..." here to skip long parts) :
{
"schemaVersion": 31,
"addons": [
{
"id": "wetransfer#extensions.thunderbird.net",
"syncGUID": "{e6369308-1efc-40fd-aa5f-38da7b20df9b}",
"version": "2.0.0",
...
},
{
...
}
]
}
Say I would like to search for "wetransfer#extensions.thunderbird.net", and would like an output which shows me where it was found with something like this:
{ "addons": [ {"id": "wetransfer#extensions.thunderbird.net"} ] }
Is there a way to get that with jq or with some other json tool?
I also tried to simply list the various ids in that file, and hoped that I would get it with jq '.id', but that just returned null, because it apparently needs the full path.
In other words, I'm looking for a command-line json parser which I could use in a way similar to Xpath tools
The path() function comes in handy:
$ jq -c 'path(.. | select(. == "wetransfer#extensions.thunderbird.net"))' input.json
["addons",0,"id"]
The resulting path is interpreted as "In the addons field of the initial object, the first array element's id field matches". You can use it with getpath(), setpath(), delpaths(), etc. to get or manipulate the value it describes.
Using your example with modifications to make it valid JSON:
< input.json jq -c --arg s wetransfer#extensions.thunderbird.net '
paths as $p | select(getpath($p) == $s) | null | setpath($p;$s)'
produces:
{"addons":[{"id":"wetransfer#extensions.thunderbird.net"}]}
Note
If there are N paths to the given value, the above will produce N lines. If you want only the first, you could wrap everything in first(...).
Listing all the "id" values
I also tried to simply list the various ids in that file
Assuming that "id" values of false and null are of no interest, you can print all the "id" values of interest using the jq filter:
.. | .id? // empty

jq construct with value strings spanning multiple lines

I am trying to form a JSON construct using jq that should ideally look like below:-
{
"api_key": "XXXXXXXXXX-7AC9-D655F83B4825",
"app_guid": "XXXXXXXXXXXXXX",
"time_start": 1508677200,
"time_end": 1508763600,
"traffic": [
"event"
],
"traffic_including": [
"unattributed_traffic"
],
"time_zone": "Australia/NSW",
"delivery_format": "csv",
"columns_order": [
"attribution_attribution_action",
"attribution_campaign",
"attribution_campaign_id",
"attribution_creative",
"attribution_date_adjusted",
"attribution_date_utc",
"attribution_matched_by",
"attribution_matched_to",
"attribution_network",
"attribution_network_id",
"attribution_seconds_since",
"attribution_site_id",
"attribution_site_id",
"attribution_tier",
"attribution_timestamp",
"attribution_timestamp_adjusted",
"attribution_tracker",
"attribution_tracker_id",
"attribution_tracker_name",
"count",
"custom_dimensions",
"device_id_adid",
"device_id_android_id",
"device_id_custom",
"device_id_idfa",
"device_id_idfv",
"device_id_kochava",
"device_os",
"device_type",
"device_version",
"dimension_count",
"dimension_data",
"dimension_sum",
"event_name",
"event_time_registered",
"geo_city",
"geo_country",
"geo_lat",
"geo_lon",
"geo_region",
"identity_link",
"install_date_adjusted",
"install_date_utc",
"install_device_version",
"install_devices_adid",
"install_devices_android_id",
"install_devices_custom",
"install_devices_email_0",
"install_devices_email_1",
"install_devices_idfa",
"install_devices_ids",
"install_devices_ip",
"install_devices_waid",
"install_matched_by",
"install_matched_on",
"install_receipt_status",
"install_san_original",
"install_status",
"request_ip",
"request_ua",
"timestamp_adjusted",
"timestamp_utc"
]
}
What I have tried unsuccessfully thus far is below:-
json_construct=$(cat <<EOF
{
"api_key": "6AEC90B5-4169-59AF-7AC9-D655F83B4825",
"app_guid": "komacca-s-rewards-app-au-ios-production-cv8tx71",
"time_start": 1508677200,
"time_end": 1508763600,
"traffic": ["event"],
"traffic_including": ["unattributed_traffic"],
"time_zone": "Australia/NSW",
"delivery_format": "csv"
"columns_order": ["attribution_attribution_action","attribution_campaign","attribution_campaign_id","attribution_creative","attribution_date_adjusted","attribution_date_utc","attribution_matched_by","attribution_matched_to","attributio
network","attribution_network_id","attribution_seconds_since","attribution_site_id","attribution_tier","attribution_timestamp","attribution_timestamp_adjusted","attribution_tracker","attribution_tracker_id","attribution_tracker_name","
unt","custom_dimensions","device_id_adid","device_id_android_id","device_id_custom","device_id_idfa","device_id_idfv","device_id_kochava","device_os","device_type","device_version","dimension_count","dimension_data","dimension_sum","ev
t_name","event_time_registered","geo_city","geo_country","geo_lat","geo_lon","geo_region","identity_link","install_date_adjusted","install_date_utc","install_device_version","install_devices_adid","install_devices_android_id","install_
vices_custom","install_devices_email_0","install_devices_email_1","install_devices_idfa","install_devices_ids","install_devices_ip","install_devices_waid","install_matched_by","install_matched_on","install_receipt_status","install_san_
iginal","install_status","request_ip","request_ua","timestamp_adjusted","timestamp_utc"]
}
EOF)
followed by:-
echo "$json_construct" | jq '.'
I get the following error:-
parse error: Expected separator between values at line 10, column 15
I am guessing it is because of the string literal which spans to multiple lines that jq is unable to parse it.
Use jq itself:
my_formatted_json=$(jq -n '{
"api_key": "XXXXXXXXXX-7AC9-D655F83B4825",
"app_guid": "XXXXXXXXXXXXXX",
"time_start": 1508677200,
"time_end": 1508763600,
"traffic": ["event"],
"traffic_including": ["unattributed_traffic"],
"time_zone": "Australia/NSW",
"delivery_format": "csv",
"columns_order": [
"attribution_attribution_action",
"attribution_campaign",
...,
"timestamp_utc"
]
}')
Your input "JSON" is not valid JSON, as indicated by the error message.
The first error is that a comma is missing after the key/value pair: "delivery_format": "csv", but there are others -- notably, JSON strings cannot be split across lines. Once you fix the key/value pair problem and the JSON strings that are split incorrectly, jq . will work with your text. (Note that once your input is corrected, the longest JSON string is quite short -- 50 characters or so -- whereas jq has no problems processing strings of length 10^8 quite speedily ...)
Generally, jq is rather permissive when it comes to JSON-like input, but if you're ever in doubt, it would make sense to use a validator such as the online validator at jsonlint.com
By the way, the jq FAQ does suggest various ways for handling input that isn't strictly JSON -- see https://github.com/stedolan/jq/wiki/FAQ#processing-not-quite-valid-json
Along the lines of chepner's suggestion since jq can read raw text data you could just use a jq filter to generate a legal json object from your script variables. For example:
#!/bin/bash
# whatever logic you have to obtain bash variables goes here
key=XXXXXXXXXX-7AC9-D655F83B4825
guid=XXXXXXXXXXXXXX
# now use jq filter to read raw text and construct legal json object
json_construct=$(jq -MRn '[inputs]|map(split(" ")|{(.[0]):.[1]})|add' <<EOF
api_key $key
app_guid $guid
EOF)
echo $json_construct
Sample Run (assumes executable script is in script.sh)
$ ./script.sh
{ "api_key": "XXXXXXXXXX-7AC9-D655F83B4825", "app_guid": "XXXXXXXXXXXXXX" }
Try it online!

How to extract the value of "name" from this puppet metadata.json without jq in shell?

For some extreme reason, I can't use jq or other cli tool. I need to extract the value of "name" from any json matching this puppet metadata.json. format.
the json might not be properly formatted and indented but will be valid. Meaning, white spaces, and line breaks, carriage backs might be inserted in eligible places.
Note that there could be "name" elements in dependencies array.
So, how to extract the value only using standard unix commands and/or shell script without installing any application like jq or other tools?
Thank you!!
{
"name": "examplecorp-mymodule",
"version": "0.0.1",
"author": "Pat",
"license": "Apache-2.0",
"summary": "A module for a thing",
"source": "https://github.com/examplecorp/examplecorp-mymodule",
"project_page": "https://forge.puppetlabs.com/examplecorp/mymodule",
"issues_url": "https://github.com/examplecorp/examplecorp-mymodule/issues",
"tags": ["things", "stuff"],
"operatingsystem_support": [
{
"operatingsystem":"RedHat",
"operatingsystemrelease":[ "5.0", "6.0" ]
},
{
"operatingsystem": "Ubuntu",
"operatingsystemrelease": [ "12.04", "10.04" ]
}
],
"dependencies": [
{ "name": "puppetlabs/stdlib", "version_requirement": ">=3.2.0 <5.0.0" },
{ "name": "puppetlabs/firewall", "version_requirement": ">= 0.0.4" }
]
}
It's ugly, awful, horrible, not structure-aware and will give you incorrect results if you have extra contents in your input file that look similar to what you're trying to find -- but...
#!/bin/bash
# ^- NOT /bin/sh; shell-native regexes are a bash extension
contents=$(<in.json)
if [[ $contents =~ '"name":'[[:space:]]*'"'([^\"]*)'"' ]]; then
echo "Found name: ${BASH_REMATCH[1]}"
fi
Now, let's talk about some of the ways this answer is broken (and using jq would be better):
It finds the first name, even if it's not one at an outer nesting layer. That is to say, if "dependencies": [ { "name": "puppetlabs/stdlib", "version_requirement": ">=3.2.0 <5.0.0" } ] comes before "name": "examplecorp-mymodule", guess which result is being found? (The easy workarounds to this would involve making assumptions about whitespace/formatting, and are thus not proof against all possible JSON expressions of the same data).
It won't unescape contents inside your name that require, well, unescaping (think about names containing symbols encoded as &foo;).
It isn't multibyte-character aware, and thus isn't guaranteed to emit output that aligns on codepoint boundaries.
If you have a name with an escaped \" subsequence... well, guess what happens there?
Etc. It's not quite as awful as trying to parse XML with regular expressions (JSON is easier!), but it's still quite a mess.
This should work for you:
jq '.name' metadata.json

Need help! - Unable to load JSON using COPY command

Need your expertise here!
I am trying to load a JSON file (generated by JSON dumps) into redshift using copy command which is in the following format,
[
{
"cookieId": "cb2278",
"environment": "STAGE",
"errorMessages": [
"70460"
]
}
,
{
"cookieId": "cb2271",
"environment": "STG",
"errorMessages": [
"70460"
]
}
]
We ran into the error - "Invalid JSONPath format: Member is not an object."
when I tried to get rid of square braces - [] and remove the "," comma separator between JSON dicts then it loads perfectly fine.
{
"cookieId": "cb2278",
"environment": "STAGE",
"errorMessages": [
"70460"
]
}
{
"cookieId": "cb2271",
"environment": "STG",
"errorMessages": [
"70460"
]
}
But in reality most JSON files from API s have this formatting.
I could do string replace or reg ex to get rid of , and [] but I am wondering if there is a better way to load into redshift seamlessly with out modifying the file.
One way to convert a JSON array into a stream of the array's elements is to pipe the former into jq '.[]'. The output is sent to stdout.
If the JSON array is in a file named input.json, then the following command will produce a stream of the array's elements on stdout:
$ jq ".[]" input.json
If you want the output in jsonlines format, then use the -c switch (i.e. jq -c ......).
For more on jq, see https://stedolan.github.io/jq