Extracting Json from a Json Data in Hive - json

I have the following data which is in the json format in a column called details of table named customer in Hive:
{
"customer" : {
"given_name" : "Anuvrat",
"surname" : "Singh"
},
"order" : {
"id" : "123dfe523gd"
},
"address" : {
"city" : "kolkata",
"pin" : "700091"
},
"phone" : {
"mobile" : "*********"
}
}
I have to remove the address and phone from the json data and the data should look like:
{
"customer" : {
"given_name" : "Anuvrat",
"surname" : "Singh"
},
"order" : {
"id" : "123dfe523gd"
}
}
How to do(i.e update) for every row present in the table?
I tried the following command hadoop fs -cat /home/customer/* | jq '.details[] |= del(.address,.phone)' yet I dint get the expected output rather getting error saying
parse error: Invalid numeric literal at line 1, column 93
cat: Unable to write to output stream.

If you're open to a solution not using Hive, I want to remark that this is something very easy to do with jq command line JSON parser.
Given your input file, you would do:
jq 'del(.address,.phone)' file
If you want to remove address and phone objects for all entries of the table, you can do:
jq '.[] |= del(.address,.phone)' file

This is the query which I ran to get the above result:
INSERT OVERWRITE TABLE customer Select id,CASE WHEN id is not null THEN concat('{"customer":',get_json_object(details,'$.customer'),',"order":',get_json_object(details,'$.order'),'"}') ELSE details END AS details FROM customer;

Related

How to extract certain data using Perl from a file?

I have data that needs to be extracted from a file, the lines I need for the moment are name,location and host. This is example of the extract. How would I go about getting these lines into a separate file? I have the Original file and the new file i want to create as the input/output file, there are thousands of devices contained within the output file and they are all the same formatting as in my example.
#!/usr/bin/perl
use strict;
use warnings;
use POSIX qw(strftime);
#names of files to be input output
my $inputfile = "/home/nmis/nmis_export.csv";
my $outputfile = "/home/nmis/nmis_data.csv";
open(INPUT,'<',$inputfile) or die $!;
open(OUTPUT, '>',$outputfile) or die $!;
my #data = <INPUT>;
close INPUT;
my $line="";
foreach $line (#data)
{
======Sample Extract=======
**"group" : "NMIS8",
"host" : "1.2.3.4",
"location" : "WATERLOO",
"max_msg_size" : 1472,
"max_repetitions" : 0,
"model" : "automatic",
"netType" : "lan",
"ping" : 1,
"polling_policy" : "default",
"port" : 161,
"rancid" : 0,
"roleType" : "access",
"serviceStatus" : "Production",
"services" : null,
"threshold" : 1,
"timezone" : 0,
"version" : "snmpv2c",
"webserver" : 0
},
"lastupdate" : 1616690858,
"name" : "test",
"overrides" : {}
},
{
"activated" : {
"NMIS" : 1
},
"addresses" : [],
"aliases" : [],
"configuration" : {
"Stratum" : 3,
"active" : 1,
"businessService" : "",
"calls" : 0,
"cbqos" : "none",
"collect" : 0,
"community" : "public",
"depend" : [
"N/A"
],
"group" : "NMIS8",
"host" : "1.2.3.5",
"location" : "WATERLOO",
"max_msg_size" : 1472,
"max_repetitions" : 0,
"model" : "automatic",
"netType" : "lan",
"ping" : 1,
"polling_policy" : "default",
"port" : 161,
"rancid" : 0,
"roleType" : "access",
"serviceStatus" : "Production",
"services" : null,
"threshold" : 1,
"timezone" : 0,
"version" : "snmpv2c",
"webserver" : 0
},
"lastupdate" : 1616690858,
"name" : "test2",
"overrides" : {}
},**
I would use jq for this not Perl. You just need to query a JSON document. That's what jq is for. You can see an example here
The jq query I created is this one,
.[] | {name: .name, group: .configuration.group, location: .configuration.location}
This breaks down into
.[] # iterate over the array
| # create a filter to send it to
{ # that produces an object with the bellow key/values
.name,
group: .configuration.group,
location: .configuration.location
}
It provides an output like this,
{
"name": "test2",
"group": "NMIS8",
"location": "WATERLOO"
}
{
"name": "test2",
"group": "NMIS8",
"location": "WATERLOO"
}
You can use this to generate a csv
jq -R '.[] | [.name, .configuration.group, .configuration.location] | #csv' ./file.json
Or this to generate a csv with a header,
jq -R '["name","group","location"], (.[] | [.name, .configuration.group, .configuration.location]) | #csv' ./file.json
You can use the JSON distribution for this. Read the entire file in one fell swoop to put the entire JSON string into a scalar (as opposed to putting it into an array and iterating over it), then simply decode the string into a Perl data structure:
use warnings;
use strict;
use JSON;
my $file = 'file.json';
my $json_string;
{
local $/; # Locally reset line endings to nothing
open my $fh, '<', $file or die "Can't open file $file!: $!";
$json_string = <$fh>; # Slurp in the entire file
}
my $perl_data_structure = decode_json $json_string;
As what you have there is JSON, you should parse it with a JSON parser. JSON::PP is part of the standard Perl distribution. If you want something faster, you could install something else from CPAN.
Update: I included a link to JSON::PP in my answer. Did you follow that link? If you did, you would have seen the documentation for the module. That has more information about how to use the module than I could include in an answer on SO.
But it's possible that you need a little more high-level information. The documentation says this:
JSON::PP is a pure perl JSON decoder/encoder
But perhaps you don't know what that means. So here's a primer.
JSON is a text format for storing complex data structures. The format was initially used in Javascript (the acronym stands for "JavaScript Object Notation") but it is now a standard that is used across pretty much all programming languages.
You rarely want to actually deal with JSON in a program. A JSON document is just text and manipulating that would require some complex regular expressions. When dealing with JSON, the usual approach is to "decode" the JSON into a data structure inside your program. You can then manipulate the data structure however you want before (optionally) "encoding" the data structure back into JSON so you can write it to an output file (in your case, you don't need to do that as you want your output as CSV).
So there are pretty much only two things that a Perl JSON library needs to do:
Take some JSON text and decode it into a Perl data structure
Take a Perl data structure and encode it into JSON text
If you look at the JSON::PP documentation you'll see that it contains two functions, encode_json() and decode_json() which do what I describe above. There's also an OO interface, but let's not overcomplicate things too quickly.
So your program now needs to have the following steps:
Read the JSON from the input file
Decode the JSON into a Perl data structure
Walk the Perl data structure to extract the items that you need
Write the required items into your output file (for which Text::CSV will be useful
Having said all that, it really does seem to me that the jq solution suggested by user157251 is a much better idea.

Unknown key for a START_OBJECT in [layers]

When trying to send my json formatted tcpdump to elasticsearch, I get the following error:
curl -X PUT --data-binary #myjson 'localhost:9200/_bulk?pretty'
{
"error" : {
"root_cause" : [
{
"type" : "parsing_exception",
"reason" : "Unknown key for a START_OBJECT in [layers].",
"line" : 1,
"col" : 95
}
],
"type" : "parsing_exception",
"reason" : "Unknown key for a START_OBJECT in [layers].",
"line" : 1,
"col" : 95
},
"status" : 400
}
The json file was obtained using tshark with the "-T json" option.
The json file was modified using jq with the filter "{index: .[]}" and the option -c since elasticsearch requires an entry to fit in a single line.
I am using elasticsearch 5.5.1 with the standard configuration.
jsonformatter marks the json object as valid.
A json object that produces the error looks as follows:
{"index":{"_index":"packets-2017-08-04","_type":"pcap_file","_score":null,"_source":{"layers":{"frame":{"frame.encap_type":"25","frame.time":"Aug 5, 2001 13:10:06.559762000 CEST","frame.offset_shift":"0.000000000","frame.time_epoch":"1501773006.559765000","frame.time_delta":"0.000000000","frame.time_delta_displayed":"0.000000000","frame.time_relative":"0.000000000","frame.number":"1","frame.len":"200","frame.cap_len":"200","frame.marked":"0","frame.ignored":"0","frame.protocols":"sll:ethertype:ip:tcp:data"},"sll":{"sll.pkttype":"4","sll.hatype":"65135","sll.halen":"0","sll.etype":"0x00000800"},"ip":{"ip.version":"4","ip.hdr_len":"20","ip.dsfield":"0x00000010","ip.dsfield_tree":{"ip.dsfield.dscp":"4","ip.dsfield.ecn":"0"},"ip.len":"184","ip.id":"0x000093f2","ip.flags":"0x00000002","ip.flags_tree":{"ip.flags.rb":"0","ip.flags.df":"1","ip.flags.mf":"0"},"ip.frag_offset":"0","ip.ttl":"64","ip.proto":"6","ip.checksum":"0x0000ef4b","ip.checksum.status":"2","ip.src":"0.0.00","ip.addr":"0.0.0.0","ip.src_host":"0.0.0.0","ip.host":"0.0.0.0","ip.dst":"0.0.0.0","ip.dst_host":"0.0.0.0","Source GeoIP: Germany":{"ip.geoip.src_country":"Germany","ip.geoip.country":"Germany","ip.geoip.src_city":"Frankfurt, 1","ip.geoip.city":"Berlin, 1","ip.geoip.src_asnum":"123","ip.geoip.asnum":"123","ip.geoip.src_lat":"701","ip.geoip.lat":"523,01","ip.geoip.src_lon":"2313,4","ip.geoip.lon":"12,13"},"Destination GeoIP: Germany":{"ip.geoip.dst_country":"Germany","ip.geoip.country":"Germany","ip.geoip.dst_asnum":"123","ip.geoip.asnum":"123","ip.geoip.dst_lat":"3321","ip.geoip.lat":"41","ip.geoip.dst_lon":"1","ip.geoip.lon":"2"}},"tcp":{"tcp.srcport":"41","tcp.dstport":"124","tcp.port":"234","tcp.stream":"3","tcp.len":"134","tcp.seq":"1","tcp.nxtseq":"133","tcp.ack":"4","tcp.hdr_len":"32","tcp.flags":"0x00000018","tcp.flags_tree":{"tcp.flags.res":"0","tcp.flags.ns":"0","tcp.flags.cwr":"0","tcp.flags.ecn":"0","tcp.flags.urg":"0","tcp.flags.ack":"1","tcp.flags.push":"1","tcp.flags.reset":"0","tcp.flags.syn":"0","tcp.flags.fin":"0","tcp.flags.str":"·······AP···"},"tcp.window_size_value":"223","tcp.window_size":"31","tcp.window_size_scalefactor":"-1","tcp.checksum":"0x0000b79c","tcp.checksum.status":"1","tcp.urgent_pointer":"0","tcp.options":"123","tcp.options_tree":{"No-Operation (NOP)":{"tcp.options.type":"1","tcp.options.type_tree":{"tcp.options.type.copy":"0","tcp.options.type.class":"0","tcp.options.type.number":"1"}},"Timestamps: TSval 1875055084, TSecr 5726840":{"tcp.option_kind":"8","tcp.option_len":"10","tcp.options.timestamp.tsval":"185084","tcp.options.timestamp.tsecr":"1116840"}},"tcp.analysis":{"tcp.analysis.bytes_in_flight":"123","tcp.analysis.push_bytes_sent":"133"}},"data":{"data.data":"01:01:02","data.len":"265"}}}}}
My question is: What is wrong with this json, so that elasticsearch rejects it?
This is not really a jq problem - Unknown key for a START_OBJECT is an elasticsearch error. The [layers] is a hint that the problem is in the object there which unfortunately was elided in the problem description so there's really not much to go on here.
Since the jq filter you specified is just {index:.[]}, jq is doing nothing to the part of the json elasticsearch is complaining about. If your workflow is expecting jq to correct that portion somehow you'll need to investigate the data closer and use a more sophisticated filter.
For reference, the elasticsearch test suite contains an example of this particular error:
---
"junk in source fails":
- do:
catch: /Unknown key for a START_OBJECT in \[junk\]./
reindex:
body:
source:
junk: {}
Hope this helps.

How to read and parse the json file and add it into the shell script variable?

I am having the file named loaded.json which contains the below json data.
{
"name" : "xat",
"code" : "QpiAc"
}
{
"name" : "gbd",
"code" : "gDSo3"
}
{
"name" : "mbB",
"code" : "mg33y"
}
{
"name" : "sbd",
"code" : "2Vl1w"
}
Form the shell script i need to read and parse the json and add the result to the variable and print it like this.
#!/bin/sh
databasename = cat loaded.json | json select '.name'
echo $databasename
When i run the above script i am getting error like
databasename command not found
json command not found
I am new to shell script please help me to solve this problem
Replace this,
databasename=`cat loaded.json | json select '.name'`
or try jq command,
databasename=`jq '.name' loaded.json`
For more information read this article.
I can able to get the result using jq command like below
databasename=`cat loaded.json | jq '.name'`

using logstash to parse csv file

I have an elasticsearch index which I am using to index a set of documents.
These documents are originally in csv format and I am looking parse these using logstash as this has powerful regular expression tools such as grok.
My problem is that I have something along the following lines
field1,field2,field3,number#number#number#number#number#number
In the last column I have key value pairs key#value separated by # and there can be any number of these
Is there a way for me to use logstash to parse this and get it to store the last column as the following json in elasticsearch (or some other searchable format) so I am able to search it
[
{"key" : number, "value" : number},
{"key" : number, "value" : number},
...
]
First, You can use CSV filter to parse out the last column.
Then, you can use Ruby filter to write your own code to do what you need.
input {
stdin {
}
}
filter {
ruby {
code => '
b = event["message"].split("#");
ary = Array.new;
for c in b;
keyvar = c.split("#")[0];
valuevar = c.split("#")[1];
d = "{key : " << keyvar << ", value : " << valuevar << "}";
ary.push(d);
end;
event["lastColum"] = ary;
'
}
}
output {
stdout {debug => true}
}
With this filter, When I input
1#10#2#20
The output is
"message" => "1#10#2#20",
"#version" => "1",
"#timestamp" => "2014-03-25T01:53:56.338Z",
"lastColum" => [
[0] "{key : 1, value : 10}",
[1] "{key : 2, value : 20}"
]
FYI. Hope this can help you.

mongoexport JSON assertion: 10340 Failure parsing JSON string

I'm trying to export CSV file list from mongoDB and save the output file to my directory, which is /home/asaj/. The output file should have the following columns: name, file_name, d_start and d_end.
The query should filter data with status equal to "FU" or "FD", and d_end > Dec. 10, 2012.
In mongoDB, the query is working properly. The query below is limited to 1 data output. See query below:
> db.Samples.find({ $or : [ { status : 'FU' }, { status : 'FD'} ], d_end : { $gte : ISODate("2012-12-10T00:00:00.000Z") } }, {_id: 0, name: 1, file_name: 1, d_start: 1, d_end: 1}).limit(1).toArray();
[
{
"name" : "sample"
"file_name" : "sample.jpg",
"d_end" : ISODate("2012-12-10T05:1:57.879Z"),
"d_start" : ISODate("2012-12-10T02:31:34.560Z"),
}
]
>
In CLI, mongoexport command looks like this:
mongoexport -d maindb -c Samples -f "name, file_name, d_start, d_end" -q "{'\$or' : [ { 'status' : 'FU' }, { 'status' : 'FD'} ] , 'd_end' : { '\$gte' : ISODate("2012-12-10T00:00:00.000Z") } }" --csv -o "/home/asaj/currentlist.csv"
But i always ended up with this error:
connected to: 127.0.0.1
Wed Dec 19 16:58:17 Assertion: 10340:Failure parsing JSON string near: , 'd_end
0x5858b2 0x528cb4 0x52902e 0xa9a631 0xa93e4d 0xa97de2 0x31b441ecdd 0x4fd289
mongoexport(_ZN5mongo11msgassertedEiPKc+0x112) [0x5858b2]
mongoexport(_ZN5mongo8fromjsonEPKcPi+0x444) [0x528cb4]
mongoexport(_ZN5mongo8fromjsonERKSs+0xe) [0x52902e]
mongoexport(_ZN6Export3runEv+0x7b1) [0xa9a631]
mongoexport(_ZN5mongo4Tool4mainEiPPc+0x169d) [0xa93e4d]
mongoexport(main+0x32) [0xa97de2]
/lib64/libc.so.6(__libc_start_main+0xfd) [0x31b441ecdd]
mongoexport(__gxx_personality_v0+0x3c9) [0x4fd289]
assertion: 10340 Failure parsing JSON string near: , 'd_end
I'm having error in ", 'd_end' " in mongoexport CLI. I'm not so sure if it is a JSON syntax error because query works on MongoDB.
Please help.
After asking someone knows MongoDB better than me, we found out that the problem is the
ISODate("2012-12-10T00:00:00.000Z")
We found the answer on this question: mongoexport JSON parsing error
To resolve this error, first, we convert it to strtotime:
php > echo strtotime("12/10/2012");
1355126400
Next, multiple strtotime result by 1000. This date will looks like this:
1355126400000
Lastly, change ISODate("2012-12-10T00:00:00.000Z") to new Date(1355126400000) in the mongoexport command.
Now, the CLI mongoexport looks like this and it works:
mongoexport -d maindb -c Samples -f "id,file_name,d_start,d_end" -q "{'\$or' : [ { 'status' : 'FU' }, { 'status' : 'FD'} ] , 'd_end' : { '\$gte' : new Date(1355126400000) } }" --csv -o "/home/asaj/listupdate.csv"
Note: remove space between each field names in -f or --fields option.
I know it has little to do with this question, but the title of this post brought it up in Google so since I was getting the exact same error I'll add an answer. Hopefully it helps someone.
My issue was adding a MongoId query for _id to a mongoexport console command on Windows. Here's the error:
Assertion: 10340:Failure parsing JSON string near: _id
The problem ended up being that I needed to wrap the JSON query in double quotes, and the ObjectId had to be in double quotes (not single!), so I had to escape those quotes. Here's the final query that worked, for future reference:
mongoexport -u USERNAME -pPASSWORD -d DATABASE -c COLLECTION
--query "{_id : ObjectId(\"5148894d98981be01e000011\")}"