Pig: parse bytearray as a string/json - json

I have some json data format saved to S3 in SequenceFile format by secor. I want to analyze it using Pig. Using elephant-bird I managed to get it from S3 in bytearray format, but I wasn't able to convert it to chararray, which is apparently needed to parse Json:
%declare SEQFILE_LOADER 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
%declare LONG_CONVERTER 'com.twitter.elephantbird.pig.util.LongWritableConverter';
%declare BYTES_CONVERTER 'com.twitter.elephantbird.pig.util.BytesWritableConverter';
%declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';
grunt> A = LOAD 's3n://...logs/raw_logs/...events/dt=2015-12-08/1_0_00000000000085594299'
USING $SEQFILE_LOADER ('-c $LONG_CONVERTER', '-c $BYTES_CONVERTER')
AS (key: long, value: bytearray);
grunt> B = LIMIT A 1;
grunt> DUMP B;
(85653965,{"key": "val1", other json data, ...})
grunt> DESCRIBE B;
B: {key: long,value: bytearray}
grunt> C = FOREACH B GENERATE (key, (chararray)value);
grunt> DUMP C;
2015-12-08 19:32:09,133 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1075: Received a bytearray from the UDF or Union from two different Loaders.
Cannot determine how to convert the bytearray to string.
Using TextConverter insted of the BytesWritableConverter just leaves me with empty values, like:
(85653965,)
It's apparent that Pig was able to cast the byte array to a string to dump it, so it doesn't seem like it should be imposible. How do I do that?

Related

python error on string format with "\n" exec(compile(contents+"\n", file, 'exec'), glob, loc)

i try to construct JSON with string that contains "\n" in it like this :
ver_str= 'Package ID: version_1234\nBuild\nnumber: 154\nBuilt\n'
proj_ver_str = 'Version_123'
comb = '{"r_content": {0}, "s_version": {1}}'.format(ver_str,proj_ver_str)
json_content = json.loads()
d =json.dumps(json_content )
getting this error:
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Dev/python/new_tester/simple_main.py", line 18, in <module>
comb = '{"r_content": {0}, "s_version": {1}}'.format(ver_str,proj_ver_str)
KeyError: '"r_content"'
The error arises not because of newlines in your values, but because of { and } characters in your format string other than the placeholders {0} and {1}. If you want to have an actual { or a } character in your string, double them.
Try replacing the line
comb = '{"r_content": {0}, "s_version": {1}}'.format(ver_str,proj_ver_str)
with
comb = '{{"r_content": {0}, "s_version": {1}}}'.format(ver_str,proj_ver_str)
However, this will give you a different error on the next line, loads() missing 1 required positional argument: 's'. This is because you presumably forgot to pass comb to json.loads().
Replacing json.loads() with json.loads(comb) gives you another error: json.decoder.JSONDecodeError: Expecting value: line 1 column 15 (char 14). This tells you that you've given json.loads malformed JSON to parse. If you print out the value of comb, you see the following:
{"r_content": Package ID: version_1234
Build
number: 154
Built
, "s_version": Version_123}
This isn't valid JSON, because the string values aren't surrounded by quotes. So a JSON parsing error is to be expected.
At this point, let's take a look at what your code is doing and what you seem to want it to do. It seems you want to construct a JSON string from your data, but your code puts together a JSON string from your data, parses it to a dict and then formats it back as a JSON string.
If you want to create a JSON string from your data, it's far simpler to create a dict with your values and use json.dumps on that:
d = json.dumps({"r_content": ver_str, "s_version": proj_ver_str})

Saving json file by dumping dictionary in a for loop, leading to malformed json

So I have the following dictionaries that I get by parsing a text file
keys = ["scientific name", "common names", "colors]
values = ["somename1", ["name11", "name12"], ["color11", "color12"]]
keys = ["scientific name", "common names", "colors]
values = ["somename2", ["name21", "name22"], ["color21", "color22"]]
and so on. I am dumping the key value pairs using a dictionary to a json file using a for loop where I go through each key value pair one by one
for loop starts
d = dict(zip(keys, values))
with open("file.json", 'a') as j:
json.dump(d, j)
If I open the saved json file I see the contents as
{"scientific name": "somename1", "common names": ["name11", "name12"], "colors": ["color11", "color12"]}{"scientific name": "somename2", "common names": ["name21", "name22"], "colors": ["color21", "color22"]}
Is this the right way to do it?
The purpose is to query the common name or colors for a given scientific name. So then I do
with open("file.json", "r") as j:
data = json.load(j)
I get the error, json.decoder.JSONDecodeError: Extra data:
I think this is because I am not dumping the dictionaries in json in the for loop correctly. I have to insert some square brackets programatically. Just doing json.dump(d, j) won't suffice.
JSON may only have one root element. This root element can be [], {} or most other datatypes.
In your file, however, you get multiple root elements:
{...}{...}
This isn't valid JSON, and the error Extra data refers to the second {}, where valid JSON would end instead.
You can write multiple dicts to a JSON string, but you need to wrap them in an array:
[{...},{...}]
But now off to how I would fix your code. First, I rewrote what you posted, because your code was rather pseudo-code and didn't run directly.
import json
inputs = [(["scientific name", "common names", "colors"],
["somename1", ["name11", "name12"], ["color11", "color12"]]),
(["scientific name", "common names", "colors"],
["somename2", ["name21", "name22"], ["color21", "color22"]])]
for keys, values in inputs:
d = dict(zip(keys, values))
with open("file.json", 'a') as j:
json.dump(d, j)
with open("file.json", 'r') as j:
print(json.load(j))
As you correctly realized, this code failes with
json.decoder.JSONDecodeError: Extra data: line 1 column 105 (char 104)
The way I would write it, is:
import json
inputs = [(["scientific name", "common names", "colors"],
["somename1", ["name11", "name12"], ["color11", "color12"]]),
(["scientific name", "common names", "colors"],
["somename2", ["name21", "name22"], ["color21", "color22"]])]
jsonData = list()
for keys, values in inputs:
d = dict(zip(keys, values))
jsonData.append(d)
with open("file.json", 'w') as j:
json.dump(jsonData, j)
with open("file.json", 'r') as j:
print(json.load(j))
Also, for python's json library, it is important that you write the entire json file in one go, meaning with 'w' instead of 'a'.

Decoding Dict in Elm failing due to extra backslashes

I'm trying to send a dict to javascript via port for storing the value in localStorage, and retrieve it next time the Elm app starts via flag.
Below code snippets show the dict sent as well as the raw json value received through flag. The Json decoding fails showing the error message at the bottom.
The issue seems to be the extra backslashes (as in \"{\\"Left\\") contained in the raw flag value. Interestingly, console.log shows that the flag value passed by javascript is "dict1:{"Left":"fullHeightVerticalCenter","Right":"fullHeightVerticalCenter","_default":"fullHeightVerticalBottom"}"as intended, so the extra backslashes seem to be added by Elm, but I can't figure out why. Also, I'd be interested to find out a better way to achieve passing a dict to and from javascript.
import Json.Decode as JD
import Json.Encode as JE
dict1 = Dict.fromList[("_default", "fullHeightVerticalBottom")
, ("Left", "fullHeightVerticalCenter")
, ("Right", "fullHeightVerticalCenter")]
type alias FlagsJEValue =
{dict1: String}
port setStorage : FlagsJEValue -> Cmd msg
-- inside Update function Cmd
setStorage {dict1 = JE.encode 0 (dictEncoder JE.string model.dict1)}
dictEncoder enc dict =
Dict.toList dict
|> List.map (\(k,v) -> (k, enc v))
|> JE.object
--
type alias Flags =
{dict1: Dict String String}
flagsDecoder : Decoder Flags
flagsDecoder =
JD.succeed Flags
|> required "dict1" (JD.dict JD.string)
-- inside `init`
case JD.decodeValue MyDecoders.flagsDecoder raw_flags of
Err e ->
_ = Debug.log "raw flag value" (Debug.toString (JE.encode 2 raw_flags) )
_ = Debug.log "flags error msg" (Debug.toString e)
... omitted ...
Ok flags ->
... omitted ...
-- raw flag value
"{\n \"dict1\": \"{\\\"Left\\\":\\\"fullHeightVerticalCenter\\\",\\\"Right\\\":\\\"fullHeightVerticalCenter\\\",\\\"_default\\\":\\\"fullHeightVerticalBottom\\\"}\"\n}"
--flags error msg
"Failure \"Json.Decode.oneOf failed in the following 2 ways:\\n\\n\\n\\n
(1) Problem with the given value:\\n \\n \\\"{\\\\\\\"Left\\\\\\\":\\\\\\\"fullHeightVerticalCenter\\\\\\\",\\\\\\\"Right\\\\\\\":\\\\\\\"fullHeightVerticalCenter\\\\\\\",\\\\\\\"_default\\\\\\\":\\\\\\\"fullHeightVerticalBottom\\\\\\\"}\\\"\\n \\n Expecting an OBJECT\\n\\n\\n\\n
(2) Problem with the given value:\\n \\n \\\"{\\\\\\\"Left\\\\\\\":\\\\\\\"fullHeightVerticalCenter\\\\\\\",\\\\\\\"Right\\\\\\\":\\\\\\\"fullHeightVerticalCenter\\\\\\\",\\\\\\\"_default\\\\\\\":\\\\\\\"fullHeightVerticalBottom\\\\\\\"}\\\"\\n \\n Expecting null\" <internals>”
You don't need to use JE.encode there.
You can just use your dictEncoder to produce a Json.Encode.Value and pass that directly to setStorage.
The problem you're encountering it that you've encoded the dict to a json string (using JE.encode) and then sent that string over a port and the port has encoded that string as json again. You see extra slashes because the json string is double encoded.

Converting epgsql results to JSON

I am a total beginner with Erlang and functional programming in general. For fun, to get me started, I am converting an existing Ruby Sinatra REST(ish) API that queries PostgreSQL and returns JSON.
On the Erlang side I am using Cowboy, Epgsql and Jiffy as the JSON library.
Epgsql returns results in the following format:
{ok, [{column,<<"column_name">>,int4,4,-1,0}], [{<<"value">>}]}
But Jiffy expects the following format when encoding to JSON:
{[{<<"column_name">>,<<"value">>}]}
The following code works to convert epgsql output into suitable input for jiffy:
Assuming Data is the Epgsql output and Key is the name of the JSON object being created:
{_, C, R} = Data,
Columns = [X || {_, X, _, _, _, _} <- C,
Rows = tuple_to_list(hd(R)),
Result = {[{atom_to_binary(Key, utf8), {lists:zip(Columns, Rows)}}]}.
However, I am wondering if this is efficient Erlang?
I've looked into the documentation for Epgsql and Jiffy and can't see any more obvious ways to perform the conversion.
Thank you.
Yes, need parse it.
For example function parse result
parse_result({error, #error{ code = <<"23505">>, extra = Extra }}) ->
{match, [Column]} =
re:run(proplists:get_value(detail, Extra),
"Key \\(([^\\)]+)\\)", [{capture, all_but_first, binary}]),
throw({error, {non_unique, Column}});
parse_result({error, #error{ message = Msg }}) ->
throw({error, Msg});
parse_result({ok, Cols, Rows}) ->
to_map(Cols, Rows);
parse_result({ok, Counts, Cols, Rows}) ->
{ok, Counts, to_map(Cols, Rows)};
parse_result(Result) ->
Result.
And function convert result to map
to_map(Cols, Rows) ->
[ maps:from_list(lists:zipwith(fun(#column{name = N}, V) -> {N, V} end,
Cols, tuple_to_list(Row))) || Row <- Rows ].
And encode it to json. You can change my code and make output as proplist.

Json Parsing in Apache Pig

I am Having a json :
{"Name":"sampling","elementInfo":{"fraction":"3"},"destination":"/user/sree/OUT","source":"/user/sree/foo.txt"}
I found that we are able to load json into PigScript.
A = LOAD ‘data.json’
USING PigJsonLoader();
But how to parse json in Apache Pig
--Sampling.pig
--pig -x mapreduce -f Sampling.pig -param input=foo.csv -param output=OUT/pig -param delimiter="," -param fraction='0.05'
--Load data
inputdata = LOAD '$input' using PigStorage('$delimiter');
--Group data
groupedByAll = group inputdata all;
--output into hdfs
sampled = SAMPLE inputdata $fraction;
store sampled into '$output' using PigStorage('$delimiter');
Above is my pig script.
How to parse json (each element) in Apache pig?
I need to take above json as input and parse its source,delimiter,fraction,output and pass in $input,$delimiter,$fraction,$output respectively.
How to parse the same .
Please suggest
Try this :
--Load data
inputdata = LOAD '/input.txt' using JsonLoader('Name:chararray,elementinfo:(fraction:chararray),destionation:chararray,source:chararray');
--Group data
groupedByAll = group inputdata all;
store groupedByAll into '/OUT/pig' using PigStorage(',');
Now your output looks :
all,{(sampling1,(4),/user/sree/OUT1,/user/sree/foo1.txt),(sampling,(3),/user/sree/OUT,/user/sree/foo.txt)}
In input file fraction data {"fraction":"3"} in double quotes. so i used fraction as chararray so can't able to run sample command so i used the above script to get the result.
if you want to perform sample operation cast the fraction data to int and then you will get the result.