Using PIG (0.14), i'm interested in the following use-case: I wish to process my raw JSON into multiple output directories based upon their key and store the result (aggregated data) as JSON. The JSON has an evolving (dynamic) schema which is read in with elephant-bird, and (so-far) has not caused any problems.
I can either store the output in the correct directories (using MultiStorage) or as JSON (using JsonStorage) but not both. As far as i can tell, there is no publicly available UDF for this purpose.
Have I missed something, or is it just a case of writing my own UDF to perform this? This seems like a simple use-case and I would have thought would have been supported.
For those who are looking for an answer to this; a UDF is required.
It is possible (and relatively straight forward) to combine the piggybank UDFs of JsonStorage and MultiStorage to create a pseudo "JsonMultiStorage" class.
Related
I'm recording request and response headers and bodies for all traffic to our API and from our API to 3rd party services into S3 as into tiny objects.
I want to be able to query this data infrequently. For example (pseudo-code):
select $.cars[0].color from "objects" where object_path in (....);
Other info:
Many "objects" in S3 won't have a valid path to $.cars[0].color (it's just one example).
I hope to not use Glue.
Cost is important - this is something that will be queried very infrequently. Configuring some ElasticSearch/similar solution is terribly out of budget for the use case.
I hope to not define my own set of schemas (this is simply not feasible).
Athena says it can search unstructured JSON. I'm having trouble creating a proof-of-concept to show this is true.
Is Athena right fr me? Am I missing a better solution?
I think Athena will work for your case.
Athena handles missing properties in JSON objects. For example, if you define the cars column as array<struct<color:string>>:
the property can be missing ⇒ SELECT cars … will be NULL
it can be an empty list ⇒ SELECT cars[1] … (Athena arrays start at 1) will result in an error, but element_at(cars, 1) and try(cars[1]) will return NULL
the object may be missing the color property ⇒ SELECT cars[1].color … will be NULL
for completely free-form JSON define the column as string and use the JSON functions to query it.
Glue is not necessary. Create the table manually, from your application, or with CloudFormation, and configure it to use partition projection and you will not have to think about using Glue crawlers at all.
Athena doesn't cost anything when you don't use it, and if you will query only infrequently this is key. Make sure to compress your data, and partition it in a way that supports your query patterns (e.g. by date or month if you most often will query recent data).
Not sure what you mean by having to define your "own set of schemas", so perhaps you can clarify that part?
Here is my JSON code.
{
"user_email": "{User.Email}",
"activity_date": "{Lead.LastAction.Date}",
"record_id": "{Lead.Id}-{Lead.LastAction.Date}",
"action_type": "{Lead.LastAction}",
"milestone": "{Lead.Milestone}",
"date_added": "{Lead.Date}"
}
Is it possible to add calculations in the code?
For example, can I add a line where the date_added is subtracted from activity_date?
No: JSON is a way to transport JS Objects.
You can do that while you format the JSON in your native language ( for example in PHP or JS serverside), basically creating the JSON object with the result of the calculation.
In JSON just by itself you cannot do that, it's just a data format, it's totally passive, like a text file. (If you happen to use JSONP, then the story would be a bit different, it might be possible, but using JSONP to do such things would step into area of 'hack/exploit' and it probably should not be used in that way:) )
However, I see you are using not only JSON - there is some extra markup like {User.Email}. This is totally outside JSON spec so clearly you are using some form text-templating engine. These can be quite intelligent at times. Check that path, see which one you are using, see what are its features, maybe you can write a custom function or expression to do that subtraction for you. Maybe, just maybe, it's as easy as
"inactivity_period": "{Lead.LastAction.Date - Lead.Date}"
or
"inactivity_period": "{myFunctionThatIWrote(Lead.LastAction.Date, Lead.Date)}"
but that all depends on the templating engine.
Is it possible, using Nifi, to load a json file into a structured table?
I've called the following weather forecast data (from 6000 weather stations), which i'm currently loading into HDFS. It all appears on one line:
{"SiteRep":{"Wx":{"Param":[{"name":"F","units":"C","$":"Feels Like Temperature"},{"name":"G","units":"mph","$":"Wind Gust"},{"name":"H","units":"%","$":"Screen Relative Humidity"},{"name":"T","units":"C","$":"Temperature"},{"name":"V","units":"","$":"Visibility"},{"name":"D","units":"compass","$":"Wind Direction"},{"name":"S","units":"mph","$":"Wind Speed"},{"name":"U","units":"","$":"Max UV Index"},{"name":"W","units":"","$":"Weather Type"},{"name":"Pp","units":"%","$":"Precipitation Probability"}]},"DV":{"dataDate":"2017-01-12T22:00:00Z","type":"Forecast","Location":[{"i":"14","lat":"54.9375","lon":"-2.8092","name":"CARLISLE AIRPORT","country":"ENGLAND","continent":"EUROPE","elevation":"50.0","Period":{"type":"Day","value":"2017-01-13Z","Rep":{"D":"WNW","F":"-3","G":"25","H":"67","Pp":"0","S":"13","T":"2","V":"EX","W":"1","U":"1","$":"720"}}},{"i":"22","lat":"53.5797","lon":"-0.3472","name":"HUMBERSIDE AIRPORT","country":"ENGLAND","continent":"EUROPE","elevation":"24.0","Period":{"type":"Day","value":"2017-01-13Z","Rep":{"D":"NW","F":"-2","G":"43","H":"63","Pp":"3","S":"25","T":"4","V":"EX","W":"3","U":"1","$":"720"}}}, .....
Ideally, I want the schema structuring into a 6000 row table.
I've tried writing a schema to pass the above into Pig, but haven't been successful, probably because I'm not familiar enough with json to translate this correctly.
Casting around for an easy way to add some structure to the data, I've spotted that there's a PutHBaseJson processor in Nifi.
Can anyone advise if this PutHBaseJson processor would work with the above data structure? And if so, can anyone point me towards a decent tutorial to give me a starting point on the configuration?
Greatly appreciate any guidance.
You probably want to use the SplitJson processor to split the 6000 record JSON structure into 6000 individual flowfiles. If you need to "inject" the parameter definitions from the top-level response, you can do a ReplaceText or JoltTransformJSON operation to manipulate the individual JSON records. Here is a good article by Yolanda Davis describing how to perform Jolt transforms (JSON -> JSON) in NiFi.
Once you have the individual flowfiles containing a single JSON record, putting them into HBase is very easy. Bryan Bende wrote an article describing the necessary configurations for the PutHBaseJson processor.
I have huge number (35k) of small (16kb) json files stored on S3Bucket. I need to load them into DataFrame for futher processing, here is my code for extract:
val jsonData = sqlContext.read.json("s3n://bucket/dir1/dir2")
.where($"nod1.filter1"==="filterValue")
.where($"nod2.subNode1.subSubNode2.created"(0)==="filterValue2")
I'm storing this data into temp table and use for futher operations (exploading nested structures into separate data frames)
jsonData.registerTempTable("jsonData")
So now I have autogenerated schema for this deeply nested dataframe.
With above code I have terrible performance issues I presume its caused by not using sc.parallelize during bucket load, moreover I'm pretty sure that autogeneration of schema in read.json() method is taking a lot of time.
Questions part:
How should my bucket load look like, to be more efficient and faster?
Is there any way to declare this schema in advance (I need to work around Case Class tuple problem thou) to avoid auto-generation?
Does filtering data during load make sense or I should simply load all and filter data after?
Found so far:
sqlContext.jsonRdd(rdd, schema)
It did the part with auto generated schema, but InteliJ screams about depreciated method, is there any alternative for it?
As an alternative to case class, use a custom class that implements the Product interface, and then DataFrame will use the schema exposed by your class members without the case class constraints. See in-line comment here http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
If your json is composed of unrooted fragments you could use s3distcp to group the files and concatenate them into fewer files. Also try s3a protocol as it is better performance than s3n.
i would really appreciate any help.
I would like to convert this list
[[{id1,1},{id2,2},{id3,3},{id4,4}],[{id1,5},{id2,6},{id3,7},{id4,8}],[...]]
to a JSON object.
Need some inspiration :)
please help.
Thank you.
Since you asked for inspiration, I can immagine two directions you can take
You can write code to hand-role your own JSON which, if your need is modest enough, can be a very light-weight and appropriate solution. It would be pretty simple Erlang to take that one data-structure and convert it to the JSON.
"[[{\"id1\":1},{\"id2\":2},{\"id3\":3},{\"id4\":4}],[{\"id1\":5},{\"id2\":6} {\"id3\":7},{\"id4\":8}]]"
You can produce a data-structure that mochiweb's mochijson:encode/1 and decode/1 can handle. I took your list and hand coded it to JSON, getting:
X = "[[{\"id1\":1},{\"id2\":2},{\"id3\":3},{\"id4\":4}],[{\"id1\":5},{\"id2\":6},{\"id3\":7},{\"id4\":8}]]".
then I used mochison:decode(X) to see what structure mochiweb uses to represent JSON (too lazy to look at the documentation).
Y = mochijson:decode(X).
{array,[{array,[{struct,[{"id1",1}]},
{struct,[{"id2",2}]},
{struct,[{"id3",3}]},
{struct,[{"id4",4}]}]},
{array,[{struct,[{"id1",5}]},
{struct,[{"id2",6}]},
{struct,[{"id3",7}]},
{struct,[{"id4",8}]}]}]}
So, if you can create this slightly more elaborate data structure then the one you are using, then you can get the JSON by using mochijson:encode/1. Here is an example imbeddied in an io:format statement so that it prints it as a string -- often you would use the io_lib:format/X depending on your application.
io:format("~s~n",[mochijson:encode(Y)]).
[[{"id1":1},{"id2":2},{"id3":3},{"id4":4}],[{"id1":5},{"id2":6},{"id3":7},{"id4":8}]]