Parsing complex json in pig? - json

I have json file in follwoing format:
{ "_id" : "foo.com", "categories" : [], "h1" : { "bar==" : { "first" : 1281916800, "last" : 1316995200 }, "foo==" : { "first" : 1281916800, "last" : 1316995200 } }, "name2" : [ "foobarl.com", "foobar2.com" ], "rep" : null }
So, how do i parse this json in pig..
also, the categories and rep can have some char in it..and might not be always empty.
I made the following attempt.
a = load 'sample_json.json' using JsonLoader('id:chararray,categories:[chararray], hostt:{ (variable_a: {(first:int,last:int)})}, ns:[chararray],rep:chararray ');
But i get this error:
org.codehaus.jackson.JsonParseException: Unexpected character ('D' (code 68)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.ByteArrayInputStream#4795b8e9; line: 1, column: 50]
at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1291)
at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:385)
at org.codehaus.jackson.impl.JsonParserMinimalBase._reportUnexpectedChar(JsonParserMinimalBase.java:306)
at org.codehaus.jackson.impl.Utf8StreamParser._handleUnexpectedValue(Utf8StreamParser.java:1582)
at org.codehaus.jackson.impl.Utf8StreamParser.nextToken(Utf8StreamParser.java:386)
at org.apache.pig.builtin.JsonLoader.readField(JsonLoader.java:173)
at org.apache.pig.builtin.JsonLoader.getNext(JsonLoader.java:157)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

You can use elephant bird pig jar for parsing json. It can parse all sort of json data.
Here are certain examples for parsing json via elephant bird pig using this jar.
https://github.com/twitter/elephant-bird/tree/master/examples/src/main/pig
It doesn't break even if an expected json tag isn't present.

Related

Promtail: how to trim not JSON part from log

I have multiline log that consists correct json part (one or more lines), and after it - stack trace.
Is it possile to parse first part of the log as json, and for stack-trace make new label ("stackTrace" for example) and put there all the lines after first part?
Unfortunately, logs can contain a different number of fields in json format, and therefore it is unlikely to parse them using regex.
{ "timestamp" : "2022-03-28 14:33:00,000", "logger" : "appLog", "level" : "ERROR", "thread" : "ktor-8080", "url" : "/path","method" : "POST","httpStatusCode" : 400,"callId" : "f7a22bfb1466","errorMessage" : "Unexpected JSON token at offset 184: Encountered an unknown key 'a'. Use 'ignoreUnknownKeys = true' in 'Json {}' builder to ignore unknown keys. JSON input: { \"entityId\" : \"TGT-8c8d950036bf\", \"processCode\" : \"test\", \"tokenType\" : \"SSO_CCOM\", \"ttlMills\" : 600000, \"a\" : \"a\" }" }
com.example.info.core.WebApplicationException: Unexpected JSON token at offset 184: Encountered an unknown key 'a'.
Use 'ignoreUnknownKeys = true' in 'Json {}' builder to ignore unknown keys.
JSON input: {
"entityId" : "TGT-8c8d950036bf",
"processCode" : "test",
"tokenType" : "SSO_CCOM",
"ttlMills" : 600000,
"a" : "a"
}
at com.example.info.signtoken.SignTokenApi$signTokenModule$2$1$1.invokeSuspend(SignTokenApi.kt:94)
at com.example.info.signtoken.SignTokenApi$signTokenModule$2$1$1.invoke(SignTokenApi.kt)
at com.example.info.signtoken.SignTokenApi$signTokenModule$2$1$1.invoke(SignTokenApi.kt)
at io.ktor.util.pipeline.SuspendFunctionGun.loop(SuspendFunctionGun.kt:248)
at io.ktor.util.pipeline.SuspendFunctionGun.proceed(SuspendFunctionGun.kt:116)
at io.ktor.util.pipeline.SuspendFunctionGun.execute(SuspendFunctionGun.kt:136)
at io.ktor.util.pipeline.Pipeline.execute(Pipeline.kt:78)
at io.ktor.routing.Routing.executeResult(Routing.kt:155)
at io.ktor.routing.Routing.interceptor(Routing.kt:39)
at io.ktor.routing.Routing$Feature$install$1.invokeSuspend(Routing.kt:107)
at io.ktor.routing.Routing$Feature$install$1.invoke(Routing.kt)
at io.ktor.routing.Routing$Feature$install$1.invoke(Routing.kt)
UPD.
I've made promtail pipeline like so
scrape_configs:
- job_name: Test_AppLog
static_configs:
- targets:
- ${HOSTNAME}
labels:
job: INFO-Test_AppLog
host: ${HOSTNAME}
__path__: /home/adm_web/app.log
pipeline_stages:
- multiline:
firstline: ^\{\s?\"timestamp\"
max_lines: 128
max_wait_time: 1s
- match:
selector: '{job="INFO-Test_AppLog"}'
stages:
- regex:
expression: '(?P<log>^\{ ?\"timestamp\".*\}[\s])(?s)(?P<stacktrace>.*)'
- labels:
log:
stacktrace:
- json:
expressions:
logger: logger
url: url
method: method
statusCode: httpStatusCode
sla: sla
source: log
But in fact, json config block does not work, the result in Grafana is only two fields - log and stacktrace.
Any help would be appreciated
if the style is constantly like this maybe the easiest way is to analyze whole log string find index of last symbol "}" - then split the string using its index+1 and result should be in the first part of output array

UniqueDecodeError from urllib2 output from webpage with no non-unicode characters

I am trying to read data off an api webpage using urllib2 in Python2.7. I am using the following lines to read the page:
url = 'https://api.edamam.com/api/nutrition-data?app_id=<my_app_id>&app_key=<my_app_key>&ingr=1cheeseburger'
json_obj = urllib2.urlopen(url)
data = json.load(json_obj)
These lines give me this error (the error is on the last line in the above code):
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb5 in position 0: invalid start byte
I understand that this error means that there are non-unicode characters in json_obj but I am not sure why this is the case, because the same url opens in a browser and the first few lines on the webpage looks like the following:
{
"uri" : "http://www.edamam.com/ontologies/edamam.owl#recipe_2a58ff3e1fec41d79da72f0be446baaa
"calories" : 312,
"totalWeight" : 119.0,
"dietLabels" : [ "BALANCED" ],
"healthLabels" : [ "PEANUT_FREE", "TREE_NUT_FREE", "ALCOHOL_FREE" ],
"cautions" : [ ],
"totalNutrients" : {
"ENERC_KCAL" : {
"label" : "Energy",
"quantity" : 312.96999999999997,
"unit" : "kcal"
},
As you can see, there are no non-unicode characters on this webpage, so I don't really follow what is going on.

Cant validate JSON when using different language. Error invalid characters found

First time im trying to create a JSON file,
Im trying to create a JSON file with different language other than English , but when i try to validate, it show Error Invalid characters found.
i tried this
{
"data": [
{
"id": "1",
"title": "Oru Velli Thaaram Vaana Veedhiyil",
"lyrics": "ഒരു വെള്ളിത്താരം വനവീഥിയിൽ തെളിയവേ
കുളിരീറൻ കാറ്റും കുഞ്ഞുതരാട്ട് മൂളവേ
ഇരുളിനലകൾ മൂടും ധരയിതിലൊരു ദീപം
കദനഭാരമെല്ലാം നീക്കിടുന്ന സ്നേഹം
പിറന്നു മണ്ണിലുഷസ്സിൻ ശോഭ പോലെ
(ഒരു വെള്ളിത്താരം…
മരുഭൂവിൽ അലയുമ്പോൾ ആ താരം മുൻപേ
മറയാതെ രാജക്കൾക്കതുമാർഗമായി
മരുഭൂവിൽ അലയുമ്പോൾ ആ താരം മുൻപേ
മറയാതെ രാജക്കൾക്കതുമാർഗമായി
പുൽക്കൂടും തേടിത്തേടി ബെത്ലഹേമിലവരണയുമ്പോൾ
ഗീതങ്ങൾ പാടിപ്പാടി വാനദൂതരും അണയുന്നൂ
തിരുസുതനെ കാണുംനേരം പാടുന്നു ഗ്ലോറിയ …
(ഒരു വെള്ളിത്താരം…
ശാരോനിൻ താഴ്വാരം തഴുകുന്ന കാറ്റെ
വരുമോ എൻ നാഥൻറെ അരികിൽ നീ മെല്ലെ
ശാരോനിൻ താഴ്വാരം തഴുകുന്ന കാറ്റെ
വരുമോ എൻ നാഥൻറെ അരികിൽ നീ മെല്ലെ
തഴുകൂ നിൻ വിരലാൽ നെറുകിൽ സ്നേഹനാഥനെ ആലോലം
പാടൂ നൽ ശ്രുതിയാൽ കാതിൽ സാന്ദ്രമാനന്ദ സംഗീതം
ഈ രാവിൽ പാരാകെ പാടുന്നു ഗ്ലോറിയ
(ഒരു വെള്ളിത്താരം…",
},
{
"id": "2",
"title": "Pukootil Vannu Jaathanayi",
"lyrics": "പുൽക്കൂട്ടിൽ വന്നു ജാതനായി
നക്ഷത്രം ഇന്ന് മിന്നി നിന്നു
ക്രിസ്മസ് രാവിൻറെ ഗാനമായി
വിണ്ണിൽ ആനന്ദമേളമായി താരകം ദീപമായ്
കൺകളിൽ തിളങ്ങി നിന്നു (2 )
ദൂതരാ വീണകൾ മീട്ടിടുന്നിതാ
ലോകരാ കീർത്തനം കേട്ടിടുന്നിതാ
ദേവദാരു പൂത്തു പാതിരാവു പെയ്തു
മഞ്ഞുതുള്ളി വീണവീഥി മിന്നിടുന്നു
( പുൽക്കൂട്ടിൽ)
വിദ്വരോ കാഴ്ചകൾ നല്കിടുന്നിതാ
വിന്നതിൽ നോക്കി സംപ്രീതരായിതാ
കീറ്റുശീല തന്നിൽ ദിവ്യശോഭ കണ്ടു
ആട്ടിടയരെത്തി ആർത്തു പാടിടുന്നു
( പുൽക്കൂട്ടിൽ)",
}
]
}
Error shows as INVALID JSON , Invalid characters found.
please help me to resolve this problem .
Problem Coming from
"lyrics": "ഒരു വെള്ളിത്താരം വനവീഥിയിൽ തെളിയവേ
Error Type
Expecting 'STRING', 'NUMBER', 'NULL', 'TRUE', 'FALSE', '{', '[', got 'undefined'
Reason
Line breaks inside your string . Encode with \n .A string is a sequence of zero or more Unicode characters .

Elasticsearch bulk data insertion

In my node app i am using Elasticsearch as my backend process. I am trying to insert data from a json file but I got an error.
My json:
{"index":{"_index":"mfissample", "_type":"place_mfi", "_id": "1"}}
{"PAR" : 42.31,"Center":"xx","District":"yy","Country" : "vv","GLP" : 13073826.63,"State" : "zz","SSScore" :null, "location":"80.102134,12.897401"}
{"index":{"_index":"mfissample", "_type":"place_mfi", "_id": "2"}}
{"PAR" : 42.31,"Center" : "xx","District" : "yy","Country" : "zz","GLP" : 13073826.63,"State" : "vv","SSScore" :null,
"location":"80.102134,12.897401"}
My command:
curl -XPOST 'http://localhost:9200/_bulk' --data-binary #jsonbulk.json
The error:
{"error":"JsonParseException[Unexpected character (':' (code 58)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')\n at [Source: [B#792c4b55; line: 1, column: 12]]","status":500}
Remove the \n after "SSScore" :null, and before the "location":"80.102134,12.897401".

Aptana gives error with JSON format

Format:
{
"lastUpdate" : "20/9/2012-12:12",
"data":[{
"user" : "_name_",
"username" : "_fullname_",
"photoURL" : "_url_"
}, {
"user" : "_name_",
"username" : "_fullname_",
"photoURL" : "_url_"
}, {
"user" : "_name_",
"username" : "_fullname_",
"photoURL" : "_url_"
}]
}
Aptana gives errors at the :
Screenshot Aptana JSON format
Why is that? It seems I'm not having any problems receiving and processing the data.
[EDIT 1] Error given: Syntax Error: unexpected token ":"
In Aptana json is parsed "as json" only when you create/open a file with extension .json.
When have a json object inside a .js file works only the javascript parser, for that you see the error, is not a valid token for JS.