I'm using Fluentd with Elasticsearch for logs from Kubernetes but I noticed that some JSON logs cannot be correctly indexed because JSON is stored as string.
Logs from kubectl logs look like:
{"timestamp":"2016-11-03T15:48:12.007Z","level":"INFO","thread":"cromwell-system-akka.actor.default-dispatcher-4","logger":"akka.event.slf4j.Slf4jLogger","message":"Slf4jLogger started","context":"default"}
But logs saved in file in /var/log/containers/... have escaped quotes and makes them string instead of JSON which spoil indexing:
{"log":"{\"timestamp\":\"2016-11-03T15:45:07.976Z\",\"level\":\"INFO\",\"thread\":\"cromwell-system-akka.actor.default-dispatcher-4\",\"logger\":\"akka.event.slf4j.Slf4jLogger\",\"message\":\"Slf4jLogger started\",\"context\":\"default\"}\n","stream":"stdout","time":"2016-11-03T15:45:07.995443479Z"}
I'm trying to get logs looking like:
{
"log": {
"timestamp": "2016-11-03T15:45:07.976Z",
"level": "INFO",
"thread": "cromwell-system-akka.actor.default-dispatcher-4",
"logger": "akka.event.slf4j.Slf4jLogger",
"message": "Slf4jLogger started",
"context": "default"
},
"stream": "stdout",
"time": "2016-11-03T15: 45: 07.995443479Z"
}
Can you suggest me how to do it?
I ran into the same issue, however I'm using fluent-bit, the "C" version of fluentd (Ruby). Since this is an older issue, I'm answering for the benefit of others who find this.
In fluent-bit v0.13, they addressed this issue. You can now specify the parser to use through annotations. The parser can be configured to decode the log as json.
fluent-bit issue detailing problem
blog post about annotations for specifying the parser
json parser documentation - The docker container's logs come out as json. However, your logs are also json. So an additional decoder is needed.
The final parser with decoder looks like this:
[PARSER]
Name embedded-json
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
# Command | Decoder | Field | Optional Action
# =============|==================|=======|=========
Decode_Field_As escaped log do_next
Decode_Field_As json log
Related
I am a rank beginner with jq, and I've been going through the tutorial, but I think there is a conceptual difference I don't understand. A common problem I encounter is that a large JSON file will contain many objects, each of which is quite big, and I'd like to view the first complete object, to see which fields exist, what types, how much nesting, etc.
In the tutorial, they do this:
# We can use jq to extract just the first commit.
$ curl 'https://api.github.com/repos/stedolan/jq/commits?per_page=5' | jq '.[0]'
Here is an example with one object - here, I'd like to return the whole array (just like my_array=['foo']; my_array[0] would return foo in Python).
wget https://hacker-news.firebaseio.com/v0/item/8863.json
I can access and pretty-print the whole thing with .
$ cat 8863.json | jq '.'
$
{
"by": "dhouston",
"descendants": 71,
"id": 8863,
"kids": [
9224,
...
8876
],
"score": 104,
"time": 1175714200,
"title": "My YC app: Dropbox - Throw away your USB drive",
"type": "story",
"url": "http://www.getdropbox.com/u/2/screencast.html"
}
But trying to get the first element fails:
$ cat 8863.json| jq '.[0]'
$ jq: error (at <stdin>:0): Cannot index object with number
I get the same error jq '.[0]' 8863.json, but strangely echo 8863.json | jq '.[0]' gives me parse error: Invalid numeric literal at line 2, column 0. What is the difference? Also, is this not the correct way to get the zeroth member of the JSON?
I've looked at other SO posts with this error message and at the manual, but I'm still confused. I think of the file as an array of JSON objects, and I'd like to get the first. But it looks like jq works with something called a "stream", and does operations on all of it (say, return one given field from every object).
Clarification:
Let's say I have 2 objects in my JSON:
{
"by": "pg",
"id": 160705,
"poll": 160704,
"score": 335,
"text": "Yes, ban them; I'm tired of seeing Valleywag stories on News.YC.",
"time": 1207886576,
"type": "pollopt"
}
{
"by": "dpapathanasiou",
"id": 16070,
"kids": [
16078
],
"parent": 16069,
"text": "Dividends don't mean that much: Microsoft in its dominant years (when they had 40%-plus margins and were raking in the cash) never paid a dividend (they did so only recently).",
"time": 1177355133,
"type": "comment"
}
How would I get the entire first object (lines 1-9) with jq?
Cannot index object with number
This error message says it all, you can't index objects with numbers. If you want to get the value of by field, you need to do
jq '.by' file
Wrt
echo 8863.json | jq '.[0]' gives me parse error: Invalid numeric literal at line 2, column 0.
It's normal since you didn't specify -R/--raw-input flag, and so jq sees the shell string 8863.json as a JSON string, and one cannot apply array indexing to JSON strings. (To get the first character as a string, you'd write .[0:1].)
If your input file consists of several separate entities, to get the first one:
jq -n 'input' file
or,
jq -n 'first(inputs)' file
To get nth (let's say 5th for example):
jq -n 'nth(5; inputs)' file
a large JSON file will contain many objects, each of which is quite big, and I'd like to view the first complete object, to see which fields exist, what types, how much nesting, etc.
As implied in #OguzIsmail's response, there are important differences between:
- a JSON file (i.e, a file containing exactly one JSON entity);
- a file containing a sequence (i.e., stream) of JSON entities;
- a file containing an array of JSON entities.
In the first two cases, you can write jq -n input to select the first entity, and in the case of an array of entities, jq .[0] will suffice.
(In JSON-speak, a "JSON object" is a kind of dictionary, and is not to be confused with JSON entities in general.)
If you have a bunch of JSON objects (whether as a stream or array or whatever), just looking at the first often doesn't really give an accurate picture of all them. For getting a bird's eye view of a bunch of objects, using a "schema inference engine" is often the way to go. For this purpose, you might like to consider my schema.jq schema inference engine. It's usually very simple to use but of course how you use it will depend on whether you have a stream or array of JSON entities. For basic details, see https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed; for related topics (e.g. verification), see the entry for JESS at https://github.com/stedolan/jq/wiki/Modules
Please note that schema.jq infers a structural schema that mirrors the entities under consideration. Such structural schemas have little in common with JSON Schema schemas, which you might also like to consider.
I am writing a bash-script for uploading certificate from a linux-server to azure keyvault using the "armclient"
I follow this guide on how to use the armclient:
https://blogs.msdn.microsoft.com/appserviceteam/2016/05/24/deploying-azure-web-app-certificate-through-key-vault/
The command i want to perform is this:
ARMClient.exe PUT /subscriptions/<Subscription Id>/resourceGroups/<Server Farm Resource Group>/providers/Microsoft.Web/certificates/<User Friendly Resource Name>?api-version=2016-03-01 "{'Location':'<Web App Location>','Properties':{'KeyVaultId':'<Key Vault Resource Id>', 'KeyVaultSecretName':'<Secret Name>', 'serverFarmId':'<Server Farm (App Service Plan) resource Id>'}}"
I have created a string that populates all the fields required:
putparm=$resolved_armapi" \"{'Location':'$resolved_locationid','Properties':{'KeyVaultId':'$resolved_keyvaultid','KeyVaultSecretName':'$certname','serverFarmId':'$resolved_farmid'}}"\"
When i echo the output of the variable putparm, the result looks as expected (X-ed out names/ids):
/subscriptions/f073334f-240f-4261-9db5-XXXXXXXXXXXXX/resourceGroups/XXXXXXXX/providers/Microsoft.Web/certificates/XXXX-XXXXX-XXXXX?api-version=2016-03-01 "{'Location':'Central US','Properties':{'KeyVaultId':'/subscriptions/f073334f-240f-4261-9db5-XXXXXXXXXXXXX/resourceGroups/XXXXXXXX/providers/Microsoft.KeyVault/vaults/XXXXXXXX','KeyVaultSecretName':'XXXX-XXXXX-XXXXX','serverFarmId':'/subscriptions/f073334f-240f-4261-9db5-XXXXXXXXXXXXX/resourceGroups/XXXXXXXX/providers/Microsoft.Web/serverfarms/ServicePlan59154b1c-XXXX'}}"
When i run armclient put $putparm in the script i get this error:
"error": {
"code": "InvalidRequestContent",
"message": "The request content was invalid and could not be deserialized: 'Unterminated string. Expected delimiter: \". Path '',
line 1, position 21.'." }
But when i take the output of the $putparm variable and run the command "manually" on the server, it works.
I guess its something with the way linux store the variables and that the API is requesting JSON (or something..)
Happy for any help.
The way you define your variable putparam is wrong.
It is likely interpreted as a literal string and not as an object. Note that a simple string, like "hello", is a valid JSON data, but it probably not what is expecting your server.
If you should quote your variable correctly:
putparm="{\"Location\":\"$resolved_locationid\",\"Properties\":{\"KeyVaultId\":\"$resolved_keyvaultid\",\"KeyVaultSecretName\":\"$certname\",\"serverFarmId\":\"$resolved_farmid\"}}"
and use it like this:
armclient put "$resolved_armapi" "$putparm"
I was trying to use a JSON file as a small DB. After creating a template table on DataFrame I queried it with SQL and got an exception. Here is my code:
val df = sqlCtx.read.json("/path/to/user.json")
df.registerTempTable("user_tt")
val info = sqlCtx.sql("SELECT name FROM user_tt")
info.show()
df.printSchema() result:
root
|-- _corrupt_record: string (nullable = true)
My JSON file:
{
"id": 1,
"name": "Morty",
"age": 21
}
Exeption:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: [_corrupt_record];
How can I fix it?
UPD
_corrupt_record is
+--------------------+
| _corrupt_record|
+--------------------+
| {|
| "id": 1,|
| "name": "Morty",|
| "age": 21|
| }|
+--------------------+
UPD2
It's weird, but when I rewrite my JSON to make it oneliner, everything works fine.
{"id": 1, "name": "Morty", "age": 21}
So the problem is in a newline.
UPD3
I found in docs the next sentence:
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
It isn't convenient to keep JSON in such format. Is there any workaround to get rid of multi-lined structure of JSON or to convert it in oneliner?
Spark >= 2.2
Spark 2.2 introduced wholeFile multiLine option which can be used to load JSON (not JSONL) files:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
See:
SPARK-18352 - Parse normal, multi-line JSON files (not just JSON Lines).
SPARK-20980 - Rename the option wholeFile to multiLine for JSON and CSV.
Spark < 2.2
Well, using JSONL formated data may be inconvenient but it I will argue that is not the issue with API but the format itself. JSON is simply not designed to be processed in parallel in distributed systems.
It provides no schema and without making some very specific assumptions about its formatting and shape it is almost impossible to correctly identify top level documents. Arguably this is the worst possible format to imagine to use in systems like Apache Spark. It is also quite tricky and typically impractical to write valid JSON in distributed systems.
That being said, if individual files are valid JSON documents (either single document or an array of documents) you can always try wholeTextFiles:
spark.read.json(sc.wholeTextFiles("/path/to/user.json").values())
Just to add on to zero323's answer, the option in Spark 2.2+ to read multi-line JSON was renamed to multiLine (see the Spark documentation here).
Therefore, the correct syntax is now:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
This happened in https://issues.apache.org/jira/browse/SPARK-20980.
Are there any format standards for writing and parsing JSON log files?
The problem I see is that you can't have a "pure" JSON log file since you need matching brackets and trailing commas are forbidden. So while the following may be written by an application, it can't be parsed by standard JSON parsers:
[{date:'2012-01-01 02:00:01', severity:"ERROR", msg:"Foo failed"},
{date:'2012-01-01 02:04:02', severity:"INFO", msg:"Bar was successful"},
{date:'2012-01-01 02:10:12', severity:"DEBUG", msg:"Baz was notified"},
So you must have some conventions on how to structure your log files in a way that a parser can process them. The easiest thing would be "one log message object per line, newlines in string values are escaped". Are there any existing standards and tools?
You're not going to write a single JSON object per FILE, you're going to write a JSON object per LINE. Each line can then be parsed individually. You don't need to worry about trailing commas and have the whole set of objects enclosed by brackets, etc. See http://blog.nodejs.org/2012/03/28/service-logging-in-json-with-bunyan/ for a more detailed explanation of what this can look like.
Also check out Fluentd http://fluentd.org/ for a neat toolset to work with.
Edit: this format is now called JSONLines or jsonl as pointed out by #Mnebuerquo below - see http://jsonlines.org/
gem log_formatter is the ruby choice, as the formatter group, now support json formatter for ruby and log4r.
simple to get stated for ruby.
gem 'log_formatter'
require 'log_formatter'
require 'log_formatter/ruby_json_formatter'
logger.debug({data: "test data", author: 'chad'})
result
{
"source": "examples",
"data": "test data",
"author": "chad",
"log_level": "DEBUG",
"log_type": null,
"log_app": "app",
"log_timestamp": "2016-08-25T15:34:25+08:00"
}
for log4r:
require 'log4r'
require 'log_formatter'
require 'log_formatter/log4r_json_formatter'
logger = Log4r::Logger.new('Log4RTest')
outputter = Log4r::StdoutOutputter.new(
"console",
:formatter => Log4r::JSONFormatter::Base.new
)
logger.add(outputter)
logger.debug( {data: "test data", author: 'chad'} )
Advance usage: README
Full Example Code: examples
Trying to use a query with mongoexport results in an error. But the same query is evaluated by the mongo-client without an error.
In mongo-client:
db.listing.find({"created_at":new Date(1221029382*1000)})
with mongoexport:
mongoexport -d event -c listing -q '{"created_at":new Date(1221029382*1000)}'
The generated error:
Fri Nov 11 17:44:08 Assertion: 10340:Failure parsing JSON string near:
$and: [ {
0x584102 0x528454 0x5287ce 0xa94ad1 0xa8e2ed 0xa92282 0x7fbd056a61c4
0x4fca29
mongoexport(_ZN5mongo11msgassertedEiPKc+0x112) [0x584102]
mongoexport(_ZN5mongo8fromjsonEPKcPi+0x444) [0x528454]
mongoexport(_ZN5mongo8fromjsonERKSs+0xe) [0x5287ce]
mongoexport(_ZN6Export3runEv+0x7b1) [0xa94ad1]
mongoexport(_ZN5mongo4Tool4mainEiPPc+0x169d) [0xa8e2ed]
mongoexport(main+0x32) [0xa92282]
/lib/libc.so.6(__libc_start_main+0xf4) [0x7fbd056a61c4]
mongoexport(__gxx_personality_v0+0x3d9) [0x4fca29]
assertion: 10340 Failure parsing JSON string near: $and: [ {
But doing the multiplication in Date beforehand in mongoexport:
mongoexport -d event -c listing -q '{"created_at":new Date(1221029382000)}'
works!
Why is mongo evaluating the queries differently in these two contexts?
The mongoexport command-line utility supports passing a query in JSON format, but you are trying to evaluate JavaScript in your query.
The JSON format was originally derived from JavaScript's object notation, but the contents of a JSON document can be parsed without eval()ing it in a JavaScript interpreter.
You should consider JSON as representing "structured data" and JavaScript as "executable code". So there are, in fact, two different contexts for the queries you are running.
The mongo command-line utility is an interactive JavaScript shell which includes a JavaScript interpreter as well as some helper functions for working with MongoDB. While the JavaScript object format looks similar to JSON, you can also use JavaScript objects, function calls, and operators.
Your example of 1221029382*1000 is the result of a math operation that would be executed by the JavaScript interpreter if you ran that in the mongo shell; in JSON it's an invalid value for a new Date so mongoexport is exiting with a "Failure parsing JSON string" error.
I also got this error doing a mongoexport, but for a different reason. I'll share my solution here though since I ended up on this SO page while trying to solve my issue.
I know it has little to do with this question, but the title of this post brought it up in Google, so since I was getting the exact same error I'll add an answer. Hopefully it helps someone.
I was trying to do a MongoId _id query in the Windows console. The problem was that I needed to wrap the JSON query in double quotes, and the ObjectId also had to be in double quotes (not single!). So I had to escape the ObjectId quotes.
mongoexport -u USERNAME -pPASSWORD -d DATABASE -c COLLECTION
--query "{_id : ObjectId(\"5148894d98981be01e000011\")}"
If I wrap the JSON query in single quote on Windows, I get this error:
ERROR: too many positional options
And if I use single quotes around the ObjectId, I get this error:
Assertion: 10340:Failure parsing JSON string near: _id
So, yeah. Good luck.