I crawled a lot of JSON files in data folder, which all named by timestamp (./data/2021-04-05-12-00.json, ./data/2021-04-05-12-30.json, ./data/2021-04-05-13-00.json, ...).
Now I'm tring to use ELK stack to load those increasing JSON files.
The JSON file is pretty printed like:
{
"datetime": "2021-04-05 12:00:00",
"length": 3,
"data": [
{
"id": 97816,
"num_list": [1,2,3],
"meta_data": "{'abc', 'cde'}"
"short_text": "This is data 97816"
},
{
"id": 97817,
"num_list": [4,5,6],
"meta_data": "{'abc'}"
"short_text": "This is data 97817"
},
{
"id": 97818,
"num_list": [],
"meta_data": "{'abc', 'efg'}"
"short_text": "This is data 97818"
},
],
}
I tried using logstash multiline plugins to extract json file, but it seems like it will handle each file as an event. Is there any way to extract each record in JSON data fileds as an event ?
Also, what's the best practice for loading multiple increasing pretty-printed JSON files in ELK ?
Using multiline is correct if you want to handle each file as one input event.
Then you need to leverage the split filter in order to create one event for each element in the data array:
filter {
split {
field => "data"
}
}
So Logstash reads one file as a whole, it passes its content as a single event to the filter layer and then the split filter as shown above will spawn one new event for each element in the data array.
Related
I have a JSON file that I need to move to Cosmos DB. I currently have a PowerShell script that will modify this file to a proper format to be used in a Data Flow or Copy activity in Azure Data Factory. However, I was wondering if there is a way to do all these modification in Azure Data Factory without using the Powershell script.
The Powershell script can manipulate a 50MB file in a matter of seconds. I would also like a similar speeds if we build something directly in Azure Data Factory.
Without the modification, I get a error because of the "#" sign. Furthermore, if I want to use companyId as my partition key, it is not allowed because it is inside of an array.
The current JSON file looks similar to the below:
{
"Extract": {
"consumptionInfo": {
"Name": "Test Stuff",
"createdOnTimestamp": "20200101161521Z",
"Version": "1.0",
"extractType": "Incremental",
"extractDate": "20200101113514Z"
},
"company": [{
"company": {
"#action": "create",
"companyId": "xxxxxxx-yyyy-zzzz-aaaa-bbbbbbbbbbbb",
"Status": "1",
"StatusName": "1 - Test - Calendar"
}
}]
}
}
I would like to be converted to the below:
{
"action": "create",
"companyId": "xxxxxxx-yyyy-zzzz-aaaa-bbbbbbbbbbbb",
"Status": "1",
"StatusName": "1 - Test - Calendar"
}
Create a new data flow that reads in your JSON file. Add a Select transformation to choose the properties you wish to send to CosmosDB. If some of those properties are embedded inside of an array, then first use Flatten. You can also use the Select transformation rename "#action" to "action".
Data Factory or Data Flow doesn't works well with nested JSON file. Per my experience, the workaround may be a little complexed but works well:
Source 1 + Flatten active 1 to flat the data in key 'Extract'.
Source 2(same with source 1) + Flatten active 2 to flat the data in
key 'company'.
Add a Union active 1 in source 1 flow to join the data after
flatten active 2
create a Dervied Column to filter the column/key you want after
union active1
Then create the Azure Cosmos DB as sink.
The Data flow overview should like this:
Using my Scala HTTP Client I retrieved a response in JSON format from an API GET call.
My end goal is to write this JSON content to an AWS S3 bucket in order to make it available as a table on RedShift running a simple AWS Glue crawler.
My thinking is to parse this JSON message and somehow converting into a Spark DataFrame, so later on I can save it to my preferred S3 location in the format of .csv, .parquet, or whatever.
The JSON file looks like this
{
"response": {
"status": "OK",
"start_element": 0,
"num_elements": 100,
"categories": [
{
"id": 1,
"name": "Airlines",
"is_sensitive": false,
"last_modified": "2010-03-19 17:48:36",
"requires_whitelist_on_external": false,
"requires_whitelist_on_managed": false,
"is_brand_eligible": true,
"requires_whitelist": false,
"whitelist": {
"geos": [],
"countries_and_brands": []
}
},
{
"id": 2,
"name": "Apparel",
"is_sensitive": false,
"last_modified": "2010-03-19 17:48:36",
"requires_whitelist_on_external": false,
"requires_whitelist_on_managed": false,
"is_brand_eligible": true,
"requires_whitelist": false,
"whitelist": {
"geos": [],
"countries_and_brands": []
}
}
],
"count": 148,
"dbg_info": {
"warnings": [],
"version": "1.18.1621",
"output_term": "categories"
}
}
}
The content I would like to map to a Dataframe is the one contained by the "categories" JSON Array.
I have managed to parse the message using json4s.JsonMethods method parse this way:
val parsedJson = parse(request) \\ "categories"
Obtaining the following:
output: org.json4s.JValue = JArray(List(JObject(List((id,JInt(1)), (name,JString(Airlines)), (is_sensitive,JBool(false)), (last_modified,JString(2010-03-19 17:48:36)), (requires_whitelist_on_external,JBool(false)), (requires_whitelist_on_managed,JBool(false)), (is_brand_eligible,JBool(true)), (requires_whitelist,JBool(false)), (whitelist,JObject(List((geos,JArray(List())), (countries_and_brands,JArray(List()))))))), JObject(List((id,JInt(2)), (name,JString(Apparel)), (is_sensitive,JBool(false)), (last_modified,JString(2010-03-19 17:48:36)), (requires_whitelist_on_external,JBool(false)), (requires_whitelist_on_managed,JBool(false)), (is_brand_eligible,JBool(true)), (requires_whitelist,JBool(false)), (whitelist,JObject(List((geos,JArray(List())), (countries_and_brands,JArray(List()))))))))
However, I am completely lost on how to proceed. I have even tried using another library for Scala called uJson:
val json = (ujson.read(request))
val tuples = json("response")("categories").arr /* <-- categories is an array */ .map { item =>
(item("id"), item("name"))
This time I have only parsed two fields for testing, but this shouldn't change much. Hence, I obtained the following structure:
tuples: scala.collection.mutable.ArrayBuffer[(ujson.Value, ujson.Value, ujson.Value, ujson.Value)] = ArrayBuffer((1,"Airlines",false,"2010-03-19 17:48:36"), (2,"Apparel",false,"2010-03-19 17:48:36"))
However, also this time I do not know how to move forward and everything I try results in errors, mostly related to format incompatibility.
Please, feel free to propose any other approach to achieve my goal even if it changes totally my workflow. I rather learn something properly. Thanks
We can use the following code to convert JSON to Spark Dataframe/Dataset
val df00 =
spark.read.option("multiline","true").json(Seq(JSON_OUTPUT).toDS())
Im trying to extracting json objects and store it to hdfs. I'm targeting message attribute which is a6,b6,c6,d6,e6
json sample
{
"#timestamp":"2020-07-06T07:35:29.047Z",
"#metadata":{
"beat":"filebeat",
"type":"_doc",
"version":"7.7.1"
},
"log":{
"offset":91,
"file":{
"path":"C:\\Program Files\\Filebeat\\test-kafka\\test_csv.csv"
}
},
"message":"a6,b6,c6,d6,e6",
"input":{
"type":"log"
},
"ecs":{
"version":"1.5.0"
},
"host":{
"name":"host"
},
"agent":{
"version":"7.7.1",
"type":"filebeat",
"ephemeral_id":"0b4a288f-f7ac-4db9-835e-60ca07a45fff",
"hostname":"host",
"id":"5e2fec03-bbdc-4f91-acc9-4ab36c7268db"
}
}
GenerateFlowFile properties
JsonEvaluatePath properties
but problem JsonEvaluatePath not working as i expected, i thought it will extracting only message attribute.
hadoop#ambari:~$ hdfs dfs -cat /user/test/5a422f02-9074-4384-a3c9-f3e3ce7c2e40
{
"#timestamp":"2020-07-06T07:35:29.047Z",
"#metadata":{
"beat":"filebeat",
"type":"_doc",
"version":"7.7.1"
},
"log":{
"offset":91,
"file":{
"path":"C:\\Program Files\\Filebeat\\test-kafka\\test_csv.csv"
}
},
"message":"a6,b6,c6,d6,e6",
"input":{
"type":"log"
},
"ecs":{
"version":"1.5.0"
},
"host":{
"name":"host"
},
"agent":{
"version":"7.7.1",
"type":"filebeat",
"ephemeral_id":"0b4a288f-f7ac-4db9-835e-60ca07a45fff",
"hostname":"host",
"id":"5e2fec03-bbdc-4f91-acc9-4ab36c7268db"
}
}
Am i missing something?
Since you used EvaluateJsonPath with destination set as flow file attributes, it extracted message into a flow file attribute and the content of the flow file is still the same as it was before. You would need to use another processor like AttributesToJson before PutHDFS to rewrite the flow file content with the attributes you want. An alternative might be to set EvaluateJsonPath destination to flow file content, but I'm not sure if that produces valid json.
In my pipeline I reach through REST API using GET request to a 3rd party database. As an output I receive a bunch of JSON files. The number of JSON files I have to download (same as number of iterations I will have to use) is in one of the fields in JSON file. The problem is that the field's name is 'page-count' which contains "-".
#activity('Lookup1').output.firstRow.meta.page.page-count
Data Factory considers dash in field's name as a minus sign, so I get an error instead of value from that field.
{"code":"BadRequest","message":"ErrorCode=InvalidTemplate, ErrorMessage=Unable to parse expression 'activity('Lookup1').output.firstRow.meta.page.['page-count']'","target":"pipeline/Product_pull/runid/f615-4aa0-8fcb-5c0a144","details":null,"error":null}
This is how the structure of JSON file looks like:
"firstRow": {
"meta": {
"page": {
"number": 1,
"size": 1,
"page-count": 7300,
"record-count": 7300
},
"non-compliant-record-count": 7267
}
},
"effectiveIntegrationRuntime": "intergrationRuntimeTest1",
"billingReference": {
"activityType": "PipelineActivity",
"billableDuration": [
{
"meterType": "SelfhostedIR",
"duration": 0.016666666666666666,
"unit": "Hours"
}
]
},
"durationInQueue": {
"integrationRuntimeQueue": 1
}
}
How to solve this problem?
The below syntax works when retrieving the value for a json element with a hyphen. It is otherwise treated as a minus sign by the parser. It does not seem to be documented by Microsoft however I managed to get this to work through trial and error on a project of mine.
#activity('Lookup1').output.firstRow.meta.page['page-count']
This worked for us too. We had the same issue where we couldn't reference an output field that contained a dash(-). We referenced this post and used the square brackets and single quote and it worked!
Example below.
#activity('get_token').output.ADFWebActivityResponseHeaders['Set-Cookie']
Would it be possible, in any way, to create json code that zabbix can understand and recreate on a graph?
Eg:
I have this json:
{
"response:" {
"success": true,
"server": {
"name": "Test Server",
"alive": true,
"users": 25
}
}
}
And I would like to have a simple graph where I can see the value of users.
I might be asking a nonsense here but I was reading about the URL element and it looks like it is possible but couldn't find any type template or any info on how to send the data.
Create a Zabbix trapper item and send such values with the zabbix_sender. The values will be processed as any normal item values by Zabbix, and graphs will be available as well.