How to take any CSV file and convert it to JSON?(with python as a script engine) [Novice user trying to learn NiFi] - json

1) There is a CSV file containing the following information (the first row is the header):
first,second,third,total
1,4,9,14
7,5,2,14
3,8,7,18
2) I would like to find the sum of individual rows and generate a final file with a modified header. The final file should look like this:
[
{
"first": 1,
"second": 4,
"third": 9,
"total": 14
},
{
"first": 7,
"second": 5,
"third": 2,
"total": 14
},
{
"first": 3,
"second": 8,
"third": 7,
"total": 18
}
]
But it does not work and I am not sure how to fix this. Can anyone provide me an understanding on how to approach this problem?
NiFi flow:

Although i'm not into Python, by just googling around i think this might do it:
import csv
with open("YOURFILE.csv") as f:
reader = csv.DictReader(f)
data = [r for r in reader]
import json
with open('result.json', 'w') as outfile:
json.dump(data, outfile)

You can use Query Record processor and add new property as
total
select first,second,third,first+second+third total from FLOWFILE
Configure the CsvReader controller service with matching avro schema with int as datatype for all the fields and Json Setwriter controller service,Include total field name so that the output from Query Record processor will be all the columns and the sum of the columns as total.
Connect total relationship from Query Record processor for further processing
Refer to these links regarding Query Record and Configure Record Reader/Writer

Related

Postgres json to view

I have a table like this (with an jsonb column):
https://dbfiddle.uk/lGuHdHEJ
If I load this json with python into a dataframe:
import pandas as pd
import json
data={
"id": [1, 2],
"myyear": [2016, 2017],
"value": [5, 9]
}
data=json.dumps(data)
df=pd.read_json(data)
print (df)
I get this result:
id myyear value
0 1 2016 5
1 2 2017 9
How can a get this result directly from the json column via sql in a postgres view?
Note: This assumes that your id, my_year, and value array are consistent and have the same length.
This answer uses PostgresSQL's json_array_elements_text function to explode array elements to the rows.
select jsonb_array_elements_text(payload -> 'id') as "id",
jsonb_array_elements_text(payload -> 'bv_year') as "myyear",
jsonb_array_elements_text(payload -> 'value') as "value"
from main
And this gives the below output,
id myyear value
1 2016 5
2 2017 9
Although this is not the best design to store the properties in jsonb object and could lead to data inconsistencies later. If it's in your control I would suggest storing the data where each property's mapping is clear. Some suggestions,
You can instead have separate columns for each property.
If you want to store it as jsonb only then consider [{id: "", "year": "", "value": ""}]

JMeter: How can I randomize post body data several times?

I have a post body data as:
"My data": [{
"Data": {
"var1": 6.66,
"var2": 8.88
},
"var3": 9
}],
Here, if I post these details on POST DATA body, it will call "My Data" just once. I want to make it random as starting from 1 to 10 times so that "My data" is running for several times but randomly. If the random value is 2, then "My data" should run twice.
Help appreciated!
If you need to generate more blocks like this one:
{
"Data": {
"var1": 6.66,
"var2": 8.88
},
"var3": 9
}
It can be done using JSR223 PreProcessor and the following code:
def myData = []
1.upto(2, {
def entry = [:]
entry.put('Data', [var1: 6.66, var2: 8.88])
entry.put('var3', '9')
myData.add(entry)
})
vars.put('myData', new groovy.json.JsonBuilder(myData).toPrettyString())
log.info(vars.get('myData'))
The above example will generate 2 blocks:
If you want 10 - change 2 in the 1.upto(2, { line to 10
The generated data can be accessed as ${myData} where needed.
More information:
Apache Groovy - Parsing and producing JSON
Apache Groovy - Why and How You Should Use It

Dask how to open json with list of dicts

I'm trying to open a bunch of JSON files using read_json In order to get a Dataframe as follow
ddf.compute()
id owner pet_id
0 1 "Charlie" "pet_1"
1 2 "Charlie" "pet_2"
3 4 "Buddy" "pet_3"
but I'm getting the following error
_meta = pd.DataFrame(
columns=list(["id", "owner", "pet_id"]])
).astype({
"id":int,
"owner":"object",
"pet_id": "object"
})
ddf = dd.read_json(f"mypets/*.json", meta=_meta)
ddf.compute()
*** ValueError: Metadata mismatch found in `from_delayed`.
My JSON files looks like
[
{
"id": 1,
"owner": "Charlie",
"pet_id": "pet_1"
},
{
"id": 2,
"owner": "Charlie",
"pet_id": "pet_2"
}
]
As far I understand the problem is that I'm passing a list of dicts, so I'm looking for the right way to specify it the meta= argument
PD:
I also tried doing it in the following way
{
"id": [1, 2],
"owner": ["Charlie", "Charlie"],
"pet_id": ["pet_1", "pet_2"]
}
But Dask is wrongly interpreting the data
ddf.compute()
id owner pet_id
0 [1, 2] ["Charlie", "Charlie"] ["pet_1", "pet_2"]
1 [4] ["Buddy"] ["pet_3"]
The invocation you want is the following:
dd.read_json("data.json", meta=meta,
blocksize=None, orient="records",
lines=False)
which can be largely gleaned from the docstring.
meta looks OK from your code
blocksize must be None, since you have a whole JSON object per file and cannot split the file
orient "records" means list of objects
lines=False means this is not a line-delimited JSON file, which is the more common case for Dask (you are not assuming that a newline character means a new record)
So why the error? Probably Dask split your file on some newline character, and so a partial record got parsed, which therefore did not match your given meta.

Problem printing json data from python script

I have a python script that should print json data.
This is what I have in my script:
finaldata = {
"date": datetime.datetime.utcnow().isoformat(),
"voltage_mv":emeter["voltage_mv"],
"current_ma":emeter["current_ma"],
"power_mw":emeter["power_mw"] ,
"energy_wh": emeter["total_wh"],
}
print(finaldata)
I am running the script from Node-RED because I need to send the data to a storage account (in json format of course). The Problem is that the data that is being sent looks like this:
{'power_mw': 0, 'date': '2019-04-16T07:12:19.858159', 'energy_wh': 2, 'voltage_mv': 225045, 'current_ma': 20}
when it should look like this in order to be correctly stored in my storage account:
{"power_mw": 0, "date": '2019-04-16T07:12:19.858159', "energy_wh": 2, "voltage_mv": 225045, "current_ma": 20}
(important for later use, since I already get errors in the storage account).
Does anyone know why this is happening and how I can fix it? Thanks in advance
You should use the python json module and dump your python dict into json:
import json
finaldata = {"power_mw": 0, "date": '2019-04-16T07:12:19.858159',
"energy_wh": 2, "voltage_mv": 225045, "current_ma": 20}
print(json.dumps(finaldata))
JSON Reference
For order check linked OrderedDict
or read the OrderedDict collection Reference

How do you print multiple key values from sub keys in a .json file?

Im pulling a list of AMI ids from my AWS account and its being written into a json file.
The json looks basically like this:
{
"Images": [
{
"CreationDate": "2017-11-24T11:05:32.000Z",
"ImageId": "ami-XXXXXXXX"
},
{
"CreationDate": "2017-11-24T11:05:32.000Z",
"ImageId": "ami-aaaaaaaa"
},
{
"CreationDate": "2017-10-24T11:05:32.000Z",
"ImageId": "ami-bbbbbbb"
},
{
"CreationDate": "2017-10-24T11:05:32.000Z",
"ImageId": "ami-cccccccc"
},
{
"CreationDate": "2017-12-24T11:05:32.000Z",
"ImageId": "ami-ddddddd"
},
{
"CreationDate": "2017-12-24T11:05:32.000Z",
"ImageId": "ami-eeeeeeee"
}
]
}
My code looks like this so far after gathering the info and writing it to a .json file locally:
#writes json output to file...
print('writing to response.json...')
with open('response.json', 'w') as outfile:
json.dump(response, outfile, ensure_ascii=False, indent=4, sort_keys=True, separators=(',', ': '))
#Searches file...
print('opening response.json...')
with open("response.json") as f:
file_parsed = json.load(f)
The next part im stuck on is how to iterate through the file and print only the CreationDate and ImageId values.
print('printing CreationDate and ImageId...')
for ami in file_parsed['Images']:
#print ami['CreationDate'] #THIS WORKS
#print ami['ImageId'] #THIS WORKS
#print ami['CreationDate']['ImageId']
The last line there gives me this no matter how I have tried it: TypeError: string indices must be integers
My desired output is something like this:
2017-11-24T11:05:32.000Z ami-XXXXXXXX
Ultimately what im looking to do is then iterate through lines that are a certain date or older and deregister those AMIs. So would I be converting these to a list or a dict?
Pretty much not a programmer here so dont drown me.
TIA
You have almost parsed the json but for the desired output you need to concatenate the 'CreationDate' and 'ImageId' like this:
for ami in file_parsed['Images']:
print(ami['CreationDate'] + " "+ ami['ImageId'])
CreationDate evaluates to a string. So you can only take numerical indices of a string which is why ['CreationDate']['ImageId'] leads to a TypeError. Your other two commented lines, however, were correct.
To check if the date is older, you can make use of the datetime module. For instance, you can take the CreationDate (which is a string), convert it to a datetime object, create your own based on what that certain date is, and compare the two.
Something to this effect:
def checkIfOlder(isoformat, targetDate):
dateAsString = datetime.strptime(isoformat, '%Y-%m-%dT%H:%M:%S.%fZ')
return dateAsString <= targetDate
certainDate = datetime(2017, 11, 30) # Or whichever date you want
So in your for loop:
for ami in file_parsed['Images']:
creationDate = ami['CreationDate']
if checkIfOlder(creationDate, certainDate):
pass # write code to deregister AMIs here
Resources that would benefit would be Python's datetime documentation and in particular, the strftime/strptime directives. HTH!