Save json list in a text file - json

I got a json log file that i rearrange to be correct, after this i am trying to save the results to the same file. The results are a list. but the problem that i am unable to save and will give me the following error:
write() argument must be str, not list
Here is the code it self:
import regex as re
import re
f_name = 'test1.txt'
splitter = r'"Event\d+":{(.*?)}' # a search pattern to capture the stuff in braces
#Open the file as Read.
with open(f_name, 'r') as src:
data = src.readlines()
# tokenize the data source...
tokens = re.findall(splitter, str(data))
#print(tokens)
# now we can operate on the tokens and split them up into key-value pairs and put them into a list
result = []
for token in tokens:
# make an empty dictionary to hold the row elements
line_dict = {}
# we can split the line (token) by comma to get the key-value pairs
pairs = token.split(',')
for pair in pairs:
# another regex split needed here, because the timestamps have colons too
splitter = r'"(.*)"\s*:\s*"(.*)"' # capture two groups of things in quotes on opposite sides of colon
parts = re.search(splitter, pair)
key, value = parts.group(1), parts.group(2)
line_dict[key] = value
# add the dictionary of line elements to the result
result.append(line_dict)
with open(f_name, 'w') as src:
for line in result:
src.write(result)
i.e the code it self was not written by me -> Log file management with python (thanks AirSquid)
Thanks for the assistance, New at Python here.
Tried to import json and use json.dump, also tried to append the text, but in most cases i end up with just [] or empty file.

Related

Convert text based key/value pair to JSON format

I have a text file with a lot of key/value pairs in the given format:
secret_key="XXXXX"
database_password="1234"
timout=30
.
.
.
and list continues...
I want these key/value pairs to be stored in a JSON format so that I can make use of this data in the JSON format. Is there any way of doing this. I mean any website or any method to do it automatically?
The Python 3.8 script below would do the job ◡̈
import json
with open('text', 'r') as fp:
dic = {}
while line:=fp.readline().strip():
key, value = line.split('=')
dic[key] = eval(value)
print(json.dumps(dic))
Note: eval is used to prevent double quotes being escaped.
As I guess that is an .env file. So, I would suggest you try to implement something like this in Python:
import json
import sys
try:
dotenv = sys.argv[1]
except IndexError as e:
dotenv = '.env'
with open(dotenv, 'r') as f:
content = f.readlines()
# removes whitespace chars like '\n' at the end of each line
content = [x.strip().split('=') for x in content if '=' in x]
print(json.dumps(dict(content)))
Reference: https://gist.github.com/GabLeRoux/d6b2c2f7a69ebcd8430ea59c9bcc62c0
*Please let me know if you want to implement it in a different language, such as JavaScript.

Apache NiFi: How to compare multiple rows in a csv and create new column

I have a csv which looks like this.
Jc,TXF,timer,alpha,beta
15,44,55,12,33
18,87,33,111
9,87,61,29,77
Alpha and Beta combined makes up a city code. I want to add the name of the city to the csv as a new column.
Jc,TXF,timer,alpha,beta,city
15,44,55,12,33,York
18,87,33,111,London
9,87,61,29,77,Sydney
I have another csv with only the columns alpha,beta,city. Which looks like this:
alpha,beta,city
12,33,York
33,111,London
29,77,Sydney
How can I achieve this using Apache NiFi. Please suggest the processors and workflow needed to be used to achieve this.
I see two ways of solving this.
First by using CsvLookupService. However the CsvLookupService only supports a single key, but you have two, alpha and beta. So to use this solution you have to concatenate both keys into a single key, like 12_33.
Second by using ExecuteScript processor. This one is better, because you don't have to modify your source data. Strategy:
Split the CSV text into lines
Enrich each line with the city column by looking up the alpha and beta keys in the mapping file
Merge the individual lines into a single CSV file.
Overall flow:
GenerateFlowFile:
SplitText:
Set header line count to 1 to include the header line in the split content. For the ExecuteScript processor set python as scripting engine and provide following script body:
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback
import csv
# Define a subclass of StreamCallback for use in session.write()
class PyStreamCallback(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
# fetch the mapping CSV file
with open('/home/nifi/mapping.csv', 'r') as mapping:
# read the mapping file
mappingContent = csv.reader(mapping, delimiter=',')
# flowfile content is CSV text with two lines, header and actual content
# split by newline to get access to each inidvidual line
lines = IOUtils.toString(inputStream, StandardCharsets.UTF_8).split('\n')
# the result will contain the header line
# the result will have the additional city column
result = lines[0] + ',city\n'
# take the second line and split it
# to get access to alpha, beta and city values
lineSplit = lines[1].split(',')
# Go through the mapping file
# item[0] -> alpha
# item[1] -> beta
# item[2] -> city
# See if you find alpha and beta on the line content
for item in mappingContent:
if item[0] == lineSplit[3] and item[1] == lineSplit[4]:
result += lines[1] + ',' + item[2]
break
if result is None:
raise Exception('No matching found.')
else:
outputStream.write(bytearray(result.encode('utf-8')))
# end class
flowFile = session.get()
if(flowFile != None):
try:
flowFile = session.write(flowFile, PyStreamCallback())
session.transfer(flowFile, REL_SUCCESS)
except Exception as e:
session.transfer(flowFile, REL_FAILURE)
See comments for a detailed description of the script. /home/nifi/mapping.csv has to be available on your NiFi instance. If you want to learn more about the ExecuteScript processor, refer to the ExecuteScript Cookbook. Finally you merge all the lines into a single CSV file:
Set CSV reader and writer. Leave their default properties. Adjust MergeContent properties to control how many lines should be in each resulting CSV file. Result:

Python 2.7: Load a JSON file search for a value, replace it, and save as new JSON

As mentioned in the title, i'm trying to make a simple py script that can be run from terminal to do the following:
Find all JSON files in current working directory and nested folders (this part works well)
Load said files
Recursively search them for a specific value or a substring
If the value is matching, replace it with a new established value by the user
Once finished, save all modified json files to a "converted" folder in the current directory.
That said, the issue is when i try the recursive search method posted below, since i'm pretty much new to python i would appreciate any help with this issue, what i suppose it is... either the json files i'm using or the search method i'm employing.
Simplifying the issue, the value i search for never matches with anything inside the object, be that a key or purely some string value. Tried multiple methods to perform a recursive search but can't get a match.
For example: taking in account the sample json, i want to replace the value "selectable_parts" or "static_parts" or even deeper in the structure "1h_mod310_door_00" but seems like my method of searching can't reach this value in "object[object][children][0][children][5][name]" (hope this helps).
Sample JSON: (https://drive.google.com/open?id=0B2-Bn2b0ujjVdW5YVGg3REg3OWs)
"""KEYWORD REPLACING MODULE."""
import os
import json
# functions
def get_files():
"""lists files"""
exclude = set(['.vscode', 'sample'])
json_files = []
for root, dirs, files in os.walk(os.getcwd(), topdown=True):
dirs[:] = [d for d in dirs if d not in exclude]
for name in files:
if name.endswith('.json'):
json_files.append(os.path.join(root, name))
return json_files
def load_files(json_files):
"""works files"""
for js_file in json_files:
with open(js_file) as json_file:
loaded_json = json.load(json_file)
replace_key_value(loaded_json, os.path.basename(js_file))
def write_file(data_file, new_file_name):
"""writes the file"""
if not os.path.exists('converted'):
os.makedirs('converted')
with open('converted/' + new_file_name, 'w') as json_file:
json.dump(data_file, json_file)
def replace_key_value(js_file, js_file_name):
"""replace and initiate save"""
recursive_replace(js_file, SKEY, '')
# write_file(js_file, js_file_name)
def recursive_replace(data, match, repl):
"""search for needed value and replace its value"""
for key, value in data.items():
if value == match:
print data[key]
print "AHHHHHHHH"
elif isinstance(value, dict):
recursive_replace(value, match, repl)
# main
print "\n" + '- on ' + os.getcwd()
NEW_DIR = raw_input('Work dir (leave empty if current): ')
if not NEW_DIR:
print NEW_DIR
NEW_DIR = os.getcwd()
else:
print NEW_DIR
os.chdir(NEW_DIR)
# get_files()
JS_FILES = get_files()
print '- files on ' + os.getcwd()
# print "\n".join(JS_FILES)
SKEY = raw_input('Value to search: ')
RKEY = raw_input('Replacement value: ')
load_files(JS_FILES)
The issue was the way i navigated the json obj because the method didn't considerate if it was a dict or a list (i believe...).
So to answer my own question here's the recursive search i'm using to check the values:
def get_recursively(search_dict, field):
"""
Takes a dict with nested lists and dicts,
and searches all dicts for a key of the field
provided.
"""
fields_found = []
for key, value in search_dict.iteritems():
if key == field:
print value
fields_found.append(value)
elif isinstance(value, dict):
results = get_recursively(value, field)
for result in results:
if SEARCH_KEY in result:
fields_found.append(result)
elif isinstance(value, list):
for item in value:
if isinstance(item, dict):
more_results = get_recursively(item, field)
for another_result in more_results:
if SEARCH_KEY in another_result:
fields_found.append(another_result)
return fields_found
# write_file(js_file, js_file_name)
Hope this helps someone.

Reading the data written to s3 by Amazon Kinesis Firehose stream

I am writing record to Kinesis Firehose stream that is eventually written to a S3 file by Amazon Kinesis Firehose.
My record object looks like
ItemPurchase {
String personId,
String itemId
}
The data is written to S3 looks like:
{"personId":"p-111","itemId":"i-111"}{"personId":"p-222","itemId":"i-222"}{"personId":"p-333","itemId":"i-333"}
NO COMMA SEPERATION.
NO STARTING BRACKET as in a Json Array
[
NO ENDING BRACKET as in a Json Array
]
I want to read this data get a list of ItemPurchase objects.
List<ItemPurchase> purchases = getPurchasesFromS3(IOUtils.toString(s3ObjectContent))
What is the correct way to read this data?
It boggles my mind that Amazon Firehose dumps JSON messages to S3 in this manner, and doesn't allow you to set a delimiter or anything.
Ultimately, the trick I found to deal with the problem was to process the text file using the JSON raw_decode method
This will allow you to read a bunch of concatenated JSON records without any delimiters between them.
Python code:
import json
decoder = json.JSONDecoder()
with open('giant_kinesis_s3_text_file_with_concatenated_json_blobs.txt', 'r') as content_file:
content = content_file.read()
content_length = len(content)
decode_index = 0
while decode_index < content_length:
try:
obj, decode_index = decoder.raw_decode(content, decode_index)
print("File index:", decode_index)
print(obj)
except JSONDecodeError as e:
print("JSONDecodeError:", e)
# Scan forward and keep trying to decode
decode_index += 1
I also had the same problem, here is how I solved.
replace "}{" with "}\n{"
line split by "\n".
input_json_rdd.map(lambda x : re.sub("}{", "}\n{", x, flags=re.UNICODE))
.flatMap(lambda line: line.split("\n"))
A nested json object has several "}"s, so split line by "}" doesn't solve the problem.
I've had the same issue.
It would have been better if AWS allowed us to set a delimiter but we can do it on our own.
In my use case, I've been listening on a stream of tweets, and once receiving a new tweet I immediately put it to Firehose.
This, of course, resulted in a 1-line file which could not be parsed.
So, to solve this, I have concatenated the tweet's JSON with a \n.
This, in turn, let me use some packages that can output lines when reading stream contents, and parse the file easily.
Hope this helps you.
I think the best ways to tackle this is to first create a properly formatted json file containing well separated json objects within them. In my case I added ',' to the events which was pushed into the firehose. Then After a file is saved in s3, all the files will contain json object separated by some delimitter(comma- in our case). Another thing that must be added are '[' and ']' at the beginning and end of the file. Then you have a proper json file containing multiple json objects. Parsing them will be possible now.
If the input source for the firehose is an Analytics application, this concatenated JSON without a delimiter is a known issue as cited here. You should have a lambda function as here that outputs JSON objects in multiple lines.
I used a transformation Lambda to add a line break at the end of every record
def lambda_handler(event, context):
output = []
for record in event['records']:
# Decode from base64 (Firehose records are base64 encoded)
payload = base64.b64decode(record['data'])
# Read json as utf-8
json_string = payload.decode("utf-8")
# Add a line break
output_json_with_line_break = json_string + "\n"
# Encode the data
encoded_bytes = base64.b64encode(bytearray(output_json_with_line_break, 'utf-8'))
encoded_string = str(encoded_bytes, 'utf-8')
# Create a deep copy of the record and append to output with transformed data
output_record = copy.deepcopy(record)
output_record['data'] = encoded_string
output_record['result'] = 'Ok'
output.append(output_record)
print('Successfully processed {} records.'.format(len(event['records'])))
return {'records': output}
Use this simple Python code.
input_str = '''{"personId":"p-111","itemId":"i-111"}{"personId":"p-222","itemId":"i-222"}{"personId":"p-333","itemId":"i-333"}'''
data_str = "[{}]".format(input_str.replace("}{","},{"))
data_json = json.loads(data_str)
And then (if you want) convert to Pandas.
import pandas as pd
df = pd.DataFrame().from_records(data_json)
print(df)
And this is result
itemId personId
0 i-111 p-111
1 i-222 p-222
2 i-333 p-333
If there's a way to change the way data is written, please separate all the records by a line. That way you can read the data simply, line by line. If not, then simply build a scanner object which takes "}" as a delimiter and use the scanner to read. That would do the job.
You can find the each valid JSON by counting the brackets. Assuming the file starts with a { this python snippet should work:
import json
def read_block(stream):
open_brackets = 0
block = ''
while True:
c = stream.read(1)
if not c:
break
if c == '{':
open_brackets += 1
elif c == '}':
open_brackets -= 1
block += c
if open_brackets == 0:
yield block
block = ''
if __name__ == "__main__":
c = 0
with open('firehose_json_blob', 'r') as f:
for block in read_block(f):
record = json.loads(block)
print(record)
This problem can be solved with a JSON parser that consumes objects one at a time from a stream. The raw_decode method of the JSONDecoder exposes just such a parser, but I've written a library that makes it straightforward to do this with a one-liner.
from firehose_sipper import sip
for entry in sip(bucket=..., key=...):
do_something_with(entry)
I've added some more details in this blog post
In Spark, we had the same problem. We're using the following:
from pyspark.sql.functions import *
#udf
def concatenated_json_to_array(text):
final = "["
separator = ""
for part in text.split("}{"):
final += separator + part
separator = "}{" if re.search(r':\s*"([^"]|(\\"))*$', final) else "},{"
return final + "]"
def read_concatenated_json(path, schema):
return (spark.read
.option("lineSep", None)
.text(path)
.withColumn("value", concatenated_json_to_array("value"))
.withColumn("value", from_json("value", schema))
.withColumn("value", explode("value"))
.select("value.*"))
It works as follows:
Read the data as one string per file (no delimiters!)
Use a UDF to introduce the JSON array and split the JSON objects by introducing a comma. Note: be careful not to break any strings with }{ in them!
Parse the JSON with a schema into DataFrame fields.
Explode the array into separate rows
Expand the value object into column.
Use it like this:
from pyspark.sql.types import *
schema = ArrayType(
StructType([
StructField("type", StringType(), True),
StructField("value", StructType([
StructField("id", IntegerType(), True),
StructField("joke", StringType(), True),
StructField("categories", ArrayType(StringType()), True)
]), True)
])
)
path = '/mnt/my_bucket_name/messages/*/*/*/*/'
df = read_concatenated_json(path, schema)
I've written more details and considerations here: Parsing JSON data from S3 (Kinesis) with Spark. Do not just split by }{, as it can mess up your string data! For example: { "line": "a\"r}{t" }.
You can use below script.
If streamed data size is not over buffer size that you set, each file of s3 have one pair of brackets([]) and comma.
import base64
print('Loading function')
def lambda_handler(event, context):
output = []
for record in event['records']:
print(record['recordId'])
payload = base64.b64decode(record['data']).decode('utf-8')+',\n'
# Do custom processing on the payload here
output_record = {
'recordId': record['recordId'],
'result': 'Ok',
'data': base64.b64encode(payload.encode('utf-8'))
}
output.append(output_record)
last = len(event['records'])-1
print('Successfully processed {} records.'.format(len(event['records'])))
start = '['+base64.b64decode(output[0]['data']).decode('utf-8')
end = base64.b64decode(output[last]['data']).decode('utf-8')+']'
output[0]['data'] = base64.b64encode(start.encode('utf-8'))
output[last]['data'] = base64.b64encode(end.encode('utf-8'))
return {'records': output}
Using JavaScript Regex.
JSON.parse(`[${item.replace(/}\s*{/g, '},{')}]`);

ASCII number pattern matching in binary file in Python 2

I'm trying to read a binary file and match number(s) in each of its records.
If the number matches then the record is to be copied to another file.
the number should be present in between 24th to 36th byte of each record.
The script takes the numbers as arguments. Here's the script I'm using:
#!/usr/bin/env python
# search.py
import re
import glob
import os
import binascii
list = sys.argv[1:]
list.sort()
rec_len=452
filelist = glob.glob(os.getcwd() + '*.bin')
print('Input File : %s' % filelist)
for file in filelist:
outfile = file + '.out'
f = open(file, "rb")
g = open(outfile, "wb")
for pattern in list:
print pattern
regex_search = re.compile(pattern).search
while True:
buf = f.read(rec_len)
if len(buf) == 0:
break
else:
match = regex_search(buf)
match2=buf.find(pattern)
#print match
#print match2
if ((match2 != -1) | (match != None )):
g.write(buf)
f.close()
g.close()
print ("Done")
I'm running it like:
python search.py 1234 56789
I'm using python 2.6.
The code is not matching the number.
I also tried using binascii to convert the number to binary before matching but even then it didn't return any record.
If I give any string it works correctly but If i give any number as argument it doesn't match.
Where am I going wrong?
You are depleting the filebuffer by reading all the bytes in the check for the first pattern. Hence, the 2nd pattern will never be matched (an attempt will not even be made), because you've already reached the end of the file, by reading all the records during the for-loop run of the first pattern.
That means that if your first pattern is nowhere to be found in those records, your script will not give you any output.
Consider adding f.seek(0) at the end of the while loop, or changing the order of the two loop constructs such that you first read a record from the file, then match the regex for each of the patterns in the argument list.
Also, try not to shadow the python builtins by using list for the name of an array. It won't give you problems in the code you've shown, but it's definitely something you should be aware of.