Python Regex: How to match the string and then modify that string by adding something at the end - json

UPDATED CODE: It is working but now the problem is that the code is attaching same random_value to every Path.
Following is my code with a sample chunk of text. I want to read Path and it's value then add (/some unique random alphabet and number combination) at the end of every Path value without changing the already existed value. For example I want the Path to be like
"Path" : "already existed value/1A" e.t.c something like that.
I am unable to make the exact regex pattern of replacing it.
Any help would be appreciated.
It can be done by json parse but the requirement of the task is to do it via REGEX.
from io import StringIO
import re
import string
import random
reader = StringIO("""{
"Bounds": [
{
"HasClip": true,
"Lang": "no",
"Page": 0,
"Path": "//Document/Sect[2]/Aside/P",
"Text": "Potsdam, den 9. Juni 2021 ",
"TextSize": 12.0
}
],
},
{
"Bounds": [
{
"HasClip": true,
"Lang": "de",
"Page": 0,
"Path": "//Document/Sect[3]/P[4]",
"Text": "this is some text ",
"TextSize": 9.0,
}
],
}""")
def id_generator(size=3, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
text = reader.read()
random_value = id_generator()
pattern = r'"Path": "(.*?)"'
replacement = '"Path": "\\1/'+random_value+'"'
text = re.sub(pattern, replacement, text)
#This is working but it is only attaching one same random_value on every Path
print(text)

Use group 1 in the replacement:
replacement = '"Path": "\\1/1A"'
See live demo.
The replacement regex \1 puts back what was captured in group 1 of the match via (.*?).

Since you already have a json structure, maybe it would help to use the json module to parse it.
import json
myDict = json.loads("your json string / variable here")
# now myDict is a dictionary that you can use to loop/read/edit/modify and you can then export myDict as json.

Related

Pulling specific Parent/Child JSON data with Python

I'm having a difficult time figuring out how to pull specific information from a json file.
So far I have this:
# Import json library
import json
# Open json database file
with open('jsondatabase.json', 'r') as f:
data = json.load(f)
# assign variables from json data and convert to usable information
identifier = data['ID']
identifier = str(identifier)
name = data['name']
name = str(name)
# Collect data from user to compare with data in json file
print("Please enter your numerical identifier and name: ")
user_id = input("Numerical identifier: ")
user_name = input("Name: ")
if user_id == identifier and user_name == name:
print("Your inputs matched. Congrats.")
else:
print("Your inputs did not match our data. Please try again.")
And that works great for a simple JSON file like this:
{
"ID": "123",
"name": "Bobby"
}
But ideally I need to create a more complex JSON file and can't find deeper information on how to pull specific information from something like this:
{
"Parent": [
{
"Parent_1": [
{
"Name": "Bobby",
"ID": "123"
}
],
"Parent_2": [
{
"Name": "Linda",
"ID": "321"
}
]
}
]
}
Here is an example that you might be able to pick apart.
You could either:
Make a custom de-jsonify object_hook as shown below and do something with it. There is a good tutorial here.
Just gobble up the whole dictionary that you get without a custom de-jsonify and drill down into it and make a list or set of the results. (not shown)
Example:
import json
from collections import namedtuple
data = '''
{
"Parents":
[
{
"Name": "Bobby",
"ID": "123"
},
{
"Name": "Linda",
"ID": "321"
}
]
}
'''
Parent = namedtuple('Parent', ['name', 'id'])
def dejsonify(json_str: dict):
if json_str.get("Name"):
parent = Parent(json_str.get('Name'), int(json_str.get('ID')))
return parent
return json_str
res = json.loads(data, object_hook=dejsonify)
print(res)
# then we can do whatever... if you need lookups by name/id,
# we could put the result into a dictionary
all_parents = {(p.name, p.id) : p for p in res['Parents']}
lookup_from_input = ('Bobby', 123)
print(f'found match: {all_parents.get(lookup_from_input)}')
Result:
{'Parents': [Parent(name='Bobby', id=123), Parent(name='Linda', id=321)]}
found match: Parent(name='Bobby', id=123)

how to generate schema from a newline delimited JSON file in python

I want to generate schema from a newline delimited JSON file, having each row in the JSON file has variable-key/value pairs. File size can vary from 5 MB to 25 MB.
Sample Data:
{"col1":1,"col2":"c2","col3":100.75}
{"col1":2,"col3":200.50}
{"col1":3,"col3":300.15,"col4":"2020-09-08"}
Exptected Schema:
[
{"name": "col1", "type": "INTEGER"},
{"name": "col2", "type": "STRING"},
{"name": "col3", "type": "FLOAT"},
{"name": "col4", "type": "DATE"}
]
Notes:
There is no scope to use any tool, as files loaded into an inbound location dynamically. The code will use to trigger an event as-soon-as file arrives and perform schema comparison.
Your first problem is, that json does not have a date-type. So you will get str there.
What I would do, if I was you is this:
import json
# Wherever your input comes from
inp = """{"col1":1,"col2":"c2","col3":100.75}
{"col1":2,"col3":200.50}
{"col1":3,"col3":300.15,"col4":"2020-09-08"}"""
schema = {}
# Split it at newlines
for line in inp.split('\n'):
# each line contains a "dict"
tmp = json.loads(line)
for key in tmp:
# if we have not seen the key before, add it
if key not in schema:
schema[key] = type(tmp[key])
# otherwise check the type
else:
if schema[key] != type(tmp[key]):
raise Exception("Schema mismatch")
# format however you like
out = []
for item in schema:
out.append({"name": item, "type": schema[item].__name__})
print(json.dumps(out, indent=2))
I'm using python types for simplicity, but you can write your own function to get the type, e.g. if you want to check if a string is actually a date.

Cleaning of JSON Objects using Spark

I have been trying to clean my json file.I used RDD to read the Json file and then tried to clean it using replace function but still I am not getting the correct json file because of the escape sequences present in the JSON value.
Here is my code with which I am trying to clean the JSON file of various disturbances.
The cleaned JSON shows errors.Please review and tell the issue**
val readjson = sparkSession
.sparkContext.textFile("dev.json")
val json=readjson.map(element=>element
.replace("\"\":\"\"","\":\"")
.replace("\"\",\"\"","\",\"")
.replace("\"\":","\":")
.replace(",\"\"",",\"")
.replace("\"{\"\"","{\"")
.replace("\"\"}\"","\"}"))
.saveAsTextFile("JSON")
HERE IS MY JSON FILE
"{""SEQ_NO"":596514,""PROV_DEMOG_SK"":596514,""PROV_ID"":""QMP000003370581"",""FRST_NM"":"""",""LAST_NM"":""RICHARD WHITTINGTON BUTCHER"",""FUL_NM"":"""",""GENDR_CD"":"""",""PROV_NPI"":"""",""PROV_STAT"":""Incomplete"",""PROV_TY"":""03"",""DT_OF_BRTH"":"""",""PROFPROFL_DESGTN"":"""",""ETL_LAST_UPDT_DT_TM"":""2020-04-28 11:43:31.000000"",""PROV_CLSFTN_CD"":""A"",""SRC_DATA_KEY"":50,""OPRN_CD"":""I"",""REC_SET"":""F""}"
I tried cleaning the above json and got the following result:-
{
"SEQ_NO": 596514,
"PROV_DEMOG_SK": 596514,
"PROV_ID": "QMP000003370581",
"FRST_NM": "",
"LAST_NM": "RICHARD WHITTINGTON BUTCHER",
"FUL_NM": "",
"GENDR_CD": "",
"PROV_NPI": "",
"PROV_STAT": "Incomplete",
"PROV_TY": "03",
"DT_OF_BRTH": "",
"PROFPROFL_DESGTN": "",
"ETL_LAST_UPDT_DT_TM": "2020-04-28 11:43:31.000000",
"PROV_CLSFTN_CD": "A",
"SRC_DATA_KEY": 50,
"OPRN_CD": "I",
"REC_SET": "F"
}
The JSON validators present online show that it is incorrect
Looks like your JSON has one or few control character \u0009 try replacing them with
.replaceAll("\\u0009"," ")
You can do it in below sequence
val replacedVal = """{""SEQ_NO"":596514,""PROV_DEMOG_SK"":596514,""PROV_ID"":""QMP000003370581"",""FRST_NM"":\"\"\"",""LAST_NM"":""RICHARD WHITTINGTON BUTCHER"",""FUL_NM"":\"\"\"",""GENDR_CD"":\"\"\"",""PROV_NPI"":\"\"\"",""PROV_STAT"":""Incomplete"",""PROV_TY"":""03"",""DT_OF_BRTH"":\"\"\"",""PROFPROFL_DESGTN"":\"\"\"",""ETL_LAST_UPDT_DT_TM"":""2020-04-28 11:43:31.000000"",""PROV_CLSFTN_CD"":""A"",""SRC_DATA_KEY"":50,""OPRN_CD"":""I"",""REC_SET"":""F""}"""
.replace("""\"""",""""""")
.replace("""""""",""""""")
.replaceAll("\\u0009"," ")

Emit Python embedded object as native JSON in YAML document

I'm importing webservice tests from Excel and serialising them as YAML.
But taking advantage of YAML being a superset of JSON I'd like the request part of the test to be valid JSON, i.e. to have delimeters, quotes and commas.
This will allow us to cut and paste requests between the automated test suite and manual test tools (e.g. Postman.)
So here's how I'd like a test to look (simplified):
- properties:
METHOD: GET
TYPE: ADDRESS
Request URL: /addresses
testCaseId: TC2
request:
{
"unitTypeCode": "",
"unitNumber": "15",
"levelTypeCode": "L",
"roadNumber1": "810",
"roadName": "HAY",
"roadTypeCode": "ST",
"localityName": "PERTH",
"postcode": "6000",
"stateTerritoryCode": "WA"
}
In Python, my request object has a dict attribute called fields which is the part of the object to be serialised as JSON. This is what I tried:
import yaml
def request_presenter(dumper, request):
json_string = json.dumps(request.fields, indent=8)
return dumper.represent_str(json_string)
yaml.add_representer(Request, request_presenter)
test = Test(...including embedded request object)
serialised_test = yaml.dump(test)
I'm getting:
- properties:
METHOD: GET
TYPE: ADDRESS
Request URL: /addresses
testCaseId: TC2
request: "{
\"unitTypeCode\": \"\",\n
\"unitNumber\": \"15\",\n
\"levelTypeCode": \"L\",\n
\"roadNumber1\": \"810\",\n
\"roadName\": \"HAY\",\n
\"roadTypeCode\": \"ST\",\n
\"localityName\": \"PERTH\",\n
\"postcode\": \"6000\",\n
\"stateTerritoryCode\": \"WA\"\n
}"
...only worse because it's all on one line and has white space all over the place.
I tried using the | style for literal multi-line strings which helps with the line breaks and escaped quotes (it's more involved but this answer was helpful.) However, escaped or multiline, the result is still a string that will need to be parsed separately.
How can I stop PyYaml analysing the JSON block as a string and make it just accept a block of text as part of the emitted YAML? I'm guessing it's something to do with overriding the emitter but I could use some help. If possible I'd like to avoid post-processing the serialised test to achieve this.
Ok, so this was the solution I came up with. Generate the YAML with a placemarker ahead of time. The placemarker marks the place where the JSON should be inserted, and also defines the root-level indentation of the JSON block.
import os
import itertools
import json
def insert_json_in_yaml(pre_insert_yaml, key, obj_to_serialise):
marker = '%s: null' % key
marker_line = line_of_first_occurrence(pre_insert_yaml, marker)
marker_indent = string_indent(marker_line)
serialised = json.dumps(obj_to_serialise, indent=marker_indent + 4)
key_with_json = '%s: %s' % (key, serialised)
serialised_with_json = pre_insert_yaml.replace(marker, key_with_json)
return serialised_with_json
def line_of_first_occurrence(basestring, substring):
"""
return line number of first occurrence of substring
"""
lineno = lineno_of_first_occurrence(basestring, substring)
return basestring.split(os.linesep)[lineno]
def string_indent(s):
"""
return indentation of a string (no of spaces before a nonspace)
"""
spaces = ''.join(itertools.takewhile(lambda c: c == ' ', s))
return len(spaces)
def lineno_of_first_occurrence(basestring, substring):
"""
return line number of first occurrence of substring
"""
return basestring[:basestring.index(substring)].count(os.linesep)
embedded_object = {
"unitTypeCode": "",
"unitNumber": "15",
"levelTypeCode": "L",
"roadNumber1": "810",
"roadName": "HAY",
"roadTypeCode": "ST",
"localityName": "PERTH",
"postcode": "6000",
"stateTerritoryCode": "WA"
}
yaml_string = """
---
- properties:
METHOD: GET
TYPE: ADDRESS
Request URL: /addresses
testCaseId: TC2
request: null
after_request: another value
"""
>>> print(insert_json_in_yaml(yaml_string, 'request', embedded_object))
- properties:
METHOD: GET
TYPE: ADDRESS
Request URL: /addresses
testCaseId: TC2
request: {
"unitTypeCode": "",
"unitNumber": "15",
"levelTypeCode": "L",
"roadNumber1": "810",
"roadName": "HAY",
"roadTypeCode": "ST",
"localityName": "PERTH",
"postcode": "6000",
"stateTerritoryCode": "WA"
}
after_request: another value

How do you print multiple key values from sub keys in a .json file?

Im pulling a list of AMI ids from my AWS account and its being written into a json file.
The json looks basically like this:
{
"Images": [
{
"CreationDate": "2017-11-24T11:05:32.000Z",
"ImageId": "ami-XXXXXXXX"
},
{
"CreationDate": "2017-11-24T11:05:32.000Z",
"ImageId": "ami-aaaaaaaa"
},
{
"CreationDate": "2017-10-24T11:05:32.000Z",
"ImageId": "ami-bbbbbbb"
},
{
"CreationDate": "2017-10-24T11:05:32.000Z",
"ImageId": "ami-cccccccc"
},
{
"CreationDate": "2017-12-24T11:05:32.000Z",
"ImageId": "ami-ddddddd"
},
{
"CreationDate": "2017-12-24T11:05:32.000Z",
"ImageId": "ami-eeeeeeee"
}
]
}
My code looks like this so far after gathering the info and writing it to a .json file locally:
#writes json output to file...
print('writing to response.json...')
with open('response.json', 'w') as outfile:
json.dump(response, outfile, ensure_ascii=False, indent=4, sort_keys=True, separators=(',', ': '))
#Searches file...
print('opening response.json...')
with open("response.json") as f:
file_parsed = json.load(f)
The next part im stuck on is how to iterate through the file and print only the CreationDate and ImageId values.
print('printing CreationDate and ImageId...')
for ami in file_parsed['Images']:
#print ami['CreationDate'] #THIS WORKS
#print ami['ImageId'] #THIS WORKS
#print ami['CreationDate']['ImageId']
The last line there gives me this no matter how I have tried it: TypeError: string indices must be integers
My desired output is something like this:
2017-11-24T11:05:32.000Z ami-XXXXXXXX
Ultimately what im looking to do is then iterate through lines that are a certain date or older and deregister those AMIs. So would I be converting these to a list or a dict?
Pretty much not a programmer here so dont drown me.
TIA
You have almost parsed the json but for the desired output you need to concatenate the 'CreationDate' and 'ImageId' like this:
for ami in file_parsed['Images']:
print(ami['CreationDate'] + " "+ ami['ImageId'])
CreationDate evaluates to a string. So you can only take numerical indices of a string which is why ['CreationDate']['ImageId'] leads to a TypeError. Your other two commented lines, however, were correct.
To check if the date is older, you can make use of the datetime module. For instance, you can take the CreationDate (which is a string), convert it to a datetime object, create your own based on what that certain date is, and compare the two.
Something to this effect:
def checkIfOlder(isoformat, targetDate):
dateAsString = datetime.strptime(isoformat, '%Y-%m-%dT%H:%M:%S.%fZ')
return dateAsString <= targetDate
certainDate = datetime(2017, 11, 30) # Or whichever date you want
So in your for loop:
for ami in file_parsed['Images']:
creationDate = ami['CreationDate']
if checkIfOlder(creationDate, certainDate):
pass # write code to deregister AMIs here
Resources that would benefit would be Python's datetime documentation and in particular, the strftime/strptime directives. HTH!