how to INSERT a HTML formatted text to MySQL? - mysql

I am creating a database and inserting data. our backend engineer said he need a column to save whole articles with HTML format. But when I am inserting data it gives me an error like this:
and I check the exact where the error comes from, I found:
looks like this part has some quote or punctuation issues, and the same line occurs multiple times. And I use str() function to convert the formatted HTML text(use type() to see the datatype is bs4.element.Tag) to string, but the problem still exists.
My database description is:
('id', 'mediumint(9)', 'NO', 'PRI', None, 'auto_increment')
('weburl', 'varchar(200)', 'YES', '', None, '')
('picurl', 'varchar(200)', 'YES', '', None, '')
('headline', 'varchar(200)', 'YES', '', None, '')
('abstract', 'varchar(200)', 'YES', '', None, '')
('body', 'longtext', 'YES', '', None, '')
('formed', 'longtext', 'YES', '', None, '')
('term', 'varchar(50)', 'YES', '', None, '')
And the function I used to collect full text is:
def GetBody(url,plain=False):
# Fetch the html file
response = urllib.request.urlopen(url)
html_doc = response.read()
# Parse the html file
soup = BeautifulSoup(html_doc, 'html.parser')
#find the article body
body = soup.find("section", {"name":"articleBody"})
if not plain:
return body
else:
text = ""
for p_tag in body.find_all('p'):
text = ' '.join([text,p_tag.text])
return text
And I import the data by this function:
def InsertDatabase(section):
s = TopStoriesSearch(section)
count1 = 0
formed = []
while count1 < len(s):
# tr = GetBody(s[count1]['url'])
# formed.append(str(tr))
# count1 = count1 + 1
(I use this to convert HTML to string, or use the code below)
formed.append(GetBody(s[count1]['url']))
count1 = count1 + 1
and this is my insert function:
for each in overall(I save everything in this list named overall):
cur.execute('insert into topstories(formed) values("%s")' % (each["formed"]))
Any tips to solve the problem?

The syntax of execute() function is as follows (link):
cursor.execute(operation, params=None, multi=False)
Therefore, you can provide values to be used in the query as an argument to the execute() function. In that case, it will handle values automatically, eliminating your problem:
import mysql.connector
cnx = mysql.connector.connect(...)
cur = cnx.cursor()
...
for each in overall:
# If 'each' is a dictionary containing 'formed' as key,
# i.e. each = {..., 'formed': ..., ...}, you can do as follows
cur.execute('INSERT INTO topstories(formed) VALUES (%s)', (each['formed']))
# You can also use dictionary directly if you use named placeholder in the query
cur.execute('INSERT INTO topstories(formed) VALUES (%(formed)s)', each)
...
cnx.commit()
cnx.close()

Related

Filter Django model with json values

I have this function:
# create a function to upload an object one to one given a json
def upload_object_values(model, json_values, update_if_exists=True):
if json_values:
# the json values contain key value that match to the model
# use a copy to avoid runtime error dictionary changing size
for json_value in json_values.copy():
# remove all ids in model copy
if json_value[-3:] == '_id' or json_value == 'id':
json_values.pop(json_value)
# assign a value to each key which is a the field in the given model
for key, value in json_values.items():
setattr(model, key, value)
# if the object is get or create, or an update if exists
if update_if_exists:
# if it exists
if model.__class__.objects.seal().exists():
# retrieve the existing object
retrieved_object = model.__class__.objects.seal().filter(json_values).first() #TODO: filter the model with the json_values
# assign values into it
retrieved_object = model
# save
retrieved_object.save()
print(retrieved_object)
print('existing, saved successfully')
# if it does not exist
else:
# save
model.save()
print('does not exist, saved successfully')
# else just save
else:
model.save()
print('created successfully')
with sample json_values like:
{'notes': '', 'name': 'test issuer', 'phone': None}
I need to be able to filter using the json_values that I have, as shown in the line:
retrieved_object = model.__class__.objects.seal().filter(json_values).first() #TODO: filter the model with the json_values
How may I able to assign these values inside the filter() in django?
I think I understand what you are trying to achieve. Try this:
retrieved_object = model.__class__.objects.seal().filter(**json_values).first()
if json_values = {'notes': '', 'name': 'test issuer', 'phone': None}, the above line is equivalent to:
retrieved_object = model.__class__.objects.seal().filter(notes='', name='test issuer', phone=None).first()
To filter by JSONField in Django you should use __ (double).
YourObject.objects.filter(json_values__notes='')
YourObject.objects.filter(json_values__name='test issuer')
If you want to have deeper dict levels, the rule stays the same:
{'notes': {'casual': 'nice', 'pro': 'elegant'}, 'name': 'test issuer', 'phone': None}
YourObject.objects.filter(json_values__notes__pro='elegant')

Useful way to convert string to dictionary using python

I have the below string as input:
'name SP2, status Online, size 4764771 MB, free 2576353 MB, path /dev/sde, log 210 MB, port 5660, guid 7478a0141b7b9b0d005b30b0e60f3c4d, clusterUuid -8650609094877646407--116798096584060989, disks /dev/sde /dev/sdf /dev/sdg, dare 0'
I wrote function which convert it to dictionary using python:
def str_2_json(string):
str_arr = string.split(',')
#str_arr{0} = name SP2
#str_arr{1} = status Online
json_data = {}
for i in str_arr:
#remove whitespaces
stripped_str = " ".join(i.split()) # i.strip()
subarray = stripped_str.split(' ')
#subarray{0}=name
#subarray{1}=SP2
key = subarray[0] #key: 'name'
value = subarray[1] #value: 'SP2'
json_data[key] = value
#{dict 0}='name': SP2'
#{dict 1}='status': online'
return json_data
The return turns the dictionary into json (it has jsonfiy).
Is there a simple/elegant way to do it better?
You can do this with regex
import re
def parseString(s):
dict(re.findall('(?:(\S+) ([^,]+)(?:, )?)', s))
sample = "name SP1, status Offline, size 4764771 MB, free 2406182 MB, path /dev/sdb, log 230 MB, port 5660, guid a48134c00cda2c37005b30b0e40e3ed6, clusterUuid -8650609094877646407--116798096584060989, disks /dev/sdb /dev/sdc /dev/sdd, dare 0"
parseString(sample)
Output:
{'name': 'SP1',
'status': 'Offline',
'size': '4764771 MB',
'free': '2406182 MB',
'path': '/dev/sdb',
'log': '230 MB',
'port': '5660',
'guid': 'a48134c00cda2c37005b30b0e40e3ed6',
'clusterUuid': '-8650609094877646407--116798096584060989',
'disks': '/dev/sdb /dev/sdc /dev/sdd',
'dare': '0'}
Your approach is good, except for a couple weird things:
You aren't creating a JSON anything, so to avoid any confusion I suggest you don't name your returned dictionary json_data or your function str_2_json. JSON, or JavaScript Object Notation is just that -- a standard of denoting an object as text. The objects themselves have nothing to do with JSON.
You can use i.strip() instead of joining the splitted string (not sure why you did it this way, since you commented out i.strip())
Some of your values contain multiple spaces (e.g. "size 4764771 MB" or "disks /dev/sde /dev/sdf /dev/sdg"). By your code, you end up everything after the second space in such strings. To avoid this, do stripped_str.split(' ', 1) which limits how many times you want to split the string.
Other than that, you could create a dictionary in one line using the dict() constructor and a generator expression:
def str_2_dict(string):
data = dict(item.strip().split(' ', 1) for item in string.split(','))
return data
print(str_2_dict('name SP2, status Online, size 4764771 MB, free 2576353 MB, path /dev/sde, log 210 MB, port 5660, guid 7478a0141b7b9b0d005b30b0e60f3c4d, clusterUuid -8650609094877646407--116798096584060989, disks /dev/sde /dev/sdf /dev/sdg, dare 0'))
Outputs:
{
'name': 'SP2',
'status': 'Online',
'size': '4764771 MB',
'free': '2576353 MB',
'path': '/dev/sde',
'log': '210 MB',
'port': '5660',
'guid': '7478a0141b7b9b0d005b30b0e60f3c4d',
'clusterUuid': '-8650609094877646407--116798096584060989',
'disks': '/dev/sde /dev/sdf /dev/sdg',
'dare': '0'
}
This is probably the same (practically, in terms of efficiency / time) as writing out the full loop:
def str_2_dict(string):
data = dict()
for item in string.split(','):
key, value = item.strip().split(' ', 1)
data[key] = value
return data
Assuming these fields cannot contain internal commas, you can use re.split to both split and remove surrounding whitespace. It looks like you have different types of fields that should be handled differently. I've added a guess at a schema handler based on field names that can serve as a template for converting the various fields as needed.
And as noted elsewhere, there is no json so don't use that name.
import re
test = 'name SP2, status Online, size 4764771 MB, free 2576353 MB, path /dev/sde, log 210 MB, port 5660, guid 7478a0141b7b9b0d005b30b0e60f3c4d, clusterUuid -8650609094877646407--116798096584060989, disks /dev/sde /dev/sdf /dev/sdg, dare 0'
def decode_data(string):
str_arr = re.split(r"\s*,\s*", string)
data = {}
for entry in str_arr:
values = re.split(r"\s+", entry)
key = values.pop(0)
# schema processing
if key in ("disks"): # multivalue keys
data[key] = values
elif key in ("size", "free"): # convert to int bytes on 2nd value
multiplier = {"MB":10**6, "MiB":2**20} # todo: expand as needed
data[key] = int(values[0]) * multiplier[values[1]]
else:
data[key] = " ".join(values)
return data
decoded = decode_data(test)
for kv in sorted(decoded.items()):
print(kv)
import json
json_data = json.loads(string)

Parsing Json like structure in spark

I have a file with data structured like this
{'analytics_category_id': 'Default', 'item_sub_type': '', 'deleted_at': '', 'product_category_id': 'Default', 'unit_price': '0.000', 'id': 'myntramprdii201907174a72fb2475d84103844083d1348acb9e', 'is_valid': True, 'measurement_uom_id': '', 'description': '', 'invoice_type': 'receivable_debit_note', 'linked_core_invoice_item_id': '', 'ref_3': '741423', 'ref_2': '6001220139357318', 'ref_1': '2022-07-04', 'tax_rate': '0.000', 'reference_id': '', 'ref_4': '', 'product_id': 'Default', 'total_amount': '0.000', 'tax_auth_party_id': '', 'item_type': 'Product', 'invoice_item_attributes': '', 'core_invoice_id': 'myntramprdi20190717a1e925911345463393bc4ac1b124dbe5', 'tax_auth_geo_id': '', 'quantity': 1}
{'analytics_category_id': 'Default', 'item_sub_type': '', 'deleted_at': '', 'product_category_id': 'Default', 'unit_price': '511.000', 'id': 'myntramprdii20190717c749a96d2e7144aea7fc5125287717f7', 'is_valid': True, 'measurement_uom_id': '', 'description': '', 'invoice_type': 'receivable_debit_note', 'linked_core_invoice_item_id': '', 'ref_3': '741424', 'ref_2': '6001220152640260', 'ref_1': '2022-07-07', 'tax_rate': '0.000', 'reference_id': '', 'ref_4': '', 'product_id': 'Default', 'total_amount': '511.000', 'tax_auth_party_id': '', 'item_type': 'Product', 'invoice_item_attributes': '', 'core_invoice_id': 'myntramprdi20190717a1e925911345463393bc4ac1b124dbe5', 'tax_auth_geo_id': '', 'quantity': 1}
I am trying to parse this in Spark using scala and create a dataframe off of it, but not able to do so because of the structure. I thought about replacing the ' with " but my text can also have the same. What I need is a key value pair of the data.
So far I have tried:
read.option("multiline", "true").json("s3://******/*********/prod_flattener/y=2019/m=07/d=17/type=flattened_core_invoices_items/invoice_items_2019_07_17_23_53_19.txt")
I did get some success reading this as a multiline text:
read.option("multiline", "true").textFile("s3://********/*********/prod_flattener/y=2019/m=07/d=17/type=flattened_core_invoices_items/invoice_items_2019_07_17_23_53_19.txt")
| value|
+--------------------+
|{'analytics_categ...|
|{'analytics_categ...|
+--------------------+
How do I read the keys as columns now?
Your issue is linked to the True value used as a boolean in your entries: this is not valid in JSON which requires true or false as boolean values
If your dataset is not very large, the easiest way is to load as text, fix this issue, write the fixed data then reopen it as json.
import spark.implicits._
import org.apache.spark.sql.types._
val initial = spark.read.text("s3://******/*********/prod_flattener/y=2019/m=07/d=17/type=flattened_core_invoices_items/invoice_items_2019_07_17_23_53_19.txt")
val fixed = initial
.select(regexp_replace('value,"\\bTrue\\b","true") as "value")
.select(regexp_replace('value,"\\bFalse\\b","false") as "value")
fixed.write.mode("overwrite").text("/tmp/fixed_items")
val json_df = spark.read.json("/tmp/fixed_items")
json_df: org.apache.spark.sql.DataFrame = [analytics_category_id: string, core_invoice_id: string ... 23 more fields]
If you don't want to a temporary dataset, you can directly use from_json to parse the fixed text value but you'll need to manually define your schema in spark beforehand and do some column renaming after parsing:
val jsonSchema = StructType.fromDDL("`analytics_category_id` STRING,`core_invoice_id` STRING,`deleted_at` STRING,`description` STRING,`id` STRING,`invoice_item_attributes` STRING,`invoice_type` STRING,`is_valid` BOOLEAN,`item_sub_type` STRING,`item_type` STRING,`linked_core_invoice_item_id` STRING,`measurement_uom_id` STRING,`product_category_id` STRING,`product_id` STRING,`quantity` BIGINT,`ref_1` STRING,`ref_2` STRING,`ref_3` STRING,`ref_4` STRING,`reference_id` STRING,`tax_auth_geo_id` STRING,`tax_auth_party_id` STRING,`tax_rate` STRING,`total_amount` STRING,`unit_price` STRING")
val jsonParsingOptions: Map[String,String] = Map()
val json_df = fixed
.select(from_json('value, jsonSchema, jsonParsingOptions) as "j")
.select(jsonSchema.map(f => 'j.getItem(f.name).as(f.name)):_*)
json_df: org.apache.spark.sql.DataFrame = [analytics_category_id: string, core_invoice_id: string ... 23 more fields]
As an aside, from the snippet you posted, you don't seem to require the multiline option but if you actually do you'll need to add the option to the jsonParsingOptions map.

Is there a way to take a list of strings and create a JSON file, where both the key and value are list items?

I am creating a python script that can read scanned, and tabular .pdfs and extract some important data and insert it into a JSON to later be implemented into a SQL database (I will also be developing the DB as a project for learning MongoDB).
Basically, my issue is I have never worked with any JSON files before but that was the format I was recommended to output to. The scraping script works, the pre-processing could be a lot cleaner, but for now it works. The issue I run into is the keys, and values are in the same list, and some of the values because they had a decimal point are two different list items. Not really sure where to even start.
I don't really know where to start, I suppose since I know what the indexes of the list are I can easily assign keys and values, but then it may not be applicable to any .pdf, that is the script cannot be coded explicitly.
import PyPDF2 as pdf2
import textract
with "TestSpec.pdf" as filename:
pdfFileObj = open(filename, 'rb')
pdfReader = pdf2.pdfFileReader(pdfFileObj)
num_pages = pdfReader.numpages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(0)
count += 1
text += pageObj.extractText()
if text != "":
text = text
else:
text = textract.process(filename, method="tesseract", language="eng")
def cleanText(x):
'''
This function takes the byte data extracted from scanned PDFs, and cleans it of all
unnessary data.
Requires re
'''
stringedText = str(x)
cleanText = stringedText.replace('\n','')
splitText = re.split(r'\W+', cleanText)
caseingText = [word.lower() for word in splitText]
cleanOne = [word for word in caseingText if word != 'n']
dexStop = cleanOne.index("od260")
dexStart = cleanOne.index("sheet")
clean = cleanOne[dexStart + 1:dexStop]
return clean
cleanText = cleanText(text)
This is the current output
['n21', 'feb', '2019', 'nsequence', 'lacz', 'rp', 'n5', 'gat', 'ctc', 'tac', 'cat', 'ggc', 'gca', 'cat', 'ttc', 'ccc', 'gaa', 'aag', 'tgc', '3', 'norder', 'no', '15775199', 'nref', 'no', '207335463', 'n25', 'nmole', 'dna', 'oligo', '36', 'bases', 'nproperties', 'amount', 'of', 'oligo', 'shipped', 'to', 'ntm', '50mm', 'nacl', '66', '8', 'xc2', 'xb0c', '11', '0', '32', '6', 'david', 'cook', 'ngc', 'content', '52', '8', 'd260', 'mmoles', 'kansas', 'state', 'university', 'biotechno', 'nmolecular', 'weight', '10', '965', '1', 'nnmoles']
and we want the output as a JSON setup like
{"Date | 21feb2019", "Sequence ID: | lacz-rp", "Sequence 5'-3' | gat..."}
and so on. Just not sure how to do that.
here is a screenshot of the data from my sample pdf
So, i have figured out some of this. I am still having issues with grabbing the last 3rd of the data i need without explicitly programming it in. but here is what i have so far. Once i have everything working then i will worry about optimizing it and condensing.
# for PDF reading
import PyPDF2 as pdf2
import textract
# for data preprocessing
import re
from dateutil.parser import parse
# For generating the JSON file array
import json
# This finds and opens the pdf file, reads the data, and extracts the data.
filename = "*.pdf"
pdfFileObj = open(filename, 'rb')
pdfReader = pdf2.PdfFileReader(pdfFileObj)
text = ""
pageObj = pdfReader.getPage(0)
text += pageObj.extractText()
# checks if extracted data is in string form or picture, if picture textract reads data.
# it then closes the pdf file
if text != "":
text = text
else:
text = textract.process(filename, method="tesseract", language="eng")
pdfFileObj.close()
# Converts text to string from byte data for preprocessing
stringedText = str(text)
# Removed escaped lines and replaced them with actual new lines.
formattedText = stringedText.replace('\\n', '\n').lower()
# Slices the long string into a workable piece (only contains useful data)
slice1 = formattedText[(formattedText.index("sheet") + 10): (formattedText.index("secondary") - 2)]
clean = re.sub('\n', " ", slice1)
clean2 = re.sub(' +', ' ', clean)
# Creating the PrimerData dictionary
with open("PrimerData.json",'w') as file:
primerDataSlice = clean[clean.index("molecular"): -1]
primerData = re.split(": |\n", primerDataSlice)
primerKeys = primerData[0::2]
primerValues = primerData[1::2]
primerDict = {"Primer Data": dict(zip(primerKeys,primerValues))}
# Generatring the JSON array "Primer Data"
primerJSON = json.dumps(primerDict, ensure_ascii=False)
file.write(primerJSON)
# Grabbing the date (this has just the date, so json will have to add date.)
date = re.findall('(\d{2}[\/\- ](\d{2}|january|jan|february|feb|march|mar|april|apr|may|may|june|jun|july|jul|august|aug|september|sep|october|oct|november|nov|december|dec)[\/\- ]\d{2,4})', clean2)
Without input data it is difficult to give you working code. A minimal working example with input would help. As for JSON handling, python dictionaries can dump to json easily. See examples here.
https://docs.python-guide.org/scenarios/json/
Get a json string from a dictionary and write to a file. Figure out how to parse the text into a dictionary.
import json
d = {"Date" : "21feb2019", "Sequence ID" : "lacz-rp", "Sequence 5'-3'" : "gat"}
json_data = json.dumps(d)
print(json_data)
# Write that data to a file
So, I did figure this out, the problem was really just that because of the way my pre-processing was pulling all the data into a single list wasn't really that great of an idea considering that the keys for the dictionary never changed.
Here is the semi-finished result for making the Dictionary and JSON file.
# Collect the sequence name
name = clean2[clean2.index("Sequence") + 11: clean2.index("Sequence") + 19]
# Collecting Shipment info
ordered = input("Who placed this order? ")
received = input("Who is receiving this order? ")
dateOrder = re.findall(
r"(\d{2}[/\- ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[/\- ]\d{2,4})",
clean2)
dateReceived = date.today()
refNo = clean2[clean2.index("ref.No. ") + 8: clean2.index("ref.No.") + 17]
orderNo = clean2[clean2.index("Order No.") +
10: clean2.index("Order No.") + 18]
# Finding and grabbing the sequence data. Storing it and then finding the
# GC content and melting temp or TM
bases = int(clean2[clean2.index("bases") - 3:clean2.index("bases") - 1])
seqList = [line for line in clean2 if re.match(r'^[AGCT]+$', line)]
sequence = "".join(i for i in seqList[:bases])
def gc_content(x):
count = 0
for i in x:
if i == 'G' or i == 'C':
count += 1
else:
count = count
return round((count / bases) * 100, 1)
gc = gc_content(sequence)
tm = mt.Tm_GC(sequence, Na=50)
moleWeight = round(mw(Seq(sequence, generic_dna)), 2)
dilWeight = float(clean2[clean2.index("ug/OD260:") +
10: clean2.index("ug/OD260:") + 14])
dilution = dilWeight * 10
primerDict = {"Primer Data": {
"Sequence": sequence,
"Bases": bases,
"TM (50mM NaCl)": tm,
"% GC content": gc,
"Molecular weight": moleWeight,
"ug/0D260": dilWeight,
"Dilution volume (uL)": dilution
},
"Shipment Info": {
"Ref. No.": refNo,
"Order No.": orderNo,
"Ordered by": ordered,
"Date of Order": dateOrder,
"Received By": received,
"Date Received": str(dateReceived.strftime("%d-%b-%Y"))
}}
# Generating the JSON array "Primer Data"
with open("".join(name) + ".json", 'w') as file:
primerJSON = json.dumps(primerDict, ensure_ascii=False)
file.write(primerJSON)

Getting values from Json data in Python

I have a json file that I am trying to pull specific attribute data from. The Json data is essentially a dictionary. Before the data is turned into a file, it is first held in a variable like this:
params = {'f': 'json', 'where': '1=1', 'geometryType': 'esriGeometryPolygon', 'spatialRel': 'esriSpatialRelIntersects','outFields': '*', 'returnGeometry': 'true'}
r = requests.get('https://hazards.fema.gov/gis/nfhl/rest/services/CSLF/Prelim_CSLF/MapServer/3/query', params)
cslfJson = r.json()
and then written into a file like this:
path = r"C:/Workspace/Sandbox/ScratchTests/cslf.json"
with open(path, 'w') as f:
json.dump(cslfJson, f, indent=2)
within this json data is an attribute called DFIRM_ID. I want to create an empty list called dfirm_id = [], get all of the values for DFIRM_ID and for that value, append it to the list like this dfirm_id.append(value). I am thinking I need to somehow read through the json variable data or the actual file, but I am not sure how to do it. Any suggestions on an easy method to accomplish this?
dfirm_id = []
for k, v in cslf:
if cslf[k] == 'DFIRM_ID':
dfirm.append(cslf[v])
As requested, here is what print(cslfJson) looks like:
It actually prints a huge dictionary that looks like this:
{'displayFieldName': 'CSLF_ID', 'fieldAliases': {'OBJECTID':
'OBJECTID', 'CSLF_ID': 'CSLF_ID', 'Area_SF': 'Area_SF', 'Pre_Zone':
'Pre_Zone', 'Pre_ZoneST': 'Pre_ZoneST', 'PRE_SRCCIT': 'PRE_SRCCIT',
'NEW_ZONE': 'NEW_ZONE', 'NEW_ZONEST': .... {'attributes': {'OBJECTID':
26, 'CSLF_ID': '13245C_26', 'Area_SF': 5.855231804165408e-05,
'Pre_Zone': 'X', 'Pre_ZoneST': '0.2 PCT ANNUAL CHANCE FLOOD HAZARD',
'PRE_SRCCIT': '13245C_STUDY1', 'NEW_ZONE': 'A', 'NEW_ZONEST': None,
'NEW_SRCCIT': '13245C_STUDY2', 'CHHACHG': 'None (Zero)', 'SFHACHG':
'Increase', 'FLDWYCHG': 'None (Zero)', 'NONSFHACHG': 'Decrease',
'STRUCTURES': None, 'POPULATION': None, 'HUC8_CODE': None, 'CASE_NO':
None, 'VERSION_ID': '2.3.3.3', 'SOURCE_CIT': '13245C_STUDY2', 'CID':
'13245C', 'Pre_BFE': -9999, 'Pre_BFE_LEN_UNIT': None, 'New_BFE':
-9999, 'New_BFE_LEN_UNIT': None, 'BFECHG': 'False', 'ZONECHG': 'True', 'ZONESTCHG': 'True', 'DFIRM_ID': '13245C', 'SHAPE_Length':
0.009178426056888393, 'SHAPE_Area': 4.711699932249018e-07, 'UID': 'f0125a91-2331-4318-9a50-d77d042a48c3'}}, {'attributes': .....}
If your json data is already a dictionary, then take advantage of that. The beauty of a dictionary / hashmap is that it provides an average time complexity of O(1).
Based on your comment, I believe this will solve your problem:
dfirm_id = []
for feature in cslf['features']:
dfirm_id.append(feature['attributes']['DFIRM_ID'])