I'm trying to iterate through a list of IP addresses, and extracting the JSON data from my url, and trying to put that JSON data into a nested list.
It seems as if my code is overwriting my list over and over, and will only show one JSON object, instead of the many I have specified.
Here's my code:
for x in range(0, 10):
try:
url = 'http://' + ip_addr[x][0] + ':8080/system/ids/'
response = urlopen(url)
json_obj = json.load(response)
except:
continue
camera_details = [[i['name'], i['serial']] for i in json_obj['cameras']]
for x in camera_details:
#This only prints one object, and not 10.
print x
How can I append my JSON objects into a list, and then extract the 'name' and 'serial' values into a nested list?
try this
camera_details = []
for x in range(0, 10):
try:
url = 'http://' + ip_addr[x][0] + ':8080/system/ids/'
response = urlopen(url)
json_obj = json.load(response)
except:
continue
camera_details.extend([[i['name'], i['serial']] for i in json_obj['cameras']])
for x in camera_details:
print x
in your code you where only getting the last requests data
Best would be using append and avoiding list comprehension
camera_details = []
for x in range(0, 10):
try:
url = 'http://' + ip_addr[x][0] + ':8080/system/ids/'
response = urlopen(url)
json_obj = json.load(response)
except:
continue
for i in json_obj['cameras']:
camera_details.append([i['name'], i['serial']])
for x in camera_details:
print x
Try breaking up your code into smaller, easier to digest parts. This will help you to diagnose what's going on.
camera_details = []
for obj in json_obj['cameras']:
if 'name' in obj and 'serial' in obj:
camera_details.append([obj['name'], obj['serial']])
Related
I'm having trouble with finding json elements in a nested json.
It seems that my code only finds the element on the root level.
My code is not able to find the elements recursively it seems.
import json
import pandas as pd
jsonString = '{"airplane": {"wings": {}, "wheels": {}, "cockpit": {}}}'
jsonObj = json.loads(jsonString)
data = ['airplane','wings','wheels','cockpit']
dfProp = pd.DataFrame(data, columns=['object'])
# find elements in JSON
for index, row in dfProp.iterrows():
if row['object'] in jsonObj:
print(row['object'] + ' ' + 'FOUND')
else:
print(row['object'] + ' ' + 'NOT FOUND')
I want to find all elements regardless of how many nesting levels there are in json files.
Can someone point me into the right direction?
If I understand you correctly, you want to check if all values from list data is found as a key in jsonObj:
import json
jsonString = '{"airplane": {"wings": {}, "wheels": {}, "cockpit": {}}}'
jsonObj = json.loads(jsonString)
data = ["airplane", "wings", "wheels", "cockpit"]
def find(o):
if isinstance(o, dict):
for k, v in o.items():
yield k
yield from find(v)
elif isinstance(o, list):
for v in o:
yield from find(v)
s = set(data).difference(find(jsonObj))
if not s:
print("All values from data found in jsonObj")
else:
print("Not all values from data found in jsonObj", s)
Prints:
All values from data found in jsonObj
I have some code which collects the description, price, and old price(if on sale) from online retailers over multiple pages. I'm looking to export this into a DataFrame and have had a go but run into the following error:
ValueError: Shape of passed values is (1, 3210), indices imply (3, 3210).
from bs4 import BeautifulSoup
import requests
import time
import pandas as pd
# Start Timer
then = time.time()
# Headers
headers = {"User-Agent": "Mozilla/5.0"}
# Set HTTPCode = 200 and Counter = 1
Code = 200
i = 1
scraped_data = []
while Code == 200:
# Put url together
url = "https://www.asos.com/women/jumpers-cardigans/cat/?cid=2637&page="
url = url + str(i)
# Request URL
r = requests.get(url, allow_redirects=False, headers=headers) # No redirects to allow infinite page count
data = r.text
Code = r.status_code
# Soup
soup = BeautifulSoup(data, 'lxml')
# For loop each product then scroll through title price, old price and description
divs = soup.find_all('article', attrs={'class': '_2qG85dG'}) # want to cycle through each of these
for div in divs:
# Get Description
Description = div.find('div', attrs={'class': '_3J74XsK'})
Description = Description.text.strip()
scraped_data.append(Description)
# Fetch TitlePrice
NewPrice = div.find('span', attrs={'data-auto-id':'productTilePrice'})
NewPrice = NewPrice.text.strip("£")
scraped_data.append(NewPrice)
# Fetch OldPrice
try:
OldPrice = div.find('span', attrs={'data-auto-id': 'productTileSaleAmount'})
OldPrice = OldPrice.text.strip("£")
scraped_data.append(OldPrice)
except AttributeError:
OldPrice = ""
scraped_data.append(OldPrice)
print('page', i, 'scraped')
# Print Array
#array = {"Description": str(Description), "CurrentPrice": str(NewPrice), "Old Price": str(OldPrice)}
#print(array)
i = i + 1
else:
i = i - 2
now = time.time()
pd.DataFrame(scraped_data, columns=["A", "B", "C"])
print('Parse complete with', i, 'pages' + ' in', now-then, 'seconds')
Right now your data is appended to list based on an algorithm that I can describe like this:
Load the web page
Append to list value A
Append to list value B
Append to list value C
What this creates for each run through the dataset is:
[A1, B1, C1, A2, B2, C2]
There exists only one column with data, which is what pandas is telling you. To construct the dataframe properly, either you need to swap it into a format where you have, on each row entry, a tuple of three values (heh) like:
[
(A1, B1, C1),
(A2, B2, C2)
]
Or, in my preferred way because it's far more robust to coding errors and inconsistent lengths to your data: creating each row as a dictionary of columns. Thus,
rowdict_list = []
for row in data_source:
a = extract_a()
b = extract_b()
c = extract_c()
rowdict_list.append({'column_a': a, 'column_b': b, 'column_c': c})
And the data frame is constructed easily without having to explicitly specify columns in the constructor with df = pd.DataFrame(rowdict_list).
You can create a DataFrame using the array dictionary.
You would want to set the values of the array dict to empty lists that way you can append the values from the webpage into the correct list. Also move the array variable outside of the while loop.
array = {"Description": [], "CurrentPrice": [], "Old Price": []}
scraped_data = []
while Code == 200:
...
On the line where you were previously defining the array variable you would then want to append the desciption, price and old price values like so.
array['Description'].append(str(Description))
array['CurrentPrice'].append(str(NewPrice))
array['Old Price'].append(str(OldPrice))
Then you can to create a DataFrame using the array variable
pd.DataFrame(array)
So the final solution would look something like
array = {"Description": [], "CurrentPrice": [], "Old Price": []}
scraped_data = []
while Code == 200:
...
# For loop
for div in divs:
# Get Description
Description = div.find('h3', attrs={'class': 'product__title'})
Description = Description.text.strip()
# Fetch TitlePrice
try:
NewPrice = div.find('div', attrs={'class': 'price product__price--current'})
NewPrice = NewPrice.text.strip()
except AttributeError:
NewPrice = div.find('p', attrs={'class': 'price price--reduced'})
NewPrice = NewPrice.text.strip()
# Fetch OldPrice
try:
OldPrice = div.find('p', attrs={'class': 'price price--previous'})
OldPrice = OldPrice.text.strip()
except AttributeError:
OldPrice = ""
array['Description'].append(str(Description))
array['CurrentPrice'].append(str(NewPrice))
array['Old Price'].append(str(OldPrice))
# Print Array
print(array)
df = pd.DataFrame(array)
i = i + 1
else:
i = i - 2
now = time.time()
print('Parse complete with', i, 'pages' + ' in', now - then, 'seconds')
Finally make sure you've imported pandas at the top of the module
import pandas as pd
I just discovered the json_normalize function which works great in taking a JSON object and giving me a pandas Dataframe. Now I want the reverse operation which takes that same Dataframe and gives me a json (or json-like dictionary which I can easily turn to json) with the same structure as the original json.
Here's an example: https://hackersandslackers.com/json-into-pandas-dataframes/.
They take a JSON object (or JSON-like python dictionary) and turn it into a dataframe, but I now want to take that dataframe and turn it back into a JSON-like dictionary (to later dump to json file).
I implemented it with a couple functions
def set_for_keys(my_dict, key_arr, val):
"""
Set val at path in my_dict defined by the string (or serializable object) array key_arr
"""
current = my_dict
for i in range(len(key_arr)):
key = key_arr[i]
if key not in current:
if i==len(key_arr)-1:
current[key] = val
else:
current[key] = {}
else:
if type(current[key]) is not dict:
print("Given dictionary is not compatible with key structure requested")
raise ValueError("Dictionary key already occupied")
current = current[key]
return my_dict
def to_formatted_json(df, sep="."):
result = []
for _, row in df.iterrows():
parsed_row = {}
for idx, val in row.iteritems():
keys = idx.split(sep)
parsed_row = set_for_keys(parsed_row, keys, val)
result.append(parsed_row)
return result
#Where df was parsed from json-dict using json_normalize
to_formatted_json(df, sep=".")
A simpler approach:
Uses only 1 function...
def df_to_formatted_json(df, sep="."):
"""
The opposite of json_normalize
"""
result = []
for idx, row in df.iterrows():
parsed_row = {}
for col_label,v in row.items():
keys = col_label.split(sep)
current = parsed_row
for i, k in enumerate(keys):
if i==len(keys)-1:
current[k] = v
else:
if k not in current.keys():
current[k] = {}
current = current[k]
# save
result.append(parsed_row)
return result
df.to_json(path)
or
df.to_dict()
I just implemented this using 2 functions.
Get a full list of fields from the DataFrame that are part of a nested field. Only the parent i.e. if location.city.code fits the criteria, we only care about location.city. Sort it by the deepest level of nesting, i.e. location.city is nested further than location.
Starting with the deepest nested parent field, find all child fields by searching in the column name. Create a field in the DataFrame for the parent field, which is a combination of all child fields (renamed so that they lose the nesting structure, e.g. location.city.code becomes code) converted to JSON and then loaded to a dictionary value. Finally, drop all of the child fields.
def _get_nested_fields(df: pd.DataFrame) -> List[str]:
"""Return a list of nested fields, sorted by the deepest level of nesting first."""
nested_fields = [*{field.rsplit(".", 1)[0] for field in df.columns if "." in field}]
nested_fields.sort(key=lambda record: len(record.split(".")), reverse=True)
return nested_fields
def df_denormalize(df: pd.DataFrame) -> pd.DataFrame:
"""
Convert a normalised DataFrame into a nested structure.
Fields separated by '.' are considered part of a nested structure.
"""
nested_fields = _get_nested_fields(df)
for field in nested_fields:
list_of_children = [column for column in df.columns if field in column]
rename = {
field_name: field_name.rsplit(".", 1)[1] for field_name in list_of_children
}
renamed_fields = df[list_of_children].rename(columns=rename)
df[field] = json.loads(renamed_fields.to_json(orient="records"))
df.drop(list_of_children, axis=1, inplace=True)
return df
let me throw in my two cents
after backward converting you might need to drop empty columns from your generated jsons
therefore, i checked if val != np.nan. but u cant directly do it, instead you need to check val == val or not, because np.nan != itself.
my version:
def to_formatted_json(df, sep="."):
result = []
for _, row in df.iterrows():
parsed_row = {}
for idx, val in row.iteritems():
if val == val:
keys = idx.split(sep)
parsed_row = set_for_keys(parsed_row, keys, val)
result.append(parsed_row)
return result
This is a solution which looks working to me. It is designed to work on a dataframe with one line, but it can be easily looped over large dataframes.
class JsonRecreate():
def __init__(self, df):
self.df = df
def pandas_to_json(self):
df = self.df
# determine the number of nesting levels
number_levels = np.max([len(i.split('.')) for i in df.columns])
# put all the nesting levels in an a list
levels = []
for level_idx in np.arange(number_levels):
levels.append(np.array([i.split('.')[level_idx] if len(i.split('.')) > level_idx else ''
for i in df.columns.tolist()]))
self.levels = levels
return self.create_dict(upper_bound = self.levels[0].shape[0])
def create_dict(self, level_idx = 0, lower_bound = 0, upper_bound = 100):
''' Function to create the dictionary starting from a pandas dataframe generated by json_normalize '''
levels = self.levels
dict_ = {}
# current nesting level
level = levels[level_idx]
# loop over all the relevant elements of the level (relevant w.r.t. its parent)
for key in [i for i in np.unique(level[lower_bound: upper_bound]) if i != '']:
# find where a particular key occurs in the level
correspondence = np.where(level[lower_bound: upper_bound] == key)[0] + lower_bound
# check if the value(s) corresponding to the key appears once (multiple times)
if correspondence.shape[0] == 1:
# if the occurence is unique, append the value to the dictionary
dict_[key] = self.df.values[0][correspondence[0]]
else:
# otherwhise, redefine the relevant bounds and call the function recursively
lower_bound_, upper_bound_ = correspondence.min(), correspondence.max() + 1
dict_[key] = self.create_dict(level_idx + 1, lower_bound_, upper_bound_)
return dict_
I tested it with a simple dataframe such as:
df = pd.DataFrame({'a.b': [1], 'a.c.d': [2], 'a.c.e': [3], 'a.z.h1': [-1], 'a.z.h2': [-2], 'f': [4], 'g.h': [5], 'g.i.l': [6], 'g.i.m': [7], 'g.z.h1': [-3], 'g.z.h2': [-4]})
The order in the json is not exactly preserved in the resulting json, but it can be easily handled if needed.
As mentioned in the title, i'm trying to make a simple py script that can be run from terminal to do the following:
Find all JSON files in current working directory and nested folders (this part works well)
Load said files
Recursively search them for a specific value or a substring
If the value is matching, replace it with a new established value by the user
Once finished, save all modified json files to a "converted" folder in the current directory.
That said, the issue is when i try the recursive search method posted below, since i'm pretty much new to python i would appreciate any help with this issue, what i suppose it is... either the json files i'm using or the search method i'm employing.
Simplifying the issue, the value i search for never matches with anything inside the object, be that a key or purely some string value. Tried multiple methods to perform a recursive search but can't get a match.
For example: taking in account the sample json, i want to replace the value "selectable_parts" or "static_parts" or even deeper in the structure "1h_mod310_door_00" but seems like my method of searching can't reach this value in "object[object][children][0][children][5][name]" (hope this helps).
Sample JSON: (https://drive.google.com/open?id=0B2-Bn2b0ujjVdW5YVGg3REg3OWs)
"""KEYWORD REPLACING MODULE."""
import os
import json
# functions
def get_files():
"""lists files"""
exclude = set(['.vscode', 'sample'])
json_files = []
for root, dirs, files in os.walk(os.getcwd(), topdown=True):
dirs[:] = [d for d in dirs if d not in exclude]
for name in files:
if name.endswith('.json'):
json_files.append(os.path.join(root, name))
return json_files
def load_files(json_files):
"""works files"""
for js_file in json_files:
with open(js_file) as json_file:
loaded_json = json.load(json_file)
replace_key_value(loaded_json, os.path.basename(js_file))
def write_file(data_file, new_file_name):
"""writes the file"""
if not os.path.exists('converted'):
os.makedirs('converted')
with open('converted/' + new_file_name, 'w') as json_file:
json.dump(data_file, json_file)
def replace_key_value(js_file, js_file_name):
"""replace and initiate save"""
recursive_replace(js_file, SKEY, '')
# write_file(js_file, js_file_name)
def recursive_replace(data, match, repl):
"""search for needed value and replace its value"""
for key, value in data.items():
if value == match:
print data[key]
print "AHHHHHHHH"
elif isinstance(value, dict):
recursive_replace(value, match, repl)
# main
print "\n" + '- on ' + os.getcwd()
NEW_DIR = raw_input('Work dir (leave empty if current): ')
if not NEW_DIR:
print NEW_DIR
NEW_DIR = os.getcwd()
else:
print NEW_DIR
os.chdir(NEW_DIR)
# get_files()
JS_FILES = get_files()
print '- files on ' + os.getcwd()
# print "\n".join(JS_FILES)
SKEY = raw_input('Value to search: ')
RKEY = raw_input('Replacement value: ')
load_files(JS_FILES)
The issue was the way i navigated the json obj because the method didn't considerate if it was a dict or a list (i believe...).
So to answer my own question here's the recursive search i'm using to check the values:
def get_recursively(search_dict, field):
"""
Takes a dict with nested lists and dicts,
and searches all dicts for a key of the field
provided.
"""
fields_found = []
for key, value in search_dict.iteritems():
if key == field:
print value
fields_found.append(value)
elif isinstance(value, dict):
results = get_recursively(value, field)
for result in results:
if SEARCH_KEY in result:
fields_found.append(result)
elif isinstance(value, list):
for item in value:
if isinstance(item, dict):
more_results = get_recursively(item, field)
for another_result in more_results:
if SEARCH_KEY in another_result:
fields_found.append(another_result)
return fields_found
# write_file(js_file, js_file_name)
Hope this helps someone.
I have the following code in my scrapy spider:
def parse(self, response):
jsonresponse = json.loads(response.body_as_unicode())
htmldata = jsonresponse["html"]
for sel in htmldata.xpath('//li/li'):
-- more xpath codes --
yield item
But i am having this error:
raise ValueError("No JSON object could be decoded")
exceptions.ValueError: No JSON object could be decoded
After checking the json reply, i found out about **<!--WPJM-->** and **<!--WPJM_END-->** which is causing this error.
<!--WPJM-->{"found_jobs":true,"html":"<html code>","max_num_pages":3}<!--WPJM_END-->
How do i parse my scrapy without looking at the !--WPJM-- and !--WPJM_END-- code?
EDIT: This is the error that i have:
File "/home/muhammad/Projects/project/project/spiders/crawler.py", line 150, in parse
for sel in htmldata.xpath('//li'):
exceptions.AttributeError: 'unicode' object has no attribute 'xpath'
def parse(self, response):
rawdata = response.body_as_unicode()
jsondata = rawdata.replace('<!--WPJM-->', '').replace('<!--WPJM_END-->', '')
# print jsondata # For debugging
# pass
data = json.loads(jsondata)
htmldata = data["html"]
# print htmldata # For debugging
# pass
for sel in htmldata.xpath('//li'):
item = ProjectjomkerjaItem()
item['title'] = sel.xpath('a/div[#class="position"]/div[#id="job-title-job-listing"]/strong/text()').extract()
item['company'] = sel.xpath('a/div[#class="position"]/div[#class="company"]/strong/text()').extract()
item['link'] = sel.xpath('a/#href').extract()
The easiest approach would be to get rid of the comments tags manually using replace():
data = response.body_as_unicode()
data = data.replace('<!--WPJM-->', '').replace('<!--WPJM_END-->', '')
jsonresponse = json.loads(data)
Though it is not quite pythonic and reliable.
Or, a better option would to be to get the text() by xpath:
$ scrapy shell index.html
>>> response.xpath('//text()').extract()[0]
u'{"found_jobs":true,"html":"<html code"}'