Scrapy - Getting duplicated items when appending items using for loop - json

I am crawling the data from JSON response. Extracting data into item using for loop and all i get, is a last record rewriting all the previous records made by this loop.
Here is my code:
def parse_centers_and_ambulances(self, response):
json_response = json.loads(response.body_as_unicode())
facility = MedFacilityItem()
facility["name"] = "Med Facility #1"
centers = []
med_centers = MedCenterItem()
for center in json_response:
if center["name"].startswith("Center"):
med_centers["response_url"] = center["product_id"]
med_centers["name"] = center["name"]
med_centers["address"] = center["name_short"] + "." +
center["adr_name"] + " " +
center["adr_dom"]
med_centers["lat"] = center["latitude"]
med_centers["lon"] = center["longitude"]
med_centers["phoneInfo"] = [{"number": center["tel1"],
"description": center["tel1_descr"]},
{"number": center["tel2"],
"description": center["tel2_descr"]}]
centers.append(med_centers)
facility["facility_type"] = centers
return facility
What i am missing?

Since Scrapy items basically behave like dicts, I'm going to use dicts for the following examples. Consider this:
In [1]: dict_list = []
...: d = {}
...: for i in range(3):
...: d['i'] = i
...: dict_list.append(d)
...: print dict_list
...: print [id(e) for e in dict_list]
...:
[{'i': 2}, {'i': 2}, {'i': 2}]
[4557722520, 4557722520, 4557722520]
Dicts are mutable objects, and in this case you are appending the same dict instance multiple times to a list. The resulting list does not contain different items, only several references to the same dict object. The following example shows the same behaviour, appending the same dict three times to a list and then setting a value for it:
In [2]: dict_list = []
...: d = {}
...: for i in range(3):
...: dict_list.append(d)
...: d['some'] = 'value'
...: print dict_list
...:
[{'some': 'value'}, {'some': 'value'}, {'some': 'value'}]
What you need to do is create different dicts by initializing them inside the for loop, as follows:
In [3]: dict_list = []
...: for i in range(3):
...: d = {}
...: d['i'] = i
...: dict_list.append(d)
...: print dict_list
...: print [id(e) for e in dict_list]
...:
[{'i': 0}, {'i': 1}, {'i': 2}]
[4557901904, 4557724760, 4557843264]

You can try defining your item inside of the loop, instead outside of it.
def parse_centers_and_ambulances(self, response):
json_response = json.loads(response.body_as_unicode())
facility = MedFacilityItem()
facility["name"] = "Med Facility #1"
centers = []
# med_centers = MedCenterItem() <-- this
for center in json_response:
if center["name"].startswith("Center"):
med_centers = MedCenterItem() <-- should be here
med_centers["response_url"] = center["product_id"]
med_centers["name"] = center["name"]
med_centers["address"] = center["name_short"] + "." +
center["adr_name"] + " " +
center["adr_dom"]
med_centers["lat"] = center["latitude"]
med_centers["lon"] = center["longitude"]
med_centers["phoneInfo"] = [{"number": center["tel1"],
"description": center["tel1_descr"]},
{"number": center["tel2"],
"description": center["tel2_descr"]}]
centers.append(med_centers)
facility["facility_type"] = centers
return facility

Related

BeautifulSoup4 & Python - multiple pages into DataFrame

I have some code which collects the description, price, and old price(if on sale) from online retailers over multiple pages. I'm looking to export this into a DataFrame and have had a go but run into the following error:
ValueError: Shape of passed values is (1, 3210), indices imply (3, 3210).
from bs4 import BeautifulSoup
import requests
import time
import pandas as pd
# Start Timer
then = time.time()
# Headers
headers = {"User-Agent": "Mozilla/5.0"}
# Set HTTPCode = 200 and Counter = 1
Code = 200
i = 1
scraped_data = []
while Code == 200:
# Put url together
url = "https://www.asos.com/women/jumpers-cardigans/cat/?cid=2637&page="
url = url + str(i)
# Request URL
r = requests.get(url, allow_redirects=False, headers=headers) # No redirects to allow infinite page count
data = r.text
Code = r.status_code
# Soup
soup = BeautifulSoup(data, 'lxml')
# For loop each product then scroll through title price, old price and description
divs = soup.find_all('article', attrs={'class': '_2qG85dG'}) # want to cycle through each of these
for div in divs:
# Get Description
Description = div.find('div', attrs={'class': '_3J74XsK'})
Description = Description.text.strip()
scraped_data.append(Description)
# Fetch TitlePrice
NewPrice = div.find('span', attrs={'data-auto-id':'productTilePrice'})
NewPrice = NewPrice.text.strip("£")
scraped_data.append(NewPrice)
# Fetch OldPrice
try:
OldPrice = div.find('span', attrs={'data-auto-id': 'productTileSaleAmount'})
OldPrice = OldPrice.text.strip("£")
scraped_data.append(OldPrice)
except AttributeError:
OldPrice = ""
scraped_data.append(OldPrice)
print('page', i, 'scraped')
# Print Array
#array = {"Description": str(Description), "CurrentPrice": str(NewPrice), "Old Price": str(OldPrice)}
#print(array)
i = i + 1
else:
i = i - 2
now = time.time()
pd.DataFrame(scraped_data, columns=["A", "B", "C"])
print('Parse complete with', i, 'pages' + ' in', now-then, 'seconds')
Right now your data is appended to list based on an algorithm that I can describe like this:
Load the web page
Append to list value A
Append to list value B
Append to list value C
What this creates for each run through the dataset is:
[A1, B1, C1, A2, B2, C2]
There exists only one column with data, which is what pandas is telling you. To construct the dataframe properly, either you need to swap it into a format where you have, on each row entry, a tuple of three values (heh) like:
[
(A1, B1, C1),
(A2, B2, C2)
]
Or, in my preferred way because it's far more robust to coding errors and inconsistent lengths to your data: creating each row as a dictionary of columns. Thus,
rowdict_list = []
for row in data_source:
a = extract_a()
b = extract_b()
c = extract_c()
rowdict_list.append({'column_a': a, 'column_b': b, 'column_c': c})
And the data frame is constructed easily without having to explicitly specify columns in the constructor with df = pd.DataFrame(rowdict_list).
You can create a DataFrame using the array dictionary.
You would want to set the values of the array dict to empty lists that way you can append the values from the webpage into the correct list. Also move the array variable outside of the while loop.
array = {"Description": [], "CurrentPrice": [], "Old Price": []}
scraped_data = []
while Code == 200:
...
On the line where you were previously defining the array variable you would then want to append the desciption, price and old price values like so.
array['Description'].append(str(Description))
array['CurrentPrice'].append(str(NewPrice))
array['Old Price'].append(str(OldPrice))
Then you can to create a DataFrame using the array variable
pd.DataFrame(array)
So the final solution would look something like
array = {"Description": [], "CurrentPrice": [], "Old Price": []}
scraped_data = []
while Code == 200:
...
# For loop
for div in divs:
# Get Description
Description = div.find('h3', attrs={'class': 'product__title'})
Description = Description.text.strip()
# Fetch TitlePrice
try:
NewPrice = div.find('div', attrs={'class': 'price product__price--current'})
NewPrice = NewPrice.text.strip()
except AttributeError:
NewPrice = div.find('p', attrs={'class': 'price price--reduced'})
NewPrice = NewPrice.text.strip()
# Fetch OldPrice
try:
OldPrice = div.find('p', attrs={'class': 'price price--previous'})
OldPrice = OldPrice.text.strip()
except AttributeError:
OldPrice = ""
array['Description'].append(str(Description))
array['CurrentPrice'].append(str(NewPrice))
array['Old Price'].append(str(OldPrice))
# Print Array
print(array)
df = pd.DataFrame(array)
i = i + 1
else:
i = i - 2
now = time.time()
print('Parse complete with', i, 'pages' + ' in', now - then, 'seconds')
Finally make sure you've imported pandas at the top of the module
import pandas as pd

I cannot access dropdown widget output in a loop

I have been going in circles for hours.
I am trying to get the dropdown output to loop, to ensure the result is correct.
I get the dropdown list, but the "output" is none.
If I select 'DEV' or "DEV", it prints DEV. The output (w) is none and the loop exits at else not if??
The python code (jypter):
source = ["Select Source", "DEV", "TEMP", "PROD"]
source_ = widgets.Dropdown(
options=source,
value=source[0],
description='Select variable:',
disabled=False,
button_style=''
)
def sourceURL(b):
clear_output()
print(source_.value)
### Drop Down
print("Drop Down")
display(source_)
w = source_.observe(sourceURL)
## print("output: ")
print(w) ### output is None
#### LOOP
if w == 'DEV':
print("This is Dev")
elif w == "TEST":
print("This is TEST")
else:
print("This is PROD")
When you do source_.observe(sourceURL), there is no return value from this call. Hence this is equivalent to w = None.
To get the behaviour you want, I think you would need to move the code at the end of your script into your sourceURL function.
import ipywidgets as widgets
from IPython.display import clear_output
source = ["DEV", "TEMP", "PROD"]
source_ = widgets.Dropdown(
options=source,
value=source[0],
description='Select variable:',
disabled=False,
button_style=''
)
def sourceURL(b):
# clear_output()
w = b['new']
if w == 'DEV':
print("This is Dev")
elif w == "TEMP":
print("This is TEMP")
else:
print("This is PROD")
### Drop Down
print("Drop Down")
display(source_)
w = source_.observe(sourceURL, names='value')

How to turn csv file data into a list WITHOUT the 'import csv'

So what I'm trying to do is, I'm trying to read the data in the CSV file into the empty lists I've defined at the top.
How can I do
this without the 'csv import'
L = []
F = []
G = []
A = []
class client ():
fh = open('fit_clinic_20.csv', 'r')
for line in fh:
data = fh.readlines()
L, F, G, A = fh.split(',')
I would try:
L = []
F = []
G = []
A = []
fh = open('fit_clinic_20.csv', 'r')
# first: you read lines
data = fh.readlines()
for line in data:
# you split every line into values
L_value, F_value, G_value, A_value = line.split(',')
# you append values to lists
L.append(L_value)
F.append(F_value)
G.append(G_value)
A.append(A_value)
Surely, there are more compact ways to do this, but I think that this way is well understandable.

Python: Append JSON objects to nested list

I'm trying to iterate through a list of IP addresses, and extracting the JSON data from my url, and trying to put that JSON data into a nested list.
It seems as if my code is overwriting my list over and over, and will only show one JSON object, instead of the many I have specified.
Here's my code:
for x in range(0, 10):
try:
url = 'http://' + ip_addr[x][0] + ':8080/system/ids/'
response = urlopen(url)
json_obj = json.load(response)
except:
continue
camera_details = [[i['name'], i['serial']] for i in json_obj['cameras']]
for x in camera_details:
#This only prints one object, and not 10.
print x
How can I append my JSON objects into a list, and then extract the 'name' and 'serial' values into a nested list?
try this
camera_details = []
for x in range(0, 10):
try:
url = 'http://' + ip_addr[x][0] + ':8080/system/ids/'
response = urlopen(url)
json_obj = json.load(response)
except:
continue
camera_details.extend([[i['name'], i['serial']] for i in json_obj['cameras']])
for x in camera_details:
print x
in your code you where only getting the last requests data
Best would be using append and avoiding list comprehension
camera_details = []
for x in range(0, 10):
try:
url = 'http://' + ip_addr[x][0] + ':8080/system/ids/'
response = urlopen(url)
json_obj = json.load(response)
except:
continue
for i in json_obj['cameras']:
camera_details.append([i['name'], i['serial']])
for x in camera_details:
print x
Try breaking up your code into smaller, easier to digest parts. This will help you to diagnose what's going on.
camera_details = []
for obj in json_obj['cameras']:
if 'name' in obj and 'serial' in obj:
camera_details.append([obj['name'], obj['serial']])

Scrapy Extract Values of tags

I am handling a json request that has a div as a value.
Now I want to get only the values of data-content-value
<li id="term_100800962" data-content-value='{"nl_term_id":100800962,"c_price_from":33415,"nd_price_discount":0,"nl_tour_id":1017864,"nl_hotel_id":[49316],"d_start":"2017-04-12","d_end":"2017-04-17"}' >
and store them in 'dates' 'id' 'price' and I can't figure out a way to do this.
Is there an easy way?
In [2]: from scrapy.selector import Selector
In [3]: text = """<li id="term_100800962" data-content-value='{"nl_term_id":100
...: 800962,"c_price_from":33415,"nd_price_discount":0,"nl_tour_id":1017864,"
...: nl_hotel_id":[49316],"d_start":"2017-04-12","d_end":"2017-04-17"}' >"""
In [4]: sel = Selector(text=text)
In [5]: data_string = sel.xpath('//li/#data-content-value').extract_first()
In [6]: import json
In [7]: json.loads(data_string)
Out[7]:
{'c_price_from': 33415,
'd_end': '2017-04-17',
'd_start': '2017-04-12',
'nd_price_discount': 0,
'nl_hotel_id': [49316],
'nl_term_id': 100800962,
'nl_tour_id': 1017864}
First, get the string of the attribute, then use json.loads() convert it to python dict.
This url will return a json response, we should loads all response to json and select the info we need:
In [11]: fetch('https://dovolena.invia.cz/direct/tour_search/ajax-next-boxes/?nl
...: _country_id%5B0%5D=28&nl_locality_id%5B0%5D=19&d_start_from=23.01.2017&
...: d_end_to=19.04.2017&nl_transportation_id%5B0%5D=3&sort=nl_sell&page=1&g
...: etOptionsCount=true&base_url=https%3A%2F%2Fdovolena.invia.cz%2F')
In [12]: j = json.loads(response.text)
In [15]: j['boxes_html'] # this will renturn the html in json file.
In [15]: from scrapy.selector import Selector
In [16]: sel = Selector(text=j['boxes_html']) # loads html to selector
In [17]: datas = sel.xpath('//li/#data-content-value').extract() # return all data in a list
In [21]: [json.loads(d) for d in datas] # loads text to value
|---dict-----|
# this will return a list of dict which generated by json.loads(d), and you can use json.loads(d)['d_end'] to access it's element.
out:
[{'c_price_from': 15690,
'd_end': '2017-04-16',
'd_start': '2017-04-09',
'nd_price_discount': 27,
'nl_hotel_id': [24810],
'nl_term_id': 93902083,
'nl_tour_id': 839597},
{'c_price_from': 27371,
'd_end': '2017-04-17',
'd_start': '2017-04-12',
'nd_price_discount': 4,
'nl_hotel_id': [49316],
'nl_term_id': 100804770,
'nl_tour_id': 1017864},
{'c_price_from': 32175,
'd_end': '2017-04-17',
'd_start': '2017-04-12',
'nd_price_discount': 4,
'nl_hotel_id': [49316],
'nl_term_id': 100800962,
'nl_tour_id': 1017864},