BeautifulSoup4 & Python - multiple pages into DataFrame - html

I have some code which collects the description, price, and old price(if on sale) from online retailers over multiple pages. I'm looking to export this into a DataFrame and have had a go but run into the following error:
ValueError: Shape of passed values is (1, 3210), indices imply (3, 3210).
from bs4 import BeautifulSoup
import requests
import time
import pandas as pd
# Start Timer
then = time.time()
# Headers
headers = {"User-Agent": "Mozilla/5.0"}
# Set HTTPCode = 200 and Counter = 1
Code = 200
i = 1
scraped_data = []
while Code == 200:
# Put url together
url = "https://www.asos.com/women/jumpers-cardigans/cat/?cid=2637&page="
url = url + str(i)
# Request URL
r = requests.get(url, allow_redirects=False, headers=headers) # No redirects to allow infinite page count
data = r.text
Code = r.status_code
# Soup
soup = BeautifulSoup(data, 'lxml')
# For loop each product then scroll through title price, old price and description
divs = soup.find_all('article', attrs={'class': '_2qG85dG'}) # want to cycle through each of these
for div in divs:
# Get Description
Description = div.find('div', attrs={'class': '_3J74XsK'})
Description = Description.text.strip()
scraped_data.append(Description)
# Fetch TitlePrice
NewPrice = div.find('span', attrs={'data-auto-id':'productTilePrice'})
NewPrice = NewPrice.text.strip("£")
scraped_data.append(NewPrice)
# Fetch OldPrice
try:
OldPrice = div.find('span', attrs={'data-auto-id': 'productTileSaleAmount'})
OldPrice = OldPrice.text.strip("£")
scraped_data.append(OldPrice)
except AttributeError:
OldPrice = ""
scraped_data.append(OldPrice)
print('page', i, 'scraped')
# Print Array
#array = {"Description": str(Description), "CurrentPrice": str(NewPrice), "Old Price": str(OldPrice)}
#print(array)
i = i + 1
else:
i = i - 2
now = time.time()
pd.DataFrame(scraped_data, columns=["A", "B", "C"])
print('Parse complete with', i, 'pages' + ' in', now-then, 'seconds')

Right now your data is appended to list based on an algorithm that I can describe like this:
Load the web page
Append to list value A
Append to list value B
Append to list value C
What this creates for each run through the dataset is:
[A1, B1, C1, A2, B2, C2]
There exists only one column with data, which is what pandas is telling you. To construct the dataframe properly, either you need to swap it into a format where you have, on each row entry, a tuple of three values (heh) like:
[
(A1, B1, C1),
(A2, B2, C2)
]
Or, in my preferred way because it's far more robust to coding errors and inconsistent lengths to your data: creating each row as a dictionary of columns. Thus,
rowdict_list = []
for row in data_source:
a = extract_a()
b = extract_b()
c = extract_c()
rowdict_list.append({'column_a': a, 'column_b': b, 'column_c': c})
And the data frame is constructed easily without having to explicitly specify columns in the constructor with df = pd.DataFrame(rowdict_list).

You can create a DataFrame using the array dictionary.
You would want to set the values of the array dict to empty lists that way you can append the values from the webpage into the correct list. Also move the array variable outside of the while loop.
array = {"Description": [], "CurrentPrice": [], "Old Price": []}
scraped_data = []
while Code == 200:
...
On the line where you were previously defining the array variable you would then want to append the desciption, price and old price values like so.
array['Description'].append(str(Description))
array['CurrentPrice'].append(str(NewPrice))
array['Old Price'].append(str(OldPrice))
Then you can to create a DataFrame using the array variable
pd.DataFrame(array)
So the final solution would look something like
array = {"Description": [], "CurrentPrice": [], "Old Price": []}
scraped_data = []
while Code == 200:
...
# For loop
for div in divs:
# Get Description
Description = div.find('h3', attrs={'class': 'product__title'})
Description = Description.text.strip()
# Fetch TitlePrice
try:
NewPrice = div.find('div', attrs={'class': 'price product__price--current'})
NewPrice = NewPrice.text.strip()
except AttributeError:
NewPrice = div.find('p', attrs={'class': 'price price--reduced'})
NewPrice = NewPrice.text.strip()
# Fetch OldPrice
try:
OldPrice = div.find('p', attrs={'class': 'price price--previous'})
OldPrice = OldPrice.text.strip()
except AttributeError:
OldPrice = ""
array['Description'].append(str(Description))
array['CurrentPrice'].append(str(NewPrice))
array['Old Price'].append(str(OldPrice))
# Print Array
print(array)
df = pd.DataFrame(array)
i = i + 1
else:
i = i - 2
now = time.time()
print('Parse complete with', i, 'pages' + ' in', now - then, 'seconds')
Finally make sure you've imported pandas at the top of the module
import pandas as pd

Related

Loop through list of dictionaries and append to csv

I'm currently trying to collect tweets with the Twitter API. I want to merge two list dictionaries to a csv. The ['data'] list consist of ID and tweet, the second list, ['includes']['users'], consist of username and location. I have tried with two for loops in order to merge these elements, one for ['data'] and one for ['includes']['users']. But I end up having the exact same tweet and ID for every user in my csv output.
print(json.dumps(json_response, indent=4, sort_keys=True))
My data looks like this (not real tweets):
{"data": [{"author_id": "1234","id": "9999","text": "This is tweet number 1"},{"author_id": "9876","id": "1111","text": "This is another tweet"},],"includes": {"users": [{"id": "9999","location": "Earth","name": "George Huston","username": "George_Huston"},{"id": "1111","name": "Adam Sandler,"username": "adam_sandler"}]
json_response['includes']['users']
[{'name': 'George Huston','location': 'Earth','id': '9876','username': 'George_Huston'},{'name': 'Adam Sandler', 'id': '9999', 'username': 'adam_sandler}]
Creating a csv:
# Create file
csvFile = open("data.csv", "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
#Create headers for the data you want to save, in this example, we only want save these columns in our dataset
csvWriter.writerow(['id', 'username', 'text', 'location'])
csvFile.close()
def append_to_csv(json_response, fileName):
#A counter variable
counter = 0
#Open OR create the target CSV file
csvFile = open(fileName, "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
#Loop through each tweet
for tweet in json_response['data']:
tweet_id = tweet['id']
text = tweet['text']
for element in json_response['includes']['users']:
username = element['username']
if ('location' in tweet):
location = element['location']
else:
location = " "
# Assemble all data in a list
res = [tweet_id,username,text,location]
# Append the result to the CSV file
csvWriter.writerow(res)
counter += 1
# When done, close the CSV file
csvFile.close()
# Print the number of tweets for this iteration
print("# of Tweets added from this response: ", counter)
append_to_csv(json_response, "data.csv")
But get this csv output:
id,username,text,location
9999,George_Huston,"This is tweet number 1",
9999,adam_sandler,"This is tweet number 1",
The id, text, location is always the same, while the username is different. How can I solve this problem?
In your for tweet in json_response['data'] loop you overwrite tweet_id and text as the loop goes on. The output you see is whatever they were set to in the last iteration of the loop.
It seems from the Twitter API that you can get usernames from the Tweet Objects as well, without json_response['includes']['users'] that you used.
Does this do what you want?
# Create file
fileName = 'data.csv'
csvFile = open("data.csv", "w", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
#Create headers for the data you want to save, in this example, we only want save these columns in our dataset
csvWriter.writerow(['id', 'username', 'text', 'location'])
csvFile.close()
def append_to_csv(json_response, fileName):
#A counter variable
counter = 0
#Open OR create the target CSV file
csvFile = open(fileName, "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
#Loop through each tweet
for tweet in json_response['data']:
tweet_id = tweet['id']
text = tweet['text']
username = tweet['username']
if ('location' in tweet):
location = element['location']
else:
location = " "
# Assemble all data in a list
res = [tweet_id,username,text,location]
# Append the result to the CSV file
csvWriter.writerow(res)
counter += 1
# When done, close the CSV file
csvFile.close()
# Print the number of tweets for this iteration
print("# of Tweets added from this response: ", counter)
append_to_csv(json_response, "data.csv")

Instaloader JSON files: Convert 200 JSON files into a Single CSV (Python 3.7)

I want to automatically download pictures (or videos) along with their captions and other data from a specific Instagram Hashtag (e.g. #moodoftheday) using Instaloader. Instaloader returns JSON files including posts metadata.
The following code worked with just a single #user_profile metadata.
I want to do the same, but for a #hashtag not a specific #user.
The ultimate goal is to have all of the JSON files (e.g. 200) into a csv file.
How can I process my downloaded data in a clean excel/CSV file?
Here is my code:
# Install Instaloader
import instaloader
def get_instagram_posts(username, startdate, enddate):
# Create an instaloader object with parameters
L = instaloader.Instaloader(download_pictures = False, download_videos = False, download_comments= False, compress_json = False)
# Log in with the instaloader object
L.login("username" , "password")
# Search the instagram profile
profile = instaloader.Profile.from_username(L.context, username)
# Scrape the posts
posts = profile.get_posts()
for post in takewhile(lambda p: p.date > startdate, dropwhile(lambda p : p.date > enddate, posts)):
print(post.date)
L.download_post(post, target = profile.username)
'''
This function will now save all instagram posts and related data to a folder in you current working directory.
Let’s call this function on the instagram account of “moodoftheday”. let the script do its magic.
This might take a while so be patient.
'''
import os
import datetime
# instagram username
username = "realdonaldtrump"
# daterange of scraping
startdate = datetime(2020, 9, 1)
enddate = datetime(2020, 10, 1)
# get your current working directory
current_wkdir = os.get_cwd()
# Call the function. This will automatically store all the scrape data in a folder in your current working directory
get_instagram_posts(username, startdate, enddate)
'''
You notice that this data is NOT yet in the right format since each post has a separate json file.
You will need to process all these json files to a consolidated excel file in order to perform analyses on the data.
'''
def parse_instafiles(username, path):
"""
This function loads in all the json files generated by the instaloader package and parses it into a csv file.
"""
#print('Entering provided directory...')
os.chdir(os.path.join(path, username))
columns = ['filename', 'datetime', 'type', 'locations_id', 'locations_name', 'mentions', 'hashtags', 'video_duration']
dataframe = pd.DataFrame(columns=[])
#print('Traversing file tree...')
glob('*UTC.json')
for file in glob('*UTC.json'):
with open(file, 'r') as filecontent:
filename = filecontent.name
#print('Found JSON file: ' + filename + '. Loading...')
try:
metadata = orjson.loads(filecontent.read())
except IOError as e:
#print("I/O Error. Couldn't load file. Trying the next one...")
continue
else:
pass
#print('Collecting relevant metadata...')
time = datetime.fromtimestamp(int(metadata['node']['taken_at_timestamp']))
type_ = metadata['node']['__typename']
likes = metadata['node']['edge_media_preview_like']['count']
comments = metadata['node']['edge_media_to_comment']['count']
username = metadata['node']['owner']['username']
followers = metadata['node']['owner']['edge_followed_by']['count']
try:
text = metadata['node']['edge_media_to_caption']['edges'][0]['node']['text']
except:
text = ""
try:
post_id = metadata['node']['id']
except:
post_id = ""
minedata = {'filename': filename, 'time': time, 'text': text,
'likes': likes, 'comments' : comments, 'username' : username, 'followers' : followers, 'post_id' : post_id}
#print('Writing to dataframe...')
dataframe = dataframe.append(minedata, ignore_index=True)
#print('Closing file...')
del metadata
filecontent.close()
#print('Storing dataframe to CSV file...')
#print('Done.')
dataframe['source'] = 'Instagram'
return dataframe
'''
You can then use this function to process the "moodoftheday" Instagram data.
'''
df_instagram = parse_instafiles(username, os.getcwd() )
df_instagram.to_excel("moodoftheday.csv")
I am very new to Python and programming overall, therefore any help is very much appreciated!!
Thank you in advance! Sofia
I made some changes it's not showing error but still needs some professional works:
import instaloader
from datetime import datetime
import datetime
from itertools import takewhile
from itertools import dropwhile
import os
import glob as glob
import json
import pandas as pd
import csv
lusername = ''
lpassword = ''
def get_instagram_posts(username, startdate, enddate):
# Create an instaloader object with parameters
L = instaloader.Instaloader(download_pictures = False, download_videos = False, download_comments= False, compress_json = False)
# Log in with the instaloader object
L.login("lusername" , "lpassword")
# Search the instagram profile
profile = instaloader.Profile.from_username(L.context, username)
# Scrape the posts
posts = profile.get_posts()
for post in takewhile(lambda p: p.date > startdate, dropwhile(lambda p : p.date > enddate, posts)):
print(post.date)
L.download_post(post, target = profile.username)
# instagram username
username = "realdonaldtrump"
# daterange of scraping
startdate = datetime.datetime(2020, 9, 1,0,0)
enddate = datetime.datetime(2022, 2, 1,0,0)
# get your current working directory
current_wkdir = os.getcwd()
# Call the function. This will automatically store all the scrape data in a folder in your current working directory
get_instagram_posts(username, startdate, enddate)
def parse_instafiles(username, path):
#print('Entering provided directory...')
os.chdir(os.path.join(path, username))
columns = ['filename', 'datetime', 'type', 'locations_id', 'locations_name', 'mentions', 'hashtags', 'video_duration']
dataframe = pd.DataFrame(columns=[])
#print('Traversing file tree...')
# glob('*UTC.json')
for file in glob.glob('*UTC.json'):
with open(file, 'r') as filecontent:
filename = filecontent.name
#print('Found JSON file: ' + filename + '. Loading...')
try:
metadata = json.load(filecontent)
except IOError as e:
#print("I/O Error. Couldn't load file. Trying the next one...")
continue
else:
pass
#print('Collecting relevant metadata...')
time = datetime.datetime.fromtimestamp(int(metadata['node']['taken_at_timestamp']))
type_ = metadata['node']['__typename']
likes = metadata['node']['edge_media_preview_like']['count']
comments = metadata['node']['edge_media_to_comment']['count']
username = metadata['node']['owner']['username']
followers = metadata['node']['owner']['edge_followed_by']['count']
try:
text = metadata['node']['edge_media_to_caption']['edges'][0]['node']['text']
except:
text = ""
try:
post_id = metadata['node']['id']
except:
post_id = ""
minedata = {'filename': filename, 'time': time, 'text': text,
'likes': likes, 'comments' : comments, 'username' : username, 'followers' : followers, 'post_id' : post_id}
#print('Writing to dataframe...')
dataframe = dataframe.append(minedata, ignore_index=True)
#print('Closing file...')
del metadata
filecontent.close()
#print('Storing dataframe to CSV file...')
#print('Done.')
dataframe['source'] = 'Instagram'
return dataframe
'''
You can then use this function to process the "moodoftheday" Instagram data.
'''
df_instagram = parse_instafiles(username, os.getcwd() )
df_instagram.to_csv("moodoftheday.csv")
Instaloader has an example of hashtag search on its documentation, here's the code:
from datetime import datetime
import instaloader
L = instaloader.Instaloader()
posts = instaloader.Hashtag.from_name(L.context, "urbanphotography").get_posts()
SINCE = datetime(2020, 5, 10) # further from today, inclusive
UNTIL = datetime(2020, 5, 11) # closer to today, not inclusive
k = 0 # initiate k
#k_list = [] # uncomment this to tune k
for post in posts:
postdate = post.date
if postdate > UNTIL:
continue
elif postdate <= SINCE:
k += 1
if k == 50:
break
else:
continue
else:
L.download_post(post, "#urbanphotography")
# if you want to tune k, uncomment below to get your k max
#k_list.append(k)
k = 0 # set k to 0
#max(k_list)
here's the link for more info:
https://instaloader.github.io/codesnippets.html
I'm trying do to something similar, but I'm still very new to programming, so I'm sorry if I can't offer much help

Inverse of Pandas json_normalize

I just discovered the json_normalize function which works great in taking a JSON object and giving me a pandas Dataframe. Now I want the reverse operation which takes that same Dataframe and gives me a json (or json-like dictionary which I can easily turn to json) with the same structure as the original json.
Here's an example: https://hackersandslackers.com/json-into-pandas-dataframes/.
They take a JSON object (or JSON-like python dictionary) and turn it into a dataframe, but I now want to take that dataframe and turn it back into a JSON-like dictionary (to later dump to json file).
I implemented it with a couple functions
def set_for_keys(my_dict, key_arr, val):
"""
Set val at path in my_dict defined by the string (or serializable object) array key_arr
"""
current = my_dict
for i in range(len(key_arr)):
key = key_arr[i]
if key not in current:
if i==len(key_arr)-1:
current[key] = val
else:
current[key] = {}
else:
if type(current[key]) is not dict:
print("Given dictionary is not compatible with key structure requested")
raise ValueError("Dictionary key already occupied")
current = current[key]
return my_dict
def to_formatted_json(df, sep="."):
result = []
for _, row in df.iterrows():
parsed_row = {}
for idx, val in row.iteritems():
keys = idx.split(sep)
parsed_row = set_for_keys(parsed_row, keys, val)
result.append(parsed_row)
return result
#Where df was parsed from json-dict using json_normalize
to_formatted_json(df, sep=".")
A simpler approach:
Uses only 1 function...
def df_to_formatted_json(df, sep="."):
"""
The opposite of json_normalize
"""
result = []
for idx, row in df.iterrows():
parsed_row = {}
for col_label,v in row.items():
keys = col_label.split(sep)
current = parsed_row
for i, k in enumerate(keys):
if i==len(keys)-1:
current[k] = v
else:
if k not in current.keys():
current[k] = {}
current = current[k]
# save
result.append(parsed_row)
return result
df.to_json(path)
or
df.to_dict()
I just implemented this using 2 functions.
Get a full list of fields from the DataFrame that are part of a nested field. Only the parent i.e. if location.city.code fits the criteria, we only care about location.city. Sort it by the deepest level of nesting, i.e. location.city is nested further than location.
Starting with the deepest nested parent field, find all child fields by searching in the column name. Create a field in the DataFrame for the parent field, which is a combination of all child fields (renamed so that they lose the nesting structure, e.g. location.city.code becomes code) converted to JSON and then loaded to a dictionary value. Finally, drop all of the child fields.
def _get_nested_fields(df: pd.DataFrame) -> List[str]:
"""Return a list of nested fields, sorted by the deepest level of nesting first."""
nested_fields = [*{field.rsplit(".", 1)[0] for field in df.columns if "." in field}]
nested_fields.sort(key=lambda record: len(record.split(".")), reverse=True)
return nested_fields
def df_denormalize(df: pd.DataFrame) -> pd.DataFrame:
"""
Convert a normalised DataFrame into a nested structure.
Fields separated by '.' are considered part of a nested structure.
"""
nested_fields = _get_nested_fields(df)
for field in nested_fields:
list_of_children = [column for column in df.columns if field in column]
rename = {
field_name: field_name.rsplit(".", 1)[1] for field_name in list_of_children
}
renamed_fields = df[list_of_children].rename(columns=rename)
df[field] = json.loads(renamed_fields.to_json(orient="records"))
df.drop(list_of_children, axis=1, inplace=True)
return df
let me throw in my two cents
after backward converting you might need to drop empty columns from your generated jsons
therefore, i checked if val != np.nan. but u cant directly do it, instead you need to check val == val or not, because np.nan != itself.
my version:
def to_formatted_json(df, sep="."):
result = []
for _, row in df.iterrows():
parsed_row = {}
for idx, val in row.iteritems():
if val == val:
keys = idx.split(sep)
parsed_row = set_for_keys(parsed_row, keys, val)
result.append(parsed_row)
return result
This is a solution which looks working to me. It is designed to work on a dataframe with one line, but it can be easily looped over large dataframes.
class JsonRecreate():
def __init__(self, df):
self.df = df
def pandas_to_json(self):
df = self.df
# determine the number of nesting levels
number_levels = np.max([len(i.split('.')) for i in df.columns])
# put all the nesting levels in an a list
levels = []
for level_idx in np.arange(number_levels):
levels.append(np.array([i.split('.')[level_idx] if len(i.split('.')) > level_idx else ''
for i in df.columns.tolist()]))
self.levels = levels
return self.create_dict(upper_bound = self.levels[0].shape[0])
def create_dict(self, level_idx = 0, lower_bound = 0, upper_bound = 100):
''' Function to create the dictionary starting from a pandas dataframe generated by json_normalize '''
levels = self.levels
dict_ = {}
# current nesting level
level = levels[level_idx]
# loop over all the relevant elements of the level (relevant w.r.t. its parent)
for key in [i for i in np.unique(level[lower_bound: upper_bound]) if i != '']:
# find where a particular key occurs in the level
correspondence = np.where(level[lower_bound: upper_bound] == key)[0] + lower_bound
# check if the value(s) corresponding to the key appears once (multiple times)
if correspondence.shape[0] == 1:
# if the occurence is unique, append the value to the dictionary
dict_[key] = self.df.values[0][correspondence[0]]
else:
# otherwhise, redefine the relevant bounds and call the function recursively
lower_bound_, upper_bound_ = correspondence.min(), correspondence.max() + 1
dict_[key] = self.create_dict(level_idx + 1, lower_bound_, upper_bound_)
return dict_
I tested it with a simple dataframe such as:
df = pd.DataFrame({'a.b': [1], 'a.c.d': [2], 'a.c.e': [3], 'a.z.h1': [-1], 'a.z.h2': [-2], 'f': [4], 'g.h': [5], 'g.i.l': [6], 'g.i.m': [7], 'g.z.h1': [-3], 'g.z.h2': [-4]})
The order in the json is not exactly preserved in the resulting json, but it can be easily handled if needed.

Python: Append JSON objects to nested list

I'm trying to iterate through a list of IP addresses, and extracting the JSON data from my url, and trying to put that JSON data into a nested list.
It seems as if my code is overwriting my list over and over, and will only show one JSON object, instead of the many I have specified.
Here's my code:
for x in range(0, 10):
try:
url = 'http://' + ip_addr[x][0] + ':8080/system/ids/'
response = urlopen(url)
json_obj = json.load(response)
except:
continue
camera_details = [[i['name'], i['serial']] for i in json_obj['cameras']]
for x in camera_details:
#This only prints one object, and not 10.
print x
How can I append my JSON objects into a list, and then extract the 'name' and 'serial' values into a nested list?
try this
camera_details = []
for x in range(0, 10):
try:
url = 'http://' + ip_addr[x][0] + ':8080/system/ids/'
response = urlopen(url)
json_obj = json.load(response)
except:
continue
camera_details.extend([[i['name'], i['serial']] for i in json_obj['cameras']])
for x in camera_details:
print x
in your code you where only getting the last requests data
Best would be using append and avoiding list comprehension
camera_details = []
for x in range(0, 10):
try:
url = 'http://' + ip_addr[x][0] + ':8080/system/ids/'
response = urlopen(url)
json_obj = json.load(response)
except:
continue
for i in json_obj['cameras']:
camera_details.append([i['name'], i['serial']])
for x in camera_details:
print x
Try breaking up your code into smaller, easier to digest parts. This will help you to diagnose what's going on.
camera_details = []
for obj in json_obj['cameras']:
if 'name' in obj and 'serial' in obj:
camera_details.append([obj['name'], obj['serial']])

Issue with defining and executing the function in Python

Data which I am scraping using the beautiful soup contains one category of device name, device names contains Colors mentioned in them eg. Lumia 800 Black. I want to create a new column which contains this color.
I want to search the device name for any color against a list of colors & if color is present in that device name I want to remove that color from device name and put it in new column named Color.
I am using below referred code to accomplish this, I am creating a function named color and trying to search the device name string for presence of color and if present I am trying to feed that color to new variable named color_column. But my output csv is not returning any values at all. It is empty.
Please check the referred code below:
# -*- coding: cp1252 -*-
import csv
import urllib2
import sys
import urllib
import time
import mechanize
import cookielib
from bs4 import BeautifulSoup
from itertools import islice
colors = ["Black","Gray"]
def color(arg):
for colors_1 in colors:
if arg.find(colors_1) == -1:
return color_column == ""
return color_column == colors_1
url = 'http://www.t-mobile.com/shop/phones/default.aspx?shape=smartphones'
user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1;Trident/5.0)'
values = {
'Phones':'MBBDevice',
'__ASYNCPOST':'true',
'__EVENTARGUMENT':'',
'__EVENTTARGET':'pgrTop$lnkPageShowAll',
'__LASTFOCUS':'',
'__VIEWSTATE':'/wEPDwULLTE1NTE5NDk1ODIPFgIeEEN1cnJlbnRQYWdlSW5kZXgCARYCAgEPZBYCAgEPZBYCAgEPZBYCZg9kFgICAQ9kFhgCCg9kFgJmD2QWAmYPZBYCZg8UKwACZDKJBAABAAAA/////wEAAAAAAAAADAIAAABfVE1vYmlsZS5XZWIuVE1vYmlsZURvdENvbS5VSS5XZWJDb250cm9scywgVmVyc2lvbj0xLjAuMC4wLCBDdWx0dXJlPW5ldXRyYWwsIFB1YmxpY0tleVRva2VuPW51bGwFAQAAAEFUTW9iaWxlLldlYi5UTW9iaWxlRG90Q29tLlVJLldlYkNvbnRyb2xzLkJyZWFkQ3J1bWJJdGVtQ29sbGVjdGlvbgEAAAATQ29sbGVjdGlvbkJhc2UrbGlzdAMcU3lzdGVtLkNvbGxlY3Rpb25zLkFycmF5TGlzdAIAAAAJAwAAAAQDAAAAHFN5c3RlbS5Db2xsZWN0aW9ucy5BcnJheUxpc3QDAAAABl9pdGVtcwVfc2l6ZQhfdmVyc2lvbgUAAAgICQQAAAACAAAABQAAABAEAAAABAAAAAkFAAAACQYAAAANAgUFAAAAN1RNb2JpbGUuV2ViLlRNb2JpbGVEb3RDb20uVUkuV2ViQ29udHJvbHMuQnJlYWRDcnVtYkl0ZW0DAAAABV90ZXh0BF91cmwJX3Nob3dMaW5rAQEAAQIAAAAGBwAAAARIb21lBggAAAAAAQEGAAAABQAAAAYJAAAAGVNtYXJ0cGhvbmVzICYgQ2VsbCBQaG9uZXMGCgAAAAtzaG9wL3Bob25lcwELZAIMD2QWAgIDDxYCHgxIdG1sT3ZlcnJpZGUFkwI8aW1nIHN0eWxlPSJGTE9BVDogcmlnaHQ7IENVUlNPUjogcG9pbnRlciEgaW1wb3J0YW50IiBvbmNsaWNrPSJqYXZhc2NyaXB0OnBvcFVwKCAnL3RlbXBsYXRlcy9wb3B1cC5hc3B4P1BBc3NldD1TaHBfUGhuX3NoaXBwaW5nRGV0YWlscycsICczNDAnLCAnNTY4JywgJzQ1JywgJzMwJywgJzAnLCAnMCcsICcxJyApIiBhbHQ9IkZyZWUgU2hpcHBpbmcgb24gYWxsIGNlbGwgcGhvbmVzIGFuZCBkZXZpY2VzLiIgc3JjPSIuLi9pbWFnZXMvZnJlZV9zaGlwcGluZy1iYW5uZXIuZ2lmIiAvPmQCDg8PFgIeB1Zpc2libGVoZGQCGA9kFgJmD2QWAmYPZBYCZg9kFggCAQ9kFgICAQ8QDxYEHgdDaGVja2VkaB4HRW5hYmxlZGgWAh4LbWFrZWVuYWJsZWQFBWZhbHNlZGRkAgUPZBYCAgEPEA9kFgIfBQUEdHJ1ZWRkZAIHD2QWAgIBDxAPZBYCHwUFBHRydWVkZGQCCQ9kFgICAQ8QD2QWAh8FBQR0cnVlZGRkAhoPZBYCZg9kFgJmD2QWAmYPZBYEAgMPZBYCAgEPEA9kFgIfBQUEdHJ1ZWRkZAIFD2QWAgIBDxAPFgIeBFRleHQF2AU8dGFibGUgaGVpZ2h0PSIxNSIgY2VsbHNwYWNpbmc9IjAiIGNlbGxwYWRkaW5nPSIwIiB3aWR0aD0iNzciIGJvcmRlcj0iMCI+CiAgICAgIDx0Ym9keT4KICAgICAgICA8dHI+CiAgICAgICAgICA8dGQgY2xhc3M9InJlZnVyYmlzaGVkIj5SZWZ1cmJpc2hlZDwvdGQ+CgogICAgICAgICAgPHRkIGNsYXNzPSJyZWZ1cmJpc2hlZCI+CiAgICAgICAgICAgIDxkaXYgb25tb3VzZW92ZXI9ImphdmFzY3JpcHQ6ZGlzcENPQkRlc2MoKTsiIHN0eWxlPSJGTE9BVDogbGVmdCIgb25tb3VzZW91dD0iamF2YXNjcmlwdDpoaWRlQ09CRGVzYygpOyIgcnVuYXQ9InNlcnZlciI+CiAgICAgICAgICAgICAgPGltZyBzcmM9Ii9pbWFnZXMvaWNvbl9oZWxwLmdpZiIgLz4gPGRpdiBjbGFzcz0idG9vbHRpcCIgaWQ9ImRpdkNPQkRlc2NyaXB0aW9uIiBzdHlsZT0iRElTUExBWTogbm9uZSI+CiAgICAgIDxkaXYgY2xhc3M9InRvb2x0aXAtYnRtLWJrZyI+CiAgICAgICAgPGRpdiBjbGFzcz0idG9vbHRpcC1jb250YWluZXIiPgogICAgICAgICAgR2V0IGEgZ3JlYXQgdmFsdWUgb24gYSBsaWtlLW5ldyBwaG9uZQogICAgICAgICAgPGJyIC8+CiAgICAgICAgICAgd2l0aCBhIDkwLWRheSB3YXJyYW50eS4KICAgICAgICA8L2Rpdj4KICAgICAgPC9kaXY+CiAgICA8L2Rpdj4KICAgICAgICAgICAgPC9kaXY+CiAgICAgICAgICA8L3RkPgogICAgICAgIDwvdHI+CiAgICAgIDwvdGJvZHk+CiAgICA8L3RhYmxlPhYCHwUFBHRydWVkZGQCIA8WAh4Fc3R5bGUFDmRpc3BsYXk6YmxvY2s7FgJmD2QWAmYPZBYCZg9kFgYCAw9kFgICAQ8QD2QWAh8FBQR0cnVlZGRkAgUPZBYCAgEPEA9kFgIfBQUEdHJ1ZWRkZAIHD2QWAgIBDxAPZBYCHwUFBHRydWVkZGQCKg9kFgJmD2QWAmYPZBYEZg8PFgIfAmcWAh4HT25DbGljawUKQ2xlYXJJRFMoKWQCAQ8PZBYCHwgFCkNsZWFySURTKClkAi4PZBYCZg9kFgJmD2QWAgIKD2QWCAIBDw8WAh8CaGRkAgMPFgIeCl9QYWdlQ291bnQCBBYGAgIPFgIfAmhkAgcPD2QWAh8HBQxkaXNwbGF5Om5vbmVkAggPDxYCHwJnZGQCBw8WAh8JAgQWBgICDxYCHwJoZAIIDw9kFgIfBwUMZGlzcGxheTpub25lZAIJDw8WAh8CZ2RkAgsPFgIfAmhkAjAPFgIeE0Ntc0NvbGxlY3Rpb25TdHJpbmdlZAI0D2QWAmYPZBYCZg9kFgQCAQ8WAh4MQ21zQXNzZXROYW1lBRVUb3V0X0ZBUV9EZXZBbGxQaG9uZXNkAgQPFgIfCgUPdG91dF9odG1sX2xvZ2luZAI2D2QWBGYPZBYCZg9kFgJmDxYCHwJoZAIBD2QWAmYPZBYCZg8WAh8LBRJzaHBfcGhuX2xlZ2FsTm90ZXNkAjgPDxYCHhxUaXRsZXBvcHVwUGxhbkNoYW5nZVJlcXVpcmVkZWQWBAIPDxYCHwJoZAITDxYCHwJoZBgBBR5fX0NvbnRyb2xzUmVxdWlyZVBvc3RCYWNrS2V5X18WNAUJTUJCRGV2aWNlBQ1QcmVQYWlkUGhvbmVzBQ1QcmVQYWlkUGhvbmVzBSFyZXBQcmljZVJhbmdlJGN0bDAwJGNoa1ByaWNlUmFuZ2UFDmNoa05ld0Fycml2YWxzBQ9jaGtXZWJPbmx5RGVhbHMFEmNoa1dlYk9ubHlQcm9kdWN0cwUPY2hrTmV3Q29uZGl0aW9uBQZjaGtDT0IFFnJlcFR5cGVzJGN0bDAwJGNoa1R5cGUFFnJlcFR5cGVzJGN0bDAyJGNoa1R5cGUFFnJlcFR5cGVzJGN0bDA0JGNoa1R5cGUFFnJlcFR5cGVzJGN0bDA1JGNoa1R5cGUFFnJlcFR5cGVzJGN0bDA2JGNoa1R5cGUFDGNoa0FuZHJvaWRPUwUPY2hrQmxhY2tCZXJyeU9TBQhjaGtXaW5PUwUgcmVwRmVhdHVyZUZpbHRlciRjdGwwMCRjaGtGaWx0ZXIFIHJlcEZlYXR1cmVGaWx0ZXIkY3RsMDEkY2hrRmlsdGVyBSByZXBGZWF0dXJlRmlsdGVyJGN0bDAyJGNoa0ZpbHRlcgUgcmVwRmVhdHVyZUZpbHRlciRjdGwwMyRjaGtGaWx0ZXIFIHJlcEZlYXR1cmVGaWx0ZXIkY3RsMDQkY2hrRmlsdGVyBSByZXBGZWF0dXJlRmlsdGVyJGN0bDA1JGNoa0ZpbHRlcgUgcmVwRmVhdHVyZUZpbHRlciRjdGwwNiRjaGtGaWx0ZXIFIHJlcEZlYXR1cmVGaWx0ZXIkY3RsMDckY2hrRmlsdGVyBSByZXBGZWF0dXJlRmlsdGVyJGN0bDA4JGNoa0ZpbHRlcgUgcmVwRmVhdHVyZUZpbHRlciRjdGwwOSRjaGtGaWx0ZXIFIHJlcEZlYXR1cmVGaWx0ZXIkY3RsMTAkY2hrRmlsdGVyBSByZXBGZWF0dXJlRmlsdGVyJGN0bDExJGNoa0ZpbHRlcgUgcmVwRmVhdHVyZUZpbHRlciRjdGwxMiRjaGtGaWx0ZXIFIHJlcEZlYXR1cmVGaWx0ZXIkY3RsMTMkY2hrRmlsdGVyBSByZXBGZWF0dXJlRmlsdGVyJGN0bDE0JGNoa0ZpbHRlcgUgcmVwRmVhdHVyZUZpbHRlciRjdGwxNSRjaGtGaWx0ZXIFIHJlcEZlYXR1cmVGaWx0ZXIkY3RsMTYkY2hrRmlsdGVyBSdyZXBNYW51ZmFjdHVyZXJzJGN0bDAwJGNoa01hbnVmYWN0dXJlcnMFJ3JlcE1hbnVmYWN0dXJlcnMkY3RsMDEkY2hrTWFudWZhY3R1cmVycwUncmVwTWFudWZhY3R1cmVycyRjdGwwMiRjaGtNYW51ZmFjdHVyZXJzBSdyZXBNYW51ZmFjdHVyZXJzJGN0bDA0JGNoa01hbnVmYWN0dXJlcnMFJ3JlcE1hbnVmYWN0dXJlcnMkY3RsMDUkY2hrTWFudWZhY3R1cmVycwUncmVwTWFudWZhY3R1cmVycyRjdGwwNiRjaGtNYW51ZmFjdHVyZXJzBSdyZXBNYW51ZmFjdHVyZXJzJGN0bDA3JGNoa01hbnVmYWN0dXJlcnMFJ3JlcE1hbnVmYWN0dXJlcnMkY3RsMDgkY2hrTWFudWZhY3R1cmVycwUabXJwUGhvbmVzJGN0bDAwJGNoa0NvbXBhcmUFGm1ycFBob25lcyRjdGwwMiRjaGtDb21wYXJlBRptcnBQaG9uZXMkY3RsMDQkY2hrQ29tcGFyZQUabXJwUGhvbmVzJGN0bDA2JGNoa0NvbXBhcmUFGm1ycFBob25lcyRjdGwwOCRjaGtDb21wYXJlBRptcnBQaG9uZXMkY3RsMTAkY2hrQ29tcGFyZQUabXJwUGhvbmVzJGN0bDEyJGNoa0NvbXBhcmUFGm1ycFBob25lcyRjdGwxNCRjaGtDb21wYXJlBRptcnBQaG9uZXMkY3RsMTYkY2hrQ29tcGFyZQUabXJwUGhvbmVzJGN0bDE4JGNoa0NvbXBhcmVnDy0KUN8keEvS5/wEmJXssTUSNw==',
'ctl09':'ctl13|pgrTop$lnkPageShowAll',
'ddlSort':'0',
'hdnBlackBerryID':'3c2c3562-aa1c-4fe4-a0ca-da5dd8e4bd84',
'hdnCapCode':'',
'hdnDeviceId':'',
'hdnFeature':'',
'hdnFeatureNames':'',
'hdnFilter':'',
'hdnIsPricingOptionLockedB':'false',
'hdnLocationParameter':'',
'hdnManufacturer':'',
'hdnManufacturerID':'',
'hdnManufacturerNames':'',
'hdnOtherFilters':'',
'hdnPageIndex':'',
'hdnPriceRange':'',
'hdnPriceRangeText':'',
'hdnProductType':'GSM',
'hdnSelectedDeviceId':'',
'hdnSelections':'',
'hdnSortFilter':'0',
'hdnTitle':'',
'hdnType':'smp,',
'hdnTypeNames':'Smartphone|',
'popupPlanChangeRequired$hdnDeviceID':'',
'popupPlanChangeRequired$hdnFamilyID':'',
'popupPlanChangeRequired$hiddenImagePath':'',
'repTypes$ctl05$chkType':'on',
'txtSelectedDevices':'0',
'txtSelectedFeatures':'0'}
headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
page = response.read()
soup = BeautifulSoup(page)
with open('tmob_colortest.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
items = soup.findAll('div', {"class": "phonename"}, text = colors)
prices = soup.findAll('p', {"class": "totalitemprice"})
for item, price in zip(items, prices):
textcontent = u' '.join(islice(item.stripped_strings, 0, 2, 1))
textcontent2 = u' '.join(price.stripped_strings)
name_1 = unicode(textcontent).encode('utf8').replace('Nexus 4','LG Nexus 4').replace(' T-Mobile Refurbished Device','').replace('™','').replace('®','').replace(' ›','').replace("NEW! ","").replace(" Web-only offer -- now thru Thu 1/3/13","").replace(" Web-only offer!","").strip()
oem = list(name_1)
pos = oem.index(' ')
if name_1.find('Refurbished')== -1:
name= name_1
refur = "N"
else:
name = name_1.replace("Refurbished","").replace(" -","")
refur = "Y"
spamwriter.writerow(["US", "T-Mobile",
name[0:pos],name,refur,color_column,
"24 Months","$",unicode(textcontent2).encode('utf8').replace("FREE","0").replace('$','')])
Please help me to solve this issue and pardon my ignorance as I am new to coding.
You never actually use your function, so color_column is never filled.
What you want to do is make your function return the changed product name, and the color detected, as two separate values:
def handle_color(arg):
for col in colors:
if col.lower() not in arg.lower():
continue
# color found, remove it from arg (case insensitively)
start = arg.lower().index(col.lower())
arg = arg[:start] + arg[start + len(col):]
return arg, col
# No matching color found, return arg unchanged and an empty value for the color
return arg, ''
Now all you have to do is call this function and unpack it's return value into two variables for your CSV:
name, color_column = handle_color(name)
and color_column will either be an empty value or the matched color (now removed from name).