Issue in outputting data scraped using beautiful soup in two columns of csv using spamwriter.writerow - csv

I am scraping 2 sets of data from a website using beautiful soup and I want them to output in a csv file in 2 columns side by side. I am using spamwriter.writerow([x,y]) argument for this but I think because of some error in my recursion structure, I am getting the wrong output in my csv file. Below is the referred code:
import csv
import urllib2
import sys
from bs4 import BeautifulSoup
page = urllib2.urlopen('http://www.att.com/shop/wireless/devices/smartphones.html').read()
soup = BeautifulSoup(page)
soup.prettify()
with open('Smartphones_20decv2.0.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
for anchor in soup.findAll('a', {"class": "clickStreamSingleItem"},text=True):
if anchor.string:
print unicode(anchor.string).encode('utf8').strip()
for anchor1 in soup.findAll('div', {"class": "listGrid-price"}):
textcontent = u' '.join(anchor1.stripped_strings)
if textcontent:
print textcontent
spamwriter.writerow([unicode(anchor.string).encode('utf8').strip(),textcontent])
Output which I am getting in csv is:
Samsung Focus® 2 (Refurbished) $99.99
Samsung Focus® 2 (Refurbished) $99.99 to $199.99 8 to 16 GB
Samsung Focus® 2 (Refurbished) $0.99
Samsung Focus® 2 (Refurbished) $0.99
Samsung Focus® 2 (Refurbished) $149.99 to $349.99 16 to 64 GB
Problem is I am getting only 1 device name in column 1 instead of all while price is coming for all devices.
Please pardon my ignorance as I am new to programming.

You are using anchor.string, instead of archor1. anchor is the last item from the previous loop, instead of the item in the current loop.
Perhaps using clearer variable names would help avoid confusion here; use singleitem and gridprice perhaps?
It could be I misunderstood though and you want to combine each anchor1 with a corresponding anchor. You'll have to loop over them together, perhaps using zip():
items = soup.findAll('a', {"class": "clickStreamSingleItem"},text=True)
prices = soup.findAll('div', {"class": "listGrid-price"})
for item, price in zip(items, prices):
textcontent = u' '.join(price.stripped_strings)
if textcontent:
print textcontent
spamwriter.writerow([unicode(item.string).encode('utf8').strip(),textcontent])
Normally it should be easier to loop over the parent table row instead, then find the cells within that row within a loop. But the zip() should work too, provided the clickStreamSingleItem cells line up with the listGrid-price matches.

Related

selenium by_xpath not returning any results

I am using Selenium 4+, and I seem to not get back the any result when requesting for elements in a div.
# Wait for the page to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.ID, "search-key")))
# wait for the page to load
driver.implicitly_wait(10)
# search for the suppliers you want to message
search_input = driver.find_element(By.ID,"search-key")
search_input.send_keys("suppliers")
search_input.send_keys(Keys.RETURN)
# find all the supplier stores on the page
supplier_stores_div = driver.find_element(By.CLASS_NAME, "list--gallery--34TropR")
print(supplier_stores_div)
supplier_stores = supplier_stores_div.find_elements(By.XPATH, "./a[#class='v3--container--31q8BOL cards--gallery--2o6yJVt']")
print(supplier_stores)
The logging statements gave me <selenium.webdriver.remote.webelement.WebElement (session="a3be4d8c5620e760177247d5b8158823", element="5ae78693-4bae-4826-baa6-bd940fa9a41b")> and an empty list for the Element objects, []
The html code is here:
<div class="list--gallery--34TropR" data-spm="main" data-spm-max-idx="24">flex
<a class="v3--container--31q8BOL cards--gallery--2o6yJVt" href="(link)" target="_blank" style="text-align: left;" data-spm-anchor-id="a2g0o.productlist.main.1"> (some divs) </a>flex
That is just one <a> class, there's more.
Before scraping the supplier names, you have to scroll down the page slowly to the bottom, then only you can get all the supplier names, try the below code:
driver.get("https://www.aliexpress.com/premium/supplier.html?spm=a2g0o.best.1000002.0&initiative_id=SB_20221218233848&dida=y")
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollBy(0, 800);")
sleep(1)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
suppliers = driver.find_elements(By.XPATH, ".//*[#class='list--gallery--34TropR']//span/a")
print("Total no. of suppliers:", len(suppliers))
print("======")
for supplier in suppliers:
print(supplier.text)
Output:
Total no. of suppliers: 60
======
Reading Life Store
ASONSTEEL Store
Paper, ink, pen and inkstone Store
Custom Stationery Store
ZHOUYANG Official Store
WOWSOCOOL Store
IFPD Official Store
The 9 Store
QuanRun Store
...
...
It returns you the Element object.
If you want to get the text you need to write
supplier_stores_div.find_elements(By.XPATH, "./a[#class='v3--container--31q8BOL cards--gallery--2o6yJVt']").getText()
or for getting an attribute (for example the href)
supplier_stores_div.find_elements(By.XPATH, "./a[#class='v3--container--31q8BOL cards--gallery--2o6yJVt']").getAttribute("href")

How to scrape only texts from specific HTML elements?

I have a problem with selecting the appropriate items from the list.
For example - I want to omit "1." then the first "5" (as in the example)
Additionally, I would like to write a condition that the letter "W" should be changed to "WIN".
import re
from selenium import webdriver
from bs4 import BeautifulSoup as BS2
from time import sleep
driver = webdriver.Chrome()
driver.get("https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela/")
sleep(10)
page = driver.page_source
soup = BS2(page,'html.parser')
content = soup.find('div',{'class':'ui-table__body'})
content_list = content.find_all('span',{"table__cell table__cell--value"})
res = []
for i in content:
line = i.text.split()[0]
if re.search('Ajax', line):
res.append(line)
print(res)
results
['1.Ajax550016:315?WWWWW']
I need
Ajax;5;5;0;16;3;W;W;W;W;W
I would recommend to select your elements more specific:
for e in soup.select('.ui-table__row'):
Iterate the ResultSet and decompose() unwanted tag:
e.select_one('.wld--tbd').decompose()
Extract texts with stripped_strings and join() them to your expected string:
data.append(';'.join(e.stripped_strings))
Example
Also making some replacements, based on dict just to demonstrate how this would work, not knowing R or P.
...
soup = BS2(page,'html.parser')
data = []
for e in soup.select('.ui-table__row'):
e.select_one('.wld--tbd').decompose()
e.select_one('.tableCellRank').decompose()
e.select_one('.table__cell--points').decompose()
e.select_one('.table__cell--score').string = ';'.join(e.select_one('.table__cell--score').text.split(':'))
pattern = {'W':'WIN','R':'RRR','P':'PPP'}
data.append(';'.join([pattern.get(i,i) for i in e.stripped_strings]))
data
To only get result for Ajax:
data = []
for e in soup.select('.ui-table__row:-soup-contains("Ajax")'):
e.select_one('.wld--tbd').decompose()
e.select_one('.tableCellRank').decompose()
e.select_one('.table__cell--points').decompose()
e.select_one('.table__cell--score').string = ';'.join(e.select_one('.table__cell--score').text.split(':'))
pattern = {'W':'WIN','R':'RRR','P':'PPP'}
data.append(';'.join([pattern.get(i,i) for i in e.stripped_strings]))
data
Output
Based on actually data it may differ from questions example.
['Ajax;6;6;0;0;21;3;WIN;WIN;WIN;WIN;WIN']
you had the right start by using bs4 to find the table div, but then you gave up and just tried to use re to extract from the text. as you can see that's not going to work. Here is a simple way to hack and get what you want. I keep grabinn divs from the table div you find, and the grab the text of the next eight divs after finding Ajax. then I do some dirty string manipulation thing because the WWWWW is all in the same toplevel div.
import re
from selenium import webdriver
from bs4 import BeautifulSoup as BS2
from time import sleep
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
#driver = webdriver.Chrome()
driver.get("https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela/")
driver.implicitly_wait(10)
page = driver.page_source
soup = BS2(page,'html.parser')
content = soup.find('div',{'class':'ui-table__body'})
content_list = content.find_all('span',{"table__cell table__cell--value"})
res = []
found = 0
for i in content.find('div'):
line = i.text.split()[0]
if re.search('Ajax', line):
found = 8
if found:
found -= 1
res.append(line)
# change field 5 into separate values and skip field 6
res = res[:4] +res[5].split(':') + res[7:]
# break the last field into separate values and drop the first '?'
res = res[:-1] + [ i for i in res[-1]][1:]
print(";".join(res))
returns
Ajax;5;5;0;16;3;W;W;W;W;W
This works, but it is very brittle, and will break as soon as the website changes their content. you should put in a lot of error checking. I also replaced the sleep with a wait call, and added chromedrivermamager, which allows me to use selenium with chrome.

Downloading a spreadsheet from a link in Python

I am trying to download a spreadsheet from this website https://www.theice.com/clear-singapore/risk-management#margin-rates. The file I want to download is called Margin Scanning (at the bottom of the website). Normally I would click on the link "Margin Scanning" to download but I want to use Python to automate it. Then I want the context to become my dataframe so that I can amend it and save it on my drive. Any idea how I can get it done? I know how to save a table from the web but I am not sure how to download a file from the link.
This is what I have tried:
import requests
url="https://www.theice.com/publicdocs/clear_singapore/irmParameters/ICSG_MARGIN_SCANNING_20200702.CSV"
response = requests.get(url)
margin_scanning = pd.DataFrame()
margin_scanning = response.content
Working code:
pandas Version: 1.0.3
import pandas as pd
import requests,csv
url = "https://www.theice.com/publicdocs/clear_singapore/irmParameters/ICSG_MARGIN_SCANNING_20200702.CSV"
response = requests.get(url)
response=response.content.decode('utf-8')
cr = csv.reader(response.splitlines(), delimiter=',')
data=pd.DataFrame(cr, index=None)
print(data.head())
Output:
0 1 ... 29 30
0 Effective Date Exchange Code ... Position Allocation Margin Erosion
1 02-JUL-20 G ... No No
2 02-JUL-20 G ... No No
3 02-JUL-20 G ... No No
4 02-JUL-20 G ... No No
[5 rows x 31 columns]

Scrape table with no ids or classes using only standard libraries?

I want to scrape two pieces of data from a website:
https://www.moneymetals.com/precious-metals-charts/gold-price
Specifically I want the "Gold Price per Ounce" and the "Spot Change" percent two columns to the right of it.
Using only Python standard libraries, is this possible? A lot of tutorials use the HTML element id to scrape effectively but inspecting the source for this page, it's just a table. Specifically I want the second and fourth <td> which appear on the page.
It's possible to do it with standard python libraries; ugly, but possible:
import urllib
from html.parser import HTMLParser
URL = 'https://www.moneymetals.com/precious-metals-charts/gold-price'
page = urllib.request.Request(URL)
result = urllib.request.urlopen(page)
resulttext = result.read()
class MyHTMLParser(HTMLParser):
gold = []
def handle_data(self, data):
self.gold.append(data)
parser = MyHTMLParser()
parser.feed(str(resulttext))
for i in parser.gold:
if 'Gold Price per Ounce' in i:
target= parser.gold.index(i) #get the index location of the heading
print(parser.gold[target+2]) #your target items are 2, 5 and 9 positions down in the list
print(parser.gold[target+5].replace('\\n',''))
print(parser.gold[target+9].replace('\\n',''))
Output (as of the time the url was loaded):
$1,566.70
8.65
0.55%

python color entire pandas dataframe rows based on column values

I have a script that downloads a .csv and does some manipulation and then emails panda dataframes in a nice html format by using df.to_html.
I would like to enhance these tables by highlighting, or coloring, different rows based on their text value in a specific column.
I tried using pandas styler which appears to work however I can not convert that to html using to_html. I get a "AttributeError: 'str' object has no attribute 'to_html"
Is there a another way to do this?
As an example lets say my DF looks like the following and I want to highlight all rows for each manufacturer. i.e Use three different colors for Ford, Chevy, and Dodge:
Year Color Manufacturer
2011 Red Ford
2010 Yellow Ford
2000 Blue Chevy
1983 Orange Dodge
I noticed I can pass formatters into to_html but it appears that it cannot do what I am trying to accomplish by coloring? I would like to be able to do something like:
def colorred():
return ['background-color: red']
def color_row(value):
if value is "Ford":
result = colorred()
return result
df1.to_html("test.html", escape=False, formatters={"Manufacturer": color_row})
Surprised this has never been answered as looking back at it I do not believe this is even possible with to_html formatters. After revisiting this several times I have found a very nice solution I am happy with. I have not seen anything close to this online so I hope this helps someone else.
d = {'Year' : [2011, 2010, 2000, 1983],
'Color' : ['Red', 'Yellow', 'Blue', 'Orange'],
'Manufacturer' : ['Ford', 'Ford', 'Chevy', 'Dodge']}
df =pd.DataFrame(d)
print (df)
def color_rows(s):
df = s.copy()
#Key:Value dictionary of Column Name:Color
color_map = {}
#Unqiue Column values
manufacturers = df['Manufacturer'].unique()
colors_to_use = ['background-color: #ABB2B9', 'background-color: #EDBB99', 'background-color: #ABEBC6',
'background-color: #AED6F1']
#Loop over our column values and associate one color to each
for manufacturer in manufacturers:
color_map[manufacturer] = colors_to_use[0]
colors_to_use.pop(0)
for index, row in df.iterrows():
if row['Manufacturer'] in manufacturers:
manufacturer = row['Manufacturer']
#Get the color to use based on this rows Manufacturers value
my_color = color_map[manufacturer]
#Update the row using loc
df.loc[index,:] = my_color
else:
df.loc[index,:] = 'background-color: '
return df
df.style.apply(color_rows, axis=None)
Output:
Pandas row coloring
Since I do not have the cred to embed images here is how I email it. I convert it to html with the following.
styled = df.style.apply(color_rows, axis=None).set_table_styles(
[{'selector': '.row_heading',
'props': [('display', 'none')]},
{'selector': '.blank.level0',
'props': [('display', 'none')]}])
html = (styled.render())