selenium by_xpath not returning any results - html

I am using Selenium 4+, and I seem to not get back the any result when requesting for elements in a div.
# Wait for the page to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.ID, "search-key")))
# wait for the page to load
driver.implicitly_wait(10)
# search for the suppliers you want to message
search_input = driver.find_element(By.ID,"search-key")
search_input.send_keys("suppliers")
search_input.send_keys(Keys.RETURN)
# find all the supplier stores on the page
supplier_stores_div = driver.find_element(By.CLASS_NAME, "list--gallery--34TropR")
print(supplier_stores_div)
supplier_stores = supplier_stores_div.find_elements(By.XPATH, "./a[#class='v3--container--31q8BOL cards--gallery--2o6yJVt']")
print(supplier_stores)
The logging statements gave me <selenium.webdriver.remote.webelement.WebElement (session="a3be4d8c5620e760177247d5b8158823", element="5ae78693-4bae-4826-baa6-bd940fa9a41b")> and an empty list for the Element objects, []
The html code is here:
<div class="list--gallery--34TropR" data-spm="main" data-spm-max-idx="24">flex
<a class="v3--container--31q8BOL cards--gallery--2o6yJVt" href="(link)" target="_blank" style="text-align: left;" data-spm-anchor-id="a2g0o.productlist.main.1"> (some divs) </a>flex
That is just one <a> class, there's more.

Before scraping the supplier names, you have to scroll down the page slowly to the bottom, then only you can get all the supplier names, try the below code:
driver.get("https://www.aliexpress.com/premium/supplier.html?spm=a2g0o.best.1000002.0&initiative_id=SB_20221218233848&dida=y")
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollBy(0, 800);")
sleep(1)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
suppliers = driver.find_elements(By.XPATH, ".//*[#class='list--gallery--34TropR']//span/a")
print("Total no. of suppliers:", len(suppliers))
print("======")
for supplier in suppliers:
print(supplier.text)
Output:
Total no. of suppliers: 60
======
Reading Life Store
ASONSTEEL Store
Paper, ink, pen and inkstone Store
Custom Stationery Store
ZHOUYANG Official Store
WOWSOCOOL Store
IFPD Official Store
The 9 Store
QuanRun Store
...
...

It returns you the Element object.
If you want to get the text you need to write
supplier_stores_div.find_elements(By.XPATH, "./a[#class='v3--container--31q8BOL cards--gallery--2o6yJVt']").getText()
or for getting an attribute (for example the href)
supplier_stores_div.find_elements(By.XPATH, "./a[#class='v3--container--31q8BOL cards--gallery--2o6yJVt']").getAttribute("href")

Related

Difficulties with web scraping

I have just came to an article called The 500 Greatest Songs of All Time and thought "oh that's cool I bet they also made a Spotify/Apple music list that I can follow". Well...they don't.
So in a nutshell, I wonder if it's possible to 1) scrap the website to extract the songs and 2) then do some kind of bulk upload to Spotify to create the list.
Songs' titles and authors are structured like this in the website:
Website screenshot. I have already tried to scrap the web with the importxml() formula in google sheets but with no success.
I understand the scrapping part is easier than the other and, as I am new to programming, I would be happy to manage to partially achieve this goal. I am sure this task can be achieved easily on python.
I feel like explaining everything would go beyond the scope here, so I tried to comment the code well enough.
1. Scrape the songs
I used python3 and selenium, their website doesn't block that.
Be sure to adjust your chromedriver path, and the output path of the .txt file at the bottom if necessary. Once it's done and you have your .txt file you can close it.
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
s = Service(r'/Users/main/Desktop/chromedriver')
driver = webdriver.Chrome(service=s)
# just setting some vars, I used Xpath because I know that
top_500 = 'https://www.rollingstone.com/music/music-lists/best-songs-of-all-time-1224767/'
cookie_button_xpath = "// button [#id = 'onetrust-accept-btn-handler']"
div_containing_links_xpath = "// div [#id = 'pmc-gallery-list-nav-bar-render'] // child :: a"
song_names_xpath = "// article [#class = 'c-gallery-vertical-album'] / child :: h2"
links = []
songs = []
driver.get(top_500)
# accept cookies, give time to load
time.sleep(3)
cookie_btn = driver.find_element(By.XPATH, cookie_button_xpath)
cookie_btn.click()
time.sleep(1)
# extracting all the links since there are only 50 songs per page
links_to_next_pages = driver.find_elements(By.XPATH, div_containing_links_xpath)
for element in links_to_next_pages:
l = element.get_attribute('href')
links.append(l)
# extracting the songs, then going to next page and so on until we hit 500
counter = 1 # were starting with 1 here since links[0] is the current page we are already on
while True:
list = driver.find_elements(By.XPATH, song_names_xpath)
for element in list:
s = element.text
songs.append(s)
if len(songs) == 500:
break
driver.get(links[counter])
counter += 1
time.sleep(2)
# verify that there are no duplicates, if there were, something would be off
if len(songs) != len( set(songs) ):
print('you f***** up')
else:
print('seems fine')
with open('/Users/main/Desktop/output_songs.txt', 'w') as file:
file.writelines(line + '\n' for line in songs)
2. Prepare Spotify
Go to the Spotify Developer Dashboard and create an
account (use your Spotify acc).
Then create an app, call it whatever you want.
On your app click settings and whitelist http://localhost:8888/callback
On your app click "users and access" and add your Spotify account
Leave the tab open, we'll come back to it
3. Prepare Your Environment
You need Node.js so make sure that is installed on your machine
Download this from Spotifys GitHub
Unzip it, cd into the folder and run npm install
Go into the authorization_code folder and open app.js in a editor
Find var scope and append ' playlist-modify-public' to the string, this is so that your app can access you Spotify playlists, see here
Now go back to the app in your Spotify Developer Dashboard we'll need to copy the Client ID and the Client Secret into the var client_id and var client_secret respectively (in the app.js file). var redirect_uri will be
http://localhost:8888/callback - don't forget to save your changes.
4. Run the Spotify side of things
cd into the authorization_code folder and run app.js with node app.js (this is basically a server running on your PC)
Now if that works leave it running and go to http://localhost:8888, authorise your Spotify account there
There copy the full token, including the overflow, use inspect element to get it
Adjust the user_id and auth variables as well as the path to the output_songs.txt (at with open) in the following python script and run that, songs which are not found will be printed out at the end, give it a search with Google. They are usually on Spotify as well but Google seem to have the better search algorithm (surprised Pikachu face).
import requests
import re
import json
# this is NOT you display name, it's your user name!!
user_id = 'YOUR_USERNAME'
# paste your auth token from spotify; it can time out then you have to get a new one, so dont panic if you get a bunch of responses in the 400s after some time
auth = {"Authorization": "Bearer YOUR_AUTH_KEY_FROM_LOCALHOST"}
playlist = []
err_log = []
base_url = 'https://api.spotify.com/v1'
search_method = '/search'
with open('/Users/main/Desktop/output_songs.txt', 'r') as file:
songs = file.readlines()
# this querys spotify does some magic and then appends the tracks spotify uri to an array
def query_song_uris():
for n, entry in enumerate(songs):
x = re.findall(r"'([^']*)'", entry)
title_len = len(entry) - len(x[0]) - 4
title = x[0]
artist = entry[:title_len]
payload = {
'q': (entry),
'track:': (title),
'artist:': (artist),
'type': 'track',
'limit': 1
}
url = base_url + search_method
try:
r = requests.get(url, params=payload, headers=auth)
print('\nquerying spotify; ', r)
c = r.content.decode('UTF-8')
dic = json.loads(c)
track_uri = dic["tracks"]["items"][0]["uri"]
playlist.append(track_uri)
print(track_uri)
except:
err = f'\nNr. {(len(songs)-n)}: ' + f'{entry}'
err_log.append(err)
playlist.reverse()
query_song_uris()
# creates a playlist and returns playlist id
def create_playlist():
payload = {
"name": "Rolling Stone: Top 500 (All Time)",
"description": "music for old men xD with occasional hip hop appearences. just kidding"
}
url = base_url + f'/users/{user_id}/playlists'
r = requests.post(url, headers=auth, json=payload)
c = r.content.decode('UTF-8')
dic = json.loads(c)
print(f'\n\ncreating playlist #{dic["id"]}; ', r)
return dic["id"]
def add_to_playlist():
playlist_id = create_playlist()
while True:
if len(playlist) > 100:
p = playlist[:100]
else:
p = playlist
payload = {"uris": (p)}
url = base_url + f'/playlists/{playlist_id}/tracks'
r = requests.post(url, headers=auth, json=payload)
print(f'\nadding {len(p)} songs to playlist; ', r)
del playlist[ : len(p) ]
if len(playlist) == 0:
break
add_to_playlist()
print('\n\ncheck your spotify :)')
print("\n\n\nthese tracks didn't make it, check manually:\n")
for line in err_log:
print(line)
print('\n\n')
Done
If you don't want to run the code yourself, heres the playlist:
https://open.spotify.com/playlist/5fdLKYNFlA4XSvhEl36KXS
If you have trouble, everything from step 2 on is also described here in the Web API quick start or in general in the web API docs.
Regarding Apple Music
So Apple seems very closed up (surprise haha). What I found though is that you can query the i-Tunes store. Given response also contains a direct link to the song(s) on Apple music.
You might be able to go from there.
Get ISRC code from iTunes Search API (Apple music)
PS: undeniably regex is witchcraft, but y'all here got my back

scraping - find the last 5 score of each match - in html

i would like your help to get the last 5 score, i can't get it please help me.
from selenium import webdriver
import pandas as pd
from pandas import ExcelWriter
from openpyxl.workbook import Workbook
import time as t
import xlsxwriter
pd.set_option('display.max_rows', 5, 'display.max_columns', None, 'display.width', None)
browser = webdriver.Firefox()
browser.get('https://www.mismarcadores.com/futbol/espana/laliga/resultados/')
print("Current Page Title is : %s" %browser.title)
aux_ids = browser.find_elements_by_css_selector('.event__match.event__match--static.event__match--oneLine')
ids=[]
i = 0
for aux in aux_ids:
if i < 1:
ids.append( aux.get_attribute('id') )
i+=1
data=[]
for idt in ids:
id_clean = idt.split('_')[-1]
browser.execute_script("window.open('');")
browser.switch_to.window(browser.window_handles[1])
browser.get(f'https://www.mismarcadores.com/partido/{id_clean}/#h2h;overall')
t.sleep(5)
p_ids = browser.find_elements_by_css_selector('h2h-wrapper')
#here the code of the last 5 score of each match
I believe you can use your Firefox browser but have not tested with it. I use chrome so if you want to use chromedriver check the version of your browser and download the right one, also add it to your system path. The only thing with this approach is that it open a browser window until the page is loaded (because we are waiting for the javascript to generate the matches data). If you need anything else let me know. Good luck!
https://chromedriver.chromium.org/downloads
Known issues: Sometimes it will throw index out of range when retrieve matches data. This is something I am looking to it because it look like sometimes the xpath on each link change a little .
from selenium import webdriver
from lxml import html
from lxml.html import HtmlElement
def test():
# Here we specified the urls to for testing purpose
urls = ['https://www.mismarcadores.com/partido/noIPZ3Lj/#h2h;overall'
]
# a loop to go over all the urls
for url in urls:
# We will print the string and format it with the url we are currently checking, Also we will print the
# result of the function get_last_5(url) where url is the current url in the for loop.
print("Scores after this match {u}".format(u=url), get_last_5(url))
def get_last_5(url):
print("processing {u}, please wait...".format(u=url))
# here we get a instance of the webdriver
browser = webdriver.Chrome()
# now we pass the url we want to get
browser.get(url)
# in this variable, we will "store" the html data as a string. We get it from here because we need to wait for
# the page to load and execute their javascript code in order to generate the matches data.
innerHTML = browser.execute_script("return document.body.innerHTML")
# Now we will assign this to a variable of type HtmlElement
tree: HtmlElement = html.fromstring(innerHTML)
# the following variables: first_team,second_team,match_date and rows are obtained via xpath method(). To get the
# xpath go to chrome browser,open it and load one of the url to check the DOM. Now if you wish to check the xpath
# of each of this variables (elements in case of html), right click on the element->click inspect->the inspect
# panel will appear->the clicked element wil appear selected on the inspect panel->right click on it->Copy->Copy
# Xpath. first_team,second_team and match_date are obtained from the "title" section. Rows are obtained from the
# table of last matches in the tbody content
# When using xpath it will return a list of HtmElement because it will try to find all the elements that match our
# xpath, so that is why we use [0] (to get the first element of the list). This will give use access to a
# HtmlElement object so now we can access its text attribute.
first_team = tree.xpath('//*[#id="flashscore"]/div[1]/div[1]/div[2]/div/div/a')[0].text
print((type(first_team)))
second_team = tree.xpath('//*[#id="flashscore"]/div[1]/div[3]/div[2]/div/div/a')[0].text
# [0:8] is used to slice the string because in the title it contains also the time of the match ie.(10.08.2020
# 13:00) . To use it for comparing each row we need only (10.08.20), so we get from position 0, 8 characters ([0:8])
match_date = tree.xpath('//*[#id="utime"]')[0].text[0:8]
# when getting the first element with [0], we get a HtmlElement object( which is the "table" that have all matches
# data). so we want to get all the children of it, which are all the "rows(elements)" inside it. getchildren()
# will also return a list of object of type HtmlElement. In this case we are also slicing the list with [:-1]
# because the last element inside the "table" is the button "Mostar mas partidos", so we want to take that out.
rows = tree.xpath('//*[#id="tab-h2h-overall"]/div[1]/table/tbody')[0].getchildren()[:-1]
# we quit the browser since we do not need this anymore, we could do it after assigning innerHtml, but no harm
# doing it here unless you wish to close it before doing all this assignment of variables.
browser.quit()
# this match_position variable will be the position of the match we currently have in the title.
match_position = None
# Now we will iterate over the rows and find the match. range(len(rows)) is just to get the count of rows to know
# until when to stop iterating.
for i in range(len(rows)):
# now we use the is_match function with the following parameter: first_team,second team, match_date and the
# current row which is row[i]. if the function return true we found the match position and we assign (i+1) to
# the match_position variable. i+1 because we iterate from 0.
if is_match(first_team, second_team, match_date, rows[i]):
match_position = i + 1
# now we stop the for no need to go further when we find it.
break
# Since we only want the following 5 matches score, we need to check if we have 5 rows beneath our match. If
# adding 5 from the match position is less than the number of rows then we can do it, if not we will only get the
# rows beneath it(maybe 0,1,2,3 or 4 rows)
if (match_position + 5) < len(rows):
# Again we are slicing the list, in this case 2 times [match_position:] (take out all the rows before the
# match position), then from the new list obtained from that we do [:5] which is start from the 0 position
# and stop on 5 [start:stop]. we use rows=rows beacause when slicing you get a new list so you can not do
# rows[match_position:][:5] you need to assign it to a variable. I am using same variable but you can assign
# it to a new one if you wish.
rows = rows[match_position:][:5]
else:
# since we do not have enough rows, just get the rows beneath our position.
rows = rows[match_position:len(rows)]
# Now to get the list of scores we are using a list comprehension in here but I will explain it as a for loop.
# Before that, you need to know that each row(<tr> element in html) has 6 td elements inside it, the number 5 is
# the score of the match. then inside each "score element" we have a span element and then a strong element,
# something like
# <tr>
# <td></td>
# <td></td>
# <td></td>
# <td></td>
# <td><span><strong>1:2</strong></span></td>.
# <td></td>
# </tr>
# Now, That been said, since each row is a HtmlElement object , we can go in a for loop as following:
scores = []
for row in rows:
data = row.getchildren()[4].getchildren()[0].text_content()
# not the best way but we will get al the text content on the element, in this case the span element,
# if the string has more than 5 characters i.e. "1 : 2" then we will take as if it is i.e. "1 : 2(0 : 1)". So
# in this case we want to slice it from the 2nd character from right to left and get 5 characters from that
# position.
# using a ternary expression here, if the length of the string is equal to 5 then this is our score,
# if not then we have to slice it and get the last part, from -6 which is the white space before then 2 (in
# our example) to -1 (which is the 1 before the last ')' ).
score = data if len(data) == 5 else data[-6:-1]
scores.append(score)
print("finished processing {u}.".format(u=url))
# now we return the scores
return scores
def is_match(t1, t2, match_date, row):
# from each row we want to compare, t1,t2,match_date (this are obtained from the title) with the rows team1,
# team2 and date. Each row has 6 element inside it. Please read all the code on get_last_5 before reading this
# explanation. so the for this row, date is in position 0, team1 in 2, team2 in 3.
# <td><span>10.03.20</span></td>
date = row.getchildren()[0].getchildren()[0].text
# <td><span>TeamName</span></td> (when the team lost) or
# <td><span><strong>TeamName</strong></span></td> (when the team won)
team1element = row.getchildren()[2].getchildren()[0] # this is the span element
# using a ternary expression (condition_if_true if condition else condition_if_false)
# https://book.pythontips.com/en/latest/ternary_operators.html
# if span element have childrens , (getchildren()>0) then the team name is team1element.getchildren()[0].text
# which is the text of the strong element, if not the jsut get the text from the span element.
mt1 = team1element.getchildren()[0].text if len(team1element.getchildren()) > 0 else team1element.text
# repeat the same as team 1
team2element = row.getchildren()[3].getchildren()[0]
mt2 = team2element.getchildren()[0].text if len(team2element.getchildren()) > 0 else team2element.text
# basically we can compare only the date, but jsut to be sure we compare the names also. So, if the dates and the
# names are the same this is our match row.
if match_date == date and t1 == mt1 and t2 == mt2:
# we found it so return true
return True
# if not the same then return false
return False

Scrape table with no ids or classes using only standard libraries?

I want to scrape two pieces of data from a website:
https://www.moneymetals.com/precious-metals-charts/gold-price
Specifically I want the "Gold Price per Ounce" and the "Spot Change" percent two columns to the right of it.
Using only Python standard libraries, is this possible? A lot of tutorials use the HTML element id to scrape effectively but inspecting the source for this page, it's just a table. Specifically I want the second and fourth <td> which appear on the page.
It's possible to do it with standard python libraries; ugly, but possible:
import urllib
from html.parser import HTMLParser
URL = 'https://www.moneymetals.com/precious-metals-charts/gold-price'
page = urllib.request.Request(URL)
result = urllib.request.urlopen(page)
resulttext = result.read()
class MyHTMLParser(HTMLParser):
gold = []
def handle_data(self, data):
self.gold.append(data)
parser = MyHTMLParser()
parser.feed(str(resulttext))
for i in parser.gold:
if 'Gold Price per Ounce' in i:
target= parser.gold.index(i) #get the index location of the heading
print(parser.gold[target+2]) #your target items are 2, 5 and 9 positions down in the list
print(parser.gold[target+5].replace('\\n',''))
print(parser.gold[target+9].replace('\\n',''))
Output (as of the time the url was loaded):
$1,566.70
8.65
0.55%

How to obtain a list of titles of all Wikipedia articles

I'd like to obtain a list of all the titles of all Wikipedia articles. I know there are two possible ways to get content from a Wikimedia powered wiki. One would be the API and the other one would be a database dump.
I'd prefer not to download the wiki dump. First, it's huge, and second, I'm not really experienced with querying databases. The problem with the API on the other hand is that I couldn't figure out a way to only retrieve a list of the article titles and even if it would need > 4 mio requests which would probably get me blocked from any further requests anyway.
So my question is
Is there a way to obtain only the titles of Wikipedia articles via the API?
Is there a way to combine multiple request/queries into one? Or do I actually have to download a Wikipedia dump?
The allpages API module allows you to do just that. Its limit (when you set aplimit=max) is 500, so to query all 4.5M articles, you would need about 9000 requests.
But a dump is a better choice, because there are many different dumps, including all-titles-in-ns0 which, as its name suggests, contains exactly what you want (59 MB of gzipped text).
Right now, as per the current statistics the number of articles is around 5.8M.
To get the list of pages I did use the AllPages API. However, the number of pages I get is around 14.5M which is ~3 times of what I was expecting. I restricted myself to namespace 0 to get the list. Following is the sample code that I am using:
# get the list of all wikipedia pages (articles) -- English
import sys
from simplemediawiki import MediaWiki
listOfPagesFile = open("wikiListOfArticles_nonredirects.txt", "w")
wiki = MediaWiki('https://en.wikipedia.org/w/api.php')
continueParam = ''
requestObj = {}
requestObj['action'] = 'query'
requestObj['list'] = 'allpages'
requestObj['aplimit'] = 'max'
requestObj['apnamespace'] = '0'
pagelist = wiki.call(requestObj)
pagesInQuery = pagelist['query']['allpages']
for eachPage in pagesInQuery:
pageId = eachPage['pageid']
title = eachPage['title'].encode('utf-8')
writestr = str(pageId) + "; " + title + "\n"
listOfPagesFile.write(writestr)
numQueries = 1
while len(pagelist['query']['allpages']) > 0:
requestObj['apcontinue'] = pagelist["continue"]["apcontinue"]
pagelist = wiki.call(requestObj)
pagesInQuery = pagelist['query']['allpages']
for eachPage in pagesInQuery:
pageId = eachPage['pageid']
title = eachPage['title'].encode('utf-8')
writestr = str(pageId) + "; " + title + "\n"
listOfPagesFile.write(writestr)
# print writestr
numQueries += 1
if numQueries % 100 == 0:
print "Done with queries -- ", numQueries
print numQueries
listOfPagesFile.close()
The number of queries fired is around 28900, which results in approx. 14.5M names of the pages.
I also tried the all-titles link mentioned in the above answer. In that case as well I am getting around 14.5M pages.
I thought that this overestimate to the actual number of pages is because of the redirects, and did add the 'nonredirects' option to the request object:
requestObj['apfilterredir'] = 'nonredirects'
After doing that I get only 112340 number of pages. Which is too small as compared to 5.8M.
With the above code I was expecting roughly 5.8M pages, but that doesn't seem to be the case.
Is there any other option that I should be trying to get the actual (~5.8M) set of page names?
Here is an asynchronous program that will generate mediawiki pages titles:
async def wikimedia_titles(http, wiki="https://en.wikipedia.org/"):
log.debug('Started generating asynchronously wiki titles at {}', wiki)
# XXX: https://www.mediawiki.org/wiki/API:Allpages#Python
url = "{}/w/api.php".format(wiki)
params = {
"action": "query",
"format": "json",
"list": "allpages",
"apfilterredir": "nonredirects",
"apfrom": "",
}
while True:
content = await get(http, url, params=params)
if content is None:
continue
content = json.loads(content)
for page in content["query"]["allpages"]:
yield page["title"]
try:
apcontinue = content['continue']['apcontinue']
except KeyError:
return
else:
params["apfrom"] = apcontinue

Issue in outputting data scraped using beautiful soup in two columns of csv using spamwriter.writerow

I am scraping 2 sets of data from a website using beautiful soup and I want them to output in a csv file in 2 columns side by side. I am using spamwriter.writerow([x,y]) argument for this but I think because of some error in my recursion structure, I am getting the wrong output in my csv file. Below is the referred code:
import csv
import urllib2
import sys
from bs4 import BeautifulSoup
page = urllib2.urlopen('http://www.att.com/shop/wireless/devices/smartphones.html').read()
soup = BeautifulSoup(page)
soup.prettify()
with open('Smartphones_20decv2.0.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
for anchor in soup.findAll('a', {"class": "clickStreamSingleItem"},text=True):
if anchor.string:
print unicode(anchor.string).encode('utf8').strip()
for anchor1 in soup.findAll('div', {"class": "listGrid-price"}):
textcontent = u' '.join(anchor1.stripped_strings)
if textcontent:
print textcontent
spamwriter.writerow([unicode(anchor.string).encode('utf8').strip(),textcontent])
Output which I am getting in csv is:
Samsung Focus® 2 (Refurbished) $99.99
Samsung Focus® 2 (Refurbished) $99.99 to $199.99 8 to 16 GB
Samsung Focus® 2 (Refurbished) $0.99
Samsung Focus® 2 (Refurbished) $0.99
Samsung Focus® 2 (Refurbished) $149.99 to $349.99 16 to 64 GB
Problem is I am getting only 1 device name in column 1 instead of all while price is coming for all devices.
Please pardon my ignorance as I am new to programming.
You are using anchor.string, instead of archor1. anchor is the last item from the previous loop, instead of the item in the current loop.
Perhaps using clearer variable names would help avoid confusion here; use singleitem and gridprice perhaps?
It could be I misunderstood though and you want to combine each anchor1 with a corresponding anchor. You'll have to loop over them together, perhaps using zip():
items = soup.findAll('a', {"class": "clickStreamSingleItem"},text=True)
prices = soup.findAll('div', {"class": "listGrid-price"})
for item, price in zip(items, prices):
textcontent = u' '.join(price.stripped_strings)
if textcontent:
print textcontent
spamwriter.writerow([unicode(item.string).encode('utf8').strip(),textcontent])
Normally it should be easier to loop over the parent table row instead, then find the cells within that row within a loop. But the zip() should work too, provided the clickStreamSingleItem cells line up with the listGrid-price matches.