The srcset= strings on the website are https links, but when I use selenium's driver.find('img')['srcset'], it gives me base64 strings

The srcset= strings on the website are https links, but when I use selenium's driver.find('img')['srcset'], it gives me base64 strings - html

I'm trying to scrape images from a website. In the website's html code, the srcset sections exist and are of the form
srcset="https://...."
For example,
srcset="https://secure.img1-fg.wfcdn.com/im/80458162/resize-h300-w300%5Ecompr-r85/1068/106844956/Derry+84%2522+Reversible+Sectional.jpg 300w,https://secure.img1-fg.wfcdn.com/im/19496430/resize-h400-w400%5Ecompr-r85/1068/106844956/Derry+84%2522+Reversible+Sectional.jpg 400w,https://secure.img1-fg.wfcdn.com/im/75516274/resize-h500-w500%5Ecompr-r85/1068/106844956/Derry+84%2522+Reversible+Sectional.jpg 500w"
However, when I try to get these srcset link using selenium and beautiful soup, I get the following:
"data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7"
Moreover, every time the srcset fails to get a valid link, the string that it gets is always
"data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7"
I tried a bunch of different lines of code, but haven't had success with any of it. Here is the full code I currently have:
def get_info_from_product_link(product_link): #get the price and correctly filtered image link
info = dict()
driver = webdriver.Chrome('C:/Users/Brian/Downloads/chromedriver_win32/chromedriver.exe')
driver.implicitly_wait(200)
try:
driver.get(product_link)
soup = BeautifulSoup(driver.page_source, 'html.parser')
time.sleep(60)
image_carousel = soup.find_all('li', {"class" : "ProductDetailImageCarouselVariantB-carouselItem"})
print("Number of images in gallery: ", len(image_carousel))
#deal with captcha
while len(image_carousel) <= 0:
print("CAPTCHA ENCOUNTERED. FIX")
soup = BeautifulSoup(driver.page_source, 'html.parser')
image_carousel = soup.find_all('li', {"class" : "ProductDetailImageCarouselVariantB-carouselItem"})
time.sleep(30)
valid_image_links = []
highest_resolution_images = []
#get correct image links
#i = 1
for image_block in image_carousel:
try:
#print("image_block:")
#print(image_block)
#print("Image: ", i)
#i += 1
images = image_block.find('div', {"class" : "ImageComponent ImageComponent--overlay"})
#image_links = images.find('img').get_attribute('srcset').split(',')
print(images)
#driver.implicitly_wait(60)
#wait = WebDriverWait(images, 30)
#image_links = wait.until(EC.visibility_of_element_located((By.tagName, "img"))).get_attribute("srcset").split(',')
#image_links = wait.until(EC.text_to_be_present_in_element_value((By.tagName, 'img'), "https")).get_attribute("srcset").split(',')
#image_links = wait.until(EC.text_to_be_present_in_element_value((By.CSS_SELECTOR, "img [srcset*='https']"), "https")).get_attribute("srcset").split(',')
#image_links = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "img[src*='https']"))).get_attribute("src").split(',')
images.implicitly_wait(30)
image_links = images.find_element_by_tag_name('img').get_attribute('srcset').split(',')
#"div[class='ajax_enabled'] [style='display:block']"
#image_links = images.find('img')['srcset'].split(',')
#print("Image links:")
#print(image_links)
#print("Number of links: ", len(image_links))
for link in image_links:
print(link)
for image_info in image_links:
image_link = image_info.split(" ")[0]
try:
if hasValidBackground(image_link) and hasValidSize(image_link):
valid_image_links.append(image_link)
else:
print("Invalid image size or background")
except:
print('ERROR when reading image: ' + image_link)
continue
if len(valid_image_links) > 0:
highest_resolution_images.append(valid_image_links[-1])
valid_image_links.clear()
except:
print("Error. Invalid image link.")
pass
#extract one link to a correctly filtered image
if len(highest_resolution_images) <= 0:
return -1
valid_image_link = highest_resolution_images[0];
info['img_url'] = valid_image_link
#get price information
standard_price_block = soup.find('div', {"class" : "StandardPriceBlock"})
base_price_block = standard_price_block.find('div', {"class" : "BasePriceBlock BasePriceBlock--highlight"})
if base_price_block is None:
base_price_block = standard_price_block.find('div', {"class" : "BasePriceBlock"})
base_price = base_price_block.find('span').text
#price_block = soup.find('span', {"class" : "notranslate"})
#base_price = standard_price_block.find('span').text
info['price'] = base_price
print(base_price)
#print(f"Image link: {image_link}\n")
#print(f"Link to product: {product_link}\n")
driver.close()
#browser.switch_to.window(browser.window_handles[0])
return info
except TimeoutException as e:
print("Page Load Timeout Occurred. Quitting...")
driver.close()
I was testing using this website:
https://www.wayfair.com/furniture/pdp/foundstone-derry-84-reversible-sectional-w001832490.html
My goal is to process each image in the image gallery/carousel and find one that has white background and has valid size of height >= 80 and width >= 80
I'm just starting to learn web scraping, so any help would be much appreciated!!

Related

Formatting json text in discord.py bot

#client.command()
async def show(ctx, player, *args): # General stats
rs = requests.get(apiLink + "/checkban?name=" + str(player))
if rs.status_code == 200: # HTTP OK
rs = rs.json()
joined_array = ','.join({str(rs["otherNames"]['usedNames'])})
embed = discord.Embed(title="Other users for" + str(player),
description="""User is known as:
""" +joined_array)
await ctx.send(embed=embed)
My goal here is to have every username on different lines after each comma, and preferably without the [] at the start and end. I have tried adding
joined_array = ','.join({str(rs["otherNames"]['usedNames'])}) but the response from the bot is the same as shown in the image.
Any answer or tip/suggestion is appreciated!

Try this:
array = ['user1', 'user2', 'user3', 'user4', 'user5', 'user6'] #your list
new = ",\n".join(array)
print(new)
Output:
user1,
user2,
user3,
user4,
user5,
user6
In your case I think array should be replaced with rs["otherNames"]['usedNames']

Collecting Tweets using max_id is not working as expected

Iam currently doing a tweet search using Twitter Api. However, taking the tweet id is not working for me.
Here is my code:
searchQuery = '#BLM' # this is what we're searching for
searchQuery = searchQuery + "-filter:retweets"
Geocode="39.8, -95.583068847656, 2500km"
maxTweets = 1000000 # Some arbitrary large number
tweetsPerQry = 100 # this is the max the API permits
fName = 'tweetsBLM.json' # We'll store the tweets in a json file.
sinceId = None
#max_id = -1 # initial search
max_id=1278836959926980609 # the last id of previous search
tweetCount = 0
print("Downloading max {0} tweets".format(maxTweets))
with open(fName, 'w') as f:
while tweetCount < maxTweets:
try:
if (max_id <= 0):
if (not sinceId):
new_tweets = api.search(q=searchQuery,lang="en", geocode=Geocode,
count=tweetsPerQry)
else:
new_tweets = api.search(q=searchQuery,lang="en",geocode=Geocode,
count=tweetsPerQry,
since_id=sinceId )
else:
if (not sinceId):
new_tweets = api.search(q=searchQuery, lang="en", geocode=Geocode,
count=tweetsPerQry,
max_id=str(max_id - 1) )
else:
new_tweets = api.search(q=searchQuery, lang="en", geocode=Geocode,
count=tweetsPerQry,
max_id=str(max_id - 1),
since_id=sinceId)
if not new_tweets:
print("No more tweets found")
break
for tweet in new_tweets:
f.write(jsonpickle.encode(tweet._json, unpicklable=False) +
'\n')
tweetCount += len(new_tweets)
print("Downloaded {0} tweets".format(tweetCount))
max_id = new_tweets[-1].id
except tweepy.TweepError as e:
# Just exit if any error
print("some error : " + str(e))
print('exception raised, waiting 15 minutes')
print('(until:', dt.datetime.now() + dt.timedelta(minutes=15), ')')
time.sleep(15*60)
break
print ("Downloaded {0} tweets, Saved to {1}".format(tweetCount, fName))
This code works perfectly fine. I initially run it and got about 40 000 tweets. Then i took the id of the last tweet of previous/initial search to go back in time. However, i was disappointed to see that there were no tweets anymore. I can not believe that for a second. I must be going wrong somewhere because #BLM has been very active in the last 2/3 months.
Any help is very welcome. Thank you

I may have found the answer. Using Twitter API, it is not possible to get older tweets (7 days old or more). Using max_id to get around this is not possible either.
The only way is to stream and wait for more than 7 days.
Finally, there is also this link that look for older tweets
https://pypi.org/project/GetOldTweets3/ it is an extension of the original Jefferson Henrique's work

How to get location from Twitter user profile using Beautiful Soup4?

So, I am trying to get the location text in the profile of a given Twitter account
handles = ['IndieWire' , 'AFP', 'UN']
for x in handles:
url= "https://twitter.com/" + x
try:
html = req.get(url)
except Exception as e:
print(f"Failed to fetch page for url {url} due to: {e}")
continue
soup = BeautifulSoup(html.text,'html.parser')
try:
label = soup.find('span',{'class':"ProfileHeaderCard-locationText"})
label_formatted = label.string.lstrip()
label_formatted = label_formatted.rstrip()
if label_formatted != "":
location_list.append(label_formatted)
print(x + ' : ' + label_formatted)
else:
location_list.append(label_formatted)
print(x + ' : ' + 'Not found')
except AttributeError:
try:
label2 = soup.findAll('span',{"class":"ProfileHeaderCard-locationText"})[0].get_text()
label2 = str(label2)
label2_formatted = label2.lstrip()
label2_formatted = label2_formatted.rstrip()
location_list.append(label_formatted)
print(x + ' : ' + label2_formatted)
except:
print(x + ' : ' + 'Not found')
except:
print(x + ' : ' + 'Not found')
This code used to work when I used it a few months ago. I changed it a little bit now after checking the Twitter page source but I still cant get the locations. Hope you can help

Use mobile version of Twitter to get location.
For example:
import requests
from bs4 import BeautifulSoup
handles = ['IndieWire' , 'AFP', 'UN']
ref = 'https://twitter.com/{h}'
headers = {'Referer': '',}
url = 'https://mobile.twitter.com/i/nojs_router?path=/{h}'
for h in handles:
headers['Referer'] = ref.format(h=h)
soup = BeautifulSoup( requests.post(url.format(h=h), headers=headers).content, 'html.parser' )
loc = soup.select_one('.location')
if loc:
print(h, loc.text)
else:
print(h, 'Not Found')
Prints:
IndieWire New York, NY
AFP France
UN New York, NY

View MySQL results in a TreeWidget - PyQt4

iam trying to view a simple MySQL-SELECT in a TreeWidget from PyQt4. Iam able to reach my current Database and iam able to print the results. But i have no idea how it works for me in a TreeWidget.
I found some documentations and tutorials but none of them were really helpful. I hope some of you can help me.
Here is my Code:
def SqlConnectionTest(self):
try:
cnn = mysql.connector.connect(
user = 'root',
host = 'localhost',
database = 'employeedb')
self.ui.textBrowser_exceptionDisplay.setText("Connection works!")
cursor = cnn.cursor()
cursor.execute("SELECT * `from employeeinfo_table`;")
results = cursor.fetchall()
for row in results:
print(row)
#self.ui.treeWidget_EmployeeList.addTopLevelItems()
if cnn:
cnn.close()
except mysql.connector.Error as e:
if e.errno == errorcode.ER_ACCESS_DENIED_ERROR:
self.ui.textBrowser_exceptionDisplay.setText("Error. Check your username or password!")
elif e.errno == errorcode.ER_BAD_DB_ERROR:
self.ui.textBrowser_exceptionDisplay.setText("Database is not available or does not exist!")
else:
self.ui.textBrowser_exceptionDisplay.setText(e)
At the moment iam not adding any Items to the treeWidget because i dont know how except this way. This was just testing. It is not looing good i know(Mitarbeiter(DE) = employee(EN)):
item = QtGui.QTreeWidgetItem([emp1.displayEmployeeName()])
child = QtGui.QTreeWidgetItem(["Details"])
child1 = QtGui.QTreeWidgetItem([emp1.displayFullEmployeeInfo()])
child2 = QtGui.QTreeWidgetItem(["Mitarbeiter"])
child3 = QtGui.QTreeWidgetItem([emp2.displayEmployeeName()])
child4 = QtGui.QTreeWidgetItem(["Details"])
child5 = QtGui.QTreeWidgetItem([emp2.displayFullEmployeeInfo()])
child6 = QtGui.QTreeWidgetItem([emp3.displayEmployeeName()])
child7 = QtGui.QTreeWidgetItem(["Details"])
child8 = QtGui.QTreeWidgetItem([emp3.displayFullEmployeeInfo()])
child9 = QtGui.QTreeWidgetItem(["Mitarbeiter"])
child10 = QtGui.QTreeWidgetItem([emp4.displayEmployeeName()])
child11 = QtGui.QTreeWidgetItem(["Details"])
child12 = QtGui.QTreeWidgetItem([emp4.displayFullEmployeeInfo()])
item.addChild(child)
child.addChild(child1)
child.addChild(child2)
child2.addChild(child3)
child3.addChild(child4)
child4.addChild(child5)
child2.addChild(child6)
child6.addChild(child7)
child7.addChild(child8)
child7.addChild(child9)
child9.addChild(child10)
child10.addChild(child11)
child11.addChild(child12)
self.ui.treeWidget_EmployeeList.addTopLevelItem(item)

GitHub Pages mangling syntax highlighting after upgrade to Jekyll 3

I use Github Pages for my personal website. They're upgrading from Jekyll 2 to Jekyll 3 and sending deprecation warnings. I complied with their warnings and switched from redcarpet to kramdown and from pygments to rouge. When I build locally (with bundle exec jekyll serve) everything works. But when I push the changes the syntax highlighting gets mangled wherever I have linenos in my code blocks.
This is the code block:
{% highlight python linenos %}
'''
scrape lyrics from vagalume.com.br
(author: thiagomarzagao.com)
'''
import json
import time
import pickle
import requests
from bs4 import BeautifulSoup
# get each genre's URL
basepath = 'http://www.vagalume.com.br'
r = requests.get(basepath + '/browse/style/')
soup = BeautifulSoup(r.text)
genres = [u'Rock']
u'Ax\u00E9',
u'Forr\u00F3',
u'Pagode',
u'Samba',
u'Sertanejo',
u'MPB',
u'Rap']
genre_urls = {}
for genre in genres:
genre_urls[genre] = soup.find('a', class_ = 'eA', text = genre).get('href')
# get each artist's URL, per genre
artist_urls = {e: [] for e in genres}
for genre in genres:
r = requests.get(basepath + genre_urls[genre])
soup = BeautifulSoup(r.text)
counter = 0
for artist in soup.find_all('a', class_ = 'top'):
counter += 1
print 'artist {} \r'.format(counter)
artist_urls[genre].append(basepath + artist.get('href'))
time.sleep(2) # don't reduce the 2-second wait (here or below) or you get errors
# get each lyrics, per genre
api = 'http://api.vagalume.com.br/search.php?musid='
genre_lyrics = {e: {} for e in genres}
for genre in artist_urls:
print len(artist_urls[genre])
counter = 0
artist1 = None
for url in artist_urls[genre]:
success = False
while not success: # foor loop in case your connection flickers
try:
r = requests.get(url)
success = True
except:
time.sleep(2)
soup = BeautifulSoup(r.text)
hrefs = soup.find_all('a')
for href in hrefs:
if href.has_attr('data-song'):
song_id = href['data-song']
print song_id
time.sleep(2)
success = False
while not success:
try:
song_metadata = requests.get(api + song_id).json()
success = True
except:
time.sleep(2)
if 'mus' in song_metadata:
if 'lang' in song_metadata['mus'][0]: # discard if no language info
language = song_metadata['mus'][0]['lang']
if language == 1: # discard if language != Portuguese
if 'text' in song_metadata['mus'][0]: # discard if no lyrics
artist2 = song_metadata['art']['name']
if artist2 != artist1:
if counter > 0:
print artist1.encode('utf-8') # change as needed
genre_lyrics[genre][artist1] = artist_lyrics
artist1 = artist2
artist_lyrics = []
lyrics = song_metadata['mus'][0]['text']
artist_lyrics.append(lyrics)
counter += 1
print 'lyrics {} \r'.format(counter)
# serialize
with open(genre + '.json', mode = 'wb') as fbuffer:
json.dump(genre_lyrics[genre], fbuffer)
{% endhighlight %}
This is what I see locally:
This is what I see on Github Pages:
(Without linenos the syntax highlighting works fine.)
What could be happening?

I think I got it!
Your code block seems to be fine. No problem there.
Make sure you have added this into your _config.yml:
highlighter: rouge
markdown: kramdown
kramdown:
input: GFM
Probably what you're missing is kramdown input: GFM, isn't it?
Well, I tested locally and worked fine. When uploaded to GitHub, worked fine as well. Should work for you too.
Let me know how it goes, ok? :)
UPDATE!
Add this to your stylesheet and check how it goes:
.lineno { width: 35px; }
Looks like it's something about your CSS styles that is breaking the layout! Keep tweaking your CSS and you're gonna be fine!

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

The srcset= strings on the website are https links, but when I use selenium's driver.find('img')['srcset'], it gives me base64 strings - html

Related

Formatting json text in discord.py bot

Collecting Tweets using max_id is not working as expected

How to get location from Twitter user profile using Beautiful Soup4?

View MySQL results in a TreeWidget - PyQt4

GitHub Pages mangling syntax highlighting after upgrade to Jekyll 3

Categories

Resources