I was following this past question (Extracting image src based on attribute with BeautifulSoup) to try to extract all the images from a google images page. I was getting a "urllib2.HTTPError: HTTP Error 403: Forbidden" error but was able to get past it using this:
req = urllib2.Request(url, headers={'User-Agent' : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30"})
however, then I got a new error that seems to be telling me that the src attribute does not exist:
Traceback (most recent call last):
File "Desktop/webscrapev2.py", line 13, in <module>
print(tag['src'])
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/bs4/element.py", line 958, in __getitem__
return self.attrs[key]
KeyError: 'src'
I was able to get over that error by checking specifically for the 'src' attribute but most of my images when extracted, dont have the src attribute. It seems like google is doing something to obscure my ability to extract even a few images (I know requests are limited but i thought it was at least 10).
For example printing out the variable tag (see code below) gives me this:
<img alt="Image result for baseball pitcher" class="rg_i" data-src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRZK59XKmZhYbaC8neSzY2KtS-aePhXYYPT2JjIGnW1N25codtr2A" data-sz="f" jsaction="load:str.tbn" name="jxlMHbZd-duNgM:" onload="google.aft&&google.aft(this)"/>
But printing out the variable v gives 'None'. I have no idea why this is happening nor how to get the actual image from what its returning. Does anyone know how to get the actual images? I'm especially worries since the data-src URL starts with encrypted... Should I query data-src to get the image instead of src? Any assistance or advice would be super appreciated!
Here is my full code (in Python):
from bs4 import BeautifulSoup
import urllib2
url = "https://www.google.com/search? q=baseball+pitcher&espv=2&biw=980&bih=627&source=lnms&tbm=isch&sa=X&ved=0ahUKEwj5h8-9lfjLAhUE7mMKHdgKD0YQ_AUIBigB"
#'http://www.imdb.com/title/tt%s/' % (id,)
req = urllib2.Request(url, headers={'User-Agent' : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30"})
soup = BeautifulSoup(urllib2.urlopen(req).read(), "lxml")
print "before FOR"
for tag in soup.findAll('img'):
print "inside FOR"
v = tag.get('src', tag.get('dfr-src')) # get's "src", else "dfr_src", if both are missing - None
print v
print tag
if v is None:
continue
print("v is NONE")
print(tag['src'])
Oh, boy. You picked the wrong site to scrape from. :)
Google's Defenses
First off, Google is (obviously) Google. It knows web crawlers and web scrapers very well - its entire business is founded on them.
So it knows all of the tricks that ordinary people get up to, and more importantly has an important mandate to make sure nobody else except end users get their hands on their images.
Didn't pass a User-Agent header? Now Google knows you're a scraper bot that doesn't bother pretending to be a browser, and forbids you from accessing its content. That's why you got a 403: Forbidden error the first time - the server realised you were a bot and prevented you from accessing material. It's the simplest technique to block automated bots.
Google Builds Pages through Javascript
Don't have Javascript parsing capability (which Python requests, urllib and its ilk don't)? Now you can't view half your images because the way Google Image search results works (if you inspect the Network tab in your Chrome console as Google Images is loading) is that a few bundled requests are made to various content providers that then systematically add a src attribute to a placeholder img tag through inline obfuscated Javascript code.
At the very beginning of time, all of your images are essentially blank, with just a custom data-src attribute to coordinate activities. Requests are made to image source providers as soon as the browser begins to parse Javascript (because Google probably makes use of its own CDN, these images are transferred to your computer very quickly), and then page Javascript does the arduous task of chunking the received data, identifying which img placeholder it should go to and then updating src appropriately. These are all time-intensive operations, and I won't even pretend to know how Google can make them happen so fast (although note that messing with network throttling operations in Dev Tools on Chrome 48 can cause Google Images to hang, for some bizarre reason, so there's probably some major network-level code-fu going on over there).
These image source providers appear to begin with https://encrypted..., which doesn't seem to be something to worry about - it probably just means that Google applies a custom encryption scheme on the data as its being sent over the network on top of HTTPS, which is then decoded by the browser. Google practises end-to-end encryption beyond just HTTPS - I believe every layer of the stack works only with encrypted data, with encryption and decryption only at the final and entry point - and I wouldn't be surprised to see the same technology behind, for example, Google Accounts.
(Note: all the above comes from poking around in Chrome Dev Tools for a bit and spending time with de-obfuscators. I am not affiliated with Google, and my understanding is most likely probably incomplete or even woefully wrong.)
Without a bundled Javascript interpreter, it is safe to say that Google Images is effectively a blank wall.
Google's Final Dirty Trick
But now say you use a scraper that is capable of parsing and executing Javascript to update the page HTML - something like a headless browser (here's a list of such browsers). Can you still expect to be able to get all the images just by visiting the src?
Not exactly. Google Images embeds images in its result pages.
In other words, it does not link to other pages, it copies the images in textual format and literally writes down the image in base64 encoding. This reduces the number of connections needed significantly and improves page loading time.
You can see this for yourself if you navigate to Google Images, right click on any image, and hit Inspect element. Here's a typical HTML tag for an image on Google Images:
<img data-sz="f" name="m4qsOrXytYY2xM:" class="rg_i" alt="Image result for google images" jsaction="load:str.tbn" onload="google.aft&&google.aft(this)" src="" style="width: 167px; height: 167px; margin-left: -7px; margin-right: -6px; margin-top: 0px;">
Note the massive wall of text buried in src. That is quite literally the image itself, written in base 64. When we see an image on our screen, we are actually viewing the result of this very text parsed and rendered by a suitable graphics engine. Modern browsers support decoding and rendering of base64-encoded URIs, so it's not a surprise you can literally copy-paste the relevant text into your address bar, hit Enter and view the image at once.
To get the image back, you can decode this wall of text (after suitably parsing it to remove the data:image/jpeg;base64,) using the base64 module in Python:
import base64
base64_string = ... # that monster you saw above
decoded_string = base64.b64decode(your_string)
You must also make sure to parse the image type appropriately from the start of the src attribute, write the decoded_string to a file and finally save it with the file extension you received from the data attribute. phew
tl;dr
Don't go after Google Images as your first major scraping project. It's
hard. Wikipedia is much easier to get ahold of.
in violation of their Terms of Service (although what scraping isn't? and note I am not a lawyer and this doesn't constitute legal advice) where they explicitly say
Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide.
really impossible to predict how to improve on. I wouldn't be surprised if Google was using additional authentication mechanisms even after spoofing a human browser as much as possible (for instance, a custom HTTP header), and no one except an anonymous rebellious Google employee eager to reduce his/her master to rubble (unlikely) could help you out then.
significantly easier to use Google's provided Custom Search API, which lets you simply ask Google for a set of images programmatically without the hassle of scraping. This API is rate-limited to about a hundred requests a day, which is more than enough for a hobby project. Here are some instructions on how to use it for images. As a rule, use an API before considering scraping.
To scrape Google Images using requests and beautifulsoup libraries, you need to parse data from the page code, inside <script> tags using regular expressions.
If you only need to parse thumbnail size images, you can do it by passing content-type (solution found from MendelG) query params into HTTP request:
import requests
from bs4 import BeautifulSoup
params = {
"q": "batman wallpaper",
"tbm": "isch",
"content-type": "image/png",
}
html = requests.get("https://www.google.com/search", params=params)
soup = BeautifulSoup(html.text, 'html.parser')
for img in soup.select("img"):
print(img["src"])
To scrape the full-res image URL with requests and beautifulsoup you need to scrape data from the page source code via regex.
Find all <script> tags:
soup.select('script')
Match images data via regex:
matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
Match desired images (full res size) via regex:
# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps() it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
matched_images_data_json)
Extract and decode them using bytes() and decode():
for fixed_full_res_image in matched_google_full_resolution_images:
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
If you need to save them, you can do it via urllib.request.urlretrieve(url, filename) (more in-depth):
import urllib.request
# often times it will throw 404 error, to avoid it we need to pass user-agent
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(original_size_img, f'LOCAL_FOLDER_NAME/YOUR_IMAGE_NAME.jpg') # you can skip folder path and it will save them in current working directory
Code and full example in the online IDE:
import requests, lxml, re, json, urllib.request
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "pexels cat",
"tbm": "isch",
"hl": "en",
"ijn": "0",
}
html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
def get_images_data():
print('\nGoogle Images Metadata:')
for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
source = google_image.select_one('.fxgdke').text
link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
print(f'{title}\n{source}\n{link}\n')
# this steps could be refactored to a more compact
all_script_tags = soup.select('script')
# # https://regex101.com/r/48UZhY/4
matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
# https://regex101.com/r/pdZOnW/3
matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)
# https://regex101.com/r/NnRg27/1
matched_google_images_thumbnails = ', '.join(
re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
str(matched_google_image_data))).split(', ')
print('Google Image Thumbnails:') # in order
for fixed_google_image_thumbnail in matched_google_images_thumbnails:
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')
# after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
print(google_image_thumbnail)
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
removed_matched_google_images_thumbnails)
print('\nDownloading Google Full Resolution Images:') # in order
for index, fixed_full_res_image in enumerate(matched_google_full_resolution_images):
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
print(original_size_img)
# ------------------------------------------------
# Download original images
# print(f'Downloading {index} image...')
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(original_size_img, f'Images/original_size_img_{index}.jpg')
get_images_data()
-------------
'''
Google Images Metadata:
9,000+ Best Cat Photos · 100% Free Download · Pexels Stock Photos
pexels.com
https://www.pexels.com/search/cat/
...
Google Image Thumbnails:
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR2cZsuRkkLWXOIsl9BZzbeaCcI0qav7nenDvvqi-YSm4nVJZYyljRsJZv6N5vS8hMNU_w&usqp=CAU
...
Full Resolution Images:
https://images.pexels.com/photos/1170986/pexels-photo-1170986.jpeg?cs=srgb&dl=pexels-evg-culture-1170986.jpg&fm=jpg
https://images.pexels.com/photos/3777622/pexels-photo-3777622.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500
...
'''
Alternatively, you can achieve the same thing by using Google Images API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with regex to match and extract needed data from the source code of the page, instead, you only need to iterate over structured JSON and get what you want.
Code to integrate:
import os, urllib.request, json # json for pretty output
from serpapi import GoogleSearch
def get_google_images():
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "pexels cat",
"tbm": "isch"
}
search = GoogleSearch(params)
results = search.get_dict()
print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))
# -----------------------
# Downloading images
for index, image in enumerate(results['images_results']):
# print(f'Downloading {index} image...')
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(image['original'], f'SerpApi_Images/original_size_img_{index}.jpg')
get_google_images()
---------------
'''
[
...
{
"position": 100, # img number
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRR1FCGhFsr_qZoxPvQBDjVn17e_8bA5PB8mg&usqp=CAU",
"source": "pexels.com",
"title": "Close-up of Cat · Free Stock Photo",
"link": "https://www.pexels.com/photo/close-up-of-cat-320014/",
"original": "https://images.pexels.com/photos/2612982/pexels-photo-2612982.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500",
"is_product": false
}
]
'''
P.S - I wrote a more in-depth blog post about how to scrape Google Images, and how to reduce chance of being blocked while web scraping search engines.
Disclaimer, I work for SerpApi.
The best way to solve this problem is by using headless browser like Chrome Webdriver and user simulation libraries like Selenium Py. Beautiful Soup alone isn't adequate.
I want to get html code from windows phone market pages. So far I have not run into any problems but today following error is displayed every time I retrieve data.
[...] Your request appears to be from an automated process.
If this is incorrect, notify us by clicking here to be redirected [...].
I tried to use proxy in case to many requests are called from one IP but this does not bring any progression. Do you happen to know why this problem takes place, any ideas about possible way outs? Any help would be very much appreciated. The main goal is to somehow get information about windows phone app from market.
It seems that they detect the user agent and block the request if it is not valid / known for a device.
I managed to make it work with curl with eg.
curl -A 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9' http://www.windowsphone.com/en-us/store/app/pinpoint-by-foundbite/ff9fdf41-aabd-4cac-9086-8710bd327da9
For asp.net, if you use HttpRequest to get the html content, try the following:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9";
For PHP you can set your user agent as well via curl_setopt.
I was not able to find out, whether there is an IP-based block after several requests.
This seems to only happen in Chrome (latest version 31.0.1650.48 m, but also earlier), but since it doesn't always happen it's hard to say for sure.
When streaming audio stored in Azure Blob storage, Chrome will occasionally play about 30-50% of the track and then stop. It's hard to reproduce, but if I clear the cache and play the file over and over again, it eventually happens. An example file can be found here.
The error is pretty much the same as what's described here, but I've yet to see the problem on any files hosted elsewhere.
Update:
The Azure Blog log only gives AnonymousSuccess messages, no error messages. This is what I get:
1.0;2013-11-14T12:10:10.6629155Z;GetBlob;AnonymousSuccess;200;3002;269;anonymous;;p3urort;blob;"http://p3urort.blob.core.windows.net/tracks/bd2fd171-b3c5-4e1c-97ba-b5109cf15098";"/p3urort/tracks/bd2fd171-b3c5-4e1c-97ba-b5109cf15098";c377a003-ea6b-4982-9335-15ebfd3cf1b1;0;160.67.18.120:54132;2009-09-19;419;0;318;7663732;0;;;"0x8D09A26E7479CEB";Friday, 18-Oct-13 14:38:53 GMT;;"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.48 Safari/537.36";"http://***.azurewebsites.net/";
Apparently you have to set the content type to audio/mpeg3
Here's how I do it:
CloudBlockBlob blockBlob = container.GetBlockBlobReference(fileName);
blockBlob.UploadFromStream(theStream);
theStream.Close();
blockBlob.Properties.ContentType = "audio/mpeg3";
blockBlob.SetProperties();
From here: https://social.msdn.microsoft.com/Forums/azure/en-US/0139d27a-0325-4be1-ae8d-fbbaf1710629/unable-to-load-audio-in-html5-audio-tag-from-storage?forum=windowsazuredevelopment
[edit] - This didn't actually work for me, I'm trying to troubleshoot, but I don't know what's wrong, going to ask a new question.
This mp3 only plays for 1.5 min and then stops. When downloaded, the file plays fully...
https://orator.blob.core.windows.net/mycontainer/zenhabits.net.unsolved.mp3
I am using Adobe AIR 3.8 and manually generating all of the OAuth calls for my application. Tracing out the user agent of the HTMLLoader that I am using for the OAuth calls in AS3, I get the follow:
Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/533.19.4 (KHTML, like Gecko) AdobeAIR/3.8
I am able to get passed the user login portion and up to the part where they need to accept the permissions to move on. However, AS3 traces out the following error upon load complete of this page:
TypeError: Result of expression 'window.sessionStorage' [undefined] is not an object.
at https://apis.google.com/_/scs/abc-static/_/js/k=gapi.gapi.en.bI438WBuHt0.O/m=googleapis_client,plusone/exm=appcirclepicker/rt=j/sv=1/d=1/ed=1/am=IA/rs=AItRSTNuPHIoFBjGmVBeSqIsgUIKEsrbzA/cb=gapi.loaded_1 : 13
at https://apis.google.com/_/scs/abc-static/_/js/k=gapi.gapi.en.bI438WBuHt0.O/m=googleapis_client,plusone/exm=appcirclepicker/rt=j/sv=1/d=1/ed=1/am=IA/rs=AItRSTNuPHIoFBjGmVBeSqIsgUIKEsrbzA/cb=gapi.loaded_1 : 23
at https://ssl.gstatic.com/gb/js/smm_5a5968e7804546d31a076ff436e35b36.js : 149
at https://ssl.gstatic.com/gb/js/smm_5a5968e7804546d31a076ff436e35b36.js : 152
at https://ssl.gstatic.com/gb/js/smm_5a5968e7804546d31a076ff436e35b36.js : 149
at https://ssl.gstatic.com/gb/js/smm_5a5968e7804546d31a076ff436e35b36.js : 151
at https://ssl.gstatic.com/gb/js/smm_5a5968e7804546d31a076ff436e35b36.js : 151
at https://ssl.gstatic.com/gb/js/smm_5a5968e7804546d31a076ff436e35b36.js : 151
at https://ssl.gstatic.com/gb/js/smm_5a5968e7804546d31a076ff436e35b36.js : 151
This problem is similar to this problem which I assume has been resolved, since this original problem had the accept button disabled: TypeError upon authenticating user using Google OAuth 2
My issue has the start button enabled, but when you click on it, it just grays out and the HTMLLoader (internal AIR browser) goes nowhere. I notice that the option to select which of the user's circles are allowed to see the app activity doesn't render correctly as well. Here is a screenshot of what I see:
Authenticating via a normal browser using the generated OAuth URL works fine.
Also would like to note that the cancel button works just fine in AIR. When the hit cancel, the HTMLLoader redirects to the redirect url with "?error=access_denied" like it should.
Are you using the G+ Sign-In javascript library?