I was following this past question (Extracting image src based on attribute with BeautifulSoup) to try to extract all the images from a google images page. I was getting a "urllib2.HTTPError: HTTP Error 403: Forbidden" error but was able to get past it using this:
req = urllib2.Request(url, headers={'User-Agent' : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30"})
however, then I got a new error that seems to be telling me that the src attribute does not exist:
Traceback (most recent call last):
File "Desktop/webscrapev2.py", line 13, in <module>
print(tag['src'])
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/bs4/element.py", line 958, in __getitem__
return self.attrs[key]
KeyError: 'src'
I was able to get over that error by checking specifically for the 'src' attribute but most of my images when extracted, dont have the src attribute. It seems like google is doing something to obscure my ability to extract even a few images (I know requests are limited but i thought it was at least 10).
For example printing out the variable tag (see code below) gives me this:
<img alt="Image result for baseball pitcher" class="rg_i" data-src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRZK59XKmZhYbaC8neSzY2KtS-aePhXYYPT2JjIGnW1N25codtr2A" data-sz="f" jsaction="load:str.tbn" name="jxlMHbZd-duNgM:" onload="google.aft&&google.aft(this)"/>
But printing out the variable v gives 'None'. I have no idea why this is happening nor how to get the actual image from what its returning. Does anyone know how to get the actual images? I'm especially worries since the data-src URL starts with encrypted... Should I query data-src to get the image instead of src? Any assistance or advice would be super appreciated!
Here is my full code (in Python):
from bs4 import BeautifulSoup
import urllib2
url = "https://www.google.com/search? q=baseball+pitcher&espv=2&biw=980&bih=627&source=lnms&tbm=isch&sa=X&ved=0ahUKEwj5h8-9lfjLAhUE7mMKHdgKD0YQ_AUIBigB"
#'http://www.imdb.com/title/tt%s/' % (id,)
req = urllib2.Request(url, headers={'User-Agent' : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30"})
soup = BeautifulSoup(urllib2.urlopen(req).read(), "lxml")
print "before FOR"
for tag in soup.findAll('img'):
print "inside FOR"
v = tag.get('src', tag.get('dfr-src')) # get's "src", else "dfr_src", if both are missing - None
print v
print tag
if v is None:
continue
print("v is NONE")
print(tag['src'])
Oh, boy. You picked the wrong site to scrape from. :)
Google's Defenses
First off, Google is (obviously) Google. It knows web crawlers and web scrapers very well - its entire business is founded on them.
So it knows all of the tricks that ordinary people get up to, and more importantly has an important mandate to make sure nobody else except end users get their hands on their images.
Didn't pass a User-Agent header? Now Google knows you're a scraper bot that doesn't bother pretending to be a browser, and forbids you from accessing its content. That's why you got a 403: Forbidden error the first time - the server realised you were a bot and prevented you from accessing material. It's the simplest technique to block automated bots.
Google Builds Pages through Javascript
Don't have Javascript parsing capability (which Python requests, urllib and its ilk don't)? Now you can't view half your images because the way Google Image search results works (if you inspect the Network tab in your Chrome console as Google Images is loading) is that a few bundled requests are made to various content providers that then systematically add a src attribute to a placeholder img tag through inline obfuscated Javascript code.
At the very beginning of time, all of your images are essentially blank, with just a custom data-src attribute to coordinate activities. Requests are made to image source providers as soon as the browser begins to parse Javascript (because Google probably makes use of its own CDN, these images are transferred to your computer very quickly), and then page Javascript does the arduous task of chunking the received data, identifying which img placeholder it should go to and then updating src appropriately. These are all time-intensive operations, and I won't even pretend to know how Google can make them happen so fast (although note that messing with network throttling operations in Dev Tools on Chrome 48 can cause Google Images to hang, for some bizarre reason, so there's probably some major network-level code-fu going on over there).
These image source providers appear to begin with https://encrypted..., which doesn't seem to be something to worry about - it probably just means that Google applies a custom encryption scheme on the data as its being sent over the network on top of HTTPS, which is then decoded by the browser. Google practises end-to-end encryption beyond just HTTPS - I believe every layer of the stack works only with encrypted data, with encryption and decryption only at the final and entry point - and I wouldn't be surprised to see the same technology behind, for example, Google Accounts.
(Note: all the above comes from poking around in Chrome Dev Tools for a bit and spending time with de-obfuscators. I am not affiliated with Google, and my understanding is most likely probably incomplete or even woefully wrong.)
Without a bundled Javascript interpreter, it is safe to say that Google Images is effectively a blank wall.
Google's Final Dirty Trick
But now say you use a scraper that is capable of parsing and executing Javascript to update the page HTML - something like a headless browser (here's a list of such browsers). Can you still expect to be able to get all the images just by visiting the src?
Not exactly. Google Images embeds images in its result pages.
In other words, it does not link to other pages, it copies the images in textual format and literally writes down the image in base64 encoding. This reduces the number of connections needed significantly and improves page loading time.
You can see this for yourself if you navigate to Google Images, right click on any image, and hit Inspect element. Here's a typical HTML tag for an image on Google Images:
<img data-sz="f" name="m4qsOrXytYY2xM:" class="rg_i" alt="Image result for google images" jsaction="load:str.tbn" onload="google.aft&&google.aft(this)" src="data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wCEAAkGBxAPEA4NEBAPEBETDxAQFhEREA8QDxAPFhEWFhURFRgYKCgiGBsnHRcVIzIiJykrLi4vGB8zODMsNygtLisBCgoKDg0OGxAQGy0mHyUrLS4rLy8tNy8tNy0tKy4rLS0tKystLTArLS0tLS0tLy0tLS0tLS0tLS0tKy0tLS0tLf/AABEIAOEA4QMBEQACEQEDEQH/xAAcAAEAAwADAQEAAAAAAAAAAAAABQYHAQMECAL/xAA8EAACAgEBBgMFBQUIAwAAAAAAAQIDBBEFBhITITFBYXEHUYGRsRQiMlKhQmJyc7MVIyUzQ1OSwTXC8P/EABsBAQACAwEBAAAAAAAAAAAAAAAEBQIDBgEH/8QAMxEBAAIBAgQDBQgCAwEAAAAAAAECAwQRBRIhMRNBgRRRYXGRIjIzQqGxwdE08AZS4SP/2gAMAwEAAhEDEQA/ANxAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHDYFR2zvzXXZ9mxKpZd76KMPwa+vivPt5lnh4ZM08TNblr+v0Rb6nry443l1Qo29euKVuJiJ/sKHMmvXv9T2cmgp0rS1vjM7HJnt3tEejpyMPeClccMnGydP2OGNcn6cS0/VGdMvD7ztak1+U7vJpnjtbdG4vtKtpm6c3GcJx/FFJ12LzSl0l80iVbg2PLXn0994ao1d6zteF62Dt3Hz6udjz44p8Mk04yhPTXhkn2fVFLqNPk09+TJHVNpkreN4SRoZgAAAAAAAAAAAAR+2MGd0OGORbQkm3y1DWXlq1qvga8lOaO+zVlpz17zHyZTsnOuyczDxZ3WqFljU+GycXKMYSlpqn014Sq09rXyREzLn9Fa2TNFbTO3zbJCKSSXZLT4Fw6V+j0AAAAAAAAM69pG9EoN4NDevSM+HvKcu1a+a19UveX/CdDXbx8nby/tX6rNMz4dVi3L3ajg0JySlkWJSts8eJ/sL3RRWa7WW1OSZ8o7R8ErDijHXaFjRDbnAFV353T/tNY0Fy6+CzWdzTdyq061w09769e2iJ2i1ttLNpjrvHSPL5y05cUZNt09sfZdOHTDHogoQiu3i34yk/Fv3kTJlvktNrzvLbWsVjaHtMHoAAAAAHGp5zQOT0AAAAB+bfwy9H9Dyezy3aWJ7pP/E9n/zLP6Mym0n4sOa4b/kfVtpdOmcgAAAAAAAdWVcq4TsfaEJTfpFa/wDR7WvNaKx5vJnaN2IbuSeXtXDdnXivldLXxkoys+qR2eujwNFatfdEKfT/AG80TLdDi1yAAAACC3n3ihhQ6Ljta1jHwXgm9PPsvE05cvJ0jun6HQzqJmZnasd5R+Lu9k5UVbnZV8XJa8imSrjWn+zJru//ALqeRjm3W8tl9bjxTy6ekbR5zG8yid4tg5Wz4Sy8PJyJwh96yuc+KcYeM4vtJLxTXbX0Mb4pr1rKbo9bh1FvC1FK9e0xGz37lb6LLksW/hV3C3Ca6RuS7rTwkl18+vuM8WXm6S08T4TOnjxMf3fP4f8Ai422KKcn0SWplkyVx1m1u0KatZtO0INZFuXOVdcuVXHTikustX2ivP8AReZU4r5ddaZ3mtI93eUy9a4IjpvZ3T2BHT7ttyl+ZyjJa+a0+mhvycKwWjpvE+/eWFdXeJ67TCMp2ndjWSqs+9w94t6qUX2lFspfbNTw/N4eSeav8fBOnT49RTnr0WfHujZGM4vWLWqOpxZa5aRevaVTek0tNZ7qrtjea2eTHZ2AoO5uSlbP70KlH8TS8dPr0LfFoq0w+Pn328ojzQrZptfkx+sunae7m0+W7Ktp222pa8uUY1Vzf5YuGnD8dTLDrNNzbXxREfWf1L4cm28XndSNkbeyMnIhi35uXiOcuXxqTlw3a6KE0301fTyZdanS48WLxcWOttuvoh472m3La0w1rZmzvs1Lr5t170bdl1jnOT08+y8kcnmyeJMztEfKNlnFeWuzH9z3/imz/wCZZ/RmUek/Fc9w3/I+rbcm+NcJWTekYrVvyLiZiI3l09KWvaK17yp+LkZO1bLOC2eNi1y4G69FbZLvw8Xppr4dSPWbZevaFvmpi0MRXli15jfr2j0fjbe6E6a534mRkcyEXNwnZxcxJatJrRxl+h7fDtG9ZZaXicWvFM9azWfh2ebcHfZ5Nv2G+XFNxcq7HopS4Vq65e96atP3JjDeZ6WbOLcMphjxcXbzj+WgkhQAAAB49s1ueNkwXeVFsV6uDSNmGYrkrM++P3Y3jeswwjc7NVWfg3SeiVsYvyU4uH/sdxxLH4mlvEe7f6dVNp7cuSJfQZwa7AAADgDIMzaP2nadLl1i8+EdPDgjYoxXySK+J5su8+92XgeDw+Yr35f36tfRYONcTimnFrVNNNPs0+6BE7dXz1tRS2dm2KDaeNlOUPfwKXFFf8Wl8SFMct3f47RqdHHN+av+/q2zbuV/d1adp/f+CSf/AGV3HM01x1pHnP7OP0WPe8z7nG6P+RJ+Lusb9eiX6aErg+3ssfOf3Ya78aflCcLRDVve/H/ybl3UnW/OLTa+j+Zzv/IMUTjrk907LThl/tTT1eTd3PksbOUesqoynH1cJNL5x/Uk/wDHJ56clu3NH6sOLU5bRaPOFL9l+antFcT1dmNZFN93LWMv1SkfRuN4dtNHL2iXMaK3/wBOvm2I5JbMJ9qOKqNo5Dh05ka7lp35klo2vNyjr8Ts+EZebRxzeW8eip1NdsvRt9bk6k5fi5a1/i4epxl9t52Wk/dYnuZ/5TZ/8yz+jMp9J+K53h3+Q0b2j5jrx6oL/Ut0fmoxb0+enyJ+qnauzvuB4YvnmZ8oR25u2eRs+Crx8jIsdt7caq24682WnFN9F00PcNtscbQy4ngi+stzWisdO/y9z2X7P2ptBON1scCiS0ddOk75RfhKfh8DPlvPdHjPpdP+FXnt77dvSP7SO7+5mDgNTppTtX+tY3Ozto9G/wAPw0M4rEImfWZs/wB+3p5LEZIwAAAcMD593z2RLBzbqNGoSk7apeDrk9Uk/fF6r4L3ndcO1MajBEz3jpP+/FTZ8U0u17cTeOOfjR4mufWlC2Pi34WLylpr66rwOV4jo502WY/LPWP69Fjp80ZK/HzWUr0hGbc2ZZlQVccq/Gj14uTwKc17uJptfDQ3YcsY7c3LE/NhenNG2+zN9+thLZVFN1WRlWynkKpq2zVcPKnLVaaddYov+G6mdXlml6xtEb9I+KDqMMY67xM/V6fY5DnzzMuyU5TrlCuGs5OMIyi3LRdte3U08cnkmuOsbRtu2aOOkzKrbx1zw8++HVSryOfXr04oufMg15eHwZyNq8l930rR5K6rSRHvrtP02bls7NhkVVZFb1hZCM16Na6epOid43cNlxWxXmlu8Ts9J61sG3wxnmbauxavvO3Jrr6eCjXBTl8OGXyI0xvd2OHN7Pw2s277T+szs1feulwpqsj2qai/KDSWvzSK3jmnnJhi8flc9w/JHiTWfN17n5K/vate75kfPolL6I0cC1EctsM9+8NnEsUxMX9FmOiVaC3wtUcda/nT+SZScdtHs8V98wsOGVmc3T3PDuBivkW3yX+dZ91PxritE/i3I38JwTiwbz59TiWWL5eWPLoou8W5+Zs7KWXhQnbTGzm1uuPHZS9deXKC6uPhqvDud9puJ4NRh8LUTtO23w+bncmmvS3NjXDE9oVXKUsjHyKLeHV1yhwptd3Fy06fAq54Teb7YrRavv3/AHbva4iPtxMSitg7AntbL/trLio08UZUU9+OMPwSl+7r182btVqo0uL2TDPX80/Ge+xixzkt4t/Ro9v4Zfwv6FFKXbtLDdzH/imz/wCZZ/RmVGk/Fc5w7/IaJ7UcOU8HnRTbotjY9P8AbacZP4ap/Ass9earvOCZ4x6nafzRsgPZXt2MJ2YU5aK2XMrbfTmaJSh8Uk16M16e232ZWPHtHNojPXy6T/E/w08lOWcgAAAAAAgd792KdpU8qz7lkW5V2payrk/rF9NV5eRL0esvpr81e3nDVlxRkjaWO34e0tiXq2UJw4W0r6050Tj4pv3P8stH9TqK6nS67HyWn0nurpxZMVuaF92F7VcS2KWSpUz06yinOp+ei+8vTR+pT6jgeas74p5o/VKprKz96NliW++zNOL7bQl5yafya1IM8N1UTtyS3e0Y/wDsoXtM3nx9o1UYuDzcqyGQrG66bHBJVzjp21b1kvDTuWvC8FtJecmaYrG23fr3R894yRtTqmvY3snJxqct5NFlPMtrlFTSTlFQab07r4kTi+ox5slZxzvtDdp6TSvVOb7bn17ShGSlysitPgt01TX5Jrxj9ClvTmW2h199Lbp1ie8Klu5lbW2PxY1+DbkY/E2nQ1Zwt9XKDXg34NIwrzU6bLTV20eu+3F+W/xj90/lb05+RB14OzcmFkunNylCquv97TX7xlzWntCBXS6bHO+XLEx7q+bt3H3KjgOeVfNXZdmvFZ14YJvWUY6+9934ntaRDXrtfbUztHSsdoW62uM4yhJKUZJpp9mn3RlasWjaUGJmJ3hS8nd/KxLObif3taeqhxJWw8uvSS/UoM/CsmPJ4unnrC2prseWnJmj1S9O37tNJYWTx+5QfDr6+BNpqtVttbDO/wAJjZEtp8O+9ckbPJfsfIz7IzylyaI9qVJOcl7m10S/X0NddFkz5Iy6nbp2rDZ7VTDSaYe895n+FoqrUYqMUoxikkl0SS7Itlf3eTaO1sfGWt91VX8c0m/Rd2bsWDJlnalZlhbJWv3pZJ7Tds4uZbRfi3Suca5VTjGq5RUdeJSUmkn3afwOm4RTJpq2pmiI3neOsK/VTGSYmnX0SPs/39hTXVhZLSqjpCu7/bj4Rs/d/e8PHp1NXE+FTeZzYu/nH9f090+p5Y5L/Vp+dZNVWSqhzZ8EnGClGPHLTotX0XqcxbfZYW3mvRj+x92NsY2VRlrChJ1SclHn0rXWLi1rr06Nlfj016WiypwaHLivzx3atsq2++qf2vGjQ5ax5XNjfxQa0fFokuvu6k+szMfahbY7X7z0ll283s+y8WyVuDF5FDlxKtS0vp8dFr+JLwa6/U03w+cOq0fGqTXkz/8AkvXsnfTbFaVNmz772uic6LoT+MktGe1m8eTTn0nDrzzVycv6rnsC3aeRKN2XCrDpXVUQ+/dZL9+XaMfJdTZXmnuqtR7NSOXDvaffPb0hZjNCAAAAAA4lFNNNJp+D6pgQ+Vups+18VmHjSfv5UE/0N9dTmr2tP1YTSs94cY+6ez63rDCxk/5UX9RbU5rd7z9SMdY7QlaMeFa0hCEF7oxUV+hpmZnuzdp4AAAAAAAAACsb+7xPBoioPS21yjF/ljFaykvPrFfEseGaP2nL17R3RtTm8OvTvKB3A3bqyqltTMXPstlNwjZrKMIRk46tPvJtPuSeJ6u1LzgxfZrHTp0YabDHLz26zLQK6IRWkYQivcopIpZmZlMZb7Tty6qa57SxoqtKUedVHpBqclHmQXg9WtV266+vR8H4jebxgv190oGqwREc0LT7K82d2zKHNtuErKU33cIS0j8l0+BXcWx1pqrRXz6/VI00zOON1uK1vAAAAAAAAAAAAAAAAAAAAAAAAAAApHtU3fty8eq6iLnbjynLlr8U6ppcaj75fdi9PJltwjWV0+WYv2si6rFOSu8d4Vf2db+VYtf2HK4oQjOXBPRt18T1lXOPfu29fPQseJ8Mtnt42Hrv3j+YaMGo8OOS7RY72bPceL7bjaedsE/k+pRTodTE7eHP0TIz4/8AtCo727e/taD2VsyLv5k4c3I0aoqhGSl+J9+qXw9ek/S4fYrePn6THavnM/w05L+L9inrK7bubIhg41OJB6quOjk+85t6yk/VtlXnzWzZJvbvKTSsVjaEkamQAAAAAAAAAAAAAAAAAAAAAAAAAADQCI2puzg5T478WmyX53BKf/JaM349TmxfctMerCcdbd4eGrcHZUXqsOp/xOc18pPQ224hqZjabz9WMYcceUJ/FxK6oqFUIVxX7MIqMfkiJMzad5lt2dx4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA/9k=" style="width: 167px; height: 167px; margin-left: -7px; margin-right: -6px; margin-top: 0px;">
Note the massive wall of text buried in src. That is quite literally the image itself, written in base 64. When we see an image on our screen, we are actually viewing the result of this very text parsed and rendered by a suitable graphics engine. Modern browsers support decoding and rendering of base64-encoded URIs, so it's not a surprise you can literally copy-paste the relevant text into your address bar, hit Enter and view the image at once.
To get the image back, you can decode this wall of text (after suitably parsing it to remove the data:image/jpeg;base64,) using the base64 module in Python:
import base64
base64_string = ... # that monster you saw above
decoded_string = base64.b64decode(your_string)
You must also make sure to parse the image type appropriately from the start of the src attribute, write the decoded_string to a file and finally save it with the file extension you received from the data attribute. phew
tl;dr
Don't go after Google Images as your first major scraping project. It's
hard. Wikipedia is much easier to get ahold of.
in violation of their Terms of Service (although what scraping isn't? and note I am not a lawyer and this doesn't constitute legal advice) where they explicitly say
Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide.
really impossible to predict how to improve on. I wouldn't be surprised if Google was using additional authentication mechanisms even after spoofing a human browser as much as possible (for instance, a custom HTTP header), and no one except an anonymous rebellious Google employee eager to reduce his/her master to rubble (unlikely) could help you out then.
significantly easier to use Google's provided Custom Search API, which lets you simply ask Google for a set of images programmatically without the hassle of scraping. This API is rate-limited to about a hundred requests a day, which is more than enough for a hobby project. Here are some instructions on how to use it for images. As a rule, use an API before considering scraping.
To scrape Google Images using requests and beautifulsoup libraries, you need to parse data from the page code, inside <script> tags using regular expressions.
If you only need to parse thumbnail size images, you can do it by passing content-type (solution found from MendelG) query params into HTTP request:
import requests
from bs4 import BeautifulSoup
params = {
"q": "batman wallpaper",
"tbm": "isch",
"content-type": "image/png",
}
html = requests.get("https://www.google.com/search", params=params)
soup = BeautifulSoup(html.text, 'html.parser')
for img in soup.select("img"):
print(img["src"])
To scrape the full-res image URL with requests and beautifulsoup you need to scrape data from the page source code via regex.
Find all <script> tags:
soup.select('script')
Match images data via regex:
matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
Match desired images (full res size) via regex:
# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps() it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
matched_images_data_json)
Extract and decode them using bytes() and decode():
for fixed_full_res_image in matched_google_full_resolution_images:
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
If you need to save them, you can do it via urllib.request.urlretrieve(url, filename) (more in-depth):
import urllib.request
# often times it will throw 404 error, to avoid it we need to pass user-agent
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(original_size_img, f'LOCAL_FOLDER_NAME/YOUR_IMAGE_NAME.jpg') # you can skip folder path and it will save them in current working directory
Code and full example in the online IDE:
import requests, lxml, re, json, urllib.request
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "pexels cat",
"tbm": "isch",
"hl": "en",
"ijn": "0",
}
html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
def get_images_data():
print('\nGoogle Images Metadata:')
for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
source = google_image.select_one('.fxgdke').text
link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
print(f'{title}\n{source}\n{link}\n')
# this steps could be refactored to a more compact
all_script_tags = soup.select('script')
# # https://regex101.com/r/48UZhY/4
matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
# https://regex101.com/r/pdZOnW/3
matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)
# https://regex101.com/r/NnRg27/1
matched_google_images_thumbnails = ', '.join(
re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
str(matched_google_image_data))).split(', ')
print('Google Image Thumbnails:') # in order
for fixed_google_image_thumbnail in matched_google_images_thumbnails:
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')
# after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
print(google_image_thumbnail)
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
removed_matched_google_images_thumbnails)
print('\nDownloading Google Full Resolution Images:') # in order
for index, fixed_full_res_image in enumerate(matched_google_full_resolution_images):
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
print(original_size_img)
# ------------------------------------------------
# Download original images
# print(f'Downloading {index} image...')
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(original_size_img, f'Images/original_size_img_{index}.jpg')
get_images_data()
-------------
'''
Google Images Metadata:
9,000+ Best Cat Photos · 100% Free Download · Pexels Stock Photos
pexels.com
https://www.pexels.com/search/cat/
...
Google Image Thumbnails:
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR2cZsuRkkLWXOIsl9BZzbeaCcI0qav7nenDvvqi-YSm4nVJZYyljRsJZv6N5vS8hMNU_w&usqp=CAU
...
Full Resolution Images:
https://images.pexels.com/photos/1170986/pexels-photo-1170986.jpeg?cs=srgb&dl=pexels-evg-culture-1170986.jpg&fm=jpg
https://images.pexels.com/photos/3777622/pexels-photo-3777622.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500
...
'''
Alternatively, you can achieve the same thing by using Google Images API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with regex to match and extract needed data from the source code of the page, instead, you only need to iterate over structured JSON and get what you want.
Code to integrate:
import os, urllib.request, json # json for pretty output
from serpapi import GoogleSearch
def get_google_images():
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "pexels cat",
"tbm": "isch"
}
search = GoogleSearch(params)
results = search.get_dict()
print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))
# -----------------------
# Downloading images
for index, image in enumerate(results['images_results']):
# print(f'Downloading {index} image...')
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(image['original'], f'SerpApi_Images/original_size_img_{index}.jpg')
get_google_images()
---------------
'''
[
...
{
"position": 100, # img number
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRR1FCGhFsr_qZoxPvQBDjVn17e_8bA5PB8mg&usqp=CAU",
"source": "pexels.com",
"title": "Close-up of Cat · Free Stock Photo",
"link": "https://www.pexels.com/photo/close-up-of-cat-320014/",
"original": "https://images.pexels.com/photos/2612982/pexels-photo-2612982.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500",
"is_product": false
}
]
'''
P.S - I wrote a more in-depth blog post about how to scrape Google Images, and how to reduce chance of being blocked while web scraping search engines.
Disclaimer, I work for SerpApi.
The best way to solve this problem is by using headless browser like Chrome Webdriver and user simulation libraries like Selenium Py. Beautiful Soup alone isn't adequate.
Related
Difficult to find answers for this question here, I know I've been searching for over an hour now. Many come close, I've tried bits and pieces from some of them but still the solution evades me. (Until now, see update)
Trying to pull all MP3 files from https://www.crrow777radio.com/free-episodes/, but they are deeply nested. I thought I could provide that URL and bs would recursively follow all links and I filter them to get my specific files to download. Apparently each href found must then be requested and parsed for the links on that page.
I have code that will pull the MP3 from the page that contains it, but doing that for all such pages (from top down recursively) is not as easy as the bs docs lead me to believe it is.
Update: With the help of MendelG and others elsewhere I have revised the code. I believe this will do the job, but getting the [large size] file content into a variable may need improvement via some sort of download write, download write scheme to reduce memory impact:
def getMP3sOnPageP(session, h, p):
soup = BeautifulSoup(session.get(p, headers=h).content, "html.parser")
# Select all the buttons on this page with the text `LISTEN`
for tag in soup.select("a.button"):
# Extracts the link from the button, in order to perform a request to that page
page = tag["href"]
soup = BeautifulSoup(session.get(page, headers=h).content, "html.parser")
# Finds the link to the mp3 file
download_link = soup.select_one("a.btn[download]")["href"]
file_name = re.search(r'(\d+-Hour-1.mp3)', download_link.split("/")[-1]).group()
# Request the mp3 file
print("Downloading ", file_name)
mp3_file = session.get(download_link).content
with open(file_name, "wb") as f:
f.write(mp3_file)
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}
with requests.Session() as session:
soup = BeautifulSoup(session.get(URL, headers=HEADERS).content, "html.parser")
# Select all the links on this page with a class of "page-numbers"
for a in soup.select("a.page-numbers"):
getMP3sOnPageP(session, HEADERS, a.get("href"))
As you can see this required 3 calls to beautiful soup (bs). My reading of the docs lead me to believe bs would operate recursively such that a single call with the correct filtering / parameters would be sufficient. If it is I definitely don't know how.
The update I provided solved the question I had. Although user MendelG posted a reply it didn't address the entire problem, although it was very helpful. His contribution is embodied in the getMP3sOnPageP function in my update.
I'm trying to scrape council tax band data from this website and I'm unable to find the API:
https://ecitizen.aberdeencity.gov.uk/publicaccesslive/selfservice/services/counciltax/bandsearchresults.htm?action=pageNavigation&p=0
I've gone through previous answers on Stack Overflow and articles including:
http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/
and
https://medium.com/#williamyeny/finding-ratemyprofessors-private-api-with-chrome-developer-tools-bd6747dd228d
I've gone into the Network tab - XHR/All - Headers/ Preview/ Response and the only thing I can find is:
/**/jQuery11130005376436794664263_1594893863666({ "html" : "<li class='navbar-text myprofile_salutation'>Welcome Guest!</li><li role='presentation' class=''><a href='https://ecitizen.aberdeencity.gov.uk/publicaccesslive/selfservice/citizenportal/login.htm?redirect_url=https://ecitizen.aberdeencity.gov.uk/publicaccesslive/selfservice/dashboard.htm'> Sign In / Register <span class='icon-arrow-right17 pull-right visible-xs-block'></span></a></li>" });
As a test I used AB24 4DE to search and couldn't find it anywhere within a json response.
As far as I can tell the data isn't hidden behind a web socket.
I ran a get request for the sake of it and got:
JSONDecodeError: Expecting value: line 10 column 1 (char 19)
What am I missing?
You're doing the right thing looking at the network tools. I find it's best to zoom in on the overview you're given in the network tab. You can select any proportion of an action in the browser. See what is happening which requests are made. So you could focus on the start of requests and responses when you click search. It gives you two requests that are made, one to post information to the server and one grabs information to a seperate url.
Suggestion
My suggestion having had a look at the website is to probably use selenium which is a package that mimics browser activity. Below you'll see my study of the requests. Essentially the form generates a unique token for every time you do a search. YOu have to replicate inorder to get the correct response. Which is hard to know in advance.
That being said, you can mimic browser activity using selenium and automatically input the postcode and automating the clicking of the search button. You then can grab the page source HTML and use beautifulsoup to parse it. Here is a minimal reproducible example showing you this.
Coding Example
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path=r'c:\users\aaron\chromedriver.exe')
url = 'https://ecitizen.aberdeencity.gov.uk/publicaccesslive/selfservice/services/counciltax/bandsearch.htm'
driver.get(url)
driver.find_element_by_id('postcode').send_keys('AB24 4DE')
driver.find_element_by_xpath('//input[#class="btn btn-primary cap-submit-inline"]').click()
soup = BeautifulSoup(driver.page_source,'html.parser')
There is also scope to make the browser headless, so it doesn't popup and all you'll get back is the parsed html.
Explanation of Code
We are importing webdriver from selenium, this provides the module necessary to load a browser. We then create an instance of that webdriver, in this case I'm using chrome but you can use firefox or other browsers.
You'll need to download chromedriver from here, https://chromedriver.chromium.org/. Webdriver uses this to open a browser.
We use the get webdriver get method to make chromedriver go to the specific page we want.
Webdriver has a list of find element by... you can use. The simplest here is find_element_by_id. We can find the id of the input box in the HTML for inputting the postcode, which I've done here. Send_keys will send whatever text we want, in this case its AB24 4DE.
find_element_by_xpath takes an XPATH selector. '//' goes through all of the DOM, we select input and the [#class=""] part selects the specifc input tag class. We want the submit button. The click() method, will click that browser.
We then grab the page source once this click is complete, this is necessary as we then input that into BeuatifulSoup, which will give us the parsed HTML of the postcode we desire.
Reverse Engineering the HTTP requests
Below is for education really, unless someone can get the unique token before sending requests to the server. Here's how the website works in terms of the search form.
Essentially looking at the process, it's sending cookies,headers,params and data to a server. The cookies has a session ID which does't seem to change on my test. The data variable is where you can change the postcode but also importantly the ABCtoken changes for every single time you want to do a search and the param is a check on the server to make sure it's not a bot.
As an in example of the HTTP POST request here. We send this
cookies = {
'JSESSIONID': '1DBAC40138879EB418C14AD83C05AD86',
}
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'Origin': 'https://ecitizen.aberdeencity.gov.uk',
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Referer': 'https://ecitizen.aberdeencity.gov.uk/publicaccesslive/selfservice/services/counciltax/bandsearch.htm',
'Accept-Language': 'en-US,en;q=0.9',
}
params = (
('action', 'validateData'),
)
data = {
'postcode': 'AB24 2RX',
'address': '',
'startSearch': 'Search',
'ABCToken': '35fbfd26-cb4b-4688-ac01-1e35bcb8f63d'
}
To
https://ecitizen.aberdeencity.gov.uk/publicaccesslive/selfservice/services/counciltax/bandsearch.htm
Then it's doing an HTTP GET request here with the same JSESSIONID and the unique ABCtoken to grab the data you want to bandsearchresults.html
'https://ecitizen.aberdeencity.gov.uk/publicaccesslive/selfservice/services/counciltax/bandsearchresults.htm'
So it's creating a JSESSIONID which seems to be the same for any postcode from my testing. then when you use that same JSESSIONID and you use the ABCtoken it supplies the searchresults URL you'll get the correct data back.
import requests
from bs4 import BeautifulSoup
#Finds the imdb rating of a given movie or TV series
search_term1="What is the imdb rating of "
search_term2=input("Enter the name of the movie or TV Series : ")
search_term=search_term1+search_term2
response=requests.get("https://www.google.co.in/search?q="+search_term)
soup = BeautifulSoup(response.text, 'html5lib')
match=soup.find('div.slp.f')
#i tried 'div',_class="slp.f"
print(match) #this line is returning none
i am trying to extract imdb ratings of a movie from google search engine. Everytime it is returning none, although id is correct.
If you try to find before-appbar in the DOM:
import requests
from bs4 import BeautifulSoup
#Finds the imdb rating of a given movie or TV series
search_term1="What is the imdb rating of "
search_term2=input("Enter the name of the movie or TV Series : ")
search_term=search_term1+search_term2
response=requests.get("https://www.google.co.in/search?q="+search_term)
print("before-appbar" in response.text)
The output is False
So clearly "before-appbar" is not an Id of any element here.
My guess is you are trying to determine the DOM element by inspecting it from the browser. However in most cases the DOM is changed a lot by JS so it will not match with what you get by using requests in python.
I can suggest you two possible solutions:
Save the response in an html file, open it in the browser, and then
check which element you need to find.
f = open("response.html", "w")
f.write(response.text)
f.close()
Use selenium and a headless browser.
The problem relies upon the way you are trying to search the id instead of
print(soup.find(id="before-appbar"))
with print(soup.find({"id":"before-appbar"}))
hope this solves the problem.
You're treating find() as select() method that accepts CSS selectors. find() method do not accept CSS selectors syntax.
find('div.slp.f') # No
find('div', 'slp f') # will work with find(). Syntax: ('tag', 'class') or ('tag', class_='class')
select('div.slp.f') # Yes
Try to use lxml instead of html5lib because html5lib is the slowest. Also, there's no need in selenium nor saving the response in the file as Bishakh Ghosh mentioned for such task.
Make sure you're using user-agent if not using selenium otherwise Google will block request eventually since the default user-agent in requests library is python-requests and Google understands that it's a bot and not a "real" user visit and will block a request.
Since you haven't mentioned from what part of the page you were trying to scrape data (organic results, knowledge graph, or answer box) I don't bother finding the right element that will appear on each search result so the rating will always be there.
Code:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "imdb rating of infinity war",
"hl": "en"
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# scrapes from the snippet organic result
rating = soup.select_one('g-review-stars+ span').text
print(rating)
# Rating: 8.4/10
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is to iterate over structured JSON string and get the data you want, rather than figuring out why certain things don't work or extract as they should.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "what is imdb rating of infinity war",
"gl": "us",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
rating = results['organic_results'][0]['rich_snippet']['top']['detected_extensions']['rating']
print(rating)
# 8.4
Disclaimer, I work for SerpApi.
You may get None because more variables needs to be passed in the request, not only the URL, but also minimum "Accept-Language" and "User-Agent" as headers. You can check your own data on this website http://myhttpheader.com/. So save the data as a dictionary in headers and after the URL just pass headers={"Accept-Language": "data_you_see", "User-Agent": "data_you_see"} and you should be ready to scrape
A question regarding Jsoup: I am building a tool that fetches prices from a website. However, this website has streaming content. If I browse manually, I see the prices of 20 mins ago and have to wait about 3 secs to get the current price. Is there any way I can make some kind of delay in Jsoup to be able to obtain the prices in the streaming section? I am using this code:
conn = Jsoup.connect(link).userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36");
conn.timeout(5000);
doc = conn.get();
As mentioned in the comments, the site is most likely using some type of scripting that just won't work with Jsoup. Since Jsoup just get the initial HTML response and does not execute any javascript.
I wanted to give you some more guidence though on where to go now. The best bet, in these cases, is to move to another platform for these types of sites. You can migrate to HTMLUnit which is a headless browser, or Selenium which can use HTMLUnit or a real browser like Firefox or Chrome. I would recommend Selenium if you think you will ever need to move past HTMLUnit as HTMLUnit can sometimes be less stable a browser compared to consumer browsers Selenium can support. You can use Selenium with the HTMLUnit driver giving you the option to move to another browser seamlessly later.
You can Use a JavaFX WebView with javascript enabled. After waiting the two seconds, you can extract the contents and pass them to JSoup.
(After loading your url into your WebView using the example above)
String text=view.getEngine() executeScript("document.documentElement.outerHTML");
Document doc = Jsoup.parse(html);
I need to extract the exchange rate of USD to another currency (say, EUR) for a long list of historical dates.
The www.xe.com website gives the historical lookup tool, and using a detailed URL, one can get the rate table for a specific date, w/o populating the Date: and From: boxes. For example, the URL http://www.xe.com/currencytables/?from=USD&date=2012-10-15 gives the table of conversion rates from USD to other currencies on the day of Oct. 15th, 2012.
Now, assume I have a list of dates, I can loop through the list and change the date part of that URL to get the required page. If I can extract the rates list, then simple grep EUR will give me the relevant exchange rate (I can use awk to specifically extract the rate).
The question is, how can I get the page(s) using Linux command line command? I tried wget but it did not do the job.
If not CLI, is there an easy and straight forward way to programmatically do that (i.e., will require less time than do copy-paste of the dates to the browser's address bar)?
UPDATE 1:
When running:
$ wget 'http://www.xe.com/currencytables/?from=USD&date=2012-10-15'
I get a file which contain:
<HTML>
<HEAD><TITLE>Autoextraction Prohibited</TITLE></HEAD>
<BODY>
Automated extraction of our content is prohibited. See http://www.xe.com/errors/noautoextract.htm.
</BODY>
</HTML>
so it seems like the server can identify the type of query and blocks the wget. Any way around this?
UPDATE 2:
After reading the response from the wget command and the comments/answers, I checked the ToS of the website and found this clause:
You agree that you shall not:
...
f. use any automatic or manual process to collect, harvest, gather, or extract
information about other visitors to or users of the Services, or otherwise
systematically extract data or data fields, including without limitation any
financial and/or currency data or e-mail addresses;
which, I guess, concludes the efforts in this front.
Now, for my curiosity, if wget generates an HTTP request, how does the server know that it was a command and not a browser request?
You need to use -O to write the STDOUT
wget -O- http://www.xe.com/currencytables/?from=USD&date=2012-10-15
But it looks like xe.com does not want you to do automated downloads. I would suggest not doing automated downloads at xe.com
That's because wget is sending a certain types of headers that makes it easy to detect.
# wget --debug cnet.com | less
[...]
---request begin---
GET / HTTP/1.1
User-Agent: Wget/1.13.4 (linux-gnu)
Accept: */*
Host: www.cnet.com
Connection: Keep-Alive
[...]
Notice the
User-Agent: Wget/1.13.4
I think that if you change that for
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14
It would work.
# wget --header='User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14' 'http://www.xe.com/currencytables/?from=USD&date=2012-10-15'
That seems to be working fine from here. :D
Did you visit the link in the response?
From http://www.xe.com/errors/noautoextract.htm:
We do offer a number of licensing options which allow you to
incorporate XE.com currency functionality into your software,
websites, and services. For more information, contact us at:
XE.com Licensing
+1 416 214-5606
licensing#xe.com
You will appreciate that the time, effort and expense we put into
creating and maintaining our site is considerable. Our services and
data is proprietary, and the result of many years of hard work.
Unauthorized use of our services, even as a result of a simple mistake
or failure to read the terms of use, is unacceptable.
This sounds like there is an API that you could use but you will have to pay for it. Needless to say, you should respect these terms, not try to get around them.