I was following this past question (Extracting image src based on attribute with BeautifulSoup) to try to extract all the images from a google images page. I was getting a "urllib2.HTTPError: HTTP Error 403: Forbidden" error but was able to get past it using this:
req = urllib2.Request(url, headers={'User-Agent' : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30"})
however, then I got a new error that seems to be telling me that the src attribute does not exist:
Traceback (most recent call last):
File "Desktop/webscrapev2.py", line 13, in <module>
print(tag['src'])
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/bs4/element.py", line 958, in __getitem__
return self.attrs[key]
KeyError: 'src'
I was able to get over that error by checking specifically for the 'src' attribute but most of my images when extracted, dont have the src attribute. It seems like google is doing something to obscure my ability to extract even a few images (I know requests are limited but i thought it was at least 10).
For example printing out the variable tag (see code below) gives me this:
<img alt="Image result for baseball pitcher" class="rg_i" data-src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRZK59XKmZhYbaC8neSzY2KtS-aePhXYYPT2JjIGnW1N25codtr2A" data-sz="f" jsaction="load:str.tbn" name="jxlMHbZd-duNgM:" onload="google.aft&&google.aft(this)"/>
But printing out the variable v gives 'None'. I have no idea why this is happening nor how to get the actual image from what its returning. Does anyone know how to get the actual images? I'm especially worries since the data-src URL starts with encrypted... Should I query data-src to get the image instead of src? Any assistance or advice would be super appreciated!
Here is my full code (in Python):
from bs4 import BeautifulSoup
import urllib2
url = "https://www.google.com/search? q=baseball+pitcher&espv=2&biw=980&bih=627&source=lnms&tbm=isch&sa=X&ved=0ahUKEwj5h8-9lfjLAhUE7mMKHdgKD0YQ_AUIBigB"
#'http://www.imdb.com/title/tt%s/' % (id,)
req = urllib2.Request(url, headers={'User-Agent' : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30"})
soup = BeautifulSoup(urllib2.urlopen(req).read(), "lxml")
print "before FOR"
for tag in soup.findAll('img'):
print "inside FOR"
v = tag.get('src', tag.get('dfr-src')) # get's "src", else "dfr_src", if both are missing - None
print v
print tag
if v is None:
continue
print("v is NONE")
print(tag['src'])
Oh, boy. You picked the wrong site to scrape from. :)
Google's Defenses
First off, Google is (obviously) Google. It knows web crawlers and web scrapers very well - its entire business is founded on them.
So it knows all of the tricks that ordinary people get up to, and more importantly has an important mandate to make sure nobody else except end users get their hands on their images.
Didn't pass a User-Agent header? Now Google knows you're a scraper bot that doesn't bother pretending to be a browser, and forbids you from accessing its content. That's why you got a 403: Forbidden error the first time - the server realised you were a bot and prevented you from accessing material. It's the simplest technique to block automated bots.
Google Builds Pages through Javascript
Don't have Javascript parsing capability (which Python requests, urllib and its ilk don't)? Now you can't view half your images because the way Google Image search results works (if you inspect the Network tab in your Chrome console as Google Images is loading) is that a few bundled requests are made to various content providers that then systematically add a src attribute to a placeholder img tag through inline obfuscated Javascript code.
At the very beginning of time, all of your images are essentially blank, with just a custom data-src attribute to coordinate activities. Requests are made to image source providers as soon as the browser begins to parse Javascript (because Google probably makes use of its own CDN, these images are transferred to your computer very quickly), and then page Javascript does the arduous task of chunking the received data, identifying which img placeholder it should go to and then updating src appropriately. These are all time-intensive operations, and I won't even pretend to know how Google can make them happen so fast (although note that messing with network throttling operations in Dev Tools on Chrome 48 can cause Google Images to hang, for some bizarre reason, so there's probably some major network-level code-fu going on over there).
These image source providers appear to begin with https://encrypted..., which doesn't seem to be something to worry about - it probably just means that Google applies a custom encryption scheme on the data as its being sent over the network on top of HTTPS, which is then decoded by the browser. Google practises end-to-end encryption beyond just HTTPS - I believe every layer of the stack works only with encrypted data, with encryption and decryption only at the final and entry point - and I wouldn't be surprised to see the same technology behind, for example, Google Accounts.
(Note: all the above comes from poking around in Chrome Dev Tools for a bit and spending time with de-obfuscators. I am not affiliated with Google, and my understanding is most likely probably incomplete or even woefully wrong.)
Without a bundled Javascript interpreter, it is safe to say that Google Images is effectively a blank wall.
Google's Final Dirty Trick
But now say you use a scraper that is capable of parsing and executing Javascript to update the page HTML - something like a headless browser (here's a list of such browsers). Can you still expect to be able to get all the images just by visiting the src?
Not exactly. Google Images embeds images in its result pages.
In other words, it does not link to other pages, it copies the images in textual format and literally writes down the image in base64 encoding. This reduces the number of connections needed significantly and improves page loading time.
You can see this for yourself if you navigate to Google Images, right click on any image, and hit Inspect element. Here's a typical HTML tag for an image on Google Images:
<img data-sz="f" name="m4qsOrXytYY2xM:" class="rg_i" alt="Image result for google images" jsaction="load:str.tbn" onload="google.aft&&google.aft(this)" src="data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wCEAAkGBxAPEA4NEBAPEBETDxAQFhEREA8QDxAPFhEWFhURFRgYKCgiGBsnHRcVIzIiJykrLi4vGB8zODMsNygtLisBCgoKDg0OGxAQGy0mHyUrLS4rLy8tNy8tNy0tKy4rLS0tKystLTArLS0tLS0tLy0tLS0tLS0tLS0tKy0tLS0tLf/AABEIAOEA4QMBEQACEQEDEQH/xAAcAAEAAwADAQEAAAAAAAAAAAAABQYHAQMECAL/xAA8EAACAgEBBgMFBQUIAwAAAAAAAQIDBBEFBhITITFBYXEHUYGRsRQiMlKhQmJyc7MVIyUzQ1OSwTXC8P/EABsBAQACAwEBAAAAAAAAAAAAAAAEBQIDBgEH/8QAMxEBAAIBAgQDBQgCAwEAAAAAAAECAwQRBRIhMRNBgRRRYXGRIjIzQqGxwdE08AZS4SP/2gAMAwEAAhEDEQA/ANxAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHDYFR2zvzXXZ9mxKpZd76KMPwa+vivPt5lnh4ZM08TNblr+v0Rb6nry443l1Qo29euKVuJiJ/sKHMmvXv9T2cmgp0rS1vjM7HJnt3tEejpyMPeClccMnGydP2OGNcn6cS0/VGdMvD7ztak1+U7vJpnjtbdG4vtKtpm6c3GcJx/FFJ12LzSl0l80iVbg2PLXn0994ao1d6zteF62Dt3Hz6udjz44p8Mk04yhPTXhkn2fVFLqNPk09+TJHVNpkreN4SRoZgAAAAAAAAAAAAR+2MGd0OGORbQkm3y1DWXlq1qvga8lOaO+zVlpz17zHyZTsnOuyczDxZ3WqFljU+GycXKMYSlpqn014Sq09rXyREzLn9Fa2TNFbTO3zbJCKSSXZLT4Fw6V+j0AAAAAAAAM69pG9EoN4NDevSM+HvKcu1a+a19UveX/CdDXbx8nby/tX6rNMz4dVi3L3ajg0JySlkWJSts8eJ/sL3RRWa7WW1OSZ8o7R8ErDijHXaFjRDbnAFV353T/tNY0Fy6+CzWdzTdyq061w09769e2iJ2i1ttLNpjrvHSPL5y05cUZNt09sfZdOHTDHogoQiu3i34yk/Fv3kTJlvktNrzvLbWsVjaHtMHoAAAAAHGp5zQOT0AAAAB+bfwy9H9Dyezy3aWJ7pP/E9n/zLP6Mym0n4sOa4b/kfVtpdOmcgAAAAAAAdWVcq4TsfaEJTfpFa/wDR7WvNaKx5vJnaN2IbuSeXtXDdnXivldLXxkoys+qR2eujwNFatfdEKfT/AG80TLdDi1yAAAACC3n3ihhQ6Ljta1jHwXgm9PPsvE05cvJ0jun6HQzqJmZnasd5R+Lu9k5UVbnZV8XJa8imSrjWn+zJru//ALqeRjm3W8tl9bjxTy6ekbR5zG8yid4tg5Wz4Sy8PJyJwh96yuc+KcYeM4vtJLxTXbX0Mb4pr1rKbo9bh1FvC1FK9e0xGz37lb6LLksW/hV3C3Ca6RuS7rTwkl18+vuM8WXm6S08T4TOnjxMf3fP4f8Ai422KKcn0SWplkyVx1m1u0KatZtO0INZFuXOVdcuVXHTikustX2ivP8AReZU4r5ddaZ3mtI93eUy9a4IjpvZ3T2BHT7ttyl+ZyjJa+a0+mhvycKwWjpvE+/eWFdXeJ67TCMp2ndjWSqs+9w94t6qUX2lFspfbNTw/N4eSeav8fBOnT49RTnr0WfHujZGM4vWLWqOpxZa5aRevaVTek0tNZ7qrtjea2eTHZ2AoO5uSlbP70KlH8TS8dPr0LfFoq0w+Pn328ojzQrZptfkx+sunae7m0+W7Ktp222pa8uUY1Vzf5YuGnD8dTLDrNNzbXxREfWf1L4cm28XndSNkbeyMnIhi35uXiOcuXxqTlw3a6KE0301fTyZdanS48WLxcWOttuvoh472m3La0w1rZmzvs1Lr5t170bdl1jnOT08+y8kcnmyeJMztEfKNlnFeWuzH9z3/imz/wCZZ/RmUek/Fc9w3/I+rbcm+NcJWTekYrVvyLiZiI3l09KWvaK17yp+LkZO1bLOC2eNi1y4G69FbZLvw8Xppr4dSPWbZevaFvmpi0MRXli15jfr2j0fjbe6E6a534mRkcyEXNwnZxcxJatJrRxl+h7fDtG9ZZaXicWvFM9azWfh2ebcHfZ5Nv2G+XFNxcq7HopS4Vq65e96atP3JjDeZ6WbOLcMphjxcXbzj+WgkhQAAAB49s1ueNkwXeVFsV6uDSNmGYrkrM++P3Y3jeswwjc7NVWfg3SeiVsYvyU4uH/sdxxLH4mlvEe7f6dVNp7cuSJfQZwa7AAADgDIMzaP2nadLl1i8+EdPDgjYoxXySK+J5su8+92XgeDw+Yr35f36tfRYONcTimnFrVNNNPs0+6BE7dXz1tRS2dm2KDaeNlOUPfwKXFFf8Wl8SFMct3f47RqdHHN+av+/q2zbuV/d1adp/f+CSf/AGV3HM01x1pHnP7OP0WPe8z7nG6P+RJ+Lusb9eiX6aErg+3ssfOf3Ya78aflCcLRDVve/H/ybl3UnW/OLTa+j+Zzv/IMUTjrk907LThl/tTT1eTd3PksbOUesqoynH1cJNL5x/Uk/wDHJ56clu3NH6sOLU5bRaPOFL9l+antFcT1dmNZFN93LWMv1SkfRuN4dtNHL2iXMaK3/wBOvm2I5JbMJ9qOKqNo5Dh05ka7lp35klo2vNyjr8Ts+EZebRxzeW8eip1NdsvRt9bk6k5fi5a1/i4epxl9t52Wk/dYnuZ/5TZ/8yz+jMp9J+K53h3+Q0b2j5jrx6oL/Ut0fmoxb0+enyJ+qnauzvuB4YvnmZ8oR25u2eRs+Crx8jIsdt7caq24682WnFN9F00PcNtscbQy4ngi+stzWisdO/y9z2X7P2ptBON1scCiS0ddOk75RfhKfh8DPlvPdHjPpdP+FXnt77dvSP7SO7+5mDgNTppTtX+tY3Ozto9G/wAPw0M4rEImfWZs/wB+3p5LEZIwAAAcMD593z2RLBzbqNGoSk7apeDrk9Uk/fF6r4L3ndcO1MajBEz3jpP+/FTZ8U0u17cTeOOfjR4mufWlC2Pi34WLylpr66rwOV4jo502WY/LPWP69Fjp80ZK/HzWUr0hGbc2ZZlQVccq/Gj14uTwKc17uJptfDQ3YcsY7c3LE/NhenNG2+zN9+thLZVFN1WRlWynkKpq2zVcPKnLVaaddYov+G6mdXlml6xtEb9I+KDqMMY67xM/V6fY5DnzzMuyU5TrlCuGs5OMIyi3LRdte3U08cnkmuOsbRtu2aOOkzKrbx1zw8++HVSryOfXr04oufMg15eHwZyNq8l930rR5K6rSRHvrtP02bls7NhkVVZFb1hZCM16Na6epOid43cNlxWxXmlu8Ts9J61sG3wxnmbauxavvO3Jrr6eCjXBTl8OGXyI0xvd2OHN7Pw2s277T+szs1feulwpqsj2qai/KDSWvzSK3jmnnJhi8flc9w/JHiTWfN17n5K/vate75kfPolL6I0cC1EctsM9+8NnEsUxMX9FmOiVaC3wtUcda/nT+SZScdtHs8V98wsOGVmc3T3PDuBivkW3yX+dZ91PxritE/i3I38JwTiwbz59TiWWL5eWPLoou8W5+Zs7KWXhQnbTGzm1uuPHZS9deXKC6uPhqvDud9puJ4NRh8LUTtO23w+bncmmvS3NjXDE9oVXKUsjHyKLeHV1yhwptd3Fy06fAq54Teb7YrRavv3/AHbva4iPtxMSitg7AntbL/trLio08UZUU9+OMPwSl+7r182btVqo0uL2TDPX80/Ge+xixzkt4t/Ro9v4Zfwv6FFKXbtLDdzH/imz/wCZZ/RmVGk/Fc5w7/IaJ7UcOU8HnRTbotjY9P8AbacZP4ap/Ass9earvOCZ4x6nafzRsgPZXt2MJ2YU5aK2XMrbfTmaJSh8Uk16M16e232ZWPHtHNojPXy6T/E/w08lOWcgAAAAAAgd792KdpU8qz7lkW5V2payrk/rF9NV5eRL0esvpr81e3nDVlxRkjaWO34e0tiXq2UJw4W0r6050Tj4pv3P8stH9TqK6nS67HyWn0nurpxZMVuaF92F7VcS2KWSpUz06yinOp+ei+8vTR+pT6jgeas74p5o/VKprKz96NliW++zNOL7bQl5yafya1IM8N1UTtyS3e0Y/wDsoXtM3nx9o1UYuDzcqyGQrG66bHBJVzjp21b1kvDTuWvC8FtJecmaYrG23fr3R894yRtTqmvY3snJxqct5NFlPMtrlFTSTlFQab07r4kTi+ox5slZxzvtDdp6TSvVOb7bn17ShGSlysitPgt01TX5Jrxj9ClvTmW2h199Lbp1ie8Klu5lbW2PxY1+DbkY/E2nQ1Zwt9XKDXg34NIwrzU6bLTV20eu+3F+W/xj90/lb05+RB14OzcmFkunNylCquv97TX7xlzWntCBXS6bHO+XLEx7q+bt3H3KjgOeVfNXZdmvFZ14YJvWUY6+9934ntaRDXrtfbUztHSsdoW62uM4yhJKUZJpp9mn3RlasWjaUGJmJ3hS8nd/KxLObif3taeqhxJWw8uvSS/UoM/CsmPJ4unnrC2prseWnJmj1S9O37tNJYWTx+5QfDr6+BNpqtVttbDO/wAJjZEtp8O+9ckbPJfsfIz7IzylyaI9qVJOcl7m10S/X0NddFkz5Iy6nbp2rDZ7VTDSaYe895n+FoqrUYqMUoxikkl0SS7Itlf3eTaO1sfGWt91VX8c0m/Rd2bsWDJlnalZlhbJWv3pZJ7Tds4uZbRfi3Suca5VTjGq5RUdeJSUmkn3afwOm4RTJpq2pmiI3neOsK/VTGSYmnX0SPs/39hTXVhZLSqjpCu7/bj4Rs/d/e8PHp1NXE+FTeZzYu/nH9f090+p5Y5L/Vp+dZNVWSqhzZ8EnGClGPHLTotX0XqcxbfZYW3mvRj+x92NsY2VRlrChJ1SclHn0rXWLi1rr06Nlfj016WiypwaHLivzx3atsq2++qf2vGjQ5ax5XNjfxQa0fFokuvu6k+szMfahbY7X7z0ll283s+y8WyVuDF5FDlxKtS0vp8dFr+JLwa6/U03w+cOq0fGqTXkz/8AkvXsnfTbFaVNmz772uic6LoT+MktGe1m8eTTn0nDrzzVycv6rnsC3aeRKN2XCrDpXVUQ+/dZL9+XaMfJdTZXmnuqtR7NSOXDvaffPb0hZjNCAAAAAA4lFNNNJp+D6pgQ+Vups+18VmHjSfv5UE/0N9dTmr2tP1YTSs94cY+6ez63rDCxk/5UX9RbU5rd7z9SMdY7QlaMeFa0hCEF7oxUV+hpmZnuzdp4AAAAAAAAACsb+7xPBoioPS21yjF/ljFaykvPrFfEseGaP2nL17R3RtTm8OvTvKB3A3bqyqltTMXPstlNwjZrKMIRk46tPvJtPuSeJ6u1LzgxfZrHTp0YabDHLz26zLQK6IRWkYQivcopIpZmZlMZb7Tty6qa57SxoqtKUedVHpBqclHmQXg9WtV266+vR8H4jebxgv190oGqwREc0LT7K82d2zKHNtuErKU33cIS0j8l0+BXcWx1pqrRXz6/VI00zOON1uK1vAAAAAAAAAAAAAAAAAAAAAAAAAAApHtU3fty8eq6iLnbjynLlr8U6ppcaj75fdi9PJltwjWV0+WYv2si6rFOSu8d4Vf2db+VYtf2HK4oQjOXBPRt18T1lXOPfu29fPQseJ8Mtnt42Hrv3j+YaMGo8OOS7RY72bPceL7bjaedsE/k+pRTodTE7eHP0TIz4/8AtCo727e/taD2VsyLv5k4c3I0aoqhGSl+J9+qXw9ek/S4fYrePn6THavnM/w05L+L9inrK7bubIhg41OJB6quOjk+85t6yk/VtlXnzWzZJvbvKTSsVjaEkamQAAAAAAAAAAAAAAAAAAAAAAAAAADQCI2puzg5T478WmyX53BKf/JaM349TmxfctMerCcdbd4eGrcHZUXqsOp/xOc18pPQ224hqZjabz9WMYcceUJ/FxK6oqFUIVxX7MIqMfkiJMzad5lt2dx4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA/9k=" style="width: 167px; height: 167px; margin-left: -7px; margin-right: -6px; margin-top: 0px;">
Note the massive wall of text buried in src. That is quite literally the image itself, written in base 64. When we see an image on our screen, we are actually viewing the result of this very text parsed and rendered by a suitable graphics engine. Modern browsers support decoding and rendering of base64-encoded URIs, so it's not a surprise you can literally copy-paste the relevant text into your address bar, hit Enter and view the image at once.
To get the image back, you can decode this wall of text (after suitably parsing it to remove the data:image/jpeg;base64,) using the base64 module in Python:
import base64
base64_string = ... # that monster you saw above
decoded_string = base64.b64decode(your_string)
You must also make sure to parse the image type appropriately from the start of the src attribute, write the decoded_string to a file and finally save it with the file extension you received from the data attribute. phew
tl;dr
Don't go after Google Images as your first major scraping project. It's
hard. Wikipedia is much easier to get ahold of.
in violation of their Terms of Service (although what scraping isn't? and note I am not a lawyer and this doesn't constitute legal advice) where they explicitly say
Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide.
really impossible to predict how to improve on. I wouldn't be surprised if Google was using additional authentication mechanisms even after spoofing a human browser as much as possible (for instance, a custom HTTP header), and no one except an anonymous rebellious Google employee eager to reduce his/her master to rubble (unlikely) could help you out then.
significantly easier to use Google's provided Custom Search API, which lets you simply ask Google for a set of images programmatically without the hassle of scraping. This API is rate-limited to about a hundred requests a day, which is more than enough for a hobby project. Here are some instructions on how to use it for images. As a rule, use an API before considering scraping.
To scrape Google Images using requests and beautifulsoup libraries, you need to parse data from the page code, inside <script> tags using regular expressions.
If you only need to parse thumbnail size images, you can do it by passing content-type (solution found from MendelG) query params into HTTP request:
import requests
from bs4 import BeautifulSoup
params = {
"q": "batman wallpaper",
"tbm": "isch",
"content-type": "image/png",
}
html = requests.get("https://www.google.com/search", params=params)
soup = BeautifulSoup(html.text, 'html.parser')
for img in soup.select("img"):
print(img["src"])
To scrape the full-res image URL with requests and beautifulsoup you need to scrape data from the page source code via regex.
Find all <script> tags:
soup.select('script')
Match images data via regex:
matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
Match desired images (full res size) via regex:
# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps() it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
matched_images_data_json)
Extract and decode them using bytes() and decode():
for fixed_full_res_image in matched_google_full_resolution_images:
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
If you need to save them, you can do it via urllib.request.urlretrieve(url, filename) (more in-depth):
import urllib.request
# often times it will throw 404 error, to avoid it we need to pass user-agent
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(original_size_img, f'LOCAL_FOLDER_NAME/YOUR_IMAGE_NAME.jpg') # you can skip folder path and it will save them in current working directory
Code and full example in the online IDE:
import requests, lxml, re, json, urllib.request
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "pexels cat",
"tbm": "isch",
"hl": "en",
"ijn": "0",
}
html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
def get_images_data():
print('\nGoogle Images Metadata:')
for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
source = google_image.select_one('.fxgdke').text
link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
print(f'{title}\n{source}\n{link}\n')
# this steps could be refactored to a more compact
all_script_tags = soup.select('script')
# # https://regex101.com/r/48UZhY/4
matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
# https://regex101.com/r/pdZOnW/3
matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)
# https://regex101.com/r/NnRg27/1
matched_google_images_thumbnails = ', '.join(
re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
str(matched_google_image_data))).split(', ')
print('Google Image Thumbnails:') # in order
for fixed_google_image_thumbnail in matched_google_images_thumbnails:
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')
# after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
print(google_image_thumbnail)
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
removed_matched_google_images_thumbnails)
print('\nDownloading Google Full Resolution Images:') # in order
for index, fixed_full_res_image in enumerate(matched_google_full_resolution_images):
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
print(original_size_img)
# ------------------------------------------------
# Download original images
# print(f'Downloading {index} image...')
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(original_size_img, f'Images/original_size_img_{index}.jpg')
get_images_data()
-------------
'''
Google Images Metadata:
9,000+ Best Cat Photos · 100% Free Download · Pexels Stock Photos
pexels.com
https://www.pexels.com/search/cat/
...
Google Image Thumbnails:
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR2cZsuRkkLWXOIsl9BZzbeaCcI0qav7nenDvvqi-YSm4nVJZYyljRsJZv6N5vS8hMNU_w&usqp=CAU
...
Full Resolution Images:
https://images.pexels.com/photos/1170986/pexels-photo-1170986.jpeg?cs=srgb&dl=pexels-evg-culture-1170986.jpg&fm=jpg
https://images.pexels.com/photos/3777622/pexels-photo-3777622.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500
...
'''
Alternatively, you can achieve the same thing by using Google Images API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with regex to match and extract needed data from the source code of the page, instead, you only need to iterate over structured JSON and get what you want.
Code to integrate:
import os, urllib.request, json # json for pretty output
from serpapi import GoogleSearch
def get_google_images():
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "pexels cat",
"tbm": "isch"
}
search = GoogleSearch(params)
results = search.get_dict()
print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))
# -----------------------
# Downloading images
for index, image in enumerate(results['images_results']):
# print(f'Downloading {index} image...')
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(image['original'], f'SerpApi_Images/original_size_img_{index}.jpg')
get_google_images()
---------------
'''
[
...
{
"position": 100, # img number
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRR1FCGhFsr_qZoxPvQBDjVn17e_8bA5PB8mg&usqp=CAU",
"source": "pexels.com",
"title": "Close-up of Cat · Free Stock Photo",
"link": "https://www.pexels.com/photo/close-up-of-cat-320014/",
"original": "https://images.pexels.com/photos/2612982/pexels-photo-2612982.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500",
"is_product": false
}
]
'''
P.S - I wrote a more in-depth blog post about how to scrape Google Images, and how to reduce chance of being blocked while web scraping search engines.
Disclaimer, I work for SerpApi.
The best way to solve this problem is by using headless browser like Chrome Webdriver and user simulation libraries like Selenium Py. Beautiful Soup alone isn't adequate.
I'm looking through my GA logs and I see a Google Chrome browser version 0.A.B.C. Could anybody tell me what this is exactly? Some kind of spider or bot or modified http header?
The full user agent string probably looks something like this:
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13"
This is most likely a bot, but it could just be someone running an automated script using CasperJS or PhantomJS (or even a shell script using something like lynx) and spoofing the user agent.
The reason it looks like that instead of something that says "My automated test runner v1.0" (or whatever is relevant to the author) is that this user agent string will pass most regular expression checks as "some version of Chrome" and not get filtered out properly by most bot checks that rely on a regular expression to match 'valid' user agent patterns.
In order to avoid it, your site bot checker would need to blacklist this string, or validate that all parts of the Chrome version to make sure they're all valid numbers. Even then, you can only do so much checking since the user agent string is so easy to spoof.
This seems to only happen in Chrome (latest version 31.0.1650.48 m, but also earlier), but since it doesn't always happen it's hard to say for sure.
When streaming audio stored in Azure Blob storage, Chrome will occasionally play about 30-50% of the track and then stop. It's hard to reproduce, but if I clear the cache and play the file over and over again, it eventually happens. An example file can be found here.
The error is pretty much the same as what's described here, but I've yet to see the problem on any files hosted elsewhere.
Update:
The Azure Blog log only gives AnonymousSuccess messages, no error messages. This is what I get:
1.0;2013-11-14T12:10:10.6629155Z;GetBlob;AnonymousSuccess;200;3002;269;anonymous;;p3urort;blob;"http://p3urort.blob.core.windows.net/tracks/bd2fd171-b3c5-4e1c-97ba-b5109cf15098";"/p3urort/tracks/bd2fd171-b3c5-4e1c-97ba-b5109cf15098";c377a003-ea6b-4982-9335-15ebfd3cf1b1;0;160.67.18.120:54132;2009-09-19;419;0;318;7663732;0;;;"0x8D09A26E7479CEB";Friday, 18-Oct-13 14:38:53 GMT;;"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.48 Safari/537.36";"http://***.azurewebsites.net/";
Apparently you have to set the content type to audio/mpeg3
Here's how I do it:
CloudBlockBlob blockBlob = container.GetBlockBlobReference(fileName);
blockBlob.UploadFromStream(theStream);
theStream.Close();
blockBlob.Properties.ContentType = "audio/mpeg3";
blockBlob.SetProperties();
From here: https://social.msdn.microsoft.com/Forums/azure/en-US/0139d27a-0325-4be1-ae8d-fbbaf1710629/unable-to-load-audio-in-html5-audio-tag-from-storage?forum=windowsazuredevelopment
[edit] - This didn't actually work for me, I'm trying to troubleshoot, but I don't know what's wrong, going to ask a new question.
This mp3 only plays for 1.5 min and then stops. When downloaded, the file plays fully...
https://orator.blob.core.windows.net/mycontainer/zenhabits.net.unsolved.mp3
Is it possible for a Chrome extension to listen for streaming audio from any of the browser's tabs? I would like to capture the streaming audio data and then analyse it.
Thanks
You could try 3 ways, neither one does provide 100% guarantee to meet your needs.
Before going into more detailed descriptions, I must note that Chrome extensions do not provide convenient tools for working on per connection level - sufficiently low level, required for stream capturing. This is by design. This is why the 1-st way is:
To look at other browsers, for example Firefox, which provides low-level APIs for connections. They are already known to be used by similar extensions. You may have a look at MediaStealer. If you do not have a specific requirement to build your system on Chrome, you should possibly move to Firefox.
You can develop a Chrome extension, which intercepts HTTP-requests by means of webRequest API, analyses their headers and extracts media urls (such as containing audio/mpeg MIME-type, for example, in HTTP-headers). Just for a quick example of code you make look at the following SO question - How to change response header in Chrome. Having the url you may force appropriate media download as a file. It will land in default downloads folder and may have unfriendly name. (I made such an extension, but I do not have requirements for further processing). If you need to further process such files, it can be a challenge to monitor them in the folder, and run additional analysis in a separate program.
You may have a look at NPAPI plugins in general, and their streaming APIs in particular. I can imagine that you create a plugin registered for, again, audio/mpeg MIME-type, and receives the data via NPP_NewStream, NPP_WriteReady and NPP_Write methods. The plugin can be wrapped into a Chrome extension. Though I made NPAPI plugins, I never used this API, and I'm not sure it will work as expected. Nethertheless, I'm mentioning this possibility here for completenees. This method requires some coding other than web-coding, meaning C/C++. NB. NPAPI plugins are deprecated and not supported in Chrome since September 2015.
Taking into account that you have some external (to the extension) "fingerprinting service" in mind, which sounds like an intelligent data processing, you may be interested in building all the system out of a browser. For example, you could, possibly, involve a HTTP-proxy, saving media from passing traffic.
If you're writing a Chrome extension, you can use the Chrome tabCapture API to record audio.
chrome.tabCapture.capture({audio: true}, function(stream) {
var recorder = new MediaRecorder(stream);
[...]
});
The rest is left as an exercise to the reader; MDN has more documentation on how to use MediaRecorder.
When this question was asked in 2013, neither chrome.tabCapture nor MediaRecorder existed.
Mac OSX solution using soundflower: http://rogueamoeba.com/freebies/soundflower/
After installing soundflower it should appear as a separate audio device in the sound preferences (apple > system preferences > sound). Divert the computer's audio to the 2ch option (stereo, 16ch is surround), then inside a DAW, such as 'audacity', set the audio input as soundflower. Now the sound should be channeled to your DAW ready for recording.
Note: having diverted the audio from the internal speakers to soundflower you will only be able to hear the audio if the 'soundflowerbed' app is actually open. You know it's open if there's a 8 legged blob in the top right task bar. Clicking this icon gives you the sound flower options.
My privoxy has the following log:
2013-08-28 18:25:27.953 00002f44 Request: api.audioaddict.com/v1/di/listener_sessions.jsonp?_method=POST&callback=_AudioAddict_WP_ListenerSession_create&listener_session%5Bid%5D=null&listener_session%5Bis_premium%5D=false&listener_session%5Bmember_id%5D=null&listener_session%5Bdevice_id%5D=6&listener_session%5Bchannel_id%5D=178&listener_session%5Bstream_set_key%5D=webplayer&_=1377699927926
2013-08-28 18:25:27.969 0000268c Request: api.audioaddict.com/v1/ping.jsonp?callback=_AudioAddict_WP_Ping__ping&_=1377699927928
2013-08-28 18:25:27.985 00002d48 Request: api.audioaddict.com/v1/di/track_history/channel/178.jsonp?callback=_AudioAddict_TrackHistory_Channel&_=1377699927942
2013-08-28 18:25:54.080 00003360 Request: pub7.di.fm/di_progressivepsy_aac?type=.flv
So I got the stream url and record it:
D:\Profiles\user\temp>wget pub7.di.fm/di_progressivepsy_aac?type=.flv
--18:26:32-- http://pub7.di.fm/di_progressivepsy_aac?type=.flv
=> `di_progressivepsy_aac#type=.flv'
Resolving pub7.di.fm... done.
Connecting to pub7.di.fm[67.221.255.50]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [video/x-flv]
[ <=> ] 1,234,151 8.96K/s
I got the file that can be reproduced in any multimedia pleer.
I came across this code change in Chromium. It says Chromium now supports both handshake versions, which the code seems to confirm. I mean the second version at Wikipedia (draft-ietf-hybi-thewebsocketprotocol-06).
However, when I connect to my server, the only thing I obtain is the old version, i.e. including these headers:
Sec-WebSocket-Key1: 4 #1 46546xW%0l 1 5
Sec-WebSocket-Key2: 12998 5 Y3 1 .P00
but not the new version which would be a request containing:
Sec-WebSocket-Key: x3JJHMbDL1EzLkh9GBhXDw==
What am I missing here? I downloaded the latest nightly build and it has been included more than two weeks ago, so that cannot be the cause I guess.
How can I make a WebSocket send the new handshake version?
The code link you posted is for the server-side of the handshake (there is a few places this will likely be used in Chrome such as remote debugging and as a proxy for extensions).
If you really want use the new HyBi-07 protocol version you can try using this branch of web-socket-js that I made. Once Chrome switch to the new protocol, web-socket-js will switch by default also. In order to make web-socket-js work in a browser that already has WebSockets support you will need make some minor tweaks to it to use a different object name instead of WebSocket.
I expect Chrome/WebKit will add the new protocol before long. Note that the API changes to add binary support have only recent been decided so Chrome the new protocol may be added before the API fully supports the new functionality enabled by the protocol.
The only browser I know of that implements the 07 protocol is this build of FF4:
http://www.ducksong.com/misc/websockets-builds/ws-07/