Issue scraping data from stats.nba.com - json

I've been having a bit of trouble scraping data from the stats.nba site. I've done this a few times so not sure what's changed up, but wanted to see if anyone else was having the same problem.
I usually just use jsonlite with the request url like so:
fromJSON("http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Per36&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=")
R just seems to get stuck running the code. Interestingly, I can still easily scrape from the nba's d-league website.
fromJSON("http://stats.nbadleague.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=20&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Per36&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=")
Anyone else having this issue?

Try this
library(httr)
library(rjson)
url = "http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Per36&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight="
agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"
data = GET(url, user_agent(agent))
fromJSON(content(data, type="text"))

I messed around with this for HOURS. The best guess I can venture is that it has something to do with the "/error" (see picture) redirect on the nba stats url which does not occur on the d-league url.
The code I wrote that works, involved reading the json as text first using readLines() then passing the result into fromJSON()
library(jsonlite)
jsonTxt <- readLines("https://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=01/01/2017&DateTo=09/30/2017&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=")
json <- fromJSON(txt = jsonTxt)
colnames(json$resultSets)

Related

Web scraping - page is not loading after 5-6 requests loaded

I'm trying to scrape a specific website's subpages. I'm using requests and bs4. I have the pages stored in a list that I use for looping. The scripts works fine with other websites, so I think I have some problems with the page itself. I can't access the page with my browser(s), or just for a limited time (few seconds). I've tried all of my browsers(Chrome, Firefox, Edge, Explorer) removed every cookie and other browsing datas, etc...)
I'm using headers:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36',
"Upgrade-Insecure-Requests": "1", "DNT": "1",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate"}
and here is the code to request the page:
cz_link= requests.get(cz_page,timeout=10, verify=False,headers=headers)
where "cz_page" is the item in the list that holds the pages I want to parse.
After 5 or 6 pages are loaded the next page won't load.
I've tried "https://downforeveryoneorjustme.com/" to check if the page is up, and it is, "it's just me."
Is there any way that I can access the pages through python requests regardless I'm not able to load the site in my browser(s)?
My next try will be to run the script with VPN on, but I'm curious if there is an other solution, I'm not able to use VPN all the time when I need to run this script.
Thank you!
The solution was to add a delay, but bigger than 5 sec. I experienced with it and it seems that after 5 page is loaded I got blocked and I had to wait for 10 minutes at least to try again.
So I added a counter inside the loop, and after it hit 5 I used time.sleep() for 10 mins and restarted the counter.
It is slow, but it works.
Thanks for the suggestions though!

Using beautiful soup to get all mp3 files from a website -- recursively

Difficult to find answers for this question here, I know I've been searching for over an hour now. Many come close, I've tried bits and pieces from some of them but still the solution evades me. (Until now, see update)
Trying to pull all MP3 files from https://www.crrow777radio.com/free-episodes/, but they are deeply nested. I thought I could provide that URL and bs would recursively follow all links and I filter them to get my specific files to download. Apparently each href found must then be requested and parsed for the links on that page.
I have code that will pull the MP3 from the page that contains it, but doing that for all such pages (from top down recursively) is not as easy as the bs docs lead me to believe it is.
Update: With the help of MendelG and others elsewhere I have revised the code. I believe this will do the job, but getting the [large size] file content into a variable may need improvement via some sort of download write, download write scheme to reduce memory impact:
def getMP3sOnPageP(session, h, p):
soup = BeautifulSoup(session.get(p, headers=h).content, "html.parser")
# Select all the buttons on this page with the text `LISTEN`
for tag in soup.select("a.button"):
# Extracts the link from the button, in order to perform a request to that page
page = tag["href"]
soup = BeautifulSoup(session.get(page, headers=h).content, "html.parser")
# Finds the link to the mp3 file
download_link = soup.select_one("a.btn[download]")["href"]
file_name = re.search(r'(\d+-Hour-1.mp3)', download_link.split("/")[-1]).group()
# Request the mp3 file
print("Downloading ", file_name)
mp3_file = session.get(download_link).content
with open(file_name, "wb") as f:
f.write(mp3_file)
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}
with requests.Session() as session:
soup = BeautifulSoup(session.get(URL, headers=HEADERS).content, "html.parser")
# Select all the links on this page with a class of "page-numbers"
for a in soup.select("a.page-numbers"):
getMP3sOnPageP(session, HEADERS, a.get("href"))
As you can see this required 3 calls to beautiful soup (bs). My reading of the docs lead me to believe bs would operate recursively such that a single call with the correct filtering / parameters would be sufficient. If it is I definitely don't know how.
The update I provided solved the question I had. Although user MendelG posted a reply it didn't address the entire problem, although it was very helpful. His contribution is embodied in the getMP3sOnPageP function in my update.

Which portion of HTML code use for scraping a website table

For the seguent url
https://sports.bwin.com/it/sports#leagueIds=42&sportId=4
I'd like to scrape a simple table of betting quotations.
The problem is tha i don't know which part od HTML code to use (and also how use it).
Here the table example:
I don't paste the html code 'cuse is too long to paste.
here's my code but i always get an empty object
my_url = "https://sports.bwin.com/it/sports#leagueIds=42&sportId=4"
#opening up connection , grabbing page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
header={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36'}
page=requests.get(my_url,headers=header)
page_soup = bs4.BeautifulSoup(page.content, "html.parser")
betting = page_soup.findAll("table" , {"class": "marketboard-event-without-header__markets-list"})
The data you are trying to get is loaded dynamically. In order to get the data you are trying to get you would need to make requests to the api that the page accesses for its data instead of getting the page itself.
The reason you are not getting the table in your beautiful soup is that the table is not in the original page source, but it is loaded after the page is loaded in the client.
If you mimic the requests that load the data you should be able to get the data from those subsquent requests.

Puppeteer JS headers for scraping

I'm trying to find out how to add headers to puppeteer js. I'm building something for the first time so I'm not familiar with this.
If set up my file with these, based on documentation but I'm not sure if I'm doing it correctly:
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36');
await page.setExtraHTTPHeaders({Referer: 'https://example.com/'});
Other than that tutorials that I've searched aren't adding the headers.
I want to add headers to replicate a real user in my case I’d like to replicate my browser as if I was using it manually. What is the correct way to do that?
This work for me, the english language is set by default when puppeteer visit the site.
await page.setExtraHTTPHeaders({'Cookie': 'language=en'});
In the browser you will find your data.
cookie on page

extracting images from Google images using src and BeautifulSoup

I was following this past question (Extracting image src based on attribute with BeautifulSoup) to try to extract all the images from a google images page. I was getting a "urllib2.HTTPError: HTTP Error 403: Forbidden" error but was able to get past it using this:
req = urllib2.Request(url, headers={'User-Agent' : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30"})
however, then I got a new error that seems to be telling me that the src attribute does not exist:
Traceback (most recent call last):
File "Desktop/webscrapev2.py", line 13, in <module>
print(tag['src'])
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/bs4/element.py", line 958, in __getitem__
return self.attrs[key]
KeyError: 'src'
I was able to get over that error by checking specifically for the 'src' attribute but most of my images when extracted, dont have the src attribute. It seems like google is doing something to obscure my ability to extract even a few images (I know requests are limited but i thought it was at least 10).
For example printing out the variable tag (see code below) gives me this:
<img alt="Image result for baseball pitcher" class="rg_i" data-src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRZK59XKmZhYbaC8neSzY2KtS-aePhXYYPT2JjIGnW1N25codtr2A" data-sz="f" jsaction="load:str.tbn" name="jxlMHbZd-duNgM:" onload="google.aft&&google.aft(this)"/>
But printing out the variable v gives 'None'. I have no idea why this is happening nor how to get the actual image from what its returning. Does anyone know how to get the actual images? I'm especially worries since the data-src URL starts with encrypted... Should I query data-src to get the image instead of src? Any assistance or advice would be super appreciated!
Here is my full code (in Python):
from bs4 import BeautifulSoup
import urllib2
url = "https://www.google.com/search? q=baseball+pitcher&espv=2&biw=980&bih=627&source=lnms&tbm=isch&sa=X&ved=0ahUKEwj5h8-9lfjLAhUE7mMKHdgKD0YQ_AUIBigB"
#'http://www.imdb.com/title/tt%s/' % (id,)
req = urllib2.Request(url, headers={'User-Agent' : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.112 Chrome/12.0.742.112 Safari/534.30"})
soup = BeautifulSoup(urllib2.urlopen(req).read(), "lxml")
print "before FOR"
for tag in soup.findAll('img'):
print "inside FOR"
v = tag.get('src', tag.get('dfr-src')) # get's "src", else "dfr_src", if both are missing - None
print v
print tag
if v is None:
continue
print("v is NONE")
print(tag['src'])
Oh, boy. You picked the wrong site to scrape from. :)
Google's Defenses
First off, Google is (obviously) Google. It knows web crawlers and web scrapers very well - its entire business is founded on them.
So it knows all of the tricks that ordinary people get up to, and more importantly has an important mandate to make sure nobody else except end users get their hands on their images.
Didn't pass a User-Agent header? Now Google knows you're a scraper bot that doesn't bother pretending to be a browser, and forbids you from accessing its content. That's why you got a 403: Forbidden error the first time - the server realised you were a bot and prevented you from accessing material. It's the simplest technique to block automated bots.
Google Builds Pages through Javascript
Don't have Javascript parsing capability (which Python requests, urllib and its ilk don't)? Now you can't view half your images because the way Google Image search results works (if you inspect the Network tab in your Chrome console as Google Images is loading) is that a few bundled requests are made to various content providers that then systematically add a src attribute to a placeholder img tag through inline obfuscated Javascript code.
At the very beginning of time, all of your images are essentially blank, with just a custom data-src attribute to coordinate activities. Requests are made to image source providers as soon as the browser begins to parse Javascript (because Google probably makes use of its own CDN, these images are transferred to your computer very quickly), and then page Javascript does the arduous task of chunking the received data, identifying which img placeholder it should go to and then updating src appropriately. These are all time-intensive operations, and I won't even pretend to know how Google can make them happen so fast (although note that messing with network throttling operations in Dev Tools on Chrome 48 can cause Google Images to hang, for some bizarre reason, so there's probably some major network-level code-fu going on over there).
These image source providers appear to begin with https://encrypted..., which doesn't seem to be something to worry about - it probably just means that Google applies a custom encryption scheme on the data as its being sent over the network on top of HTTPS, which is then decoded by the browser. Google practises end-to-end encryption beyond just HTTPS - I believe every layer of the stack works only with encrypted data, with encryption and decryption only at the final and entry point - and I wouldn't be surprised to see the same technology behind, for example, Google Accounts.
(Note: all the above comes from poking around in Chrome Dev Tools for a bit and spending time with de-obfuscators. I am not affiliated with Google, and my understanding is most likely probably incomplete or even woefully wrong.)
Without a bundled Javascript interpreter, it is safe to say that Google Images is effectively a blank wall.
Google's Final Dirty Trick
But now say you use a scraper that is capable of parsing and executing Javascript to update the page HTML - something like a headless browser (here's a list of such browsers). Can you still expect to be able to get all the images just by visiting the src?
Not exactly. Google Images embeds images in its result pages.
In other words, it does not link to other pages, it copies the images in textual format and literally writes down the image in base64 encoding. This reduces the number of connections needed significantly and improves page loading time.
You can see this for yourself if you navigate to Google Images, right click on any image, and hit Inspect element. Here's a typical HTML tag for an image on Google Images:
<img data-sz="f" name="m4qsOrXytYY2xM:" class="rg_i" alt="Image result for google images" jsaction="load:str.tbn" onload="google.aft&&google.aft(this)" src="data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wCEAAkGBxAPEA4NEBAPEBETDxAQFhEREA8QDxAPFhEWFhURFRgYKCgiGBsnHRcVIzIiJykrLi4vGB8zODMsNygtLisBCgoKDg0OGxAQGy0mHyUrLS4rLy8tNy8tNy0tKy4rLS0tKystLTArLS0tLS0tLy0tLS0tLS0tLS0tKy0tLS0tLf/AABEIAOEA4QMBEQACEQEDEQH/xAAcAAEAAwADAQEAAAAAAAAAAAAABQYHAQMECAL/xAA8EAACAgEBBgMFBQUIAwAAAAAAAQIDBBEFBhITITFBYXEHUYGRsRQiMlKhQmJyc7MVIyUzQ1OSwTXC8P/EABsBAQACAwEBAAAAAAAAAAAAAAAEBQIDBgEH/8QAMxEBAAIBAgQDBQgCAwEAAAAAAAECAwQRBRIhMRNBgRRRYXGRIjIzQqGxwdE08AZS4SP/2gAMAwEAAhEDEQA/ANxAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHDYFR2zvzXXZ9mxKpZd76KMPwa+vivPt5lnh4ZM08TNblr+v0Rb6nry443l1Qo29euKVuJiJ/sKHMmvXv9T2cmgp0rS1vjM7HJnt3tEejpyMPeClccMnGydP2OGNcn6cS0/VGdMvD7ztak1+U7vJpnjtbdG4vtKtpm6c3GcJx/FFJ12LzSl0l80iVbg2PLXn0994ao1d6zteF62Dt3Hz6udjz44p8Mk04yhPTXhkn2fVFLqNPk09+TJHVNpkreN4SRoZgAAAAAAAAAAAAR+2MGd0OGORbQkm3y1DWXlq1qvga8lOaO+zVlpz17zHyZTsnOuyczDxZ3WqFljU+GycXKMYSlpqn014Sq09rXyREzLn9Fa2TNFbTO3zbJCKSSXZLT4Fw6V+j0AAAAAAAAM69pG9EoN4NDevSM+HvKcu1a+a19UveX/CdDXbx8nby/tX6rNMz4dVi3L3ajg0JySlkWJSts8eJ/sL3RRWa7WW1OSZ8o7R8ErDijHXaFjRDbnAFV353T/tNY0Fy6+CzWdzTdyq061w09769e2iJ2i1ttLNpjrvHSPL5y05cUZNt09sfZdOHTDHogoQiu3i34yk/Fv3kTJlvktNrzvLbWsVjaHtMHoAAAAAHGp5zQOT0AAAAB+bfwy9H9Dyezy3aWJ7pP/E9n/zLP6Mym0n4sOa4b/kfVtpdOmcgAAAAAAAdWVcq4TsfaEJTfpFa/wDR7WvNaKx5vJnaN2IbuSeXtXDdnXivldLXxkoys+qR2eujwNFatfdEKfT/AG80TLdDi1yAAAACC3n3ihhQ6Ljta1jHwXgm9PPsvE05cvJ0jun6HQzqJmZnasd5R+Lu9k5UVbnZV8XJa8imSrjWn+zJru//ALqeRjm3W8tl9bjxTy6ekbR5zG8yid4tg5Wz4Sy8PJyJwh96yuc+KcYeM4vtJLxTXbX0Mb4pr1rKbo9bh1FvC1FK9e0xGz37lb6LLksW/hV3C3Ca6RuS7rTwkl18+vuM8WXm6S08T4TOnjxMf3fP4f8Ai422KKcn0SWplkyVx1m1u0KatZtO0INZFuXOVdcuVXHTikustX2ivP8AReZU4r5ddaZ3mtI93eUy9a4IjpvZ3T2BHT7ttyl+ZyjJa+a0+mhvycKwWjpvE+/eWFdXeJ67TCMp2ndjWSqs+9w94t6qUX2lFspfbNTw/N4eSeav8fBOnT49RTnr0WfHujZGM4vWLWqOpxZa5aRevaVTek0tNZ7qrtjea2eTHZ2AoO5uSlbP70KlH8TS8dPr0LfFoq0w+Pn328ojzQrZptfkx+sunae7m0+W7Ktp222pa8uUY1Vzf5YuGnD8dTLDrNNzbXxREfWf1L4cm28XndSNkbeyMnIhi35uXiOcuXxqTlw3a6KE0301fTyZdanS48WLxcWOttuvoh472m3La0w1rZmzvs1Lr5t170bdl1jnOT08+y8kcnmyeJMztEfKNlnFeWuzH9z3/imz/wCZZ/RmUek/Fc9w3/I+rbcm+NcJWTekYrVvyLiZiI3l09KWvaK17yp+LkZO1bLOC2eNi1y4G69FbZLvw8Xppr4dSPWbZevaFvmpi0MRXli15jfr2j0fjbe6E6a534mRkcyEXNwnZxcxJatJrRxl+h7fDtG9ZZaXicWvFM9azWfh2ebcHfZ5Nv2G+XFNxcq7HopS4Vq65e96atP3JjDeZ6WbOLcMphjxcXbzj+WgkhQAAAB49s1ueNkwXeVFsV6uDSNmGYrkrM++P3Y3jeswwjc7NVWfg3SeiVsYvyU4uH/sdxxLH4mlvEe7f6dVNp7cuSJfQZwa7AAADgDIMzaP2nadLl1i8+EdPDgjYoxXySK+J5su8+92XgeDw+Yr35f36tfRYONcTimnFrVNNNPs0+6BE7dXz1tRS2dm2KDaeNlOUPfwKXFFf8Wl8SFMct3f47RqdHHN+av+/q2zbuV/d1adp/f+CSf/AGV3HM01x1pHnP7OP0WPe8z7nG6P+RJ+Lusb9eiX6aErg+3ssfOf3Ya78aflCcLRDVve/H/ybl3UnW/OLTa+j+Zzv/IMUTjrk907LThl/tTT1eTd3PksbOUesqoynH1cJNL5x/Uk/wDHJ56clu3NH6sOLU5bRaPOFL9l+antFcT1dmNZFN93LWMv1SkfRuN4dtNHL2iXMaK3/wBOvm2I5JbMJ9qOKqNo5Dh05ka7lp35klo2vNyjr8Ts+EZebRxzeW8eip1NdsvRt9bk6k5fi5a1/i4epxl9t52Wk/dYnuZ/5TZ/8yz+jMp9J+K53h3+Q0b2j5jrx6oL/Ut0fmoxb0+enyJ+qnauzvuB4YvnmZ8oR25u2eRs+Crx8jIsdt7caq24682WnFN9F00PcNtscbQy4ngi+stzWisdO/y9z2X7P2ptBON1scCiS0ddOk75RfhKfh8DPlvPdHjPpdP+FXnt77dvSP7SO7+5mDgNTppTtX+tY3Ozto9G/wAPw0M4rEImfWZs/wB+3p5LEZIwAAAcMD593z2RLBzbqNGoSk7apeDrk9Uk/fF6r4L3ndcO1MajBEz3jpP+/FTZ8U0u17cTeOOfjR4mufWlC2Pi34WLylpr66rwOV4jo502WY/LPWP69Fjp80ZK/HzWUr0hGbc2ZZlQVccq/Gj14uTwKc17uJptfDQ3YcsY7c3LE/NhenNG2+zN9+thLZVFN1WRlWynkKpq2zVcPKnLVaaddYov+G6mdXlml6xtEb9I+KDqMMY67xM/V6fY5DnzzMuyU5TrlCuGs5OMIyi3LRdte3U08cnkmuOsbRtu2aOOkzKrbx1zw8++HVSryOfXr04oufMg15eHwZyNq8l930rR5K6rSRHvrtP02bls7NhkVVZFb1hZCM16Na6epOid43cNlxWxXmlu8Ts9J61sG3wxnmbauxavvO3Jrr6eCjXBTl8OGXyI0xvd2OHN7Pw2s277T+szs1feulwpqsj2qai/KDSWvzSK3jmnnJhi8flc9w/JHiTWfN17n5K/vate75kfPolL6I0cC1EctsM9+8NnEsUxMX9FmOiVaC3wtUcda/nT+SZScdtHs8V98wsOGVmc3T3PDuBivkW3yX+dZ91PxritE/i3I38JwTiwbz59TiWWL5eWPLoou8W5+Zs7KWXhQnbTGzm1uuPHZS9deXKC6uPhqvDud9puJ4NRh8LUTtO23w+bncmmvS3NjXDE9oVXKUsjHyKLeHV1yhwptd3Fy06fAq54Teb7YrRavv3/AHbva4iPtxMSitg7AntbL/trLio08UZUU9+OMPwSl+7r182btVqo0uL2TDPX80/Ge+xixzkt4t/Ro9v4Zfwv6FFKXbtLDdzH/imz/wCZZ/RmVGk/Fc5w7/IaJ7UcOU8HnRTbotjY9P8AbacZP4ap/Ass9earvOCZ4x6nafzRsgPZXt2MJ2YU5aK2XMrbfTmaJSh8Uk16M16e232ZWPHtHNojPXy6T/E/w08lOWcgAAAAAAgd792KdpU8qz7lkW5V2payrk/rF9NV5eRL0esvpr81e3nDVlxRkjaWO34e0tiXq2UJw4W0r6050Tj4pv3P8stH9TqK6nS67HyWn0nurpxZMVuaF92F7VcS2KWSpUz06yinOp+ei+8vTR+pT6jgeas74p5o/VKprKz96NliW++zNOL7bQl5yafya1IM8N1UTtyS3e0Y/wDsoXtM3nx9o1UYuDzcqyGQrG66bHBJVzjp21b1kvDTuWvC8FtJecmaYrG23fr3R894yRtTqmvY3snJxqct5NFlPMtrlFTSTlFQab07r4kTi+ox5slZxzvtDdp6TSvVOb7bn17ShGSlysitPgt01TX5Jrxj9ClvTmW2h199Lbp1ie8Klu5lbW2PxY1+DbkY/E2nQ1Zwt9XKDXg34NIwrzU6bLTV20eu+3F+W/xj90/lb05+RB14OzcmFkunNylCquv97TX7xlzWntCBXS6bHO+XLEx7q+bt3H3KjgOeVfNXZdmvFZ14YJvWUY6+9934ntaRDXrtfbUztHSsdoW62uM4yhJKUZJpp9mn3RlasWjaUGJmJ3hS8nd/KxLObif3taeqhxJWw8uvSS/UoM/CsmPJ4unnrC2prseWnJmj1S9O37tNJYWTx+5QfDr6+BNpqtVttbDO/wAJjZEtp8O+9ckbPJfsfIz7IzylyaI9qVJOcl7m10S/X0NddFkz5Iy6nbp2rDZ7VTDSaYe895n+FoqrUYqMUoxikkl0SS7Itlf3eTaO1sfGWt91VX8c0m/Rd2bsWDJlnalZlhbJWv3pZJ7Tds4uZbRfi3Suca5VTjGq5RUdeJSUmkn3afwOm4RTJpq2pmiI3neOsK/VTGSYmnX0SPs/39hTXVhZLSqjpCu7/bj4Rs/d/e8PHp1NXE+FTeZzYu/nH9f090+p5Y5L/Vp+dZNVWSqhzZ8EnGClGPHLTotX0XqcxbfZYW3mvRj+x92NsY2VRlrChJ1SclHn0rXWLi1rr06Nlfj016WiypwaHLivzx3atsq2++qf2vGjQ5ax5XNjfxQa0fFokuvu6k+szMfahbY7X7z0ll283s+y8WyVuDF5FDlxKtS0vp8dFr+JLwa6/U03w+cOq0fGqTXkz/8AkvXsnfTbFaVNmz772uic6LoT+MktGe1m8eTTn0nDrzzVycv6rnsC3aeRKN2XCrDpXVUQ+/dZL9+XaMfJdTZXmnuqtR7NSOXDvaffPb0hZjNCAAAAAA4lFNNNJp+D6pgQ+Vups+18VmHjSfv5UE/0N9dTmr2tP1YTSs94cY+6ez63rDCxk/5UX9RbU5rd7z9SMdY7QlaMeFa0hCEF7oxUV+hpmZnuzdp4AAAAAAAAACsb+7xPBoioPS21yjF/ljFaykvPrFfEseGaP2nL17R3RtTm8OvTvKB3A3bqyqltTMXPstlNwjZrKMIRk46tPvJtPuSeJ6u1LzgxfZrHTp0YabDHLz26zLQK6IRWkYQivcopIpZmZlMZb7Tty6qa57SxoqtKUedVHpBqclHmQXg9WtV266+vR8H4jebxgv190oGqwREc0LT7K82d2zKHNtuErKU33cIS0j8l0+BXcWx1pqrRXz6/VI00zOON1uK1vAAAAAAAAAAAAAAAAAAAAAAAAAAApHtU3fty8eq6iLnbjynLlr8U6ppcaj75fdi9PJltwjWV0+WYv2si6rFOSu8d4Vf2db+VYtf2HK4oQjOXBPRt18T1lXOPfu29fPQseJ8Mtnt42Hrv3j+YaMGo8OOS7RY72bPceL7bjaedsE/k+pRTodTE7eHP0TIz4/8AtCo727e/taD2VsyLv5k4c3I0aoqhGSl+J9+qXw9ek/S4fYrePn6THavnM/w05L+L9inrK7bubIhg41OJB6quOjk+85t6yk/VtlXnzWzZJvbvKTSsVjaEkamQAAAAAAAAAAAAAAAAAAAAAAAAAADQCI2puzg5T478WmyX53BKf/JaM349TmxfctMerCcdbd4eGrcHZUXqsOp/xOc18pPQ224hqZjabz9WMYcceUJ/FxK6oqFUIVxX7MIqMfkiJMzad5lt2dx4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA/9k=" style="width: 167px; height: 167px; margin-left: -7px; margin-right: -6px; margin-top: 0px;">
Note the massive wall of text buried in src. That is quite literally the image itself, written in base 64. When we see an image on our screen, we are actually viewing the result of this very text parsed and rendered by a suitable graphics engine. Modern browsers support decoding and rendering of base64-encoded URIs, so it's not a surprise you can literally copy-paste the relevant text into your address bar, hit Enter and view the image at once.
To get the image back, you can decode this wall of text (after suitably parsing it to remove the data:image/jpeg;base64,) using the base64 module in Python:
import base64
base64_string = ... # that monster you saw above
decoded_string = base64.b64decode(your_string)
You must also make sure to parse the image type appropriately from the start of the src attribute, write the decoded_string to a file and finally save it with the file extension you received from the data attribute. phew
tl;dr
Don't go after Google Images as your first major scraping project. It's
hard. Wikipedia is much easier to get ahold of.
in violation of their Terms of Service (although what scraping isn't? and note I am not a lawyer and this doesn't constitute legal advice) where they explicitly say
Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide.
really impossible to predict how to improve on. I wouldn't be surprised if Google was using additional authentication mechanisms even after spoofing a human browser as much as possible (for instance, a custom HTTP header), and no one except an anonymous rebellious Google employee eager to reduce his/her master to rubble (unlikely) could help you out then.
significantly easier to use Google's provided Custom Search API, which lets you simply ask Google for a set of images programmatically without the hassle of scraping. This API is rate-limited to about a hundred requests a day, which is more than enough for a hobby project. Here are some instructions on how to use it for images. As a rule, use an API before considering scraping.
To scrape Google Images using requests and beautifulsoup libraries, you need to parse data from the page code, inside <script> tags using regular expressions.
If you only need to parse thumbnail size images, you can do it by passing content-type (solution found from MendelG) query params into HTTP request:
import requests
from bs4 import BeautifulSoup
params = {
"q": "batman wallpaper",
"tbm": "isch",
"content-type": "image/png",
}
html = requests.get("https://www.google.com/search", params=params)
soup = BeautifulSoup(html.text, 'html.parser')
for img in soup.select("img"):
print(img["src"])
To scrape the full-res image URL with requests and beautifulsoup you need to scrape data from the page source code via regex.
Find all <script> tags:
soup.select('script')
Match images data via regex:
matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
Match desired images (full res size) via regex:
# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps() it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
matched_images_data_json)
Extract and decode them using bytes() and decode():
for fixed_full_res_image in matched_google_full_resolution_images:
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
If you need to save them, you can do it via urllib.request.urlretrieve(url, filename) (more in-depth):
import urllib.request
# often times it will throw 404 error, to avoid it we need to pass user-agent
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(original_size_img, f'LOCAL_FOLDER_NAME/YOUR_IMAGE_NAME.jpg') # you can skip folder path and it will save them in current working directory
Code and full example in the online IDE:
import requests, lxml, re, json, urllib.request
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "pexels cat",
"tbm": "isch",
"hl": "en",
"ijn": "0",
}
html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
def get_images_data():
print('\nGoogle Images Metadata:')
for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
source = google_image.select_one('.fxgdke').text
link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
print(f'{title}\n{source}\n{link}\n')
# this steps could be refactored to a more compact
all_script_tags = soup.select('script')
# # https://regex101.com/r/48UZhY/4
matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
# https://regex101.com/r/pdZOnW/3
matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)
# https://regex101.com/r/NnRg27/1
matched_google_images_thumbnails = ', '.join(
re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
str(matched_google_image_data))).split(', ')
print('Google Image Thumbnails:') # in order
for fixed_google_image_thumbnail in matched_google_images_thumbnails:
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')
# after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
print(google_image_thumbnail)
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
removed_matched_google_images_thumbnails)
print('\nDownloading Google Full Resolution Images:') # in order
for index, fixed_full_res_image in enumerate(matched_google_full_resolution_images):
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
print(original_size_img)
# ------------------------------------------------
# Download original images
# print(f'Downloading {index} image...')
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(original_size_img, f'Images/original_size_img_{index}.jpg')
get_images_data()
-------------
'''
Google Images Metadata:
9,000+ Best Cat Photos · 100% Free Download · Pexels Stock Photos
pexels.com
https://www.pexels.com/search/cat/
...
Google Image Thumbnails:
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR2cZsuRkkLWXOIsl9BZzbeaCcI0qav7nenDvvqi-YSm4nVJZYyljRsJZv6N5vS8hMNU_w&usqp=CAU
...
Full Resolution Images:
https://images.pexels.com/photos/1170986/pexels-photo-1170986.jpeg?cs=srgb&dl=pexels-evg-culture-1170986.jpg&fm=jpg
https://images.pexels.com/photos/3777622/pexels-photo-3777622.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500
...
'''
Alternatively, you can achieve the same thing by using Google Images API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with regex to match and extract needed data from the source code of the page, instead, you only need to iterate over structured JSON and get what you want.
Code to integrate:
import os, urllib.request, json # json for pretty output
from serpapi import GoogleSearch
def get_google_images():
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "pexels cat",
"tbm": "isch"
}
search = GoogleSearch(params)
results = search.get_dict()
print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))
# -----------------------
# Downloading images
for index, image in enumerate(results['images_results']):
# print(f'Downloading {index} image...')
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(image['original'], f'SerpApi_Images/original_size_img_{index}.jpg')
get_google_images()
---------------
'''
[
...
{
"position": 100, # img number
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRR1FCGhFsr_qZoxPvQBDjVn17e_8bA5PB8mg&usqp=CAU",
"source": "pexels.com",
"title": "Close-up of Cat · Free Stock Photo",
"link": "https://www.pexels.com/photo/close-up-of-cat-320014/",
"original": "https://images.pexels.com/photos/2612982/pexels-photo-2612982.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500",
"is_product": false
}
]
'''
P.S - I wrote a more in-depth blog post about how to scrape Google Images, and how to reduce chance of being blocked while web scraping search engines.
Disclaimer, I work for SerpApi.
The best way to solve this problem is by using headless browser like Chrome Webdriver and user simulation libraries like Selenium Py. Beautiful Soup alone isn't adequate.