Puppeteer JS headers for scraping - puppeteer

I'm trying to find out how to add headers to puppeteer js. I'm building something for the first time so I'm not familiar with this.
If set up my file with these, based on documentation but I'm not sure if I'm doing it correctly:
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36');
await page.setExtraHTTPHeaders({Referer: 'https://example.com/'});
Other than that tutorials that I've searched aren't adding the headers.
I want to add headers to replicate a real user in my case I’d like to replicate my browser as if I was using it manually. What is the correct way to do that?

This work for me, the english language is set by default when puppeteer visit the site.
await page.setExtraHTTPHeaders({'Cookie': 'language=en'});
In the browser you will find your data.
cookie on page

Related

Web scraping - page is not loading after 5-6 requests loaded

I'm trying to scrape a specific website's subpages. I'm using requests and bs4. I have the pages stored in a list that I use for looping. The scripts works fine with other websites, so I think I have some problems with the page itself. I can't access the page with my browser(s), or just for a limited time (few seconds). I've tried all of my browsers(Chrome, Firefox, Edge, Explorer) removed every cookie and other browsing datas, etc...)
I'm using headers:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36',
"Upgrade-Insecure-Requests": "1", "DNT": "1",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate"}
and here is the code to request the page:
cz_link= requests.get(cz_page,timeout=10, verify=False,headers=headers)
where "cz_page" is the item in the list that holds the pages I want to parse.
After 5 or 6 pages are loaded the next page won't load.
I've tried "https://downforeveryoneorjustme.com/" to check if the page is up, and it is, "it's just me."
Is there any way that I can access the pages through python requests regardless I'm not able to load the site in my browser(s)?
My next try will be to run the script with VPN on, but I'm curious if there is an other solution, I'm not able to use VPN all the time when I need to run this script.
Thank you!
The solution was to add a delay, but bigger than 5 sec. I experienced with it and it seems that after 5 page is loaded I got blocked and I had to wait for 10 minutes at least to try again.
So I added a counter inside the loop, and after it hit 5 I used time.sleep() for 10 mins and restarted the counter.
It is slow, but it works.
Thanks for the suggestions though!

Which portion of HTML code use for scraping a website table

For the seguent url
https://sports.bwin.com/it/sports#leagueIds=42&sportId=4
I'd like to scrape a simple table of betting quotations.
The problem is tha i don't know which part od HTML code to use (and also how use it).
Here the table example:
I don't paste the html code 'cuse is too long to paste.
here's my code but i always get an empty object
my_url = "https://sports.bwin.com/it/sports#leagueIds=42&sportId=4"
#opening up connection , grabbing page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
header={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36'}
page=requests.get(my_url,headers=header)
page_soup = bs4.BeautifulSoup(page.content, "html.parser")
betting = page_soup.findAll("table" , {"class": "marketboard-event-without-header__markets-list"})
The data you are trying to get is loaded dynamically. In order to get the data you are trying to get you would need to make requests to the api that the page accesses for its data instead of getting the page itself.
The reason you are not getting the table in your beautiful soup is that the table is not in the original page source, but it is loaded after the page is loaded in the client.
If you mimic the requests that load the data you should be able to get the data from those subsquent requests.

Issue scraping data from stats.nba.com

I've been having a bit of trouble scraping data from the stats.nba site. I've done this a few times so not sure what's changed up, but wanted to see if anyone else was having the same problem.
I usually just use jsonlite with the request url like so:
fromJSON("http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Per36&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=")
R just seems to get stuck running the code. Interestingly, I can still easily scrape from the nba's d-league website.
fromJSON("http://stats.nbadleague.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=20&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Per36&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=")
Anyone else having this issue?
Try this
library(httr)
library(rjson)
url = "http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Per36&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight="
agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"
data = GET(url, user_agent(agent))
fromJSON(content(data, type="text"))
I messed around with this for HOURS. The best guess I can venture is that it has something to do with the "/error" (see picture) redirect on the nba stats url which does not occur on the d-league url.
The code I wrote that works, involved reading the json as text first using readLines() then passing the result into fromJSON()
library(jsonlite)
jsonTxt <- readLines("https://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=01/01/2017&DateTo=09/30/2017&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=")
json <- fromJSON(txt = jsonTxt)
colnames(json$resultSets)

Getting many requests with User Agent [Mozilla/5.0]

When a request arrives to my java servlet I'm checking its UserAgent:
protected void service(HttpServletRequest request, HttpServletResponse response){
final String UA = request.getHeader("User-Agent");
eu.bitwalker.useragentutils.Browser browser = UserAgent.parseUserAgentString(UA).getBrowser();}
Most requests has UA (User Agent) with information in it, e.g. Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36.
Some requests (about 10%) has only Mozilla/5.0 or Mozilla/4.0.
Does it means they are bots?
Is it possible that something before the servlet removes the relevant part in the UA?
I'm using HaraldWalker User Agent Utils to identify the UA and it returns Mozilla for those UA's.But this online tool returns unknown.
Can someone please explain?
It looks very likely that these are some sort of bot, as that user agent is not used by any mainstream browser.
It will be worth you filtering your logs to extract just these entries, and checking if they are following any sort of obvious bot-like pattern. For instance, you may see:
A request every X seconds exactly
That they all happen at a specific time of day
That they all happen within a very short period of time
That they request URLs in alphabetical order
That all the requests come from a single IP address, or limited range of IPs

Get Windows Phone Market html code

I want to get html code from windows phone market pages. So far I have not run into any problems but today following error is displayed every time I retrieve data.
[...] Your request appears to be from an automated process.
If this is incorrect, notify us by clicking here to be redirected [...].
I tried to use proxy in case to many requests are called from one IP but this does not bring any progression. Do you happen to know why this problem takes place, any ideas about possible way outs? Any help would be very much appreciated. The main goal is to somehow get information about windows phone app from market.
It seems that they detect the user agent and block the request if it is not valid / known for a device.
I managed to make it work with curl with eg.
curl -A 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9' http://www.windowsphone.com/en-us/store/app/pinpoint-by-foundbite/ff9fdf41-aabd-4cac-9086-8710bd327da9
For asp.net, if you use HttpRequest to get the html content, try the following:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9";
For PHP you can set your user agent as well via curl_setopt.
I was not able to find out, whether there is an IP-based block after several requests.