Web scraping - page is not loading after 5-6 requests loaded - html

I'm trying to scrape a specific website's subpages. I'm using requests and bs4. I have the pages stored in a list that I use for looping. The scripts works fine with other websites, so I think I have some problems with the page itself. I can't access the page with my browser(s), or just for a limited time (few seconds). I've tried all of my browsers(Chrome, Firefox, Edge, Explorer) removed every cookie and other browsing datas, etc...)
I'm using headers:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36',
"Upgrade-Insecure-Requests": "1", "DNT": "1",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate"}
and here is the code to request the page:
cz_link= requests.get(cz_page,timeout=10, verify=False,headers=headers)
where "cz_page" is the item in the list that holds the pages I want to parse.
After 5 or 6 pages are loaded the next page won't load.
I've tried "https://downforeveryoneorjustme.com/" to check if the page is up, and it is, "it's just me."
Is there any way that I can access the pages through python requests regardless I'm not able to load the site in my browser(s)?
My next try will be to run the script with VPN on, but I'm curious if there is an other solution, I'm not able to use VPN all the time when I need to run this script.
Thank you!

The solution was to add a delay, but bigger than 5 sec. I experienced with it and it seems that after 5 page is loaded I got blocked and I had to wait for 10 minutes at least to try again.
So I added a counter inside the loop, and after it hit 5 I used time.sleep() for 10 mins and restarted the counter.
It is slow, but it works.
Thanks for the suggestions though!

Related

Getting many requests with User Agent [Mozilla/5.0]

When a request arrives to my java servlet I'm checking its UserAgent:
protected void service(HttpServletRequest request, HttpServletResponse response){
final String UA = request.getHeader("User-Agent");
eu.bitwalker.useragentutils.Browser browser = UserAgent.parseUserAgentString(UA).getBrowser();}
Most requests has UA (User Agent) with information in it, e.g. Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36.
Some requests (about 10%) has only Mozilla/5.0 or Mozilla/4.0.
Does it means they are bots?
Is it possible that something before the servlet removes the relevant part in the UA?
I'm using HaraldWalker User Agent Utils to identify the UA and it returns Mozilla for those UA's.But this online tool returns unknown.
Can someone please explain?
It looks very likely that these are some sort of bot, as that user agent is not used by any mainstream browser.
It will be worth you filtering your logs to extract just these entries, and checking if they are following any sort of obvious bot-like pattern. For instance, you may see:
A request every X seconds exactly
That they all happen at a specific time of day
That they all happen within a very short period of time
That they request URLs in alphabetical order
That all the requests come from a single IP address, or limited range of IPs

What is causing requests to this url/path?

My company is hosting an ecom shop based on Infinity Shop System. Our logs say that there are HTTP calls to this path which lead to 404 errors since the file does not exist:
http://{domain}/{somePath}/skin/default/images/tb-collectingalarm-red-low_inject.png
However, this reference is not made by us as I cannot find this path in any line of our source code.
The logs also state that only (some?) Firefox users do this call:
User Agent Mozilla/5.0 (Windows NT 6.3; rv:35.0) Gecko/20100101
Firefox/35.0
So, since this does cause quite some 404 errors, does anyone know what could cause these requests?
We already followed the referrer URL which lead to one of our sites but within its html markup we could not find any reference.

Get Windows Phone Market html code

I want to get html code from windows phone market pages. So far I have not run into any problems but today following error is displayed every time I retrieve data.
[...] Your request appears to be from an automated process.
If this is incorrect, notify us by clicking here to be redirected [...].
I tried to use proxy in case to many requests are called from one IP but this does not bring any progression. Do you happen to know why this problem takes place, any ideas about possible way outs? Any help would be very much appreciated. The main goal is to somehow get information about windows phone app from market.
It seems that they detect the user agent and block the request if it is not valid / known for a device.
I managed to make it work with curl with eg.
curl -A 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9' http://www.windowsphone.com/en-us/store/app/pinpoint-by-foundbite/ff9fdf41-aabd-4cac-9086-8710bd327da9
For asp.net, if you use HttpRequest to get the html content, try the following:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9";
For PHP you can set your user agent as well via curl_setopt.
I was not able to find out, whether there is an IP-based block after several requests.

Audio hosted on Azure Blob storage occasionally stops playing

This seems to only happen in Chrome (latest version 31.0.1650.48 m, but also earlier), but since it doesn't always happen it's hard to say for sure.
When streaming audio stored in Azure Blob storage, Chrome will occasionally play about 30-50% of the track and then stop. It's hard to reproduce, but if I clear the cache and play the file over and over again, it eventually happens. An example file can be found here.
The error is pretty much the same as what's described here, but I've yet to see the problem on any files hosted elsewhere.
Update:
The Azure Blog log only gives AnonymousSuccess messages, no error messages. This is what I get:
1.0;2013-11-14T12:10:10.6629155Z;GetBlob;AnonymousSuccess;200;3002;269;anonymous;;p3urort;blob;"http://p3urort.blob.core.windows.net/tracks/bd2fd171-b3c5-4e1c-97ba-b5109cf15098";"/p3urort/tracks/bd2fd171-b3c5-4e1c-97ba-b5109cf15098";c377a003-ea6b-4982-9335-15ebfd3cf1b1;0;160.67.18.120:54132;2009-09-19;419;0;318;7663732;0;;;"0x8D09A26E7479CEB";Friday, 18-Oct-13 14:38:53 GMT;;"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.48 Safari/537.36";"http://***.azurewebsites.net/";
Apparently you have to set the content type to audio/mpeg3
Here's how I do it:
CloudBlockBlob blockBlob = container.GetBlockBlobReference(fileName);
blockBlob.UploadFromStream(theStream);
theStream.Close();
blockBlob.Properties.ContentType = "audio/mpeg3";
blockBlob.SetProperties();
From here: https://social.msdn.microsoft.com/Forums/azure/en-US/0139d27a-0325-4be1-ae8d-fbbaf1710629/unable-to-load-audio-in-html5-audio-tag-from-storage?forum=windowsazuredevelopment
[edit] - This didn't actually work for me, I'm trying to troubleshoot, but I don't know what's wrong, going to ask a new question.
This mp3 only plays for 1.5 min and then stops. When downloaded, the file plays fully...
https://orator.blob.core.windows.net/mycontainer/zenhabits.net.unsolved.mp3

Jsoup Delay due to Streaming Website

A question regarding Jsoup: I am building a tool that fetches prices from a website. However, this website has streaming content. If I browse manually, I see the prices of 20 mins ago and have to wait about 3 secs to get the current price. Is there any way I can make some kind of delay in Jsoup to be able to obtain the prices in the streaming section? I am using this code:
conn = Jsoup.connect(link).userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36");
conn.timeout(5000);
doc = conn.get();
As mentioned in the comments, the site is most likely using some type of scripting that just won't work with Jsoup. Since Jsoup just get the initial HTML response and does not execute any javascript.
I wanted to give you some more guidence though on where to go now. The best bet, in these cases, is to move to another platform for these types of sites. You can migrate to HTMLUnit which is a headless browser, or Selenium which can use HTMLUnit or a real browser like Firefox or Chrome. I would recommend Selenium if you think you will ever need to move past HTMLUnit as HTMLUnit can sometimes be less stable a browser compared to consumer browsers Selenium can support. You can use Selenium with the HTMLUnit driver giving you the option to move to another browser seamlessly later.
You can Use a JavaFX WebView with javascript enabled. After waiting the two seconds, you can extract the contents and pass them to JSoup.
(After loading your url into your WebView using the example above)
String text=view.getEngine() executeScript("document.documentElement.outerHTML");
Document doc = Jsoup.parse(html);