Getting many requests with User Agent [Mozilla/5.0] - google-chrome

When a request arrives to my java servlet I'm checking its UserAgent:
protected void service(HttpServletRequest request, HttpServletResponse response){
final String UA = request.getHeader("User-Agent");
eu.bitwalker.useragentutils.Browser browser = UserAgent.parseUserAgentString(UA).getBrowser();}
Most requests has UA (User Agent) with information in it, e.g. Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36.
Some requests (about 10%) has only Mozilla/5.0 or Mozilla/4.0.
Does it means they are bots?
Is it possible that something before the servlet removes the relevant part in the UA?
I'm using HaraldWalker User Agent Utils to identify the UA and it returns Mozilla for those UA's.But this online tool returns unknown.
Can someone please explain?

It looks very likely that these are some sort of bot, as that user agent is not used by any mainstream browser.
It will be worth you filtering your logs to extract just these entries, and checking if they are following any sort of obvious bot-like pattern. For instance, you may see:
A request every X seconds exactly
That they all happen at a specific time of day
That they all happen within a very short period of time
That they request URLs in alphabetical order
That all the requests come from a single IP address, or limited range of IPs

Related

Web scraping - page is not loading after 5-6 requests loaded

I'm trying to scrape a specific website's subpages. I'm using requests and bs4. I have the pages stored in a list that I use for looping. The scripts works fine with other websites, so I think I have some problems with the page itself. I can't access the page with my browser(s), or just for a limited time (few seconds). I've tried all of my browsers(Chrome, Firefox, Edge, Explorer) removed every cookie and other browsing datas, etc...)
I'm using headers:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36',
"Upgrade-Insecure-Requests": "1", "DNT": "1",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate"}
and here is the code to request the page:
cz_link= requests.get(cz_page,timeout=10, verify=False,headers=headers)
where "cz_page" is the item in the list that holds the pages I want to parse.
After 5 or 6 pages are loaded the next page won't load.
I've tried "https://downforeveryoneorjustme.com/" to check if the page is up, and it is, "it's just me."
Is there any way that I can access the pages through python requests regardless I'm not able to load the site in my browser(s)?
My next try will be to run the script with VPN on, but I'm curious if there is an other solution, I'm not able to use VPN all the time when I need to run this script.
Thank you!
The solution was to add a delay, but bigger than 5 sec. I experienced with it and it seems that after 5 page is loaded I got blocked and I had to wait for 10 minutes at least to try again.
So I added a counter inside the loop, and after it hit 5 I used time.sleep() for 10 mins and restarted the counter.
It is slow, but it works.
Thanks for the suggestions though!

What is Google Chrome Browser Version 0.A.B.C?

I'm looking through my GA logs and I see a Google Chrome browser version 0.A.B.C. Could anybody tell me what this is exactly? Some kind of spider or bot or modified http header?
The full user agent string probably looks something like this:
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13"
This is most likely a bot, but it could just be someone running an automated script using CasperJS or PhantomJS (or even a shell script using something like lynx) and spoofing the user agent.
The reason it looks like that instead of something that says "My automated test runner v1.0" (or whatever is relevant to the author) is that this user agent string will pass most regular expression checks as "some version of Chrome" and not get filtered out properly by most bot checks that rely on a regular expression to match 'valid' user agent patterns.
In order to avoid it, your site bot checker would need to blacklist this string, or validate that all parts of the Chrome version to make sure they're all valid numbers. Even then, you can only do so much checking since the user agent string is so easy to spoof.

What is causing requests to this url/path?

My company is hosting an ecom shop based on Infinity Shop System. Our logs say that there are HTTP calls to this path which lead to 404 errors since the file does not exist:
http://{domain}/{somePath}/skin/default/images/tb-collectingalarm-red-low_inject.png
However, this reference is not made by us as I cannot find this path in any line of our source code.
The logs also state that only (some?) Firefox users do this call:
User Agent Mozilla/5.0 (Windows NT 6.3; rv:35.0) Gecko/20100101
Firefox/35.0
So, since this does cause quite some 404 errors, does anyone know what could cause these requests?
We already followed the referrer URL which lead to one of our sites but within its html markup we could not find any reference.

Get Windows Phone Market html code

I want to get html code from windows phone market pages. So far I have not run into any problems but today following error is displayed every time I retrieve data.
[...] Your request appears to be from an automated process.
If this is incorrect, notify us by clicking here to be redirected [...].
I tried to use proxy in case to many requests are called from one IP but this does not bring any progression. Do you happen to know why this problem takes place, any ideas about possible way outs? Any help would be very much appreciated. The main goal is to somehow get information about windows phone app from market.
It seems that they detect the user agent and block the request if it is not valid / known for a device.
I managed to make it work with curl with eg.
curl -A 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9' http://www.windowsphone.com/en-us/store/app/pinpoint-by-foundbite/ff9fdf41-aabd-4cac-9086-8710bd327da9
For asp.net, if you use HttpRequest to get the html content, try the following:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9";
For PHP you can set your user agent as well via curl_setopt.
I was not able to find out, whether there is an IP-based block after several requests.

How to programmatically extract information from a web page, using Linux command line?

I need to extract the exchange rate of USD to another currency (say, EUR) for a long list of historical dates.
The www.xe.com website gives the historical lookup tool, and using a detailed URL, one can get the rate table for a specific date, w/o populating the Date: and From: boxes. For example, the URL http://www.xe.com/currencytables/?from=USD&date=2012-10-15 gives the table of conversion rates from USD to other currencies on the day of Oct. 15th, 2012.
Now, assume I have a list of dates, I can loop through the list and change the date part of that URL to get the required page. If I can extract the rates list, then simple grep EUR will give me the relevant exchange rate (I can use awk to specifically extract the rate).
The question is, how can I get the page(s) using Linux command line command? I tried wget but it did not do the job.
If not CLI, is there an easy and straight forward way to programmatically do that (i.e., will require less time than do copy-paste of the dates to the browser's address bar)?
UPDATE 1:
When running:
$ wget 'http://www.xe.com/currencytables/?from=USD&date=2012-10-15'
I get a file which contain:
<HTML>
<HEAD><TITLE>Autoextraction Prohibited</TITLE></HEAD>
<BODY>
Automated extraction of our content is prohibited. See http://www.xe.com/errors/noautoextract.htm.
</BODY>
</HTML>
so it seems like the server can identify the type of query and blocks the wget. Any way around this?
UPDATE 2:
After reading the response from the wget command and the comments/answers, I checked the ToS of the website and found this clause:
You agree that you shall not:
...
f. use any automatic or manual process to collect, harvest, gather, or extract
information about other visitors to or users of the Services, or otherwise
systematically extract data or data fields, including without limitation any
financial and/or currency data or e-mail addresses;
which, I guess, concludes the efforts in this front.
Now, for my curiosity, if wget generates an HTTP request, how does the server know that it was a command and not a browser request?
You need to use -O to write the STDOUT
wget -O- http://www.xe.com/currencytables/?from=USD&date=2012-10-15
But it looks like xe.com does not want you to do automated downloads. I would suggest not doing automated downloads at xe.com
That's because wget is sending a certain types of headers that makes it easy to detect.
# wget --debug cnet.com | less
[...]
---request begin---
GET / HTTP/1.1
User-Agent: Wget/1.13.4 (linux-gnu)
Accept: */*
Host: www.cnet.com
Connection: Keep-Alive
[...]
Notice the
User-Agent: Wget/1.13.4
I think that if you change that for
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14
It would work.
# wget --header='User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14' 'http://www.xe.com/currencytables/?from=USD&date=2012-10-15'
That seems to be working fine from here. :D
Did you visit the link in the response?
From http://www.xe.com/errors/noautoextract.htm:
We do offer a number of licensing options which allow you to
incorporate XE.com currency functionality into your software,
websites, and services. For more information, contact us at:
XE.com Licensing
+1 416 214-5606
licensing#xe.com
You will appreciate that the time, effort and expense we put into
creating and maintaining our site is considerable. Our services and
data is proprietary, and the result of many years of hard work.
Unauthorized use of our services, even as a result of a simple mistake
or failure to read the terms of use, is unacceptable.
This sounds like there is an API that you could use but you will have to pay for it. Needless to say, you should respect these terms, not try to get around them.