I'm looking through my GA logs and I see a Google Chrome browser version 0.A.B.C. Could anybody tell me what this is exactly? Some kind of spider or bot or modified http header?
The full user agent string probably looks something like this:
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13"
This is most likely a bot, but it could just be someone running an automated script using CasperJS or PhantomJS (or even a shell script using something like lynx) and spoofing the user agent.
The reason it looks like that instead of something that says "My automated test runner v1.0" (or whatever is relevant to the author) is that this user agent string will pass most regular expression checks as "some version of Chrome" and not get filtered out properly by most bot checks that rely on a regular expression to match 'valid' user agent patterns.
In order to avoid it, your site bot checker would need to blacklist this string, or validate that all parts of the Chrome version to make sure they're all valid numbers. Even then, you can only do so much checking since the user agent string is so easy to spoof.
Related
I'm trying to scrape council tax band data from this website and I'm unable to find the API:
https://ecitizen.aberdeencity.gov.uk/publicaccesslive/selfservice/services/counciltax/bandsearchresults.htm?action=pageNavigation&p=0
I've gone through previous answers on Stack Overflow and articles including:
http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/
and
https://medium.com/#williamyeny/finding-ratemyprofessors-private-api-with-chrome-developer-tools-bd6747dd228d
I've gone into the Network tab - XHR/All - Headers/ Preview/ Response and the only thing I can find is:
/**/jQuery11130005376436794664263_1594893863666({ "html" : "<li class='navbar-text myprofile_salutation'>Welcome Guest!</li><li role='presentation' class=''><a href='https://ecitizen.aberdeencity.gov.uk/publicaccesslive/selfservice/citizenportal/login.htm?redirect_url=https://ecitizen.aberdeencity.gov.uk/publicaccesslive/selfservice/dashboard.htm'> Sign In / Register <span class='icon-arrow-right17 pull-right visible-xs-block'></span></a></li>" });
As a test I used AB24 4DE to search and couldn't find it anywhere within a json response.
As far as I can tell the data isn't hidden behind a web socket.
I ran a get request for the sake of it and got:
JSONDecodeError: Expecting value: line 10 column 1 (char 19)
What am I missing?
You're doing the right thing looking at the network tools. I find it's best to zoom in on the overview you're given in the network tab. You can select any proportion of an action in the browser. See what is happening which requests are made. So you could focus on the start of requests and responses when you click search. It gives you two requests that are made, one to post information to the server and one grabs information to a seperate url.
Suggestion
My suggestion having had a look at the website is to probably use selenium which is a package that mimics browser activity. Below you'll see my study of the requests. Essentially the form generates a unique token for every time you do a search. YOu have to replicate inorder to get the correct response. Which is hard to know in advance.
That being said, you can mimic browser activity using selenium and automatically input the postcode and automating the clicking of the search button. You then can grab the page source HTML and use beautifulsoup to parse it. Here is a minimal reproducible example showing you this.
Coding Example
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path=r'c:\users\aaron\chromedriver.exe')
url = 'https://ecitizen.aberdeencity.gov.uk/publicaccesslive/selfservice/services/counciltax/bandsearch.htm'
driver.get(url)
driver.find_element_by_id('postcode').send_keys('AB24 4DE')
driver.find_element_by_xpath('//input[#class="btn btn-primary cap-submit-inline"]').click()
soup = BeautifulSoup(driver.page_source,'html.parser')
There is also scope to make the browser headless, so it doesn't popup and all you'll get back is the parsed html.
Explanation of Code
We are importing webdriver from selenium, this provides the module necessary to load a browser. We then create an instance of that webdriver, in this case I'm using chrome but you can use firefox or other browsers.
You'll need to download chromedriver from here, https://chromedriver.chromium.org/. Webdriver uses this to open a browser.
We use the get webdriver get method to make chromedriver go to the specific page we want.
Webdriver has a list of find element by... you can use. The simplest here is find_element_by_id. We can find the id of the input box in the HTML for inputting the postcode, which I've done here. Send_keys will send whatever text we want, in this case its AB24 4DE.
find_element_by_xpath takes an XPATH selector. '//' goes through all of the DOM, we select input and the [#class=""] part selects the specifc input tag class. We want the submit button. The click() method, will click that browser.
We then grab the page source once this click is complete, this is necessary as we then input that into BeuatifulSoup, which will give us the parsed HTML of the postcode we desire.
Reverse Engineering the HTTP requests
Below is for education really, unless someone can get the unique token before sending requests to the server. Here's how the website works in terms of the search form.
Essentially looking at the process, it's sending cookies,headers,params and data to a server. The cookies has a session ID which does't seem to change on my test. The data variable is where you can change the postcode but also importantly the ABCtoken changes for every single time you want to do a search and the param is a check on the server to make sure it's not a bot.
As an in example of the HTTP POST request here. We send this
cookies = {
'JSESSIONID': '1DBAC40138879EB418C14AD83C05AD86',
}
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'Origin': 'https://ecitizen.aberdeencity.gov.uk',
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Referer': 'https://ecitizen.aberdeencity.gov.uk/publicaccesslive/selfservice/services/counciltax/bandsearch.htm',
'Accept-Language': 'en-US,en;q=0.9',
}
params = (
('action', 'validateData'),
)
data = {
'postcode': 'AB24 2RX',
'address': '',
'startSearch': 'Search',
'ABCToken': '35fbfd26-cb4b-4688-ac01-1e35bcb8f63d'
}
To
https://ecitizen.aberdeencity.gov.uk/publicaccesslive/selfservice/services/counciltax/bandsearch.htm
Then it's doing an HTTP GET request here with the same JSESSIONID and the unique ABCtoken to grab the data you want to bandsearchresults.html
'https://ecitizen.aberdeencity.gov.uk/publicaccesslive/selfservice/services/counciltax/bandsearchresults.htm'
So it's creating a JSESSIONID which seems to be the same for any postcode from my testing. then when you use that same JSESSIONID and you use the ABCtoken it supplies the searchresults URL you'll get the correct data back.
I'm writing an application which will run on a microcontroller (arduino or Raspberry Zero) with wifi and a web server which will be configurable by a web browser without any client side scripts. This will use a string of HTML forms for the purpose of creating a number of small files on the microcontroller which will be interpreted by the microcontroller to perform its tasks.
I'm writing it initially on a Slackware Linux system but when it gets close to completion, will move it all to a Raspberry Pi running a customised version of Ubuntu Linux for final tuning.
I'm using lighttpd with mod_fastcgi and libfcgi and I am writing forms handler software in C.
Now, ideally, the responses returned to the server by each form would be processed by its individual handler daemon started by mod_fcgi, however I have not been able to figure out how to configure fastcgi to load more than one handler daemon. My fcgi.conf file is pointed at by a link later in this missive.
I could live with this restriction but another problem arises. In using just one handler, the action="handlerProgram" field at the top of every form has to point at that one handler, each form is unique and must be handled differently so how do I tell the formsHandler program which form is being handled? I need to be able to embed another label into each HTML form somewhere so that the web client will send this back to the server which will pass its value to the forms handler via the environment - or some such mechanism. Any clues on how to do this? Pleaase?
Peter.
PS. Here's a link to the related config and html data. HTML Problem
Maybe one of these solutions may help :
In the html code, add informations about the form to handle after the handler program name in the action tag, like :
action="/cgi-bin/handlerProgram/id/of/form/to/handle"
In your CGI handlerProgram you'll have the PATH_INFO environment variable valued to "/id/of/form/to/handle". Use it to know what form to handle.
In the html code add a hidden input field to your form like :
<input type="hidden" id="form_to_hanlde" value="form_id"/>
Just use the form_to_handle field's value in you handlerProgram to know what form to handle.
Joe Hect Posted an answer which completely solves this question.
The information which needed to be sent for the form called 'index.htm' is the name of the form. I used the action field "ACTION=/formsHandler.fcgi/index.htm" and below is the contents of the environment returned as reported by echo.fcgi (renamed to formsHandler.fcgi to avoid having to change anything else in my config.). If you can decipher the listing after this page has scrambled it, you will see that the required information is now present in a number of places, including PATH_INFO as suggested. Thank you, Joe.
Now all I have to do is figure out how to vote for you properly.
{
Request number 1
CONTENT_LENGTH: 37
DOCUMENT_ROOT: /home/lighttpd/htdocs
GATEWAY_INTERFACE: CGI/1.1
HTTP_ACCEPT: text/html, application/xhtml+xml, */*
HTTP_ACCEPT_ENCODING: gzip, deflate
HTTP_ACCEPT_LANGUAGE: en-AU
HTTP_CACHE_CONTROL: no-cache
HTTP_CONNECTION: Keep-Alive
HTTP_HOST: 192.168.0.16:6666
HTTP_PRAGMA:
HTTP_RANGE:
HTTP_REFERER: http://192.168.0.16:6666/
HTTP_TE:
HTTP_USER_AGENT: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
HTTP_X_FORWARDED_FOR:
PATH:
PATH_INFO: /index.htm
PATH_TRANSLATED: /home/lighttpd/htdocs/index.htm
QUERY_STRING:
CONTENT_LENGTH: 37
CONTENT:
REMOTE_ADDR: 192.168.0.19
REMOTE_HOST:
REMOTE_PORT: 54159
REQUEST_METHOD: POST
REQUEST_ACTION:
ACTION:
REQUEST_URI: /formsHandler.fcgi/index.htm
REDIRECT_URI:
SCRIPT_FILENAME: /home/lighttpd/htdocs/formsHandler.fcgi
SCRIPT_NAME: /formsHandler.fcgi
SERVER_ADDR: 192.168.0.16
SERVER_ADMIN:
SERVER_NAME: 192.168.0.16
SERVER_PORT: 6666
SERVER_PROTOCOL: HTTP/1.1
SERVER_SIGNATURE:
SERVER_SOFTWARE: lighttpd/1.4.41
}
I want to get html code from windows phone market pages. So far I have not run into any problems but today following error is displayed every time I retrieve data.
[...] Your request appears to be from an automated process.
If this is incorrect, notify us by clicking here to be redirected [...].
I tried to use proxy in case to many requests are called from one IP but this does not bring any progression. Do you happen to know why this problem takes place, any ideas about possible way outs? Any help would be very much appreciated. The main goal is to somehow get information about windows phone app from market.
It seems that they detect the user agent and block the request if it is not valid / known for a device.
I managed to make it work with curl with eg.
curl -A 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9' http://www.windowsphone.com/en-us/store/app/pinpoint-by-foundbite/ff9fdf41-aabd-4cac-9086-8710bd327da9
For asp.net, if you use HttpRequest to get the html content, try the following:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9";
For PHP you can set your user agent as well via curl_setopt.
I was not able to find out, whether there is an IP-based block after several requests.
A question regarding Jsoup: I am building a tool that fetches prices from a website. However, this website has streaming content. If I browse manually, I see the prices of 20 mins ago and have to wait about 3 secs to get the current price. Is there any way I can make some kind of delay in Jsoup to be able to obtain the prices in the streaming section? I am using this code:
conn = Jsoup.connect(link).userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36");
conn.timeout(5000);
doc = conn.get();
As mentioned in the comments, the site is most likely using some type of scripting that just won't work with Jsoup. Since Jsoup just get the initial HTML response and does not execute any javascript.
I wanted to give you some more guidence though on where to go now. The best bet, in these cases, is to move to another platform for these types of sites. You can migrate to HTMLUnit which is a headless browser, or Selenium which can use HTMLUnit or a real browser like Firefox or Chrome. I would recommend Selenium if you think you will ever need to move past HTMLUnit as HTMLUnit can sometimes be less stable a browser compared to consumer browsers Selenium can support. You can use Selenium with the HTMLUnit driver giving you the option to move to another browser seamlessly later.
You can Use a JavaFX WebView with javascript enabled. After waiting the two seconds, you can extract the contents and pass them to JSoup.
(After loading your url into your WebView using the example above)
String text=view.getEngine() executeScript("document.documentElement.outerHTML");
Document doc = Jsoup.parse(html);
I need to extract the exchange rate of USD to another currency (say, EUR) for a long list of historical dates.
The www.xe.com website gives the historical lookup tool, and using a detailed URL, one can get the rate table for a specific date, w/o populating the Date: and From: boxes. For example, the URL http://www.xe.com/currencytables/?from=USD&date=2012-10-15 gives the table of conversion rates from USD to other currencies on the day of Oct. 15th, 2012.
Now, assume I have a list of dates, I can loop through the list and change the date part of that URL to get the required page. If I can extract the rates list, then simple grep EUR will give me the relevant exchange rate (I can use awk to specifically extract the rate).
The question is, how can I get the page(s) using Linux command line command? I tried wget but it did not do the job.
If not CLI, is there an easy and straight forward way to programmatically do that (i.e., will require less time than do copy-paste of the dates to the browser's address bar)?
UPDATE 1:
When running:
$ wget 'http://www.xe.com/currencytables/?from=USD&date=2012-10-15'
I get a file which contain:
<HTML>
<HEAD><TITLE>Autoextraction Prohibited</TITLE></HEAD>
<BODY>
Automated extraction of our content is prohibited. See http://www.xe.com/errors/noautoextract.htm.
</BODY>
</HTML>
so it seems like the server can identify the type of query and blocks the wget. Any way around this?
UPDATE 2:
After reading the response from the wget command and the comments/answers, I checked the ToS of the website and found this clause:
You agree that you shall not:
...
f. use any automatic or manual process to collect, harvest, gather, or extract
information about other visitors to or users of the Services, or otherwise
systematically extract data or data fields, including without limitation any
financial and/or currency data or e-mail addresses;
which, I guess, concludes the efforts in this front.
Now, for my curiosity, if wget generates an HTTP request, how does the server know that it was a command and not a browser request?
You need to use -O to write the STDOUT
wget -O- http://www.xe.com/currencytables/?from=USD&date=2012-10-15
But it looks like xe.com does not want you to do automated downloads. I would suggest not doing automated downloads at xe.com
That's because wget is sending a certain types of headers that makes it easy to detect.
# wget --debug cnet.com | less
[...]
---request begin---
GET / HTTP/1.1
User-Agent: Wget/1.13.4 (linux-gnu)
Accept: */*
Host: www.cnet.com
Connection: Keep-Alive
[...]
Notice the
User-Agent: Wget/1.13.4
I think that if you change that for
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14
It would work.
# wget --header='User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14' 'http://www.xe.com/currencytables/?from=USD&date=2012-10-15'
That seems to be working fine from here. :D
Did you visit the link in the response?
From http://www.xe.com/errors/noautoextract.htm:
We do offer a number of licensing options which allow you to
incorporate XE.com currency functionality into your software,
websites, and services. For more information, contact us at:
XE.com Licensing
+1 416 214-5606
licensing#xe.com
You will appreciate that the time, effort and expense we put into
creating and maintaining our site is considerable. Our services and
data is proprietary, and the result of many years of hard work.
Unauthorized use of our services, even as a result of a simple mistake
or failure to read the terms of use, is unacceptable.
This sounds like there is an API that you could use but you will have to pay for it. Needless to say, you should respect these terms, not try to get around them.