Web scraping difficulties for accessing html - html

I am trying to scrape this website
scraping website
I have analyzed all the XHR files and found that https://stock360.hkej.com/data/getQuotes/00002?t=1583900958659 is the data fetching site, but I have difficulties in accessing through chrome directly. It return me unknown host found:
Can anyone explain to me what happening deal with this HTTP GET call. I understand that 00002 is the stock number and t is the time

Seems like the endpoint requires a specific header for it to work.
I found that it requires referer header with value of https://stock360.hkej.com/quotePlus/stockFinancialStatements/1
Here's the example request i sent from Postman:

Related

HTML junk returned when JSON is expected

The following code used to work but not anymore and I'm seeing junk HTML with success code of 200 returned.
response = urlopen('https://www.tipranks.com/api/stocks/stockAnalysisOverview/?tickers='+symbol)
data = json.load(response)
If you open the page in chrome you will see the JSON file format. But when opened in python I'm now getting:
f1xx.v1xx=v1xx;f1xx[374148]=window;f1xx[647467]=e8NN(f1xx[374148]);f1xx[125983]=n3EE(f1xx[374148]);f1xx[210876]=(function(){var
P6=2;for(;P6 !== 1;){switch(P6){case 2:return {w3:(function(v3){var
v6=2;for(;v6 !== 10;){switch(v6){case 2:var O3=function(W3){var
u6=2;for(;u6 !== 13;){switch(u6){case 2:var o3=[];u6=1;break;case
14:return E3;break;case 8:U3=o3.H8NN(function(){var Z6=2;for(;Z6 !==
1;){switch(Z6){case 2:return 0.5 - B8NN.P8NN();break;}}.....
What should I be doing to adapt to the new backend change so that I can parse the JSON again.
It is a bot protection, to prevent people from doing what you are doing. This API endpoint is supposed to be used only by the website itself, not by some Python script!
If you delete your site data and then freshly access the page in the browser, you'll see it first loads the HTML page that you see which loads some JavaScript, which then executes a POST to another URL with some data. Somewhere in the process a number of cookies get set and finally the code refreshes the page which then loads the JSON data. At this point visiting the URL directly returns the data because the correct cookies are already set.
If you look at those requests, you'll see the server returns a header server: rhino-core-shield. If you google that, you can see that it's part of the Reblaze DDoS Protection Platform.
You may have luck with a headless browser like ghost.py or pyppetteer but I'm not sure how effective it will be, you'll have to try. The proper way to do this would be to find an official (probably paid) API for getting the information you need instead of relying on non-public endpoints.

How this webpage data access works?

I'm trying to get data from this site: [1] https://www.eurobet.it/it/scommesse/#!/calcio/?temporalFilter=TEMPORAL_FILTER_OGGI_DOMANI
I found this link where I can get the data in JSON format: [2] https://www.eurobet.it/detail-service/sport-schedule/services/discipline/calcio?prematch=1&live=0&temporalFilter=TEMPORAL_FILTER_OGGI_DOMANI
But there is a problem:
The JSON link Doesn't work every time in fact sometimes I get a 404 error.
I noticed that if I open the first link [1] before opening the second [2] it works perfectly.
This error is also more frequent when I try to scrape other data on the same site: [3] https://www.eurobet.it/detail-service/sport-schedule/services/discipline/calcio/piu-giocate/u-o-goal?prematch=1&live=0&temporalFilter=TEMPORAL_FILTER_OGGI_DOMANI
In this link [3] I try to get all "u-o-goal" odds but this link works only if (before starting my program to scrape data) in the main link [1] I press the "U/O GOAL" button -> https://i.stack.imgur.com/Nei5u.png
In my code, I'm using Java and htmlunit to scrape the data.
My question is: how this webpage works, why couldn't I open directly the links [2]/[3], I know that there is a sort of request and approval system behind but I can't see where.
You cannot directly open these URLs since the website (and many like it) will use cookies and bot-prevention techniques/session tracking so they can gather data about usage of their website. eg. they set a "Referer".
I'm not going to code a solution for you but I can at least help you understand what you need to do to get to where you want...
I've attempted to summarise how I'd typically unpick a request like this to recreate it, but in its essence, you need to understand the sequence of HTTP requests being made (this is how the web works - HTTP requests).
First you typically start with no session cookies and you access the site directly (no referer).
Once you access a website, typically the server responds with a session cookie for you to communicate back to the server a unique session ID so it has some sort of record of your browser having already been in contact.
Your browser may make more requests (asynchronously) and in doing so typically sends the cookies and the referring URL (usually the base Url will work... just don't use something that starts with something other than "https://www.eurobet.it"
anything else you're going to need to figure it out. Lots of headers are optional. Lots of query params have defaults.
https://stackoverflow.com/a/64671815/7619034 - here's an answer I've given before that answers this type of question which comes up often enough.
so to explain a bit further, for your specific scenario...
When you access https://www.eurobet.it/it/scommesse/#!/calcio/?temporalFilter=TEMPORAL_FILTER_OGGI_DOMANI, the server responds with HTTP headers:
...
set-cookie: __cfduid=dd38d***********41125; ...
...
The rest doesn't look that relevant:
Going straight to the other request: https://www.eurobet.it/detail-service/sport-schedule/services/discipline/calcio?prematch=1&live=0&temporalFilter=TEMPORAL_FILTER_OGGI_DOMANI
This HTTP request takes (as input):
cookie: __cfduid=dd38d***********41125; mbox=session#6661556c.....b6e8cc1fa6f03#1608242987; at_check=true; s_ecid=MCMID%***********2021453010; AMCVS_45F10C3A53DAEC9F0A490D4D%40AdobeOrg=1; AMCV_45F10C3A53DAEC9F0A490D4D%40AdobeOrg=1075005958%7CMCIDTS%7C18614%7CMCMID%7C91883906030825914429183258312021453010%7CMCAID%7CNONE%7CMCOPTOUT-1608248327s%7CNONE%7CvVersion%7C4.4.1; s_cc=true
...
referer: https://www.eurobet.it/it/scommesse/
...
x-eb-accept-language: it_IT
x-eb-marketid: 5
x-eb-platformid: 1
Cookies are set in an initial request (typically) using Set-Cookie header and then are passed back to the server in subsequent requests using the cookie header.
I'm not certain how many of these values are relevant but you'd need to figure out where each came from in the chain of HTTP requests between the initial one and this one and you'd need to replicate them (see url above of my previous answer - warning this can be time consuming).
The other headers can be set statically most likely since they probably aren't due to change.
If you have access to curl on the command line, you can attempt to reconstruct some of these requests by hand. Some will be time sensitive since cookies do expire after an amount of time (see set-cookie header details for exactly when). Once you've reconstructed a working request, you can then start coding it in your application.
If you can work all this out you should be able to re-construct the chain of HTTP GET requests to get the JSON data you want. Good luck!

Acessing data from a http request

I am writing a http handler for a server and I am looking directly at the http requests when they come in from different clients. I can easily deal with normal http requests. The problem occurs when I get a GET or POST request. I do not know how to access the data from the GET or the POST therefore I cannot continue. Could someone please point me in the direction of some where which deals with the issue on how to access the data. Thanks in advance.
Answer:
to do this:
In a GET request the data comes in the URL itself therefore just parse the URL from the HTTP request and look for the question mark and the arguments.
For a POST request there are 2 different ways however the main one means that the arguments are put in the body of the request like this:
q=hello&v=world
The length is specified in the request as well so if you need it is under Content-Length:

Error in open.connection(x,"rb") : HTTP error 406

I am trying to read the contents of a website using read_htmlin R. However, for some websites like http://benchmarkrealestate.com/, I get this error. Error in open.connection(x,"rb") : HTTP error 406
What does this error mean? This only happens in some websites. I tried to look it up online, but wasn't able to find the exact reason why I get this error.
How do I fix this?
406 Not Acceptable
The requested resource is capable of generating only content not
acceptable according to the Accept headers sent in the request.
The sentence above is lifted right off of Wikipedia.
Basically, whenever a Web crawler makes a request to a website, it often identifies itself, its application type and other information by submitting a characteristic identification string to its operating peer, i.e. the web server. In this case, this identification is transmitted in a header field called User-Agent.
One way to have the content of the web page returned to your console is to set your user-agent information to something identifiable with the help of the curl package:
library(xml2)
library(rvest)
library(curl)
web_content <- read_html(curl('http://benchmarkrealestate.com/', handle = new_handle("useragent" = "Mozilla/5.0")))
You may also want to read up on header fields.

Best practice for 404 page for API

I am writting REST API now. I faced with one problem - how 404 page in api should look like.
Of course it will be better to show user link to the API documentation.
But what about page in general.
Should it be beautiful page or just simple text also in JSON or XML format.
If page has a lot of images/scripts, we have to use network connection inefficiently.
Surely, we can get only headers of http response but it seems to be bad solution.
I have looked on several well-known APIs and the way how do they implement 404 pages.
Both Flickr and Twitter has 404 pages with some images, css.
But Github has pretty simple response, just JSON with following content.
{
"message": "Not Found",
"documentation_url": "https://developer.github.com/v3"
}
I think this is good example of response (without images and css) if not please correct me.
So my question is what is the best practice for the 404 API response, should it be pretty simple (like on GitHub) or it is better to add some other useful information about API.
Thanks everyone in advance.
If someone request an API they will most likely catch the 404 exception and look for the HttpStatusCode. If it's 404 they don't need more information and just report that the User (?) is not found.
So it doesn't really matter what you do as 99% of the request won't even download the response site
A 404 Not Found response in an API is not required to have any content at all. The client code will probably look at the status code before it looks at the content of the request.
Every REST client code I've written so far does not concern itself with the body of a 404 Not Found response. I can not imagine anything in the body that has any use above the fact that the resource was not found.