How this webpage data access works? - html

I'm trying to get data from this site: [1] https://www.eurobet.it/it/scommesse/#!/calcio/?temporalFilter=TEMPORAL_FILTER_OGGI_DOMANI
I found this link where I can get the data in JSON format: [2] https://www.eurobet.it/detail-service/sport-schedule/services/discipline/calcio?prematch=1&live=0&temporalFilter=TEMPORAL_FILTER_OGGI_DOMANI
But there is a problem:
The JSON link Doesn't work every time in fact sometimes I get a 404 error.
I noticed that if I open the first link [1] before opening the second [2] it works perfectly.
This error is also more frequent when I try to scrape other data on the same site: [3] https://www.eurobet.it/detail-service/sport-schedule/services/discipline/calcio/piu-giocate/u-o-goal?prematch=1&live=0&temporalFilter=TEMPORAL_FILTER_OGGI_DOMANI
In this link [3] I try to get all "u-o-goal" odds but this link works only if (before starting my program to scrape data) in the main link [1] I press the "U/O GOAL" button -> https://i.stack.imgur.com/Nei5u.png
In my code, I'm using Java and htmlunit to scrape the data.
My question is: how this webpage works, why couldn't I open directly the links [2]/[3], I know that there is a sort of request and approval system behind but I can't see where.

You cannot directly open these URLs since the website (and many like it) will use cookies and bot-prevention techniques/session tracking so they can gather data about usage of their website. eg. they set a "Referer".
I'm not going to code a solution for you but I can at least help you understand what you need to do to get to where you want...
I've attempted to summarise how I'd typically unpick a request like this to recreate it, but in its essence, you need to understand the sequence of HTTP requests being made (this is how the web works - HTTP requests).
First you typically start with no session cookies and you access the site directly (no referer).
Once you access a website, typically the server responds with a session cookie for you to communicate back to the server a unique session ID so it has some sort of record of your browser having already been in contact.
Your browser may make more requests (asynchronously) and in doing so typically sends the cookies and the referring URL (usually the base Url will work... just don't use something that starts with something other than "https://www.eurobet.it"
anything else you're going to need to figure it out. Lots of headers are optional. Lots of query params have defaults.
https://stackoverflow.com/a/64671815/7619034 - here's an answer I've given before that answers this type of question which comes up often enough.
so to explain a bit further, for your specific scenario...
When you access https://www.eurobet.it/it/scommesse/#!/calcio/?temporalFilter=TEMPORAL_FILTER_OGGI_DOMANI, the server responds with HTTP headers:
...
set-cookie: __cfduid=dd38d***********41125; ...
...
The rest doesn't look that relevant:
Going straight to the other request: https://www.eurobet.it/detail-service/sport-schedule/services/discipline/calcio?prematch=1&live=0&temporalFilter=TEMPORAL_FILTER_OGGI_DOMANI
This HTTP request takes (as input):
cookie: __cfduid=dd38d***********41125; mbox=session#6661556c.....b6e8cc1fa6f03#1608242987; at_check=true; s_ecid=MCMID%***********2021453010; AMCVS_45F10C3A53DAEC9F0A490D4D%40AdobeOrg=1; AMCV_45F10C3A53DAEC9F0A490D4D%40AdobeOrg=1075005958%7CMCIDTS%7C18614%7CMCMID%7C91883906030825914429183258312021453010%7CMCAID%7CNONE%7CMCOPTOUT-1608248327s%7CNONE%7CvVersion%7C4.4.1; s_cc=true
...
referer: https://www.eurobet.it/it/scommesse/
...
x-eb-accept-language: it_IT
x-eb-marketid: 5
x-eb-platformid: 1
Cookies are set in an initial request (typically) using Set-Cookie header and then are passed back to the server in subsequent requests using the cookie header.
I'm not certain how many of these values are relevant but you'd need to figure out where each came from in the chain of HTTP requests between the initial one and this one and you'd need to replicate them (see url above of my previous answer - warning this can be time consuming).
The other headers can be set statically most likely since they probably aren't due to change.
If you have access to curl on the command line, you can attempt to reconstruct some of these requests by hand. Some will be time sensitive since cookies do expire after an amount of time (see set-cookie header details for exactly when). Once you've reconstructed a working request, you can then start coding it in your application.
If you can work all this out you should be able to re-construct the chain of HTTP GET requests to get the JSON data you want. Good luck!

Related

HTML junk returned when JSON is expected

The following code used to work but not anymore and I'm seeing junk HTML with success code of 200 returned.
response = urlopen('https://www.tipranks.com/api/stocks/stockAnalysisOverview/?tickers='+symbol)
data = json.load(response)
If you open the page in chrome you will see the JSON file format. But when opened in python I'm now getting:
f1xx.v1xx=v1xx;f1xx[374148]=window;f1xx[647467]=e8NN(f1xx[374148]);f1xx[125983]=n3EE(f1xx[374148]);f1xx[210876]=(function(){var
P6=2;for(;P6 !== 1;){switch(P6){case 2:return {w3:(function(v3){var
v6=2;for(;v6 !== 10;){switch(v6){case 2:var O3=function(W3){var
u6=2;for(;u6 !== 13;){switch(u6){case 2:var o3=[];u6=1;break;case
14:return E3;break;case 8:U3=o3.H8NN(function(){var Z6=2;for(;Z6 !==
1;){switch(Z6){case 2:return 0.5 - B8NN.P8NN();break;}}.....
What should I be doing to adapt to the new backend change so that I can parse the JSON again.
It is a bot protection, to prevent people from doing what you are doing. This API endpoint is supposed to be used only by the website itself, not by some Python script!
If you delete your site data and then freshly access the page in the browser, you'll see it first loads the HTML page that you see which loads some JavaScript, which then executes a POST to another URL with some data. Somewhere in the process a number of cookies get set and finally the code refreshes the page which then loads the JSON data. At this point visiting the URL directly returns the data because the correct cookies are already set.
If you look at those requests, you'll see the server returns a header server: rhino-core-shield. If you google that, you can see that it's part of the Reblaze DDoS Protection Platform.
You may have luck with a headless browser like ghost.py or pyppetteer but I'm not sure how effective it will be, you'll have to try. The proper way to do this would be to find an official (probably paid) API for getting the information you need instead of relying on non-public endpoints.

Mandrill webhooks timeout error

I have been using Mandrill webhooks from a long time and till now I haven't encountered this error.
But now I see this error, I am not sure what has caused this ?
Please let me know why this might be happening and what might be the possible solution for the same.
Is it related to my server handling capacity because I have checked for that as well and Mandrill doesnt have too many concurrent request that it is sending to my Apache server, so according to me that is not an issue and also mysql also doesn't seem to be causing the bottleneck, but then I I have not used any benchmarking tool to determine the same.
Please let me know the solution if you guys have encountered something like this.
It seems that the URL is not responding to the request. There could be a few reasons:
If the URL points to an internal server, a firewall could be blocking it or a port number (if given).
Once set up the webhook will send via a POST HTTP verb, however for testing it sends a HEAD request. Quite often web servers (e.g. IIS) will limit what verbs they respond to and will only respond to GET and POST requests.
If that's working your URL should respond with just headers only to acknowledge the request. (HEAD doesn't allow any page content to be sent) so it should only do something like this for a HEAD request:
<?php header( 'Content-Type:' ); // returning 200 ?>
More details on their site
http://help.mandrill.com/entries/22024856-Why-can-t-my-webhook-or-inbound-route-URL-be-verified-
You may wish to try this tool to see what HTTP header result is being returned (if any) or if another error is being returned, just remember that if the URL is internal, it could be blocked to the outside world.
https://chrome.google.com/webstore/detail/postman-rest-client/fdmmgilgnpjigdojojpjoooidkmcomcm?hl=en

how to reverse engineer an http API call using REST console

I'm trying to replicate a request I make on a website (ie zoominfo.com) using the same http POST parameters using chrome rest console, but it fails for some reason. I'm not sure if there is a missing field or it's not working because the origin of the request isn't valid.. can someone point me out in the right direction? Below is a detailed explanation of the experiment:
ORIGINAL CASE
basically if I go to zoominfo.com (registered and all) I see a form page that I need to fill:
if I hit enter.. the site makes an ajax call. If I open the chrome web dev tools, and open the network tab, I see the details of the ajax call:
notice the body of the POST has the name John Becker in it:
{"boardMember":{"value":"Include","isUsed":true},"workHistory":{"value":"CurrentAndPast","isUsed":true},"includePartialProfiles":{"value":true,"isUsed":true},"personName":{"value":"john%20becker","isUsed":true},"lastUpdated":{"value":0,"isUsed":true}}
the response is shown under the respones tag:
WHAT I'M TRYING TO DO
basically replicate what i've done above using a REST console (note: so there is nothing illegal here.. i'm just replacing a chrome browser action with a rest client action.. i'm not hacking anyone and i'm not getting information I can't get the normal way, but if someone feels otherwise.. please let me know)..
so I plug in the same parameters as above into the rest console:
now i'm not sure about authentication.. but just to be safe, i entered the same user name and pwd i have for the site into the REST console:
but then I keep on getting an error as a response to my rest console's request:
UPDATE: CORRECT ANSWER:
so according to JMTyler's answer.. I had to simply include criteria in the RAW body, and convert it to url encoding.. in addition to that, I had to explicitly set the encoding in the rest console body..
looking at the chrome inspector more closely, it turns out that I simply had to click on view source:
to get the url-encoded value that I needed to put in the RAW body in the rest console:
I also had to set encoding to gzip,deflate,sdch and things worked fine!
The form is posting all that JSON under the field criteria. You can see this in the screencap of the chrome dev console you posted.
Just start your raw body in rest console with criteria= and make sure the json has been url-encoded. That should do it.
No authentication is needed because none is passed through the headers in your screencap. Any cookies you have when you load the page normally will also be loaded through rest console, so you don't need to worry about explicitly setting them.
Reading your problems I'll make an educated guess:
zoominfo does not provide an RESTful API.
Rest-Console understands and uses HTTP Authentication, which is different from the authentication handler zoominfo implemented.
A possible way to work around may be:
Make a call to the login-page via rest console. you'll get back cookies and a lot more.
In subsequent requests to zoominfo be sure to include those cookies (likely holding some session information) in your request, therefore acting like a browser.

cURL from a gamecast

Im wondering if there is anyway to get scores from a gamecast that uses javascript or flash to update the content dynamically. Here's an example: http://www.cstv.com/gametracker/launch/gt_wlacros.html?sport=wlacros&camefrom=&startschool=md&event=952412&school=cs&
How could I pull the scores from the teams out of this page?
Really what you need to do is sniff the requests being made under the hood. You can get any sort of HTTP sniffer. I use Live HTTP Headers extension for firefox. You start capturing, then click the link above. You'll see all sorts of requests. The underlying data seems to be coming from http://origin.livestats.www.cstv.com. I got this request that has a lot of useful player stats from the game:
http://origin.livestats.www.cstv.com/livestats/data/w-lacros/952412/player_stats.xml?344026907808
http://origin.livestats.www.cstv.com/livestats/data/w-lacros/952412/summary.xml?644493847800
(note the second url throws an XML parse error, but you could still try to parse it manually)

Determine which webpage an object in a packet capture is associated with

I am currently working on trying to take a packet capture and work backwards to determine what objects are associated to each page request. For example if a packet capture contains 2 different webpages worth of requests I want to be able to determine for each object (TCP stream) which root page it is associated with. Is there an easy way to do this?
I know there are tools that will isolate the TCP streams and which will pull the data within them out, however I am not looking to replicate the webpage. I am simply looking to be able to associate each stream to the original page that requested it.
What you are trying to do is reconstructing the "call graph" of a browsing session. For a simply analysis, you can inspect just the HTTP headers. Bro makes this process very convenient. If site A loads site B, A typically shows up in the Referer header of B.
However, if you aim for completeness, this task becomes a daunting challenge: you need to parse the HTTP body payload and even JavaScript to determine all the URLs that are being created at runtime in the client, e.g., via AJAX, iframes, and friends.