HTML junk returned when JSON is expected - json

The following code used to work but not anymore and I'm seeing junk HTML with success code of 200 returned.
response = urlopen('https://www.tipranks.com/api/stocks/stockAnalysisOverview/?tickers='+symbol)
data = json.load(response)
If you open the page in chrome you will see the JSON file format. But when opened in python I'm now getting:
f1xx.v1xx=v1xx;f1xx[374148]=window;f1xx[647467]=e8NN(f1xx[374148]);f1xx[125983]=n3EE(f1xx[374148]);f1xx[210876]=(function(){var
P6=2;for(;P6 !== 1;){switch(P6){case 2:return {w3:(function(v3){var
v6=2;for(;v6 !== 10;){switch(v6){case 2:var O3=function(W3){var
u6=2;for(;u6 !== 13;){switch(u6){case 2:var o3=[];u6=1;break;case
14:return E3;break;case 8:U3=o3.H8NN(function(){var Z6=2;for(;Z6 !==
1;){switch(Z6){case 2:return 0.5 - B8NN.P8NN();break;}}.....
What should I be doing to adapt to the new backend change so that I can parse the JSON again.

It is a bot protection, to prevent people from doing what you are doing. This API endpoint is supposed to be used only by the website itself, not by some Python script!
If you delete your site data and then freshly access the page in the browser, you'll see it first loads the HTML page that you see which loads some JavaScript, which then executes a POST to another URL with some data. Somewhere in the process a number of cookies get set and finally the code refreshes the page which then loads the JSON data. At this point visiting the URL directly returns the data because the correct cookies are already set.
If you look at those requests, you'll see the server returns a header server: rhino-core-shield. If you google that, you can see that it's part of the Reblaze DDoS Protection Platform.
You may have luck with a headless browser like ghost.py or pyppetteer but I'm not sure how effective it will be, you'll have to try. The proper way to do this would be to find an official (probably paid) API for getting the information you need instead of relying on non-public endpoints.

Related

How this webpage data access works?

I'm trying to get data from this site: [1] https://www.eurobet.it/it/scommesse/#!/calcio/?temporalFilter=TEMPORAL_FILTER_OGGI_DOMANI
I found this link where I can get the data in JSON format: [2] https://www.eurobet.it/detail-service/sport-schedule/services/discipline/calcio?prematch=1&live=0&temporalFilter=TEMPORAL_FILTER_OGGI_DOMANI
But there is a problem:
The JSON link Doesn't work every time in fact sometimes I get a 404 error.
I noticed that if I open the first link [1] before opening the second [2] it works perfectly.
This error is also more frequent when I try to scrape other data on the same site: [3] https://www.eurobet.it/detail-service/sport-schedule/services/discipline/calcio/piu-giocate/u-o-goal?prematch=1&live=0&temporalFilter=TEMPORAL_FILTER_OGGI_DOMANI
In this link [3] I try to get all "u-o-goal" odds but this link works only if (before starting my program to scrape data) in the main link [1] I press the "U/O GOAL" button -> https://i.stack.imgur.com/Nei5u.png
In my code, I'm using Java and htmlunit to scrape the data.
My question is: how this webpage works, why couldn't I open directly the links [2]/[3], I know that there is a sort of request and approval system behind but I can't see where.
You cannot directly open these URLs since the website (and many like it) will use cookies and bot-prevention techniques/session tracking so they can gather data about usage of their website. eg. they set a "Referer".
I'm not going to code a solution for you but I can at least help you understand what you need to do to get to where you want...
I've attempted to summarise how I'd typically unpick a request like this to recreate it, but in its essence, you need to understand the sequence of HTTP requests being made (this is how the web works - HTTP requests).
First you typically start with no session cookies and you access the site directly (no referer).
Once you access a website, typically the server responds with a session cookie for you to communicate back to the server a unique session ID so it has some sort of record of your browser having already been in contact.
Your browser may make more requests (asynchronously) and in doing so typically sends the cookies and the referring URL (usually the base Url will work... just don't use something that starts with something other than "https://www.eurobet.it"
anything else you're going to need to figure it out. Lots of headers are optional. Lots of query params have defaults.
https://stackoverflow.com/a/64671815/7619034 - here's an answer I've given before that answers this type of question which comes up often enough.
so to explain a bit further, for your specific scenario...
When you access https://www.eurobet.it/it/scommesse/#!/calcio/?temporalFilter=TEMPORAL_FILTER_OGGI_DOMANI, the server responds with HTTP headers:
...
set-cookie: __cfduid=dd38d***********41125; ...
...
The rest doesn't look that relevant:
Going straight to the other request: https://www.eurobet.it/detail-service/sport-schedule/services/discipline/calcio?prematch=1&live=0&temporalFilter=TEMPORAL_FILTER_OGGI_DOMANI
This HTTP request takes (as input):
cookie: __cfduid=dd38d***********41125; mbox=session#6661556c.....b6e8cc1fa6f03#1608242987; at_check=true; s_ecid=MCMID%***********2021453010; AMCVS_45F10C3A53DAEC9F0A490D4D%40AdobeOrg=1; AMCV_45F10C3A53DAEC9F0A490D4D%40AdobeOrg=1075005958%7CMCIDTS%7C18614%7CMCMID%7C91883906030825914429183258312021453010%7CMCAID%7CNONE%7CMCOPTOUT-1608248327s%7CNONE%7CvVersion%7C4.4.1; s_cc=true
...
referer: https://www.eurobet.it/it/scommesse/
...
x-eb-accept-language: it_IT
x-eb-marketid: 5
x-eb-platformid: 1
Cookies are set in an initial request (typically) using Set-Cookie header and then are passed back to the server in subsequent requests using the cookie header.
I'm not certain how many of these values are relevant but you'd need to figure out where each came from in the chain of HTTP requests between the initial one and this one and you'd need to replicate them (see url above of my previous answer - warning this can be time consuming).
The other headers can be set statically most likely since they probably aren't due to change.
If you have access to curl on the command line, you can attempt to reconstruct some of these requests by hand. Some will be time sensitive since cookies do expire after an amount of time (see set-cookie header details for exactly when). Once you've reconstructed a working request, you can then start coding it in your application.
If you can work all this out you should be able to re-construct the chain of HTTP GET requests to get the JSON data you want. Good luck!

Is it possible for "view page source" to return different data to Google Scripts' "UrlFetchApp"?

I am trying to find a specific source of data programmatically on a page:
https://finance.yahoo.com/quote/3DP.AX/financials?p=3DP.AX
When I "view page source" on the page, I only find once instance of:
,"3DP.AX":
after which the data I require occurs. So in my code, I have:
UrlFetchApp.fetch("https://finance.yahoo.com/quote/3DP.AX/financials?p=3DP.AX").getContentText().indexOf(",\"3DP.AX\":")
^^ this however returns -1
I managed to find the data I need in the response of UrlFetchApp and discovered it occurs after:
{"quoteData":{"3DP.AX":
However, I cannot find this string in view page source. I cleared my cache and this didn't change the page source results.
Question: Is it possible for the data on "view page source" to be different from the data returned by UrlFetchApp?
It is possible for there to be a difference between what UrlFetchApp.fetch receives and what a browser request receives. However, you don't have enough information to reach that conclusion based on your current code.
To access the results of a Fetch request, you need to call getContentText() on the result, currently you are calling indexOf on an HTTPResponse object, not the text body of the response.
Additionally, you should pass a proper string literal to indexOf() - indexOf(',"3DP.AX":')

how to reverse engineer an http API call using REST console

I'm trying to replicate a request I make on a website (ie zoominfo.com) using the same http POST parameters using chrome rest console, but it fails for some reason. I'm not sure if there is a missing field or it's not working because the origin of the request isn't valid.. can someone point me out in the right direction? Below is a detailed explanation of the experiment:
ORIGINAL CASE
basically if I go to zoominfo.com (registered and all) I see a form page that I need to fill:
if I hit enter.. the site makes an ajax call. If I open the chrome web dev tools, and open the network tab, I see the details of the ajax call:
notice the body of the POST has the name John Becker in it:
{"boardMember":{"value":"Include","isUsed":true},"workHistory":{"value":"CurrentAndPast","isUsed":true},"includePartialProfiles":{"value":true,"isUsed":true},"personName":{"value":"john%20becker","isUsed":true},"lastUpdated":{"value":0,"isUsed":true}}
the response is shown under the respones tag:
WHAT I'M TRYING TO DO
basically replicate what i've done above using a REST console (note: so there is nothing illegal here.. i'm just replacing a chrome browser action with a rest client action.. i'm not hacking anyone and i'm not getting information I can't get the normal way, but if someone feels otherwise.. please let me know)..
so I plug in the same parameters as above into the rest console:
now i'm not sure about authentication.. but just to be safe, i entered the same user name and pwd i have for the site into the REST console:
but then I keep on getting an error as a response to my rest console's request:
UPDATE: CORRECT ANSWER:
so according to JMTyler's answer.. I had to simply include criteria in the RAW body, and convert it to url encoding.. in addition to that, I had to explicitly set the encoding in the rest console body..
looking at the chrome inspector more closely, it turns out that I simply had to click on view source:
to get the url-encoded value that I needed to put in the RAW body in the rest console:
I also had to set encoding to gzip,deflate,sdch and things worked fine!
The form is posting all that JSON under the field criteria. You can see this in the screencap of the chrome dev console you posted.
Just start your raw body in rest console with criteria= and make sure the json has been url-encoded. That should do it.
No authentication is needed because none is passed through the headers in your screencap. Any cookies you have when you load the page normally will also be loaded through rest console, so you don't need to worry about explicitly setting them.
Reading your problems I'll make an educated guess:
zoominfo does not provide an RESTful API.
Rest-Console understands and uses HTTP Authentication, which is different from the authentication handler zoominfo implemented.
A possible way to work around may be:
Make a call to the login-page via rest console. you'll get back cookies and a lot more.
In subsequent requests to zoominfo be sure to include those cookies (likely holding some session information) in your request, therefore acting like a browser.

jsFiddle how to get json?

Hy,
I'm working on a jsFiddle with a openlayers example in it.
http://dev.openlayers.org/releases/OpenLayers-2.11/examples/snapping.html
At the moment it's not working because it's not getting a response for the http request to get data. How do I fix that?
The jsFiddle is here: http://jsfiddle.net/TcuxA/6/
Go to the line "// create three vector layers" in the script.
There are 3 requests for data. If you type the URLs in your browser you get the JSON, but my firebug gives 3 errors when I run the jsFiddle.
I tried fixing with jsFiddle echo ( http://doc.jsfiddle.net/use/echo.html ), but that didn't work. I don't know how to change the script to load the data otherwise.
Why can't I get the json by these URLs? What are good sollutions?
What you are experiencing is an exception being thrown by the XMLHttpRequest object, because you are using AJAX to call elements from different domain. This is better said, for example, in here:
"The XMLHttpRequest object is prevented from calling web services from outside its own domain. This is sensible given that if you called a script in one place and it, in turn, called a script on another server, it could leave an application open to all sorts of malicious scripts, hacks and exploits."
So the easiest way to do is to code it locally and call a local copy of the files (poly.json, line.json and point.json) that would reside on your local server. For testing if all displays on a map you could hard-code the files into your code. I am not sure how it could be achieved otherwise.
A good solution can be this : using github responses . You can store your example in github, along with the predefined XHR requests responses.

Retrieve URL JSON data in MS Access

There is a web service that allows me to go to a URL, with my API-key, and request a page of data. The data is returned as JSON. The JSON is well-formed, I ran it through JSONLint and confirmed its OK.
What I would like to do is retrieve the JSON data from within MS Access (2003 or 2007), if possible, and build a table from that data (first time thru), then append/update the table on subsequent calls to that URL. I would settle for "pre-step" where I retrieve this information via another means. Since I have an API key in the URL, I do not want to do this server-side. I would like to keep it all within Access, run it on my PC at home (its for personal use anyway).
If I have to use another step before the database load then Javascript? But I dont know that very well. I dont even really know what JSON is other than what I have read in Wikipedia. The URL looks similar to:
http://www.SomeWebService.com/MyAPIKey?p=1&s=50
where: p = page number
s = records per page
Access DB is a JavaScript Lib for MS Access, quick page search says they play nicely with JSON, and you can input/output with. boo-ya.
http://www.accessdb.org/
EDIT:
dead url; wayback machine ftw:
http://web.archive.org/web/20131007143335/http://www.accessdb.org/
also sourceforge
http://sourceforge.net/projects/accessdb/