Groovy: CyberNeko | User Agents | Browser Version - html

I'm currently using CyberNeko in an attempt to grab information I want from a website. However, I believe the website checks the user agent/browser version to keep from just grabbing the url content.
I am aware of using htmlunit to change the browser version, but not sure if I can go about this using CyberNeko.
Does anyone know if it's possible to do such a thing?

I've never used CyberNeko, but I thought it was just a HTML parser, i.e. I didn't think you could use it to issue the HTTP requests and actually download the web page.
It could be the fact that the HTTP request issued by CyberNeko is missing various headers such as the user agent header. An easy way to ensure that the HTTP request looks like a request sent from a browser is to use HttpClient instead of CyberNeko to download the web page. There's some example code available here.
Once you've successfully downloaded the page, use CyberNeko to parse out the bits you're interested in.

Related

Where is the host website pulling the navigator.language information from, and does that matter when using Python Requests module?

I'm trying to understand the underlying mechanism behind navigator.language.
I'm using Python3 and Requests on a headless server to scrape some websites. I have 'Accept-Language' set in my headers to the language I want:
'Accept-Language': 'en-US,en;q=0.8'
I see that the website has a function built in to test what my navigator.language is.
The website doesn't seem to be consistently recognizing my default language.
So I'm trying to understand where navigator.language actually gets its information from. explanations I've seen suggest it is a browser setting, meaning if I were personally going to the website through my browser, my browser would somehow communicate my preferred language when the website called for navigator.language.
But as I'm using Python and Requests, I'm not actually using a real browser, so there shouldn't be any default browser settings. In this case, does navigator.language get its information from the Requests headers? How do the Requests module and navigator.language interact, if at all?

Http redirect for content on relative paths

Essentially what my use case is, a 3rd party server only support POST on a specific integration url. but I want a browser to hit it from a normal html link (I have no control over either systems code, I can only configure the destination url for the link)
To solve or this I have written a web hosted app (done in Mirth Connect - but the server tech in theory shouldn't matter). The objective of my app is to cat the GET and convert it to a POST
My systems logic:
My web server receives an HTTP GET from a browser, grabs the query strings.
The server then performs an HTTP POST on a 3rd party server, and grabs the html result
The server then returns the original html and delivers it as the result to the original http request from the browser
This works great, the issue comes in with content hosted on the 3rd party server that is referenced with a relative path (css, js, images, etc).
Because I have "tricked" the browser into thinking it received the html from my system, it looks on my server for the content (which will all 404).
Without having to handle the fetching of all the content myself, is there a way to tell a browser to redirect all further queries to the 3rd party server?
I tried making my HTTP GET return a status 301 or 302 with the location being the base address of the 3rd party server, but this obviously tells the browser to redirect completely
Figured it out.
I just had to intercept the html and inject a BASE tag .

Where is the Data stored on Website

I am at this website -
http://www.zoominfo.com/s/#!search/company/1.64.eyJjb21wYW55TmFtZSI6xIB2YWx1xIw6ImEiLCJpc1VzZWTEjXRyxJN9fQ%3D%3D
If you see the company name - Agilent Technologies Inc.
Its neither there in page source, nor in any json format.
But it does show in the Dom of Chrome Developer tool.
I have looked and analysed almost every requests that it sent, but still couldn't find where this data is saved.
By where the data is saved - I am looking to find where I can scrape that data from?
If by using python-requests and BeautifulSoup
I do see an XMLHTTPREQUEST made, not sure what that means, or if that is the clue to my answer.
I am still learning python, and it would be a very useful information if someone helps me with this.
Thanks in advance.
After the HTML is loaded, js requests for the data through an XMLHTTPREQUEST which is loaded right after the request is received on your client. That's why you see the DOM element right there using element inspector.
You didn't mention what goal you want to achieve or what tool you are using. Please be specific on your question. If you do not have any idea about this kind of pattern, google out angularjs, see some example.
do see an XMLHTTPREQUEST made, not sure what that means, or if that is the clue to my answer.
It means that javascript embedded in the page is sending an extra HHTP request to the web server. It is likely that the "Agilent Technologies Inc." text is being returned in the server's response to that request, and the javascript in the page is then injecting the text into the DOM in the appropriate place.
Where is the Data stored on Website
That is a completely different question ...
(You have already noted that the data (e.g. the company name) gets injected into the page displayed by your browser.)
On the server side, the data could be stored in the web server (or its back-end systems) in a variety of ways. Or it might not be stored at all. There is no way of knowing ... without looking at the server-side code and configurations.

Serving Json from WCF Service with no extension in IIS

I have a WCF service set up to serve data through multiple endpoints (SOAP, JSON and XML) the SOAP and XML is working perfect, but when I try to view the json I get a prompt to download a file with the json results instead of displaying the results in the browser. This probably won't matter as the client will most likely be consuming the data from some sort of .net environment which will be able to handle the response natively, but I wanted to see if there was a way to display the json results in the browser just like the XML results.
An example of the url I am using to get the results:
http://localhost/api/Service.svc/json/GetResults?name=Test&test=test
This then prompts me to download a file named "GetReults" with no extension and the filetype is: application/json
If your goal is to view the content of the json response in the browser then change your settings in Firefox or else use another browser. I have tried a similar thing with IE and it showed the json content in the browser without making any changes. Not sure what Chrome will do.
I have a similar situation with a rest call that I post a request to. I was making a mock rest service in Grails and noticed that when I hit the live server or my mock server with Firefox it kept asking me to download the file, but not with IE. The problem I'm dealing with now is that I am trying to hit my mock endpoint with SoapUI and it is also asking me to download the file. If I hit the live server with SoapUI it does not ask me to download the file.
Still trying to figure this issue out.
This is exactly the desired behavior. The return response's content type is "application/json". Most browsers cannot display content with this content type inline (unless manually configured), so they prompt you to download the file.
If you actually save this file, and open it with -- say, note Notepad -- you will notice that the pure JSON response is contained in the file.
The inability to handle this content type and the browser forcing this download is almost never an issue, however. The reason is, the general use case for the usage of this JSON endpoint is either the ASP.NET AJAX framework-powered webpages (that automagically make these requests and parse responses by themselves), scripting environments like Python or Perl (which again would just get the requests and then parse them), or custom JavaScript frameworks.
Hope this helps!

Identify Webserver & Script of a website

I have got two simple questions
How can I tell what server is a website on? I remember I used to read the HTTP Host Header to identify the type of server. Is there any tool to do it?
2a. A lot of the website have the page extension .html and you just know they are not html. How can I tell what programming language is behind them?
2b. For ASPX, I think IIS can map the extension, so it will show HTML instead of ASPX, right?
Cheers
1.
Yes, you can check the http header tag "SERVER". Example of responses:
-Microsoft-IIS/6.0
-GFE/1.3
-Server Apache/2.2.11 (Ubuntu) PHP/5.2.6-3ubuntu4.2 with Suhosin-Patch
You can also check "X-Powered-By" on some servers, example:
-PHP/5.2.6-3ubuntu4.2
-ASP.NET
You can do this in firefox/firebug for example. Go to NET pick a request, select headers and check under response headers. You could do this is Fiddler to or any other http sniffer.
2a)
See my first answer
2b)
Yes you can map .html or anything as a "asp.net" extension, meaning that the extension will be handled by the web application. Common use is that you have a httphandler that catches that extension in web.config.
Not sure what your endgoal of these questions are.. or rather to what purpose, maybe we could answer better then.
Look at the HTTP headers. This works as long as the Server admin hasn't disabled them (which he usually doesn't).
Try http://kalender-365.de/ip/get-http-header.php
2a. This actually works with all servers and all extensions. Some Interpreters - such as e.g. PHP - send a special created-by HTTP header (which can be disabled, however).