This is regarding retrieving data from Mr.Robot
When I used the inspect element tool to investigate the traffic I was getting from the site via the network panel, here's a sample of the data I got
Does anyone know where I can find the data that corresponds to the video(tv episode)?
I saw that the file format of xhr represents the XMLHttpRequest so that is a combination of my browser requesting JSON, HTML, and XML from the web server? (Can someone confirm this as well)
I am trying to find a type that corresponds to one of these but having no luck.
I am doing this to enhance my knowledge of web and network engineering.
In the network tab, if you select to see the headers tab, you can see this information:
Request URL:http://api.massrelevance.com/usadigitalapps/mr-robot-tag-mrrobot.json?limit=5&since_id=1039088770827352555_891624285
Here, you can see it's a request to a JSON file. :D
EDIT: Try going to this URL when the video is not playing in the browser (for some reason, when I had the page loaded this was returning as blank):
http://api.massrelevance.com/usadigitalapps/mr-robot-tag-mrrobot.json?limit=5&since_id=1039263408306586885_20082880
That was the request mine was making. In there, you can find a video URL:
https:\/\/scontent.cdninstagram.com\/hphotos-xaf1\/t50.2886-16\/11765169_875397039210031_1586195986_n.mp4
Remove the \ and you'll see the video :D
Related
I'm not familiar with web development but I believe this web page text content
https://almath123.github.io/semstyle_examples/
is made of two JSON files mentioned in it (semstyle_results.json and semstyle_results.json) and the JSON files are completely present in ram (If this is the correct term for referring to it) because when I disconnect the internet I can still browse the page and see the text content.
I want to download semstyle_results.json file. Is that possible? how can I do that?
Technically if you visit a website you're "downloading" the content. Your browser sends a request for information and a server responds by sending you the information. You're viewing that information locally. Dynamic sites poll or make further requests as you browse to keep the data updated and relevant, but it's sent to you.
If you want to easily download any of the content from the website, a simple way is to open up the development tools (CTRL + SHFT + I on windows for Firefox and Chrome), go to a source file and click save as. The network tab shows you requests that were made which includes not just files such as json but also the details of the request.
Here is a screenshot locating one of the json files in a Chrome-based browser (Brave)
The webpages may not always show that they will support json or xml return of data. For example if you inspect this webpage SEC EDGAR database using the method described above, it shows no json link but if you append index.json at the end of the link it will return the same data in json format or xml format, if you so please.
i.e: same website but with json endpoint
So it is always a good idea to see if the website hosts developer information. For example SEC EDGAR provides developer tools that mentions that the directory structure can be accessed via HTML, XML or JSON.
SEC developer information
What are the few things that I'll have to include in my code that will point me in the right direction?
For Example this website
Open your browser's debugger on Network tab and observe what are the requests when site is loading dynamic content (when you click). You'll see it's getting all the data using some API, for example: https://www.bestfightodds.com/api?f=ggd&b=3&m=16001&p=2
You can download all the data by changing parameters in this URL.
Usually that's enough but here it's more tricky as the data returned by the server is somehow encoded and not easily readable. You'd have to debug its javascript to find function which is used to decode this data before you can parse it.
I am at this website -
http://www.zoominfo.com/s/#!search/company/1.64.eyJjb21wYW55TmFtZSI6xIB2YWx1xIw6ImEiLCJpc1VzZWTEjXRyxJN9fQ%3D%3D
If you see the company name - Agilent Technologies Inc.
Its neither there in page source, nor in any json format.
But it does show in the Dom of Chrome Developer tool.
I have looked and analysed almost every requests that it sent, but still couldn't find where this data is saved.
By where the data is saved - I am looking to find where I can scrape that data from?
If by using python-requests and BeautifulSoup
I do see an XMLHTTPREQUEST made, not sure what that means, or if that is the clue to my answer.
I am still learning python, and it would be a very useful information if someone helps me with this.
Thanks in advance.
After the HTML is loaded, js requests for the data through an XMLHTTPREQUEST which is loaded right after the request is received on your client. That's why you see the DOM element right there using element inspector.
You didn't mention what goal you want to achieve or what tool you are using. Please be specific on your question. If you do not have any idea about this kind of pattern, google out angularjs, see some example.
do see an XMLHTTPREQUEST made, not sure what that means, or if that is the clue to my answer.
It means that javascript embedded in the page is sending an extra HHTP request to the web server. It is likely that the "Agilent Technologies Inc." text is being returned in the server's response to that request, and the javascript in the page is then injecting the text into the DOM in the appropriate place.
Where is the Data stored on Website
That is a completely different question ...
(You have already noted that the data (e.g. the company name) gets injected into the page displayed by your browser.)
On the server side, the data could be stored in the web server (or its back-end systems) in a variety of ways. Or it might not be stored at all. There is no way of knowing ... without looking at the server-side code and configurations.
I am trying to play story_html5.html video from Amazon Cloud fount in a IFrame, "story.html" is working fine in Iframe, but when use story_html5.html, it gives error.
Unsafe JavaScript attempt to access frame with URL "URL1" from frame with URL "URL2". Domains, protocols and ports must match.
Please let me know about the solution.
Thanks,
Laxmilal Menaria
I believe the html5 version of the file may be called once the code determines it's needed. Based on this, your personal code may look like it's trying to hijack the process and causing the exception. W/out the exact code you're working with though, it's hard to say. This is just based on my pulling apart some of the files a while back.
I'm currently using CyberNeko in an attempt to grab information I want from a website. However, I believe the website checks the user agent/browser version to keep from just grabbing the url content.
I am aware of using htmlunit to change the browser version, but not sure if I can go about this using CyberNeko.
Does anyone know if it's possible to do such a thing?
I've never used CyberNeko, but I thought it was just a HTML parser, i.e. I didn't think you could use it to issue the HTTP requests and actually download the web page.
It could be the fact that the HTTP request issued by CyberNeko is missing various headers such as the user agent header. An easy way to ensure that the HTTP request looks like a request sent from a browser is to use HttpClient instead of CyberNeko to download the web page. There's some example code available here.
Once you've successfully downloaded the page, use CyberNeko to parse out the bits you're interested in.