get a json file that a webpage is made of

get a json file that a webpage is made of - html

I'm not familiar with web development but I believe this web page text content
https://almath123.github.io/semstyle_examples/
is made of two JSON files mentioned in it (semstyle_results.json and semstyle_results.json) and the JSON files are completely present in ram (If this is the correct term for referring to it) because when I disconnect the internet I can still browse the page and see the text content.
I want to download semstyle_results.json file. Is that possible? how can I do that?

Technically if you visit a website you're "downloading" the content. Your browser sends a request for information and a server responds by sending you the information. You're viewing that information locally. Dynamic sites poll or make further requests as you browse to keep the data updated and relevant, but it's sent to you.
If you want to easily download any of the content from the website, a simple way is to open up the development tools (CTRL + SHFT + I on windows for Firefox and Chrome), go to a source file and click save as. The network tab shows you requests that were made which includes not just files such as json but also the details of the request.
Here is a screenshot locating one of the json files in a Chrome-based browser (Brave)

The webpages may not always show that they will support json or xml return of data. For example if you inspect this webpage SEC EDGAR database using the method described above, it shows no json link but if you append index.json at the end of the link it will return the same data in json format or xml format, if you so please.
i.e: same website but with json endpoint
So it is always a good idea to see if the website hosts developer information. For example SEC EDGAR provides developer tools that mentions that the directory structure can be accessed via HTML, XML or JSON.
SEC developer information

Related

Is there a way to scrape data from a website that is not available in the page's source?

What are the few things that I'll have to include in my code that will point me in the right direction?
For Example this website

Open your browser's debugger on Network tab and observe what are the requests when site is loading dynamic content (when you click). You'll see it's getting all the data using some API, for example: https://www.bestfightodds.com/api?f=ggd&b=3&m=16001&p=2
You can download all the data by changing parameters in this URL.
Usually that's enough but here it's more tricky as the data returned by the server is somehow encoded and not easily readable. You'd have to debug its javascript to find function which is used to decode this data before you can parse it.

How to download a Document from ipaper swf

Hi guys I am trying to download a document from a swf link in ipaper
Please guide me on how can I download the book
Here is the link to the book which I want to convert to pdf or word and save
http://en-gage.kaplan.co.uk/LMS/content/live_content_v2/acca/exam_kits/2014-15/p6_fa2014/iPaper.swf
Your kind guidance in this regard would be appreciated.
Regards,
Muneeb

first you open the book in your browser with network capturing (in developer/s tools).
you should open many pages at diffrent locations with and without zoom
then look in the captured data.
you will see that for each new page you are opening, the browser asks for a new file (or files).
this means that there is a file for each page and with that file your browser is creating the image of the page. (usually there is one file for a page and it is some format of picture but I encountered base64 encoded picture and a picture cut into four pieces).
so we want to download and save all the files that are containing the book's pages.
now, usually there is a consistent pattern to the addresses of the files and there is some incrementing number in it (as we can see in the captured data the difference between following files), and knowing the number of pages in the book we can guess ourselves the remaining addresses till the end of the book (and of course download all the files programmatically in a for loop)
and we could stop here.
but sometimes the addresses are bit difficult to guess or we want the process to be more automatic.anyway we want to get programmatically the number of pages and all the addresses of the pages.
so we have to check how the browser knows that stuff. usually the browser downloads some files at the beginning and one of them contains the number of pages in the book (and potentially their address). we just have to check in the captured data and find that file to parse it in our proram.
at the end there is issue of security:
some websites try to protect their data one way or another (ussually using cookies or http authentication). but if your browser can access the data you just have to track how it does it and mimic it.
(if it is cookies the server will respond at some point with Set-Cookie: header. it could be that you have to log-in to view the book so you have to track also this process. usually it's via post messeges and cookies. if it is http authentication you will see something like Authorization: Basic in the request headers).
in your case the answer is simple:
(all the files names are relative to the main file directory: "http://en-gage.kaplan.co.uk/LMS/content/live_content_v2/acca/exam_kits/2014-15/p6_fa2014/")
there is a "manifest.zip" file that contains "pages.xml" file which contains the number of files and links to them. we can see that for each page there is a thumb, a small, and a large pictures so we want just the large ones.
you just need a program that will loop those addresses (from Paper/Pages/491287/Zoom.jpg to Paper/Pages/491968/Zoom.jpg).
finally you can merge all the jpg's to pdf.

Where to find the response video data?

This is regarding retrieving data from Mr.Robot
When I used the inspect element tool to investigate the traffic I was getting from the site via the network panel, here's a sample of the data I got
Does anyone know where I can find the data that corresponds to the video(tv episode)?
I saw that the file format of xhr represents the XMLHttpRequest so that is a combination of my browser requesting JSON, HTML, and XML from the web server? (Can someone confirm this as well)
I am trying to find a type that corresponds to one of these but having no luck.
I am doing this to enhance my knowledge of web and network engineering.

In the network tab, if you select to see the headers tab, you can see this information:
Request URL:http://api.massrelevance.com/usadigitalapps/mr-robot-tag-mrrobot.json?limit=5&since_id=1039088770827352555_891624285
Here, you can see it's a request to a JSON file. :D
EDIT: Try going to this URL when the video is not playing in the browser (for some reason, when I had the page loaded this was returning as blank):
http://api.massrelevance.com/usadigitalapps/mr-robot-tag-mrrobot.json?limit=5&since_id=1039263408306586885_20082880
That was the request mine was making. In there, you can find a video URL:
https:\/\/scontent.cdninstagram.com\/hphotos-xaf1\/t50.2886-16\/11765169_875397039210031_1586195986_n.mp4
Remove the \ and you'll see the video :D

launch network location from browser

I am working on a webpage to provide download link to a searched file from the input form from user thru webpage.
I can use the html <a> tag as in <a href="file://ip/path/filename> link</a>
But when the file is in a network require login, i cannot do it.
Following is not working.
i had tried link
the file i need to link is locate at different network location based on user input to the browser form. then the backend python will search the file location.
can anybody give me a help ?
thank you.

Unfortunately, you are trying to do something which protocols and browsers do not support.
The username:password in URLs are designed to be consumed by a Web server. When you insert them in file URIs, there is nothing that will consume them; there's no HTTP server on the other end. Hence, the browser actually strips those before it extracts the file path from the request, and passes the file request to the OS.
You need to either make sure that the end-users are preauthenticated to all the network shares you are going to access, or avoid file URIs and set rudimentary web servers at your file targets.

HTML5 read files from path

Well, using HTML5 file handlining api we can read files with the collaboration of inpty type file. What about ready files with pat like
/images/myimage.png
etc??
Any kind of help is appreciated

Yes, if it is chrome! Play with the filesytem you will be able to do that.

The simple answer is; no. When your HTML/CSS/images/JavaScript is downloaded to the client's end you are breaking loose of the server.
Simplistic Flowchart
User requests URL in Browser (for example; www.mydomain.com/index.html)
Server reads and fetches the required file (www.mydomain.com/index.html)
index.html and it's linked resources will be downloaded to the user's browser
The user's Browser will render the HTML page
The user's Browser will only fetch the files that came with the request (images/someimages.png and stuff like scripts/jquery.js)
Explanation
The problem you are facing here is that when HTML is being rendered locally it has no link with the server anymore, thus requesting what /images/ contains file-wise is not logically comparable as it resides on the server.
Work-around
What you can do, but this will neglect the reason of the question, is to make a server-side script in JSP/PHP/ASP/etc. This script will then traverse through the directory you want. In PHP you can do this by using opendir() (http://php.net/opendir).
With a XHR/AJAX call you could request the PHP page to return the directory listing. Easiest way to do this is by using jQuery's $.post() function in combination with JSON.
Caution!
You need to keep in mind that if you use the work-around you will store a link to be visible for everyone to see what's in your online directory you request (for example http://www.mydomain.com/my_image_dirlist.php would then return a stringified list of everything (or less based on certain rules in the server-side script) inside http://www.mydomain.com/images/.
Notes
http://www.html5rocks.com/en/tutorials/file/filesystem/ (seems to work only in Chrome, but would still not be exactly what you want)
If you don't need all files from a folder, but only those files that have been downloaded to your browser's cache in the URL request; you could try to search online for accessing browser cache (downloaded files) of the currently loaded page. Or make something like a DOM-walker and CSS reader (regex?) to see where all file-relations are.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008