When I view a website in my browser (for example https://www.homedepot.ca/en/home/p.725-inch-miter-saw-with-laser.1000748698.html), it contains information that is not in the source code.
For example, the source code of this page doesn't specify a product price:
<span itemprop="price">-</span>
<small>/
each</small>
However, when viewed in a browser, the tag does actually contain a price.
How can I retrieve the product's price from the source code?
Short answer: just by reading the source, you can't. The price is dynamically loaded from their servers (using javascript), after the page loaded.
Using appropriate tools (such as the network tab in Chrome/Firefox's developer console) you can figure out where they retrieve the price from (in this case JSON document on their servers). However, even if you used that, there is no guarantee that it'll still work tomorrow - they can charge their link or the format of the data at any moment.
A good place to get started on the technologies they use is reading up on
JavaScript
AJAX
JSON
If you are interested in retrieving information from their page pro grammatically, a good start would be to contact them to see if they have a public interface (API) you can use. These are usually more stable to use.
Related
I use Microsoft OneNote daily to take notes. I would like to write a script to send myself an email every night with all the new notes I took that day across notebooks so I can review them. This would usually be straightforward in e.g. a Word doc where I can timestamp all saves and take the latest file, diff it with the last file from the previous day and send the diff. Unfortunately OneNote complicates this for at least two reasons:
OneNote autosaves and as far as I can tell does not offer the ability to rename saves or add a timestamp to the filename
Notebooks and pages mean changes are across "documents" instead of a single file that can be diff'd.
So I am looking for a solution that considers the complications above. Thanks.
The basic approach via the microsoft-graph API
./me/onenote/pages?$filter=lastModifiedDateTime ge yyyy-MM-ddThh:mm:ssZ&$expand=parentNotebook
will yield json data with
title - Page title
links/oneNoteWebUrl - allows opening of the onenote page in web browser
links/oneNoteClientUrl - allows opening of the onenote page in onenote app
parentNotebook/displayName - Notebook name
self - needed to get page content.
for small page numbers this may work but is likely to time out with a 504 error for a drive with many pages.
In that case a two stage approach is required.
./me/onenote/sections?$filter=lastModifiedDateTime ge yyyy-MM-ddThh:mm:ssZ
will return a list of all the sections that have been modified since the defined lastModifiedDateTime.
Next iterate through the returned json data and get pages modified since lastModifiedDateTime with the returned pagesUrls using the format
.me/onenote/sections/1-xxxxxxxx-xxxx-xxxx-xxxxxxxxxxxx/pages?$filter=lastModifiedDateTime ge yyyy-MM-ddThh:mm:ssZ&$expand=parentNotebook
yielding the same data as noted previously.
Once you have this data you can generate an email containing a list of the modified Notebooks,page names and page links.
If you need the actual page data(content) then you need to call
./me/onenote/pages/1-1c13bcbae2fdd747a95b3e5386caddf1!1-xxxxxxxx-xxxx-xxxx-xxxxxxxxxxxx/content?includeIDs=true&includeInkML=true&preAuthenticated=true
Which will give you text/html, ink and links to other resources from each page.
I've been wondering how to fetch the PlayStation server status. They display it on this page:
https://status.playstation.com/en-us/
But PlayStation is known to use APIs instead of PHP database fetches. After looking around in the source code of the site, I found that they have a separate file called /data.json.
https://status.playstation.com/en-us/data.json
The content of this file is the same as the index file (for some reason). They use stuff like {{endDateTitle}} and {{message}}, but I can't find where it's defined, if it's pulled using a separate file or just pulled from a database using PHP.
How can I "reverse" this site and see if there's a API I can use to display the status on my site?
Maybe I did not get the question right, but it seems pretty straightforward.
If using firefox, open Developer tools, Network. Reload the page.
You can clearly see the requested URL
https://status.playstation.com/data/statuses/region/SCEA.json
It seems that an empty list as a status means "No problems" (since there are no problems I cannot verify this assumption. That's all
The parenthesis {{}} are used by various HTML templating languages, like angular, so you'd have to go through the js code to understand where they get updated.
New to the JSON world and I'm trying to find out how to view a JSON object of a webpage. Will every webpage have a JSON object and if so how do I find it in order to get the data and display it on my site? I vaguely remember something about using Firebug?
Thanks,
B
Will every webpage have a JSON object
No.
Many web sites will not use any JSON; many will be completely static (HTML and CSS only).
It may only apply if there is a "Web API" (for programmatic access to content), but there are non-JSON ways to do APIs (the X in AJAX is for XML).
To determine how to access a site programmatically look at the site's developer documentation. If there isn't any documentation then any AJAX web debuggers (like FireBug) show may well be internal only and intended only for the site's own implementation; other uses could well be not welcome (you could be up for violating IP).
This might become a vulnerability to add sensitive JSON to your final HTML page.. JSON should be loaded like an ingredient to the soup, via Ajax for example on authenticated page. If it's not sensitive JSON then you should load it for performance reasons once it is required... it really depends on your choice. I have built a library to handle these kind of requests for web, check it out: https://github.com/alexmano/jsMan
I am at this website -
http://www.zoominfo.com/s/#!search/company/1.64.eyJjb21wYW55TmFtZSI6xIB2YWx1xIw6ImEiLCJpc1VzZWTEjXRyxJN9fQ%3D%3D
If you see the company name - Agilent Technologies Inc.
Its neither there in page source, nor in any json format.
But it does show in the Dom of Chrome Developer tool.
I have looked and analysed almost every requests that it sent, but still couldn't find where this data is saved.
By where the data is saved - I am looking to find where I can scrape that data from?
If by using python-requests and BeautifulSoup
I do see an XMLHTTPREQUEST made, not sure what that means, or if that is the clue to my answer.
I am still learning python, and it would be a very useful information if someone helps me with this.
Thanks in advance.
After the HTML is loaded, js requests for the data through an XMLHTTPREQUEST which is loaded right after the request is received on your client. That's why you see the DOM element right there using element inspector.
You didn't mention what goal you want to achieve or what tool you are using. Please be specific on your question. If you do not have any idea about this kind of pattern, google out angularjs, see some example.
do see an XMLHTTPREQUEST made, not sure what that means, or if that is the clue to my answer.
It means that javascript embedded in the page is sending an extra HHTP request to the web server. It is likely that the "Agilent Technologies Inc." text is being returned in the server's response to that request, and the javascript in the page is then injecting the text into the DOM in the appropriate place.
Where is the Data stored on Website
That is a completely different question ...
(You have already noted that the data (e.g. the company name) gets injected into the page displayed by your browser.)
On the server side, the data could be stored in the web server (or its back-end systems) in a variety of ways. Or it might not be stored at all. There is no way of knowing ... without looking at the server-side code and configurations.
I have a drupal site that is being used strictly as a CMS that produces JSON feeds using services and services_views, which are consumed by a separate site. What I would like to do (and I have a working proof of concept of this) is allow for a "live preview" on the real site, by intercepting the node form preview / submit, encoding the node as JSON, and loading a special page on the live site that consumes that JSON and displays the page accordingly.
The problem with this JSONized node is, it's different from the JSON being produced by my view (using services_views). My end goal is to produce JSON that is identical for both previewed and non-previewed objects, without having to maintain separate output methods (I could easily hand-customize the json but then when my view for the public api changes I have to make the same changes to the preview json. Trying to avoid this).
I'm looking for feedback on this approach. Is what I'm attempting even possible? The ideas I've been able to come up with so far are:
being able to (conditionally) drive my view with data from a non-databse source
sneakily inserting data into the view object during one of the stages of execution? Kludgy but I'm not above that :)
saving a "clone" node (or revision?) of the node being previewed and let the view use that to display the preview JSON?
Maybe this is the wrong approach altogether and there's something better? (Trying to intercept and format the services output in my module... maybe avoid services_views altogether?)
If anyone can offer some advice, insight or opinions on how to best proceed here, I'd be really grateful.
in a custom module, you could set up a page that grabs the json output from the view page.
$JSON = file_get_contents($url);
that way the preview stays bound to the view, even if the view changes.
First I think it's not an easy task what you are trying to achieve. So before all, good luck.
I think you could intercept the node submission data, then create a node programatically, then render that node, and then export the rendered node to JSON. Inmediately after you get the JSON, delete this node, because the programmatically created node is only for preview.
This task could be more CPU demanding but think that previewing content exactly as the content will look is difficult.
Your rss feeds that your site reads could be filtered with some parameter to avoid programmatically created nodes (prewiew nodes), despite these nodes will be available for a very short time.