Forgive me if I don't use the proper terminology. I have a webpage that I'm trying to scrape information from. The problem is that when I view the page source the data I want to scrape is not there. I've encountered this problem before where the main http request triggers other requests and so the information I'm looking for is actually somewhere else which I find using Google chromes inspect - Network feature. I manually search the various documents and xhr files so the one that has the correct information. This is sometimes long and tedious. I can also use google chromes inspect feature to inspect the element that contains the information I want and that brings up the correct source code but it I can't seem to figure out where or how I can use that to quickly find the corresponding HTTP headers.
Restated in a short - can I use the inspect element feature of google chrome and then ask it to show me the corresponding network event (HTTP request) that produced that code?
I'll add the case study I'm working on.
http://www.flashscore.com/tennis/atp-singles/acapulco/results/
shows the different matches that took place at a tennis tournament. I'm trying to scrape the the match hrefs but if you view source of the page you'll see they're not there.
Thanks
Restated in a short - can I use the inspect element feature of google chrome and then ask it to show me the corresponding network event (HTTP request) that produced that code?
No. This isn't something that the browser keeps track of.
In most situations, the HTTP response will pass through a good deal of Javascript code before being eventually turned into elements on the page. Tracing which HTTP response was "responsible" for a given element would involve a great deal of data flow analysis, and is impractical for a browser to do.
One way:
open firefox, install LiveHttpHeaders, then run it, and you will see the expected HEADERS.
There's the same addon for google chrome, but not tested.
Related
please may I have a little help I'm stuck not being able to google for a solution because of very common words.
There is a web page that uses POST to send data to a page on a subdomain when a button is clicked.
I need to recreate a button and send the same information.
My question is: Is it possible just by looking at the page (and the console??) when you click the button, to observer what happens and recreate/implement the same POST method?
Can I say for example: It does this, therefore I need this code to do the same thing?
Or is it not possible to reverse engineer? Will I have to seek help form the web page developer (not really an option in this case).
It is perfectly possible to inspect the request and reverse engineer this. You can use tools like the developer console in your Chrome/Edge browser (press F12), and tools like Postman to simulate requests. Also inspect the form and eventual javascript events attached to the button.
I can see some images in my webpage that are not loading the reason is the wrong path inside the css or js file but how to know which js or css file is trying to load that resource. I tried too much in browser's inspector to find some way but I cannot figure out so I have to search each js and css file for the resource name. Is there any way to know the exact js or css file that is trying to load the failed resource.
Thanks
The Firefox and the Chrome DevTools provide a way to see what initiated the request of a resource within their Network panels.
Chrome DevTools
The Network panel within Chrome's DevTools provides information about the origin of a request within an Initiator column. All entries except the ones saying 'Other' link to the line within the JavaScript, CSS, or document, which caused the request.
For JavaScript calls it shows the stack trace that led to the request on hover.
It even allows you to highlight the initiators and dependencies by holding down Shift and hovering entries within the list.
Firefox DevTools
The Firefox DevTools Network Monitor has a Cause column indicating if a request came from JavaScript, the document, CSS, or some other source.
For JavaScript calls it provides the stack trace within a Stack Trace side bar when selecting the entry.
Unfortunately, for causes other than JavaScript it doesn't provide much useful information or links to the source files yet (as of Firefox 55). Therefore I've filed several enhancement requests to improve this feature.
I've noticed that both Firefox and Chrome issue a new HTTP request when you view the source for a web page that you've already loaded. It's particularly annoying when the page itself is slow to load or if it won't load at all.
Why is that? Wouldn't they have the existing source for the originally received page cached already? Is it based on Cache-Control headers?
This has been on my mind for a while (usually, comes up when looking at what's behind slow web apps).
In the context of Chrome, according to this link it is indeed base on Cache-Control headers.
...view-source grabs the html source from the http cache
and pretty-prints it, but for the page NOT in http cache, it's 'forced to'
make a new request.
To me, this makes sense. You wouldn't want to use what is currently rendered as the source of truth as obviously the HTML can be manipulated dynamically. If you can't use this, then the http cache would be the next likely candidate for the source. If the source in unavailable from cahce, a subsequent GET of the source seems to be the only alternative.
This does, however, introduce another interesting delima raised here.
Requesting the URL again doesn't make sense as there is no guarantee that source received during the second request will match what was received during the first request.
I would imagine this was a conscious trade-off that was made to ensure that a view-source request is always satisfied in some form or another.
You need to do "Inspect Element" for the live web page. Show-code reloads the page to show the source code without modification.
When I'm viewing the downloaded resources for a page in the Chrome web inspector, I also see the HTML/JS/CSS requested by certain extensions.
In the example above, indicator.html, indicator.js and indicator.css are actually part of the Readability Chrome extension, not part of my app.
This isn't too big a deal in this particular situation, but on a more complex page and with several extensions installed, it can get quite crowded in there!
I was wondering if there was a way to filter out any extension-related resources from this list (i.e. any requests using the chrome-extension:// protocol).
Does anyone know how I could achieve this?
Not quite the solution I was after (I'd have preferred a global setting), but there is now a way to filter out requests from extensions, as mentioned by a commenter on the issue I originally opened.
In the network tab filter box, enter the string -scheme:chrome-extension (as shown below):
This is case-sensitive, so make sure it's lowercase. Doing this will hide all resources which were requested by extensions.
Just enter "-f" in Network field
Was having the same question when my extension adds a lot of noise in the network tab.
Some extensions also fire a lot of data like data:text/image etc, you can append more filter with - like:
-scheme:chrome-extension -scheme:data
Another way to get the http/https requests is to just use scheme:https without - because the resources that extensions request are usually from their local bundle:
scheme:https
An Incognito Window, can be configured to include or exclude extensions from the extensions page of Chrome settings.
One alternative is to go to "Network Request blocking" tab and add "chrome-extension:" to the list, thus extension requests will be blocked and coloured red so it's easy to visually filter them out.
you can simply enable this option and requests from extension will be group.
Update: It can only group requests that create by the extension that draw iframe, such as cVim
How can I intercept the post data a page is sending in FF or Chrome via configuration, extension or code? (Code part makes this programming related. ;)
I currently use Wireshark/Ethereal for this, but it's a bit difficult to use.
You could just use the Chrome Developer Tools, if you only need to track requests.
Activate them with Ctrl+Shift+I and select the Network tab.
This works also when Chrome talks HTTPS with another server (and unless you have the HTTPS private key you cannot use Wireshark to sniff that traffic).
(I copied this answer from this related query.)
With Firefox you can use the Network tab (Ctrl+Shift+E or Command+Option+E). The sub-tab "Params" shows the submitted form data.
Reference: https://developer.mozilla.org/en-US/docs/Tools/Network_Monitor/request_details#Params
Alternatively, in the console (Ctrl+Shift+K or Command+Option+K) right click on the big pane and check "Log Request and Response Bodies". Then when the form is submitted, a line with POST <url> will appear. Click on it; it will open a new window with the form data.
As of the time of originally writing this reply, both methods messed up newlines in textarea fields. The former deleted them, the latter converted them to blanks. I haven't checked with a newer version.
Do you have control of the browser POSTing the data?
If you do, then just use Firebug. It's got a lot of usefull features, including this
For Firefox there is also TamperData, and even more powerful and cross-browser is Fiddler.
Programatically, you can do this with dBug - it's a small code module you can integrate into any website.
I use it with CodeIgniter and it works perfectly.
In network tab of Web Developer tools in firefox right click on the PUT, POST or any type of request, you will find "Use as Fetch in Console" option. Here we can seed the data we are passing.
Do the respective steps sequentially.