First of all, my problem is different from this one: Difference between cURL and web browser?
I use my Chrome browser to visit: http://www.walmart.com/search/browse-ng.do?cat_id=1115193_1071967 And then, I view the page source to get like:
<a class="js-product-title" href="/ip/Tide-Simply-Clean-Fresh-Refreshing-Breeze-Liquid-Laundry-Detergent-138-fl-oz/33963161">
However, I didn't find this kind of info from command line:
curl "http://www.walmart.com/search/browse-ng.do?cat_id=1115193_1071967">local.html
Does anyone know why cause the difference? I am using Python scrapy selector to parse the webpage.
You browser can execute JavaScript, which can in turn change the document. Curl will just give you the plain original output and nothing else.
If you turn off JavaScript in the browser and refresh the page, you will see that it looks differently.
In addition to just executing JS as explained in the other answer, your browser does a lot more work to fetch that page from the server that you are overlooking, and the server may be reacting based on that.
Open Chrome, Press F12, Go to the "Network" Tab.
Load the page you want to.
Look for the very first thing that got requested (It should be a document icon, with the url below it, you can also sort by 'Timeline' to find it too)
Right click on the item, choose 'Copy as cURL'
Paste this into notepad and take a look at what your browser sent to fetch that, vs the simple curl command you did.
curl "http://stackoverflow.com/questions/25333342/viewing-page-source-shows-different-html-than-curl" -H "Accept-Encoding: gzip,deflate,sdch" -H "Accept-Language: en-US,en;q=0.8" -H "User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" -H "Referer: http://stackoverflow.com/questions?page=2&sort=newest" -H "Cookie: <cookies redacted because lulz>" -H "Connection: keep-alive" -H "Cache-Control: max-age=0" --compressed
Things like the language header sent, and the user agent (more or less what browser and OS you are on), even in some cases if it was requested compressed can all cause a server to generate the page differently. This can be just normal reactions (like giving browser specific html to only that browser, cough*ie and opera*) or part of higher level A/B testing on new designs or functionality. Chances are, the content returned to you see at a URL may likely be different for someone else, or even to you using a different browser or tool.
I also have to point out that what you SEE on the page isnt what comes up with view source. The source is what was sent to your browser to render. What you actually see on the page is something after rendering and Javascript have executed. Most browser support some sort of "Inspect" function on the right click menu, I suggest you take a look at pages through that and compare to what shows in view source, It will change your perspective on how the web works.
Don't know if you have found your answer or not. I have a solution. It could be due to the server throwing 301 etc. The code is straight C, so adapt yourself up.
curl_easy_setopt(curl, CURLOPT_NOPROGRESS, 0);
curl_easy_setopt(curl, CURLOPT_VERBOSE, 1L); // To see what's happening
curl_easy_setopt(curl, CURLOPT_USERAGENT, curlversion); // variable
curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L); // Optional/toggle
The last option needs to be tested with/without to see the exactness in both browser output and curl's.
Also, see the verbose by issuing a direct Shell command
:~$ curl -v http://myurl > page.html
See the difference. It should help.
Related
I am using branch.io for my android app. I am trying to generate via POST method and this is the code:
{"branch_key":"key_test_lerbZ22zyjfpfFtl2auzukafywi220dN", "campaign":"new_product_annoucement", "channel":"email", "tags":["monday", "test123"],
"data":"{\"name\": \"Alex\", \"email\": \"alex#branch.io\", \"user_id\": \"12346\", \"$desktop_url\": \"https://file.town/download/odrqliwc94d440jt08wxngceo\",\"$marketing_title\": \"2\"}"}
In the dashboard, the campaign can be seen, the channel can be seen, and the generated URL goes to the desired site. But the generated URL does not show up in the Marketing tab in the dashboard to show the URL's statistics with regards to clicks, downloads and installs.
Is there any code I am missing in it?
To use the Branch HTTP API to create a link that shows up in the Marketing section of the Branch dashboard you need to add the parameter "type":2 at the root level of the request. You will also want to use a $marketing_title that is descriptive.
Here is an updated curl request using the parameters you provided:
curl -X POST -H "Content-Type: application/json" -d '{"type":2, "branch_key":"key_test_lerbZ22zyjfpfFtl2auzukafywi220dN", "campaign":"new_product_annoucement", "channel":"email", "tags":["monday", "test123"], "data":"{\"name\": \"Alex\", \"email\": \"alex#branch.io\", \"user_id\": \"12346\", \"$desktop_url\": \"https://file.town/download/odrqliwc94d440jt08wxngceo\",\"$marketing_title\": \"Super Amazing Branch Link\"}"}' https://api.branch.io/v1/url
I'm writing an application which will run on a microcontroller (arduino or Raspberry Zero) with wifi and a web server which will be configurable by a web browser without any client side scripts. This will use a string of HTML forms for the purpose of creating a number of small files on the microcontroller which will be interpreted by the microcontroller to perform its tasks.
I'm writing it initially on a Slackware Linux system but when it gets close to completion, will move it all to a Raspberry Pi running a customised version of Ubuntu Linux for final tuning.
I'm using lighttpd with mod_fastcgi and libfcgi and I am writing forms handler software in C.
Now, ideally, the responses returned to the server by each form would be processed by its individual handler daemon started by mod_fcgi, however I have not been able to figure out how to configure fastcgi to load more than one handler daemon. My fcgi.conf file is pointed at by a link later in this missive.
I could live with this restriction but another problem arises. In using just one handler, the action="handlerProgram" field at the top of every form has to point at that one handler, each form is unique and must be handled differently so how do I tell the formsHandler program which form is being handled? I need to be able to embed another label into each HTML form somewhere so that the web client will send this back to the server which will pass its value to the forms handler via the environment - or some such mechanism. Any clues on how to do this? Pleaase?
Peter.
PS. Here's a link to the related config and html data. HTML Problem
Maybe one of these solutions may help :
In the html code, add informations about the form to handle after the handler program name in the action tag, like :
action="/cgi-bin/handlerProgram/id/of/form/to/handle"
In your CGI handlerProgram you'll have the PATH_INFO environment variable valued to "/id/of/form/to/handle". Use it to know what form to handle.
In the html code add a hidden input field to your form like :
<input type="hidden" id="form_to_hanlde" value="form_id"/>
Just use the form_to_handle field's value in you handlerProgram to know what form to handle.
Joe Hect Posted an answer which completely solves this question.
The information which needed to be sent for the form called 'index.htm' is the name of the form. I used the action field "ACTION=/formsHandler.fcgi/index.htm" and below is the contents of the environment returned as reported by echo.fcgi (renamed to formsHandler.fcgi to avoid having to change anything else in my config.). If you can decipher the listing after this page has scrambled it, you will see that the required information is now present in a number of places, including PATH_INFO as suggested. Thank you, Joe.
Now all I have to do is figure out how to vote for you properly.
{
Request number 1
CONTENT_LENGTH: 37
DOCUMENT_ROOT: /home/lighttpd/htdocs
GATEWAY_INTERFACE: CGI/1.1
HTTP_ACCEPT: text/html, application/xhtml+xml, */*
HTTP_ACCEPT_ENCODING: gzip, deflate
HTTP_ACCEPT_LANGUAGE: en-AU
HTTP_CACHE_CONTROL: no-cache
HTTP_CONNECTION: Keep-Alive
HTTP_HOST: 192.168.0.16:6666
HTTP_PRAGMA:
HTTP_RANGE:
HTTP_REFERER: http://192.168.0.16:6666/
HTTP_TE:
HTTP_USER_AGENT: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
HTTP_X_FORWARDED_FOR:
PATH:
PATH_INFO: /index.htm
PATH_TRANSLATED: /home/lighttpd/htdocs/index.htm
QUERY_STRING:
CONTENT_LENGTH: 37
CONTENT:
REMOTE_ADDR: 192.168.0.19
REMOTE_HOST:
REMOTE_PORT: 54159
REQUEST_METHOD: POST
REQUEST_ACTION:
ACTION:
REQUEST_URI: /formsHandler.fcgi/index.htm
REDIRECT_URI:
SCRIPT_FILENAME: /home/lighttpd/htdocs/formsHandler.fcgi
SCRIPT_NAME: /formsHandler.fcgi
SERVER_ADDR: 192.168.0.16
SERVER_ADMIN:
SERVER_NAME: 192.168.0.16
SERVER_PORT: 6666
SERVER_PROTOCOL: HTTP/1.1
SERVER_SIGNATURE:
SERVER_SOFTWARE: lighttpd/1.4.41
}
I'm looking through my GA logs and I see a Google Chrome browser version 0.A.B.C. Could anybody tell me what this is exactly? Some kind of spider or bot or modified http header?
The full user agent string probably looks something like this:
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13"
This is most likely a bot, but it could just be someone running an automated script using CasperJS or PhantomJS (or even a shell script using something like lynx) and spoofing the user agent.
The reason it looks like that instead of something that says "My automated test runner v1.0" (or whatever is relevant to the author) is that this user agent string will pass most regular expression checks as "some version of Chrome" and not get filtered out properly by most bot checks that rely on a regular expression to match 'valid' user agent patterns.
In order to avoid it, your site bot checker would need to blacklist this string, or validate that all parts of the Chrome version to make sure they're all valid numbers. Even then, you can only do so much checking since the user agent string is so easy to spoof.
I have been looking for a way to alter a XHR request made in my browser and then replay it again.
Say I have a complete POST request done in my browser, and the only thing I want to change is a small value and then play it again.
This would be a lot easier and faster to do directly in the browser.
I have googled a bit around, and haven't found a way to do this in Chrome or Firefox.
Is there some way to do it in either one of those browsers, or maybe another one?
Chrome :
In the Network panel of devtools, right-click and select Copy as cURL
Paste / Edit the request, and then send it from a terminal, assuming you have the curl command
See capture :
Alternatively, and in case you need to send the request in the context of a webpage, select "Copy as fetch" and edit-send the content from the javascript console panel.
Firefox :
Firefox allows to edit and resend XHR right from the Network panel. Capture below is from Firefox 36:
Chrome now has Copy as fetch in version 67:
Copy as fetch
Right-click a network request then select Copy > Copy As Fetch to copy the fetch()-equivalent code for that request to your clipboard.
https://developers.google.com/web/updates/2018/04/devtools#fetch
Sample output:
fetch("https://stackoverflow.com/posts/validate-body", {
credentials: "include",
headers: {},
referrer: "https://stackoverflow.com/",
referrerPolicy: "origin",
body:
"body=Chrome+now+has+_Copy+as+fetch_+in+version+67%3A%0A%0A%3E+Copy+as+fetch%0ARight-click+a+network+request+then+select+**Copy+%3E+Copy+As+Fetch**+to+copy+the+%60fetch()%60-equivalent+code+for+that+request+to+your+clipboard.%0A%0A&oldBody=&isQuestion=false",
method: "POST",
mode: "cors"
});
The difference is that Copy as cURL will also include all the request headers (such as Cookie and Accept) and is suitable for replaying the request outside of Chrome. The fetch() code is suitable for replaying inside of the same browser.
Updating/completing zszep answer:
After copying the request as cUrl (bash), simply import it in the Postman App:
My two suggestions:
Chrome's Postman plugin + the Postman Interceptor Plugin. More Info: Postman Capturing Requests Docs
If you're on Windows then Telerik's Fiddler is an option. It has a composer option to replay http requests, and it's free.
Microsoft Chromium-based Edge supports "Edit and Replay" requests in the Network Tab as an experimental feature:
In order to enable the option you have to "Enable Experimental Features".
Control+Shift+I (Windows, Linux) or Command+Option+I (macOS)
and tick the checkbox next to "Enable Network Console".
More details about how to Enable Experimental Tools and the feature can be found here
For Firefox the problem solved itself. It has the "Edit and Resend" feature implemented.
For Chrome Tamper extension seems to do the trick.
Awesome Requestly
Intercept & Modify HTTP Requests
https://chrome.google.com/webstore/detail/requestly-modify-headers/mdnleldcmiljblolnjhpnblkcekpdkpa
https://requestly.io/
5 years have passed and this essential requirement didn't get ignored by the Chrome devs.
While they offer no method to edit the data like in Firefox, they offer a full XHR replay.
This allows to debug ajax calls.
"Replay XHR" will repeat the entire transmission.
There are a few ways to do this, as mentioned above, but in my experience the best way to manipulate an XHR request and resend is to use chrome dev tools to copy the request as cURL request (right click on the request in the network tab) and to simply import into the Postman app (giant import button in the top left).
No need to install 3rd party extensions!
There exists the javascript-snippet, which you can add as browser-bookmark and then activate on any site to track & modify the requests. It looks like:
For further instructions, review the github page.
I need to extract the exchange rate of USD to another currency (say, EUR) for a long list of historical dates.
The www.xe.com website gives the historical lookup tool, and using a detailed URL, one can get the rate table for a specific date, w/o populating the Date: and From: boxes. For example, the URL http://www.xe.com/currencytables/?from=USD&date=2012-10-15 gives the table of conversion rates from USD to other currencies on the day of Oct. 15th, 2012.
Now, assume I have a list of dates, I can loop through the list and change the date part of that URL to get the required page. If I can extract the rates list, then simple grep EUR will give me the relevant exchange rate (I can use awk to specifically extract the rate).
The question is, how can I get the page(s) using Linux command line command? I tried wget but it did not do the job.
If not CLI, is there an easy and straight forward way to programmatically do that (i.e., will require less time than do copy-paste of the dates to the browser's address bar)?
UPDATE 1:
When running:
$ wget 'http://www.xe.com/currencytables/?from=USD&date=2012-10-15'
I get a file which contain:
<HTML>
<HEAD><TITLE>Autoextraction Prohibited</TITLE></HEAD>
<BODY>
Automated extraction of our content is prohibited. See http://www.xe.com/errors/noautoextract.htm.
</BODY>
</HTML>
so it seems like the server can identify the type of query and blocks the wget. Any way around this?
UPDATE 2:
After reading the response from the wget command and the comments/answers, I checked the ToS of the website and found this clause:
You agree that you shall not:
...
f. use any automatic or manual process to collect, harvest, gather, or extract
information about other visitors to or users of the Services, or otherwise
systematically extract data or data fields, including without limitation any
financial and/or currency data or e-mail addresses;
which, I guess, concludes the efforts in this front.
Now, for my curiosity, if wget generates an HTTP request, how does the server know that it was a command and not a browser request?
You need to use -O to write the STDOUT
wget -O- http://www.xe.com/currencytables/?from=USD&date=2012-10-15
But it looks like xe.com does not want you to do automated downloads. I would suggest not doing automated downloads at xe.com
That's because wget is sending a certain types of headers that makes it easy to detect.
# wget --debug cnet.com | less
[...]
---request begin---
GET / HTTP/1.1
User-Agent: Wget/1.13.4 (linux-gnu)
Accept: */*
Host: www.cnet.com
Connection: Keep-Alive
[...]
Notice the
User-Agent: Wget/1.13.4
I think that if you change that for
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14
It would work.
# wget --header='User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14' 'http://www.xe.com/currencytables/?from=USD&date=2012-10-15'
That seems to be working fine from here. :D
Did you visit the link in the response?
From http://www.xe.com/errors/noautoextract.htm:
We do offer a number of licensing options which allow you to
incorporate XE.com currency functionality into your software,
websites, and services. For more information, contact us at:
XE.com Licensing
+1 416 214-5606
licensing#xe.com
You will appreciate that the time, effort and expense we put into
creating and maintaining our site is considerable. Our services and
data is proprietary, and the result of many years of hard work.
Unauthorized use of our services, even as a result of a simple mistake
or failure to read the terms of use, is unacceptable.
This sounds like there is an API that you could use but you will have to pay for it. Needless to say, you should respect these terms, not try to get around them.