Web crawler dealing with "Sign up or log in to read full content"

Web crawler dealing with "Sign up or log in to read full content" - html

Given a page like this, I am trying to extract all the answer text with a ruby web crawler.
I am using nokogiri and search('div[#class="answer_content"]').inner_text to access the answers, but I can't seem to access all the text, even when in fact I am logged in. About 200 words down, I'll get the message "sign up or log in to read full content."
Also, is this div class the correct one to use?

It seems to me that you need to authenticate yourself from the crawler. I've done it a few weeks ago. I used a firefox extension called Tamper Data which allowed me to see the requests made between the browser and the server. In my case, the authentication was handled by a session id; I just had to get it back and pass it to each request I made to the server.
But in your case, the authentication might be done by a different way, you'll have to see for yourself. Anyway, I can detail if it's not clear enough.

Related

Post to Node.js Server from Within HTML e-mail

I am writing a simple mailing application, however I am not yet aware of the full capabilities of HTML editing within the mailing world.
I would like to give the website administrator the choice to accept or to refuse a reservation by sending him an overview of the reservation. Below in the mail I had 2 buttons in mind, accept & refuse.
I tried using a form within the HTML e-mail but almost every mailing client blocks this out.
Is there another method to do a http post command to let's say myserver.com/accept or myserver.com/refuse from within an e-mail without having to open an additional webpage?
If not, what is the best way to achieve such things?

This is a pretty relevant article: https://www.sitepoint.com/forms-in-email/
Basically he concludes that support is not reliable so you should not use forms in emails which I agree with.
Since you say you want to give this choice to a website administrator I think you probably want some sort of authentication. So I could see it working something like this...
Send the admin an email containing two links mysite.com/reservations/:reservation_id/accept and mysite.com/reservations/:reservation_id/refuse.
Admin clicks on one of the links
Link opens in the browser and your site(controller -> ReservationService) accepts or refuses based on the id and action in the url
You will have a few things to consider, such as authentication(I assume you already have this since you have the notion of website admin?), authorization(can this admin accept or deny the reservation?), does the reservation exist, has the admin already accepted or denied the reservation, etc.

How to make basic REST API calls using a browser

I am trying to get started with REST API calls by seeing how to format the API calls using a browser. Most examples I have found online use SDKs or just return all fields for a request.
For example, I am trying to use the Soundcloud API to view track information.
To start, I've made a simple request in the browser as follows http://api.soundcloud.com/tracks/13158665.json?client_id=31a9f4a3314c219bd5c79393a8a569ec which returns a bunch of info about the track in JSON format
(e.g. {"kind":"track","id":13158665,"created_at":"2011/04/06 15:37:43 ...})
Is it possible to only to get returned the "created_at" value using the browser? I apologize if this question is basic, but I don't know what keywords to search online. Links to basic guides would be nice, although I would prefer to stay out of using a specific SDK for the time being.

In fact, it's really hard to answer such question since it depends on the Web APIs. I mean if the API supports to return only a subset of fields, you could but if not, you will receive all the content. From what I saw on the documentation, it's not possible. The filters only allow you to get a subset of elements and not control the list of returned fields within elements.
Notice that you have a great application to execute HTTP requests (and also REST) in Chrome: Postman. This allows to execute all HTTP methods and not only GET ones and controls the headers and sent content and also see what is received back.
If you use Firefox, Firebug provides a similar thing.
To finish, you could have a look at this link to find out hints about the way Web APIs work and are designed: https://templth.wordpress.com/2014/12/15/designing-a-web-api/.
Hope it helps you and I answered you question,
Thierry

Straight from the browser bar you can utilize REST endpoints that respond to a GET message. That is what you are doing when you hit that URI, you are sending an HTTP GET message to that server and it is sending back a JSON.
You are not always guaranteed a JSON, or anything when hitting a known REST endpoint. What each endpoint returns when hit with a GET is specific to how it was built. In that case, it is built to return a JSON, but some may return an HTML page. In my personal experience, most endpoints that utilize JSON returns expect you to process that object in a computer fashion and don't give you a lot of options to get a specific field of the JSON. Here is a good link on how to process JSON utilizing JavaScript.
You can utilize REST clients (such as the Advanced REST Client for Chrome) to craft HTTP POST and PUT if a specific REST endpoint has the functionality built in to receive data and do something with it. For example, a lot of wiki style REST endpoints will allow you to create a page with a specifically crafted HTTP POST with either specific header information, URI parameters or a JSON as part of it.

you can install DHC client app in your chrome and send request like put or get

Finding out what http requests were made by the user

I am working on a project involving finding out what http requests were made by the user.
I have all the http request and response headers (but not the data), and I need to find out what content was requested by the user and what content was automatically sent (e.g. ads pages, streaming on the background, and all sorts of unrelevant content).
When recording the net traffic (even for a short period) alot of content gets generated, and most of it is not relevant.
since im no expert in http, i'd like some help with directions as of which headers I can safely use (assuming most web pages send them), and which headers might be omitted and so it will not be safe to rely on.
my current idea involves:
find all the html files, and check what the main html files were (no referrer or search engine referrer), and then recursively mark all the files called by these html files onward as relevant, and discard the rest.
the problem with this is that I've been told that I can't trust the referrer header, and I have no idea as of how to identify what html files were clicked by the user.
Every kind of help will be appreciated, sorry if the post is not formatted well, this is my first question here.
EDIT:
I've been told the question is'nt clear enough, so all I'm asking is for some way to determine which requests were triggered by the user and whic requests were automatically made

To determine which request was send by the user itself you should look at the first request send through the connection and look at it's response body.
All external files referenced in this first body which then consecutively get send to the user are most likely to be send automatically without the users interaction.
Time passing between requests could also be an factor worth looking at.
Another thing you already mentioned yourself would be looking at the Referer header. As far as the RFC 2616 14.36 goes it can be trusted, as the Referer header must not be sent if the Request URI comes from user input. Although there could be automatically send content which does not have the Referer header set, as it's optional.

Is there a generally available HTTPS POST smoke test?

When debugging an HTTP client, one of your first tests is likely to be a Google search, which lets you see whether your client does non-SSL GETs properly. Everyone knows where it is, everyone can use it, and everyone can see whether it succeeded.
My client has a problem with HTTPS POST. I can reproduce it locally with my specially set-up HTTPS server, but I want others to be able to try it as well. Is there a public web page using HTTPS where sending a test POST is not a bad idea?
Edit: In the end, the problem turned out to be that my client would cache network output by the line when sending over TLS. Obviously, that causes problems for POST but not for GET...

I stumbled across this question while looking for the same thing, but I also found https://posttestserver.com/, which provides such a service for HTTP and HTTPS.

Google App Engine has supported HTTPS for a while now. That should give you a simple, easy way to put up test pages for anybody to see and serve them up over HTTP and HTTPS. Give us the link too, could be useful to use for our clients if the tests are generic enough.

The simplest public HTTPS post test I can think of would be webmail.
For example create a dummy Google account, then take the username and password of that dummy and see if the user can login using https://www.google.com/accounts/ManageAccount (a simple HTTPS post form).

Create a twitter account. Because of the json api, checking for a valid post to twitter is very simple. For the POST, you can look at the API docs for Status Updates. Once you've made a post, you can check the results at the User Timeline.
The API docs even have simple examples with curl to show you how easy it is. The POST:
curl -u user:password -d "status=testing my HTTP POST request" https://twitter.com/statuses/update.json
And getting the status to check it:
curl https://twitter.com/statuses/user_timeline.json?screen_name=user

Any login form should do.

In short, no. But without further info as to what specific bug you're experiencing, it's hard to search for something that already exists. My suggestion would be to find a free hosting service, and put the test page up there, along with a small google ad for some revenue.
Interesting concept, though, the publicly available test cases for standards. I like.

I'll bet that google search will accept a search paramter as POST if you sent it that way.

SSL adds a lot of complexity to the transaction, and you actually should break it up into two pieces.
You should do an GET w/ HTTPS first. When I was smoke testing networking for Netscape/AOL/Mozilla, I used http://www.verisign.com, because that was the home page for the main certificate vendor. I did not test the HTTP/SSL implementation itself, but we figured that while we are sitting around clicking on links in a build, we may as well do some SSL versions of the HTTP requests.
I cannot easily think of a good https: URL that uses POST, but I actually think it matters a lot less.
Once you know that SSL is working w/ HTTP at all, failures that are request-specific are going to be pretty limited, based on my recollection. Then again, this area was not assigned directly to me, so take that with a grain of salt.
My more recent thinking about testing is that test groups need to setup their systems, especially test servers. You would probably get better mileage by getting a good working set of instructions on how to configure HTTPS w/ a self signed certs, and then create your own internal POST test pages.

pulling webpages from an adult site -- how to get past the site agreement?

I'm trying to parse a bunch of webpages from an adult website using Ruby:
require 'hpricot'
require 'open-uri'
doc = Hpricot(open('random page on an adult website'))
However, what I end up getting instead is that initial 'Site Agreement' page making sure that you're 18+, etc.
How do I get past the Site Agreement and pull the webpages I want? (If there's a way to do it, any language is fine.)

You're going to have to figure out how the site detects that a visitor has accepted the agreement.
The most obvious choice would be cookies. Likely when a visitor accepts the agreement, a cookie is sent to their browser, which is then passed back to the site on every subsequent request.
You'll have to get your script to act like a visitor by accepting the cookie, and sending it with every subsequent request. This will require programming on your part to request the "accept agreement" page first, find the cookie, and store it for use. It's likely that they don't use a specific cookie for the agreement, but rather store it in a session, in which case you just need to find the session cookie.

The 'Site Agreement' page probably has a link you have to click or form you have to submit to send back to the server to proceed. Read the source of that page to be sure. You could send that response back from your application. I don't know how to do that in Ruby, but I've seen similar tasks done using cURL and libcurl, which can probably be used from Ruby.

Install LiveHTTPHeaders plugin for Firefox and visit this site. Watch the headers and see what happens when you accept the agreement. You'll probably see that the browser sends some request (possibly a Post) and accepts some cookies. Then you'll have to repeat whatever browser does in your Ruby script.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Web crawler dealing with "Sign up or log in to read full content" - html

Related

Post to Node.js Server from Within HTML e-mail

How to make basic REST API calls using a browser

Finding out what http requests were made by the user

Is there a generally available HTTPS POST smoke test?

pulling webpages from an adult site -- how to get past the site agreement?

Categories

Resources