Web Scraping, data mining, data extraction - html

I am tasked with creating a web scraping software, and I don't know where to even begin. Any help would be appreciated, even just telling me how this data is organized, or what "type" of data layout the website is using would help, because I would be able to Google search that term.
http://utilsub.lbs.ubc.ca/ion/default.aspx?dgm=x-pml:/diagrams/ud/Default/7330_FAC-delta_V2.4.1/7330_FAC-delta_V2.4.1-pq.dgm&node=Buildings.Angus_addition&logServerName=QUERYSERVER.UTIL2SUB&logServerHandle=327952
http://utilsub.lbs.ubc.ca/ion/default.aspx?dgm=x-pml:/diagrams/ud/network.dgm&node=Buildings.AERL&unique_id=75660a13-5145-42d5-b661-a50f328306c7&logServerName=QUERYSERVER.UTIL2SUB&logServerHandle=327952
Basically, I need to extract the "harmonic values" from this website. Specifically, I need the 9 numbers displayed on the second link. The numbers are not passed to HTML, they just seem to update automatically every few seconds. I need to able to extract these values in real time as they update. Even if that is not possible I still need to show that doing such web scraping is impossible. I am not given any API's to any of the back end, and do not know how they're site receives the data.
Overall, ANY help would be appreciated, even if its just some simple search terms to put me in the right direction. I am currently clueless in terms of web scraping/data mining/

Web Scraping
To parse HTML from a website is otherwise called Screen Scraping. It’s a process to access external website information (the information must be public – public data) and processing it as required. For instance, if we want to get the average ratings of Nokia Lumia 1020 from different websites we can scrap the ratings from all the websites and calculate the average in our code. So we can say, as a general “User” what you can have as “Public Data”, you’ll be able to scrap that using HTML Agility Pack easily.
Try These :
ASP.NET : HTMLAgilityPack (open source library)
Scraping HTML DOM elements using HtmlAgilityPack (HAP) in ASP.NET
PHP & CURL : WEB SCRAPING WITH PHP & CURL
Node.js : Screen Scraping with Node.js
YQL & Ajax : Screen scraping using YQL and AJAX

Try http://code.google.com/p/crawler4j/
It is very easy to use, you have to override one classe which is Controller.java.
You only need to specify the seeds and it returns the text and the HTML data in two variables for every website crawled.

The second link is pulling information from an API every few seconds. Using Google Chrome you can inspect things like this using the developer tools and clicking on "Network" then. You then see which requests are sent and can easily replicate them by right clicking the request -> copy as CURL. You then get something like this, which includes all headers and post data sent by the request in an CURL command. This is what the second link was calling:
curl 'http://utilsub.lbs.ubc.ca/ion/default.aspx/GetRTxmlData' -H 'Cookie: ASP.NET_SessionId=oq0qiwuqbb3g3453jvyysvjx' -H 'Origin: http://utilsub.lbs.ubc.ca' -H 'Accept-Encoding: gzip,deflate,sdch' -H 'Host: utilsub.lbs.ubc.ca' -H 'Accept-Language: de-DE,de;q=0.8,en-US;q=0.6,en;q=0.4' -H 'User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36' -H 'Content-Type: application/json; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Referer: http://utilsub.lbs.ubc.ca/ion/default.aspx?dgm=x-pml:/diagrams/ud/network.dgm&node=Buildings.AERL&unique_id=75660a13-5145-42d5-b661-a50f328306c7&logServerName=QUERYSERVER.UTIL2SUB&logServerHandle=327952' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' --data-binary $'{\'dgm\':\'x-pml:/diagrams/ud/network.dgm\',\'id\':\'75660a13-5145-42d5-b661-a50f328306c7\',\'node\':\'\'}' --compressed
The API returns XML wrapped in JSON.
You might wanna use CURL with PHP as codeSpy said, you just have to set all the headers and post data and replicate the request properly, otherwise the API wont respond to your request.

Related

Empty response when using curl for JSON

This is potentially very basic, but I'm trying to access the JSON under this URL: https://fantasy.premierleague.com/drf/bootstrap-static.
You can see data if you visit the page in a browser but when I use
curl https://fantasy.premierleague.com/drf/bootstrap-static
I get a 200 response but no data (at least that I can see).
Is there something I'm missing? Possibly header related?
Thanks
Looks like it was indeed header related. I needed a "User-Agent"
curl -H "User-Agent: Testing" https://fantasy.premierleague.com/drf/bootstrap-static
worked fine.
In my searching I also discovered you if you go into the network tab of Chrome developer options you can right click the request and copy the curl. Hope this helps someone.

How to connect 2 REST api's together

I am working on an online shop using 3dcart, i want to connect the store to an inventory management store called ChannelGrabber. Channel Grabber has provided me with a public and private key with some bits of their API.
$ curl -v -X POST -d "grant_type=client_credentials&client_id=f836e7675c46adbc33d98e32c06dfc6f&client_secret=2f4e72f89bda7f15062a2ba9d107adb5" https://api.orderhub.io/accessToken
> POST /accessToken HTTP/1.1
> User-Agent: curl/7.35.0
> Host: api.orderhub.io
> Accept: */*
> Content-Length: 119
> Content-Type: application/x-www-form-urlencoded
>
< ...response headers...
{
"access_token": "aVSyKhKNPi5XXJqlIMCNfeZwSfvTvasTcWyX2lv2",
"token_type": "Bearer",
"expires_in": 3600
}
$ curl -v -X GET -H "Authorization: Bearer aVSyKhKNPi5XXJqlIMCNfeZwSfvTvasTcWyX2lv2" https://api.orderhub.io/ping
> GET /ping HTTP/1.1
> User-Agent: curl/7.35.0
> Host: api.orderhub.io
> Accept: */*
> Authorization: Bearer aVSyKhKNPi5XXJqlIMCNfeZwSfvTvasTcWyX2lv2
>
< ...response headers...
pong
3d cart have provided the following git project has an example of how to connect up to their clients API. https://github.com/3dcart/REST-API-Client/tree/master/3dCartRestAPIClient.
My issue is that i have basically no idea on how to go about connecting the 2 services up. What language to use other then using Json but i'm not even sure that is possible, I'm only still a student and still quite new to the world of programming so i don't want to have to say i can't do this project and i would quite like to learn how to do this.
Can anyone point me in the right direction?
I was asking myself a similar question some time ago. I am already familiar with basic requesting of rest apis using python. I want to connect the apis of an online sales tool called pipedrive and a tool for generating invoices and bills called billomat. both come with a sophisticated rest api and I know how to get data from them or create new data into them.
If I now create a python script on my local computer I can imagine what I'd have to code to pull eg customer data from pipedrive and create this customers data into billomat. The thing now is that this process is completely manual.
To have the process be completely automatic, I came to the following conclusion:
Use webhooks in pipedrive to send out data when certain events are happening
the data can only be sent to a url which generally should also be a rest api
this url cannot be billomat directly because it wont unterstand or know what to do with the data
thats why I decided to code a litte api myself and host it on my private webserver
this api receives data from pipedrive, processes it, eg maps field names from a customers record to the corresponding field names in billomat, and then sends the prepared data over to billomat in a format that it expects and understands
I know this does not directly answer the OPs question, but would be my suggestion for a fully automatic solution in case you cannot alter the behaviour of at least one of the two apis you'd like to connect.
REST (Representational state transfer) is a way of interfacing your data. The idea is that the action should be defined by the HTTP request method (GET, PUT, POST etc.), while the URL should have no verb/action, just kind of data.
JSON is just the way of communication between server and client. It's like 2 people deciding to speak the same language.
Now, in your client, you can make requests to as many services as you need, and interpret the results. This can be achieved in virtually any programming language. You will find a lot of libraries for both handling HTTP request and parsing JSON responses.
As for the right direction. Pick a programming language you are more familiar with (if it's hard to decide I would recommend python which is fairly easy to start with) and look for libraries for sending HTTP request and parsing json strings.

BOX-API: Trying to get a shared folder without a token 401 Unauthorized error

I want to interrogate a shared folder without having to log the user in, from reading the documentation, this should be fine to do, but if run the example within my command line:
curl https://api.box.com/2.0/shared_items \
-H "Authorization: BoxAuth api_key=YOUR_API_KEY&shared_link=https%3A%2F%2Fwww.box.com%2Fs%2F8tqjqtoky18sbnoz264c"
Using my API key it works fine, however, within my app or just within a web browser, if I use:
https://api.box.com/2.0/shared_items -H "Authorization: BoxAuth api_key=YOUR_API_KEY&shared_link=https%3A%2F%2Fwww.box.com%2Fs%2F8tqjqtoky18sbnoz264c"
again with my API key, I get 401 Unauthorized error.
What am I doing wrong? Is it an encoding issue? as it looks like the end part of the string needs to be encoded, however the rest of it doesn't, I have tried to make sure that the C# code I am using does not encode the string, and I think it is not, but it still fails with 401.
It looks like the shared link from the example that you're using (the one ending with 8tqjqtoky18sbnoz264c) is no longer a valid URL. You should go into the Box web app and create a new shared link to test with, and that should work.

Do REST POST action from within HTML

Further to https://stackoverflow.com/questions/16726368/apache-user-auth-and-redirection-based-on-remote-user, I need to do some POSTs to the REST API but I'm struggling on how to implement.
I need to do the equivilent of:
/usr/bin/curl -s X POST -H "Accept: application/xml" --cacert ca.cer -u user#domain:password -d "<action />" https://server:port/api/id/stop
in just plain HTML.
Any ideas?
No feature of HTML will allow you to construct a request in which you:
Override the browser's Accept header
Specify an certificate to use instead of the browser's library of them
Specify HTTP auth (at least cross browser, some may still accept URIs in the form http://foo:bar#example.com) or
Make a POST request with XML as the body
You could specify some of that using JavaScript/XHR, but not with "just plain HTML".

POST: sending a post request in a url itself

I have been given a url .. www.abc.com/details and asked to send my name and phone number on this url using POST. They have told me to set the content-type as application/json and the body as valid JSON with the following keys:
name: name of the user
phone number: phone number of the user
Now i have no clue how to send this request! Will it be something like:
http://www.abc.com/details?method=post&name=john&phonenumber=445566
or do i have to use java to send the same?
Please help
Based on what you provided, it is pretty simple for what you need to do and you even have a number of ways to go about doing it. You'll need something that'll let you post a body with your request. Almost any programming language can do this as well as command line tools like cURL.
Once you have your tool decided, you'll need to create your JSON body and submit it to the server.
An example using cURL would be (all in one line, minus the \ at the end of the first line):
curl -v -H "Content-Type: application/json" -X POST \
-d '{"name":"your name","phonenumber":"111-111"}' http://www.example.com/details
The above command will create a request that should look like the following:
POST /details HTTP/1.1
Host: www.example.com
Content-Type: application/json
Content-Length: 44
{"name":"your name","phonenumber":"111-111"}
You can post data to a url with JavaScript & Jquery something like that:
$.post("www.abc.com/details", {
json_string: JSON.stringify({name:"John", phone number:"+410000000"})
});
It is not possible to send POST parameters in the URL in a straightforward manner. POST request in itself means sending information in the body.
I found a fairly simple way to do this. Use Postman by Google, which allows you to specify the content-type (a header field) as application/json and then provide name-value pairs as parameters.
You can find clear directions at [2020-09-04: broken link - see comment] http://docs.brightcove.com/en/video-cloud/player-management/guides/postman.html
Just use your URL in the place of theirs.
You can use postman.
Where select Post as method.
and In Request Body send JSON Object.
In windows this command does not work for me..I have tried the following command and it works..using this command I created session in couchdb sync gate way for the specific user...
curl -v -H "Content-Type: application/json" -X POST -d "{ \"name\": \"abc\",\"password\": \"abc123\" }" http://localhost:4984/todo/_session
If you are sending a request through url from browser(like consuming webservice) without using html pages by default it will be GET because GET has/needs no body. if you want to make url as POST you need html/jsp pages and you have to mention in form tag as "method=post" beacause post will have body and data will be transferred in that body for security reasons. So you need a medium (like html page) to make a POST request. You cannot make an URL as POST manually unless you specify it as POST through some medium. For example in URL (http://example.com/details?name=john&phonenumber=445566)you have attached data(name, phone number) so server will identify it as a GET data because server is receiving data is through URL but not inside a request body
In Java you can use GET which shows requested data on URL.But POST method cannot , because POST has body but GET donot have body.