Extract HTML Tables from Multiple webpages Using R - html

Hi I have done thorough research and have come to this extent. All I am trying to do is extract HTML table spanning many webpages.
I have to query the website sec.gov's database and the table then returns appropriate number of results (the size and number of pages vary with every query). For example:
Link: http://www.sec.gov/cgi-bin/srch-edgar
Inputs to be given:
Enter a Search string box: form-type=(8-k) AND filing-date=20140523
Start: 2014
End: 2014
How can I do this totally in R without even opening the browser?
I am sharing what all I have done
I tried many packages and the closest I came to was package RCurl. But in getURL function I opened the browser, ran the query in browser and pasted it in getURL. It returned a very long character, which has the URLs that can be looped and produce the output I want. All this information is in the "center" tag of output.
Now I do not know how to get those URLs out from the middle of the character.
Also, this is not what I wanted. I wanted to run a web query directly from R and get the varied HTML table outputs directly into R. is this possible at all?
Thanks
Meena

Yes, it is possible. You will want to use a combination of the RCurl and XML packages. You will need to programmatically generate the query parameters in the URL (based on the HTML form) and then use getURL() or getURLContent(). Sometimes, the server will expect an HTTP POST, so there is postForm().
To parse the result, look up the XPath language, which the XML package supports with getNodeSet(). I think there is also a function in the XML package for parsing an HTML table into a data.frame.
You might want to invest in this book.

Related

Onenote does not return pages

This is similar question as OneNote pages API doesn't return pages in section-groups.
I'm using the Get sections and get Section pages with Sections expanded to get names for all pages in a notebook. However using the same requests all the time, the sections in section groups sometimes disappears, and wont reappear before recreating the group.
Is this a bug that one can somehow work around, or is there a better way of polling all the page names from a specific notebook? The pages need to be in order.
If you are trying to get all the page names for a Notebook, a work-around exists by using an OData nested filter. The idea is to query for all pages, $expand the parentNotebook, and then $filter on the id of the parentNotebook. Here is an example URL.
GET ~/pages?$expand=parentNotebook&$filter=parentNotebook/id%20eq%20'{$notebook_id_here}'
Here is another SO question where someone employs a similar pattern: Best way to use One Note API to GET specific pages in specific section in specific notebook?
Update:
You can control the order of the returned pages by using OData's $orderby and specifying properties that exist on the entities in the returned entity set (in this case, the entity is pages). From dev.onenote.com: "The default [order] is lastModifiedTime desc (most recently modified page first)."
Under https://dev.onenote.com/docs#/reference/get-pages there is a section, "Page properties", that shows all the properties returned by this call. Since we are getting all the pages available to a user that exist in a notebook, the only property we can use is unfortunately createdTime.
The query param to add looks like $orderby=createdTime
In full:
GET ~/pages?$expand=parentNotebook&$filter=parentNotebook/id%20eq%20'{$notebook_id_here}&$orderby=createdTime
I just tested this using Fiddler against my own pages so I think it should work. The nice thing is that it is a single request.
GET https://www.onenote.com/api/v1.0/me/notes/pages?$expand=parentNotebook&$filter=parentNotebook/id%20eq%20'{$YOUR_NOTEBOOK_ID_HERE}'&$orderby=createdTime

Scan an area of a web page's source code for changes while reporting it?

this is one heck of a confusing question to ask so here it goes. Firstly, I'm not asking you to write me any code I just need help going in the right direction for what I'm trying to achieve here. Basically the task is this, I want to scan a select area of a web page's source code for changes and if something does change, I want to report it somewhere (like a console or something). However, I do not want just a notification of change, I also want what the change is/was. I've been looking into things like jsoup but I am still struggling to even find out what this is called.
Any pointers would be insanely appreciated. Thanks, Optimistic.
Here are some steps assuming this is from a node.js project:
Get the URL for the specific script file you're looking for a change in.
Using the request() module, fetch that URL.
Break the data up into lines (probably using .split()).
Find the specific line you are looking for either by counting line numbers of by searching for some representative text in that line.
Using some sort of search in that line (perhaps a regex), find the current value of the exact item in that line you are looking for.
Save the current value.
Then, at some future time, repeat this whole process and compare what you find to the previous value.
If this is being done from a browser instead of node.js, then use an Ajax call to retrieve the file. If the file is on another domain from your web page and that domain does not permit cross-origin requests, then you cannot solve this problem in an automated fashion from a browser in your own web page.
Here is how I would do it with Jsoup:
Document doc = Jsoup.connect(url).get();
String scriptCssQuery = "script"; // Tune this CSS query to find THE script you need.
Element script = doc.select(scriptCssQuery).first();
if (script != null) {
String scriptLines = script.html();
// Store the changing line somewhere and compare it to its previous value...
}

Sequentialising Google Sheets Macro

I'm looking for a way to recursively generate links to next pages on a website with canonical structure. In essence, I'm trying to generate a link to each next page and then feed that result back in to the process to find the following page ad infinitum. However, I'm having problems automating this as the macro seems to be trying to generate the result for cells that are empty (i.e. the results for an earlier cell hasn't been created/copied yet).
So I'd like to sequentialise the macro to start from A20, generate the result for that cell, copy that result to A21, then begin the macro again for A21, et cetera, without requiring constant human input.
The Google spreadsheet with the error can be seen here in cell C27 and the macro itself can be seen here.
I realise this may be quite a roundabout way to perform this task and am open to any suggestions that may be easier, more intuitive, or faster.
So two suggestions: one is that with stuff that is a continuous scroll, its very easy to find the json of the source, and either grab all the data you want in one go, or by easily picking out the "next page" or pagination...
I personally use importdata() and importxml() more than any other functions, and when in google sheets I also use regexextract() and regexreplace() when needed.
for example the json your looking for is here: http://iconosquare.com/controller_nl.php?action=nlGetMethod&method=mediasTag&value=cricket&max_id=1145408330912313787
if you look at the top row, it tell you what the next min and max is so technically you could just extract that piece to generate your url.
Second option is to just build the query such that it autoincrements the urls. I can give you an example, but I would like to understand a little more about what you really want in the end result...
Are you just looking for the pagination urls, or are you wanting to extract the actual data from them?

Exporting Oracle APEX regions with DBMS_LOB

I'm interested in exporting the contents of a page generated by Oracle-APEX to an external file while preserving as much formatting as possible. Eventually, I'd like to export it to either a .doc, .xls., or .pdf format. For now, I'm testing with a .doc file.
Currently, I'm attempting to do this by creating a PL/SQL anonymous block "Process" that executes when an "Export" button is pressed. Based off an example I found online, if I use the following code in the process, I can output one of the items in my page to a .doc file:
DECLARE
test_blob BLOB;
BEGIN
dbms_lob.createtemporary(test_blob, FALSE);
dbms_lob.open(test_blob, dbms_lob.lob_readwrite);
DBMS_LOB.APPEND(test_blob, UTL_RAW.CAST_TO_RAW(:P4016_ITEM_NAME));
OWA_UTIL.mime_header('application/doc', FALSE);
HTP.p('Content-Length: ' || DBMS_LOB.getlength(test_blob));
htp.p('Content-Disposition: attachment; filename= "text.doc"');
OWA_UTIL.http_header_close;
WPG_DOCLOAD.download_file(test_blob);
dbms_lob.close(test_blob);
END;
However, I would like to output some regions on my page that include tables, which are not considered items, as far as I know (I'm still very new to APEX). If I include the table name in the DMS_LOB.APPEND line, I receive an error message. Does anyone know of a simple way to reference these regions?
The only workaround I've found is to replicate the page in my exported file by enclosing the results of the SQL queries used to populate the tables in HTML based off the HTML of my APEX Page. In other words, if I wanted to italicize something, I would do the following:
...
dbms_lob.append(test_blob, UTL_RAW.CAST_TO_RAW('<html><i>'));
dbms_lob.append(test_blob, [PARSED SQL QUERY]);
dbms_lob.append(test_blob, UTL_RAW.CAST_TO_RAW('</i></html>'));
If anyone knows of a simpler way to do this, preferably involving a simple reference to my page regions, I would greatly appreciate it.
There is a way (3 ways) to get PDF - http://www.oracle.com/technetwork/developer-tools/apex/learnmore/custom-pdf-reports-1953918.pdf.
Also you can use Interactive report, it has export option. But it can be exported only in HTML by default.
It is not exactly what you ask, but you don't need to write code.

how to remove result set from html page

I have a MySQL database, and few perl scripts using which i am generating the webpages.
On Html page links are available,
For example-
Customer_link => (calls customers.pl) query executed - select * from customers.
Now there is one more link say Customer_in_mumbai => it should remove all the customers whose city is not mumbai.
How to achieve that?
do i need to execute the query once again with where clause or any other way is also possible so that i can simply remove the customers whose city is not mumbai?
Also if i need to execute the query again, do i need to write one more perl file, if not how can i use the same file?
You can use Javascript to manipulate on the client side of things after it's been loaded/displayed. There's nothing you can do on the server side to change a page once it's been downloaded without Javascript.
The rest of your questions indicate a lack of familiary with how dynamic web pages are generated. You can have a single page that does all that, using standard HTTP query variables to modify how the script operates. e.g.
http://example.com/yourscript.pl?remove=mumbai
then have Perl retrieve that remove value and use it to modify how the database query runs. But showing you that is beyond the scope of this site - we're not here to teach you, just help fix problems.
From your question, it seems you only want to remove it from the page temporarily for the user. You can achieve that simply by remove rows with mumbai as a value with Javascript. That should save you server side processing. Use a Library like JQuery to achieve it easily.