Attempting to extract the data from this Google Politics Insights webpage from "Jan-2012 to the Present" for Mitt Romney and Barack Obama for the following datasets:
Search Trends Based on volume
Google News Mentions Mentions in articles and blog posts
YouTube Video Views Views from candidate channels
For visual example, here's what I mean:
Using Firebug I was able to figure out the data is stored in a format readable by Raphael 2.1.0; looked at the dataset and nothing strikes me as a simple way to convert the data to CSV.
How do I convert the data per chart per presidential candidate into a CSV that has a table for "Search Trends", "Google News Mentions", and "YouTube Video Views" broken down by the smallest increment of time with the results measured in the graph are set to a value of "0.0 to 1.0"? (Note: The reason for "0.0 to 1.0" is the graphs do not appear to give volume info, so the volume is relative to the height of the graph itself.)
Alternatively, if there's another source for all three datasets in CSV, that would work too.
First thing to do is to find out where the data comes from, so I looked up the network traffic in my developer console, and found it very soon: The data is stored as json here.
Now you've got plenty of data for each candidate. I don't know exactly in what relation these numbers are but they definitely are used for their calulation in the graph. I found out that the position in the main.js is on line 392 where they calculate the data with this expression:
Math.log(dataPoints[i][j] * 100.0) / Math.log(logScaleBase);
My guess is: Without the logarithm and a bit exponential calculation you should get the right results.
Related
My company has a web service that we use to keep track of some information. They have built an API that allows us to get information out of it. My group is trying to do the same. However, the other groups have all used C# to accomplish this, and the development level of this team isn't higher than that of Excel VBA. What I need to do:
Go to a known URL http://service.company.com/Group=1
Get the JSON result that appears on the screen into VBA
Translate the JSON to readable human - This part seems to have been solved here
The rest of the code around what I need to do, I can handle. I'm hoping there is a JSON reader built into VBA that I can leverage for part 3. Any help would be much appreciated.
EDIT:
I figured out how to get JSON information out of the web page, there was some user/password authentication required in order to get that. So that is part 1 and 2 done. I'm working on Part 3. The JSON information seems to be coming out in this pattern:
[{"Column1":value1,"Column2":value2},
{"Column1":value1,"Column2":value2}]
I'm looking for a way to recursively generate links to next pages on a website with canonical structure. In essence, I'm trying to generate a link to each next page and then feed that result back in to the process to find the following page ad infinitum. However, I'm having problems automating this as the macro seems to be trying to generate the result for cells that are empty (i.e. the results for an earlier cell hasn't been created/copied yet).
So I'd like to sequentialise the macro to start from A20, generate the result for that cell, copy that result to A21, then begin the macro again for A21, et cetera, without requiring constant human input.
The Google spreadsheet with the error can be seen here in cell C27 and the macro itself can be seen here.
I realise this may be quite a roundabout way to perform this task and am open to any suggestions that may be easier, more intuitive, or faster.
So two suggestions: one is that with stuff that is a continuous scroll, its very easy to find the json of the source, and either grab all the data you want in one go, or by easily picking out the "next page" or pagination...
I personally use importdata() and importxml() more than any other functions, and when in google sheets I also use regexextract() and regexreplace() when needed.
for example the json your looking for is here: http://iconosquare.com/controller_nl.php?action=nlGetMethod&method=mediasTag&value=cricket&max_id=1145408330912313787
if you look at the top row, it tell you what the next min and max is so technically you could just extract that piece to generate your url.
Second option is to just build the query such that it autoincrements the urls. I can give you an example, but I would like to understand a little more about what you really want in the end result...
Are you just looking for the pagination urls, or are you wanting to extract the actual data from them?
Hi I have done thorough research and have come to this extent. All I am trying to do is extract HTML table spanning many webpages.
I have to query the website sec.gov's database and the table then returns appropriate number of results (the size and number of pages vary with every query). For example:
Link: http://www.sec.gov/cgi-bin/srch-edgar
Inputs to be given:
Enter a Search string box: form-type=(8-k) AND filing-date=20140523
Start: 2014
End: 2014
How can I do this totally in R without even opening the browser?
I am sharing what all I have done
I tried many packages and the closest I came to was package RCurl. But in getURL function I opened the browser, ran the query in browser and pasted it in getURL. It returned a very long character, which has the URLs that can be looped and produce the output I want. All this information is in the "center" tag of output.
Now I do not know how to get those URLs out from the middle of the character.
Also, this is not what I wanted. I wanted to run a web query directly from R and get the varied HTML table outputs directly into R. is this possible at all?
Thanks
Meena
Yes, it is possible. You will want to use a combination of the RCurl and XML packages. You will need to programmatically generate the query parameters in the URL (based on the HTML form) and then use getURL() or getURLContent(). Sometimes, the server will expect an HTTP POST, so there is postForm().
To parse the result, look up the XPath language, which the XML package supports with getNodeSet(). I think there is also a function in the XML package for parsing an HTML table into a data.frame.
You might want to invest in this book.
So here is my problem. I plan to implement a localized map for my college presenting all the locations such as main block, Tech park etc. Not only do i plan to develop a GUI but also I also want to run my own algorithms, such as finding the quickest route from one block to another etc (Note: the algorithm is something i will be writing since i don't want to take the shortest route as the quickest but want to add my own parameters as weights). I want to host the map locally (say on a in house system) and should be able to cater real time request (displaying route to the nearest cafeteria) and display current data (such as what event is taking place in what corner of the campus). I know Google Maps API or Openstreetmap/OpenLyers API will enable me to build my own map, but can i run my own algorithms on them? also can I add elements that i have created and replace the traditional building/office components with my own?
You can do the following :
1. Export a part of open street map from their website. (go to the export tab)
2. Use ElementTree in python to parse the exported the xml data.
3. Use networkx to add the parsed data into a graph.
4. Run your algorithms on it.
As part of the WP:ASE project, I want to get the list of editors that have edited a given article.
For instance, for the article Szklarka Mielęcka (history) that would be:
Kotbot, AnomieBOT, Xenobot
I could not find anything in the MediaWiki API.
Any better idea than scraping the history web page?
Downloading the history data dumps is not a solution because I don't have the resources to handle 5 terabytes of text.
Scale: I want to do this on about 1000 random articles, twice a year.
I have found:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Sinjhoro&rvprop=user&rvlimit=500