How to connect to Statistics Canada JSON from R - json

I'm trying to connect to an online database through R, which can be found here:
http://open.canada.ca/data/en/dataset/2270e3a4-447c-45f6-8e63-aea9fe94948f
How would I be able to load the data table into R and then be able to simply change the table name in my code to access other tables? I'm not particularly concerned with what language I need to use (JSON, JSOn-LD, XML).
Thanks in advance!

Assuming you know the URLs for each of the datasets a similar question can be found here:
Download a file from HTTPS using download.file()
For this it becomes:
library(RCurl)
URL <- "http://www.statcan.gc.ca/cgi-bin/sum-som/fl/cstsaveascsv.cgi?filename=labr71a-eng.htm&lan=eng"
x <- getURL(URL)
URLout <- read.csv(textConnection(x),row.names=NULL)
I obtained the URL by right-clicking the access button and copying the address.
I had to declare row.names=NULL as the number of columns in the first row is not equal to the number of columns elsewhere, thus read.csv assumes row names as described here. I'm not sure if the URL to these datasets would change when they are updated, but this isn't a really convenient way to get this data. The JSON doesn't seem to be much better for intuitively being able to change datasets.
At least this way you could create a list of URLs and perform the following:
URL <- list(getURL("http://www.statcan.gc.ca/cgi-bin/sum-som/fl/cstsaveascsv.cgi?filename=labr71a-eng.htm&lan=eng"),
getURL("http://www.statcan.gc.ca/cgi-bin/sum-som/fl/cstsaveascsv.cgi?filename=labr72-eng.htm&lan=eng"))
URLout <- lapply(URL,function(x) read.csv(textConnection(x),row.names=NULL,skip=2))
Again I don't like having to declare row.names=NULL and when I look at the file I'm not seeing the discrepant number of columns, however this will at least get the file into the R environment for you. It may take some more work to perform the operation over multiple URLs.
In a further effort to obtain useful colnames:
URL <- "http://www.statcan.gc.ca/cgi-bin/sum-som/fl/cstsaveascsv.cgi?filename=labr71a-eng.htm&lan=eng"
x <- getURL(URL)
URLout <- read.csv(textConnection(x),row.names=NULL, skip=2)
The arguement skip = 2 will skip the first 2 rows when reading in the CSV, and will yield some header names. Because the headers are numbers an X will be placed in front. Row 2 in this case will have the value "number" in the second column. Unfortunately it appears this data was intended for use within excel, which is really sad.

1) You need to download the CSV into some directory that you have access to.
2) Use "read.csv", or "read_csv", or "fread" to read that csv file into R.
yourTableName<-read.csv("C:/..../canadaDataset.csv")
3) You can name that csv into whatever object name you want.

Related

Scrape a Table-Like Index from HTML in R

I am currently working to scrape the table at this website, which contains variable IDs, question text, variable type, and origin dataset from ICPSR's PATH Survey data. My end goal is to create a spreadsheet inventory matrix of variable IDs and their corresponding question text by scraping this information in R, but I am having trouble getting it to work. In short, I aim to essentially get the table shown at the url above into a spreadsheet.
I've tried using rvest,XML, and a number of other packages/strategies (read.table,htmltab,htmltable,etc...), but the underlying table does not appear to be a table-like object "under the hood", if you will. Therefore, I am struggling to find a resource/previous question that helps scrape a table that may not necessarily be a table in structure, but certainly is a table visually.
Any help would be appreciated on this. Thanks!
I think most of that content is located within a script tag from which it is pulled dynamically within the browser via JavaScript during rendering the page.
You can regex out the appropriate JavaScript object and handle as json. However, given the variability within the returned list under response$docs, you are going to need to spend some time studying the json and determining what you want, and how you will organise output, then write a custom function to apply to the list to return possibly a dataframe of results.
The following shows how to extract the documents list:
library(rvest)
library(stringr)
library(magrittr)
library(jsonlite)
s <- read_html('https://www.icpsr.umich.edu/web/NAHDAP/search/variables?start=0&sort=STUDYID asc,DATASETID asc,STARTPOS asc&SERIESFULL_FACET_Q=606|Population Assessment of Tobacco and Health (PATH) Study Series&DATASETTITLE_FACET=Wave 4: Youth / Parent Questionnaire Data&EXTERNAL_FLAG=1&ARCHIVE=NAHDAP&rows=1000#') %>%
html_text()
r <- stringr::str_match(s, 'searchResults : (\\{.*\\}), searchConfig')
data <- jsonlite::parse_json(r[1,2])
docs <- data$response$docs
And this is a sample item in the list (bearing in mind variability of items within list):

Python Loop through variable in URLs

What I want to do here is that I want to change an user id within an url for every url and then get outputs from urls.
What I did so far:
import urllib
import requests
import json
url="https://api.abc.com/users/12345?api_key=5632lkjgdlg&_format=_show"
data=requests.get(url).json()
print (data['user'])
(I type in 'user' inside of the print because it gives all the information about a focal user in json format)
My question is that I want to change user id (which is 12345 in this url example) by giving another number (any random number) and then get the outputs from every url I type in. For example, change to 5211, for example, and get the result. And then change to 959444 and get the result and so on. I think I need to use loop to make this iterate through just by changing the numbers within an url but kept failing to do this due to difficulty splitting the original url and then changing only the user id number inside. Could anyone help me out?
Thank you so much in advance.
=====================The Next Following Question is Stated Below================
Thank you for your previous answer! I tried to build my codes more based on the answer and made it but ran into another issue. I could iterate through and fetch each user's information in a json format. The format gave me a single quote (rather than double quotes) and a weird u' notation in front of every keys in json format but I could solve this issue. Anyway, I cleaned up json format and made it in a perfect neat json format.
My plan is to convert each json into a csv file but want to stack all the json I scrape to one csv file. For example, the first json format on user1 will be converted into a csv file and user1 will be considered row1 and all the keys in json will be column names and all the corresponding values will be the values for the corresponding columns. And the second json format I scrape will convert into the same csv file but in the second row, and so on.
from pandas.io.json import json_normalize
eg_data=[data['user']]
df=pd.DataFrame.from_dict(json_normalize(data['user']))
print (df)
df.to_csv('C:/Users/todd/Downloads/eg.csv')
print (df)
So, I found that json_normalize flattens the nested brackets so it's useful in a real world example. Also, I tried to use pandas dataframe to make it as a table. Here I have 2 questions: 1. How do I stack each json format that I scraped one by one in a row in one csv file? (If there's another way to do this without using pandas frame, that would be also appreciated) 2. As I know, pandas dataframe won't give you an output unless every row has the same number of columns. But in my case since every json format I've scraped has either 10 columns or 20 columns depending on whether a json format has nested brackets or not. In this case, how do I stack all the rows and make it in one csv file?
Comments or questions will be greatly appreciated.
You can split it into two initially and join them together every time you generate a random number
import random
url1="https://api.abc.com/users/"
url2="?api_key=5632lkjgdlg&_format=_show"
for i in range(4):
num=random.randint(1000,10000) #you can change the range here for generating a random number
url=url1+str(num)+url2
print(url)
OUTPUT
https://api.abc.com/users/2079?api_key=5632lkjgdlg&_format=_show
https://api.abc.com/users/2472?api_key=5632lkjgdlg&_format=_show
and so on...
But, if you wanted to split at that exact place without knowing how it looks beforehand, you can use regex as you know for sure that a ? is found after this number.
import re
url="https://api.abc.com/users/12345?api_key=5632lkjgdlg&_format=_show"
matches=re.split('\d+(?=\?)',url)
['https://api.abc.com/users/', '?api_key=5632lkjgdlg&_format=_show']
Now just set
url1=matches[0]
url2=matches[1]
And use the for loop.

Editing JSON - Add Attribute

I have a slew of JSON files I'm getting dumps of, with data from the day/period it was pulled. Most of the JSON files I'm dealing with are a lot larger than this, but I figured a smaller one would be easier to work with.
{"playlists":[{"uri":"spotify:user:11130196075:playlist:1Ov4b3NkyzIMwfY9E8ixpE","listeners":366,"streams":386,"dateAdded":"2016-02-24","newListeners":327,"title":"#Covers","owner":"Saga Prommeedet"},{"uri":"spotify:user:mickeyrose30:playlist:2Ov4b3NkyzIMwfY9E8ixpE","listeners":229,"streams":263,"dateAdded":"removed","newListeners":154,"title":"bestcovers2016","owner":"Mickey Rose"}],"top":2,"total":53820}
What I'm essentially trying to do is add a date attribute to each line of data, so that when I combine multiple JSON files to put through an analytical tool, the right row of data is associated with the correct date. My first thought was to write it as such:
{"playlists":[{"uri":"spotify:user:11130196075:playlist:1Ov4b3NkyzIMwfY9E8ixpE","listeners":366,"streams":386,"dateAdded":"2016-02-24","newListeners":327,"title":"#Covers","owner":"Saga Prommeedet"},{"uri":"spotify:user:mickeyrose30:playlist:2Ov4b3NkyzIMwfY9E8ixpE","listeners":229,"streams":263,"dateAdded":"removed","newListeners":154,"title":"bestcovers2016","owner":"Mickey Rose"}],"top":2,"total":53820,"date":072617}
since the "top" and "total" attributes are showing up on each row of data (with the associated values also showing up on each row) when I put it through an analytical tool like Tableau.
Also, have been editing and saving files through Brackets, and testing things through this converter (https://konklone.io/json/)
In javascript language
var m = JSON.parse(json_string);
m["date"]="20170804";
JSON.stringify(m);
This will work for you, very simple,

Random selection from CSV file in Jmeter

I have a very large CSV file (8000+ items) of URLs that I'm reading with a CSV Data Set Config element. It is populating the path of an HTTP Request sampler and iterating through with a while controller.
This is fine except what I want is have each user (thread) to pick a random URL from the CSV URL list. What I don't want is each thread using CSV items sequentially.
I was able to achieve this with a Random Order Controller with multiple HTTP Request samplers , however 8000+ HTTP Samplers really bogged down jmeter to an unusable state. So this is why I put the HTTP Sampler URLs in the CSV file. It doesn't appear that I can use the Random Order Controller with the CSV file data however. So how can I achieve random CSV data item selection per thread?
There is another way to achieve this:
create a separate thread group
depending on what you want to achieve:
add a (random) loop count -> this will set a start offset for the thread group that does the work
add a loop count or forever and a timer and let it loop while the other thread group is running. This thread group will read a 'pseudo' random line
It's not really random, the file is still read sequentially, but your work thread makes jumps in the file. It worked for me ;-)
There's no random selection function when reading csv data. The reason is you would need to read the whole file into memory first to do this and that's a bad idea with a load test tool (any load test tool).
Other commercial tools solve this problem by automatically re-processing the data. In JMeter you can achieve the same manually by simply sorting the data using an arbitrary field. If you sort by, say Surname, then the result is effectively random distribution.
Note. If you ensure the default All Threads is set for the CSV Data Set Config then the data will be unique in the scope of the JMeter process.
The new Random CSV Data Set Config from BlazeMeter plugin should perfectly fit your needs.
As other answers have stated, the reason you're not able to select a line at random is because you would have to read the whole file into memory which is inefficient.
Rather than trying to get JMeter to handle this on the fly, why not just randomise the file order itself before you start the test?
A scripting language such as perl makes short work of this:
cat unrandom.csv | perl -MList::Util=shuffle -e 'print shuffle<STDIN>' > random.csv
For my case:
single column
small dataset
Non-changing CSV
I just discard using CSV and refer to https://stackoverflow.com/a/22042337/6463291 and use a Bean Preprocessor instead, something like this:
String[] query = new String[]{"csv_element1", "csv_element2", "csv_element3"};
Random random = new Random();
int i = random.nextInt(query.length);
vars.put("randomOption",query[i]);
Performance seems ok, if you got the same issue can try this out.
I am not sure if this will work, but I will anyways suggest it.
Why not divide your URLs in 100 different CSV files. Then in each thread you generate the random number and use that number to identify CSV file to read using __CSVRead function.
CSVRead">http://jmeter.apache.org/usermanual/functions.html#_CSVRead
Now the only part I am not sure if the __CSVRead function reopens the file every time or shares the same file handle across the threads.
You may want to try it. Please share your findings.
A much straight forward solution.
In CSV file, add another column (say B)
apply =RAND() function in the first cell of column B (say B1). This will create random float number.
Drag the cell (say B1) corner to apply for all the corresponding URLs
Sort column B.
your URL will be sorted randomly.
Delete column B.

Read a Text File into R

I apologize if this has been asked previously, but I haven't been able to find an example online or elsewhere.
I have very dirty data file in a text file (it may be JSON). I want to analyze the data in R, and since I am still new to the language, I want to read in the raw data and manipulate as needed from there.
How would I go about reading in JSON from a text file on my machine? Additionally, if it isn't JSON, how can I read in the raw data as is (not parsed into columns, etc.) so I can go ahead and figure out how to parse it as needed?
Thanks in advance!
Use the rjson package. In particular, look at the fromJSON function in the documentation.
If you want further pointers, then search for rjson at the R Bloggers website.
If you want to use the packages related to JSON in R, there are a number of other posts on SO answering this. I presume you searched on JSON [r] already on this site, plenty of info there.
If you just want to read in the text file line by line and process later on, then you can use either scan() or readLines(). They appear to do the same thing, but there's an important difference between them.
scan() lets you define what kind of objects you want to find, how many, and so on. Read the help file for more info. You can use scan to read in every word/number/sign as element of a vector using eg scan(filename,""). You can also use specific delimiters to separate the data. See also the examples in the help files.
To read line by line, you use readLines(filename) or scan(filename,"",sep="\n"). It gives you a vector with the lines of the file as elements. This again allows you to do custom processing of the text. Then again, if you really have to do this often, you might want to consider doing this in Perl.
Suppose your file is in JSON format, you may try the packages jsonlite ou RJSONIO or rjson. These three package allows you to use the function fromJSON.
To install a package you use the install.packages function. For example:
install.packages("jsonlite")
And, whenever the package is installed, you can load using the function library.
library(jsonlite)
Generally, the line-delimited JSON has one object per line. So, you need to read line by line and collecting the objects. For example:
con <- file('myBigJsonFile.json')
open(con)
objects <- list()
index <- 1
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
objects[[index]] <- fromJSON(line)
index <- index + 1
}
close(con)
After that, you have all the data in the objects variable. With that variable you may extract the information you want.