Struggling to import NZ companies extract into R (json) - json

The NZ companies register offers a json file containing all publicly available business info. This file comes in at a whopping 40gb, but there is also a smaller json file (~250mb) containing data on unincorporated entities (sole traders etc). As a warm up excercise I thought i'd have a go importing it into R to get an idea of size, scalability and computational reqs.
I'm having alot of trouble importing the smaller json file into R. I've tried jsonlite, RJSONIO, rjson but it appears that the file is written in an 'unorthodox' json format, hence the standard 'fromJSON' commands are falling over. Below is a portion of the file (2 entities) which i've been trying to import into R: test.json
library(jsonlite)
json <- fromJSON("test.json", flatten=TRUE)
Error in parse_con(txt, bigint_as_char) :
parse error: invalid object key (must be a string)
zbn": [{ "entity": [{ { "australianBusinessNumbe
(right here) ------^
NB: JSONlint doesn't seem to think the file is a valied JSON file
My thought is that I may need to use stream_in() or readLines() but I am no very proficient with these functions. Any help or insight greatly appreciated. Cheers

Related

Read graph into NetworkX from JSON file

I have downloaded my Facebook data. I got the data in the form of JSON files.
Now I am trying to read these JSON files into NetworkX. I don't find any function to read graph from JSON file into NetworkX.
In another post, found the info related to reading a graph from JSON, where the JSON file was earlier created from NetworkX using json.dump().
But here in my case I have downloaded the data from Facebook. Is there any function to read graph from JSON file into NetworkX?
Unlike Pandas tables or Numpy arrays, JSON files has no rigid structure so one can't write a function to convert any JSON file to Networkx graph. If you want to construct a graph based on JSON, you should pick all needed info yourself. You can load a file with json.loads function, extract all nodes and edges according to your rules and then put them into your graph with add_nodes_from and add_edges_from functions.
For example Facebook JSON file you can write something like it:
import json
import networkx as nx
with open('fbdata.json') as f:
json_data = json.loads(f.read())
G = nx.DiGraph()
G.add_nodes_from(
elem['from']['name']
for elem in json_data['data']
)
G.add_edges_from(
(elem['from']['id'], elem['id'])
for elem in json_data['data']
)
nx.draw(
G,
with_labels=True
)
And get this graph:

R - rjson "Unexpected character: :" (finding the source of the error)

I need to find certain information from a JSON data set that my company acquired. When I try to import it to a variable via the "fromJSON" method, I get the error listed in the title. The data set contains information for over 16,000 files, so searching for the problem manually just isn't an option (especially since it's JSON, so there are tons of colons). Is there a way in R to find the source, or at least line-number, of the problematic character(s)?
Paste the json here and validate it. It will tell you where the json is invalid.
https://jsonlint.com/

APP-ENGINE load data from static json file or load data into the datastore?

Im new to app-engine. Writing a rest api. Wondering if anyone has been in this dilemma before?
This data that i have is not alot (3 to 4 pages) and but it changes annually.
Option 1: Write the data as json and parse the json file every time a request comes in.
Option 2: Model into objects and throw into the datastore and then retrieve them whenever a requests comes in.
Does anyone know the pros and cons for each of this method or any better solutions if any.
Of course the answer is it depends.
Here are some of the questions I'd ask myself to make a decision -
do you want to make the change to the data dependent on a code push?
is there sensitive information in the data that should not be checked in to a VCS
what other parts of your system is dependent on this data
how likely are your assumptions about the data going to change in terms of frequency of updating and size
Assuming the data is small (<1MB) and there's no sensitive information in it, I'd start out loading the JSON file as it's the simplest solution.
You don't have to parse the data on each request, but you can parse it at the top level once and effectively treat it as a constant.
Something along these lines -
import os
import json
DATA_FILE = os.path.join(os.path.dirname(__file__), 'YOUR_DATA_FILE.json')
with open(DATA_FILE, 'r') as dataFile:
JSON_DATA = json.loads(dataFile.read())
You can then use JSON_DATA like a dictionary in your code.
awesome_data = JSON_DATA['data']['awesome']
In case you need to access the data in multiple places, you can move this into its own module (ex. config.py) and import JSON_DATA wherever you need it.
Ex. in main.py
from config import JSON_DATA
# do something w/ JSON_DATA

export R list into Julia via JSON

suppose I have this list in R
x = list(a=1:3,b=8:20)
and I write this to a json file on disk with
library(jsonlite)
cat(toJSON(x),file="f.json")
how can I use the Julia JSON package to read that? Can I?
# Julia
using JSON
JSON.parse("/Users/florianoswald/f.json")
gives a mistake - I guess it expects a json string.
Any alternatives? I would benefit from being able to pass a list (i.e. a nested structure) rather than tabular data. thanks!
If you want to do this with the current version of JSON you can use Julia's readall method to get a string from a file.
Pkg.clone("JSON") will get you the latest development version of JSON.jl (as opposed to the latest released version) – it seems parsefile is not released yet.

Parse JSON with R

I am fairly new to R, but the more use it, the more I see how powerful it really is over SAS or SPSS. Just one of the major benefits, as I see them, is the ability to get and analyze data from the web. I imagine this is possible (and maybe even straightforward), but I am looking to parse JSON data that is publicly available on the web. I am not a programmer by any stretch, so any help and instruction you can provide will be greatly appreciated. Even if you point me to a basic working example, I probably can work through it.
RJSONIO from Omegahat is another package which provides facilities for reading and writing data in JSON format.
rjson does not use S4/S3 methods and so is not readily extensible, but still useful. Unfortunately, it does not used vectorized operations and so is too slow for non-trivial data. Similarly, for reading JSON data into R, it is somewhat slow and so does not scale to large data, should this be an issue.
Update (new Package 2013-12-03):
jsonlite: This package is a fork of the RJSONIO package. It builds on the parser from RJSONIO but implements a different mapping between R objects and JSON strings. The C code in this package is mostly from the RJSONIO Package, the R code has been rewritten from scratch. In addition to drop-in replacements for fromJSON and toJSON, the package has functions to serialize objects. Furthermore, the package contains a lot of unit tests to make sure that all edge cases are encoded and decoded consistently for use with dynamic data in systems and applications.
The jsonlite package is easy to use and tries to convert json into data frames.
Example:
library(jsonlite)
# url with some information about project in Andalussia
url <- 'https://api.stackexchange.com/2.2/badges?order=desc&sort=rank&site=stackoverflow'
# read url and convert to data.frame
document <- fromJSON(txt=url)
Here is the missing example
library(rjson)
url <- 'http://someurl/data.json'
document <- fromJSON(file=url, method='C')
The function fromJSON() in RJSONIO, rjson and jsonlite don't return a simple 2D data.frame for complex nested json objects.
To overcome this you can use tidyjson. It takes in a json and always returns a data.frame. It is currently not availble in CRAN, you can get it here: https://github.com/sailthru/tidyjson
Update: tidyjson is now available in cran, you can install it directly using install.packages("tidyjson")
For the record, rjson and RJSONIO do change the file type, but they don't really parse per se. For instance, I receive ugly MongoDB data in JSON format, convert it with rjson or RJSONIO, then use unlist and tons of manual correction to actually parse it into a usable matrix.
Try below code using RJSONIO in console
library(RJSONIO)
library(RCurl)
json_file = getURL("https://raw.githubusercontent.com/isrini/SI_IS607/master/books.json")
json_file2 = RJSONIO::fromJSON(json_file)
head(json_file2)