Selectively Import only Json data in txt file into R. - json

I have 3 questions I would like to ask as I am relatively new to both R and Json format. I read quite a bit of things but I don't quite understand still.
1:) Can R parse Json data when the txt file contains other irrelevant information as well?
Assuming I can't, I uploaded the text file into R and did some cleaning up. So that it will be easier to read the file.
require(plyr)
require(rjson)
small.f.2 <- subset(small.f.1, ! V1 %in% c("Level_Index:", "Feature_Type:", "Goals:", "Move_Count:"))
small.f.3 <- small.f.2[,-1]
This would give me a single column with all the json data in each line.
I tried to write new .txt file .
write.table(small.f.3, file="small clean.txt", row.names = FALSE)
json_data <- fromJSON(file="small.clean")
The problem was it only converted 'x' (first row) into a character and ignored everything else. I imagined it was the problem with "x" so I took that out from the .txt file and ran it again.
json_data <- fromJSON(file="small clean copy.txt")
small <- fromJSON(paste(readLines("small clean copy.txt"), collapse=""))
Both time worked and I manage to create a list. But it only takes the data from the first row and ignore the rest. This leads to my second question.
I tried this..
small <- fromJSON(paste(readLines("small clean copy.txt"), collapse=","))
Error in fromJSON(paste(readLines("small clean copy.txt"), collapse = ",")) :
unexpected character ','
2.) How can I extract the rest of the rows in the .txt file?
3.) Is it possible for R to read the Json data from one row, and extract only the nested data that I need, and subsequently go on to the next row, like a loop? For example, in each array, I am only interested in the Action vectors and the State Feature vectors, but I am not interested in the rest of the data. If I can somehow extract only the information I need before moving on to the next array, than I can save a lot of memory space.
I validated the array online. But the .txt file is not json formatted. Only within each array. I hope this make sense. Each row is a nested array.
The data looks something like this. I have about 65 rows (nested arrays) in total.
{"NonlightningIndices":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],"LightningIndices":[],"SelectedAction":12,"State":{"Features":{"Data":[21.0,58.0,0.599999964237213,12.0,9.0,3.0,1.0,0.0,11.0,2.0,1.0,0.0,0.0,0.0,0.0]}},"Actions":[{"Features":{"Data":[4.0,4.0,1.0,1.0,0.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.12213890532609,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.13055793241076,0.0,0.0,0.0,0.0,0.0,0.231325346416068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.949158357257511,0.0,0.0,0.0,0.0,0.0,0.369666537828737,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0851765937900996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.223409208023677,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.698640447815897,1.69496718435102,0.0,0.0,0.0,0.0,1.42312654023416,0.0,0.38394999584831,0.0,0.0,0.0,0.0,1.0,1.22164326251584,1.30980246401454,1.00411570750454,0.0,0.0,0.0,1.44306759429513,0.0,0.00568191150434618,0.0,0.0,0.0,0.0,0.0,0.0,0.157705869690127,0.0,0.0,0.0,0.0,0.102089274086033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.37039305683305,2.64354332879095,0.0,0.456876463171171,0.0,0.0,0.208651305680117,0.0,0.0,0.0,0.0,0.0,2.0,0.0,3.46713142511126,2.26785558685153,0.284845692694476,0.29200364444299,0.0,0.562185300773834,1.79134869431988,0.423426746571872,0.0,0.0,0.0,0.0,5.06772310533214,0.0,1.95593334724537,2.08448537685298,1.22045520912269,0.251119892385839,0.0,4.86192274732091,0.0,0.186941346075472,0.0,0.0,0.0,0.0,4.37998688020614,0.0,3.04406665275463,1.0,0.49469909818283,0.0,0.0,1.57589195190525,0.0,0.0,0.0,0.0,0.0,0.0,3.55229001446173]}},......
{"NonlightningIndices":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,24],"LightningIndices":[[15,16,17,18,19,20,21,22,23]],"SelectedAction":15,"State":{"Features":{"Data":[20.0,53.0,0.0,11.0,10.0,2.0,1.0,0.0,12.0,2.0,1.0,0.0,0.0,1.0,0.0]}},"Actions":[{"Features":{"Data":[4.0,4.0,1.0,1.0,0.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.110686363475575,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.13427913742728,0.0,0.0,0.0,0.0,0.0,0.218834141070836,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.939443046803111,0.0,0.0,0.0,0.0,0.0,0.357568892126985,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0889329732996782,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.22521492930721,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.700441220022084,1.6762090551226,0.0,0.0,0.0,0.0,1.44526456614638,0.0,0.383689214317325,0.0,0.0,0.0,0.0,1.0,1.22583659574753,1.31795156033445,0.99710368703165,0.0,0.0,0.0,1.44325394830013,0.0,0.00418600599483917,0.0,0.0,0.0,0.0,0.0,0.0,0.157518319482216,0.0,0.0,0.0,0.0,0.110244186273209,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.369899973785845,2.55505143302811,0.0,0.463342609296841,0.0,0.0,0.226088384842823,0.0,0.0,0.0,0.0,0.0,2.0,0.0,3.47842109127488,2.38476342332125,0.0698115810371108,0.276804206873942,0.0,1.53514282355593,1.77391161515718,0.421465101754304,0.0,0.0,0.0,0.0,4.45530484778828,0.0,1.43798302409155,3.46965807176681,0.468528940277049,0.259853183829217,0.0,4.86988325473155,0.0,0.190659677933533,0.0,0.0,0.963116148760181,0.0,4.29930830894124,0.0,2.56201697590845,0.593423384852181,0.46165947868584,0.0,0.0,1.59497392171253,0.0,0.0,0.0,0.0,0.0368838512398189,0.0,4.24538684327048]}},......
I would really appreciate any advice here.

Related

Creating individual JSON files from a CSV file that is already in JSON format

I have JSON data in a CVS file that I need to break apart into seperate JSON files. The data looks like this: {"EventMode":"","CalculateTax":"Y",.... There are multiple rows of this and I want each row to be a separate JSON file. I have used code provided by Jatin Grover that parses the CVS into JSON:
lcount = 0
out = json.dumps(row)
jsonoutput = open( 'json_file_path/parsedJSONfile'+str(lcount)+'.json', 'w')
jsonoutput.write(out)
lcount+=1
This does an excellent job the problem is it adds "R": " before the {"EventMode... and adds extra \ between each element as well as item at the end.
Each row of the CVS file is already valid JSON objects. I just need to break each row into a separate file with the .json extension.
I hope that makes sense. I am very new to this all.
It's not clear from your picture what your CSV actually looks like.
I mocked up a really small CSV with JSON lines that looks like this:
Request
"{""id"":""1"", ""name"":""alice""}"
"{""id"":""2"", ""name"":""bob""}"
(all the double-quotes are for escaping the quotes that are part of the JSON)
When I run this little script:
import csv
with open('input.csv', newline='') as input_file:
reader = csv.reader(input_file)
next(reader) # discard/skip the fist line ("header")
for i, row in enumerate(reader):
with open(f'json_file_path/parsedJSONfile{i}.json', 'w') as output_file:
output_file.write(row[0])
I get two files, json_file_path/parsedJSONfile0.json and json_file_path/parsedJSONfile1.json, that look like this:
{"id":"1", "name":"Alice"}
and
{"id":"2", "name":"bob"}
Note that I'm not using json.dumps(...), that only makes sense if you are starting with data inside Python and want to save it as JSON. Your file just has text that is complete JSON, so basically copy-paste each line as-is to a new file.

How can I write certain sections of text from different lines to multiple lines?

So I'm currently trying to use Python to transform large sums of data into a neat and tidy .csv file from a .txt file. The first stage is trying to get the 8-digit company numbers into one column called 'Company numbers'. I've created the header and just need to put each company number from each line into the column. What I want to know is, how do I tell my script to read the first eight characters of each line in the .txt file (which correspond to the company number) and then write them to the .csv file? This is probably very simple but I'm only new to Python!
So far, I have something which looks like this:
with open(r'C:/Users/test1.txt') as rf:
with open(r'C:/Users/test2.csv','w',newline='') as wf:
outputDictWriter = csv.DictWriter(wf,['Company number'])
outputDictWriter.writeheader()
rf = rf.read(8)
for line in rf:
wf.write(line)
My recommendation would be 1) read the file in, 2) make the relevant transformation, and then 3) write the results to file. I don't have sample data, so I can't verify whether my solution exactly addresses your case
with open('input.txt','r') as file_handle:
file_content = file_handle.read()
list_of_IDs = []
for line in file_content.split('\n')
print("line = ",line)
print("first 8 =", line[0:8])
list_of_IDs.append(line[0:8])
with open("output.csv", "w") as file_handle:
file_handle.write("Company\n")
for line in list_of_IDs:
file_handle.write(line+"\n")
The value of separating these steps is to enable debugging.

convert json text entries to a dataframe in r

I have a text file with json like structure that contains values for certain variables as below.
[{"variable1":"111","variable2":"666","variable3":"11","variable4":"aaa","variable5":"0"}]
[{"variable1":"34","variable2":"12","variable3":"78","variable4":"qqq","variable5":"-9"}]
Every line is a new set of values for the same variables 1 through 5. There can be 1000s of lines in a text file but the variables would always remain the same. I want to extract variable 1 through 5 along with their values and convert into a dataframe. Currently I perform these operations in excel using string manipulation and transpose. Here is what it looks like in excel -
How to do this in R? Much appreciated. Thanks.
J
There is a package named jsonlite that you can use.
library("jsonlite")
df<- fromJSON("YourPathToTheFile")
You can find more info here.

Read a log file in R

I'm trying to read a log file in R.
It looks like an extract from a JSON file to me, but when trying to read it using jsonlite I get the following error message: "Error: parse error: trailing garbage".
Here is how my log file look like:
{"date":"2017-05-11T04:37:15.587Z","userId":"admin","module":"Quote","action":"CreateQuote","identifier":"-.admin1002"},
{"date":"2017-05-11T05:12:24.939Z","userId":"a145fhyy","module":"Quote","action":"Call","identifier":"RunUY"},
{"date":"2017-05-11T05:12:28.174Z","userId":"a145fhyy","license":"named","usage":"External","module":"Catalog","action":"OpenCatalog","identifier":"wks.klu"},
Has you can see, the column name is precised directly in front of the content for each line (e.g: "date": or "action":)
And some line can skip some columns and add some other.
What I want to get as output would be to have 7 columns with the corresponding data filled in each:
date
userId
license
usage
module
action
identifier
Does anyone has a suggestion about how to get there?
Thanks a lot in advance
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Thanks everyone for your answers. Here are some precisions about my issue:
The data that I gave as example in an extract of one of my log files. I've got a lot of them that I need to read as one unique table.
I haven't added any commas or anything to it.
#r2evans
I've tried the following:
Log3 <-read.table("/Projects/data/analytics.log.agregated.2017-05‌​-11.log") jsonlite::stream_in(textConnection(gsub(",$","",Log3)))
It returns the following error:
Error: lexical error: invalid char in json text.
c(17, 18, 19, 20, 21, 22, 23, 2
(right here) ------^
I'm not sure how to use sed -e 's/,$//g' infile > outfile and Sys.which("sed"), that something I'm not familiar with. I'm looking into it, but if you have anymore precisions to give me about the usage of it that would be great.
I have saved your example as a file "test.json" and was able to read and parse it like this:
library(jsonlite)
rf <- read_file("test.json")
rfr <- gsub("\\},", "\\}", rf)
data <- stream_in(textConnection(rfr))
It parses and simplifies into a neat data frame exactly like you want. What I do is look for "}," rather than ",$", because the very last comma is not (necessarily) followed by a newline character(s).
However, this might not be the best solution for very large files.. For those you may need to first look for a way to modify the text file itself by getting rid of the commas. Or, if that's possible, ask the people who exported this file to export it in a normal ndjson format:-)

Python 3 code to read CSV file, manipulate then create new file....works, but looking for improvements

This is my first ever post here. I am trying to learn a bit of Python. Using Python 3 and numpy.
Did a few tutorials then decided to dive in and try a little project I might find useful at work as thats a good way to learn for me.
I have written a program that reads in data from a CSV file which has a few rows of headers, I then want to extract certain columns from that file based on the header names, then output that back to a new csv file in a particular format.
The program I have works fine and does what I want, but as I'm a newbie I would like some tips as to how I can improve my code.
My main data file (csv) is about 57 columns long and about 36 rows deep so not big.
It works fine, but looking for advice & improvements.
import csv
import numpy as np
#make some arrays..at least I think thats what this does
A=[]
B=[]
keep_headers=[]
#open the main data csv file 'map.csv'...need to check what 'r' means
input_file = open('map.csv','r')
#read the contents of the file into 'data'
data=csv.reader(input_file, delimiter=',')
#skip the first 2 header rows as they are junk
next(data)
next(data)
#read in the next line as the 'header'
headers = next(data)
#Now read in the numeric data (float) from the main csv file 'map.csv'
A=np.genfromtxt('map.csv',delimiter=',',dtype='float',skiprows=5)
#Get the length of a column in A
Alen=len(A[:,0])
#now read the column header values I want to keep from 'keepheader.csv'
keep_headers=np.genfromtxt('keepheader.csv',delimiter=',',dtype='unicode_')
#Get the length of keep headers....i.e. how many headers I'm keeping.
head_len=len(keep_headers)
#Now loop round extracting all the columns with the keep header titles and
#append them to array B
i=0
while i < head_len:
#use index to find the apprpriate column number.
item_num=headers.index(keep_headers[i])
i=i+1
#append the selected column to array B
B=np.append(B,A[:,item_num])
#now reshape the B array
B=np.reshape(B,(head_len,36))
#now transpose it as thats the format I want.
B=np.transpose(B)
#save the array B back to a new csv file called 'cmap.csv'
np.savetxt('cmap.csv',B,fmt='%.3f',delimiter=",")
Thanks.
You can greatly simplify your code using more of numpy capabilities.
A = np.loadtxt('stack.txt',skiprows=2,delimiter=',',dtype=str)
keep_headers=np.loadtxt('keepheader.csv',delimiter=',',dtype=str)
headers = A[0,:]
cols_to_keep = np.in1d( headers, keep_headers )
B = np.float_(A[1:,cols_to_keep])
np.savetxt('cmap.csv',B,fmt='%.3f',delimiter=",")