Read a log file in R - json

I'm trying to read a log file in R.
It looks like an extract from a JSON file to me, but when trying to read it using jsonlite I get the following error message: "Error: parse error: trailing garbage".
Here is how my log file look like:
{"date":"2017-05-11T04:37:15.587Z","userId":"admin","module":"Quote","action":"CreateQuote","identifier":"-.admin1002"},
{"date":"2017-05-11T05:12:24.939Z","userId":"a145fhyy","module":"Quote","action":"Call","identifier":"RunUY"},
{"date":"2017-05-11T05:12:28.174Z","userId":"a145fhyy","license":"named","usage":"External","module":"Catalog","action":"OpenCatalog","identifier":"wks.klu"},
Has you can see, the column name is precised directly in front of the content for each line (e.g: "date": or "action":)
And some line can skip some columns and add some other.
What I want to get as output would be to have 7 columns with the corresponding data filled in each:
date
userId
license
usage
module
action
identifier
Does anyone has a suggestion about how to get there?
Thanks a lot in advance
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Thanks everyone for your answers. Here are some precisions about my issue:
The data that I gave as example in an extract of one of my log files. I've got a lot of them that I need to read as one unique table.
I haven't added any commas or anything to it.
#r2evans
I've tried the following:
Log3 <-read.table("/Projects/data/analytics.log.agregated.2017-05‌​-11.log") jsonlite::stream_in(textConnection(gsub(",$","",Log3)))
It returns the following error:
Error: lexical error: invalid char in json text.
c(17, 18, 19, 20, 21, 22, 23, 2
(right here) ------^
I'm not sure how to use sed -e 's/,$//g' infile > outfile and Sys.which("sed"), that something I'm not familiar with. I'm looking into it, but if you have anymore precisions to give me about the usage of it that would be great.

I have saved your example as a file "test.json" and was able to read and parse it like this:
library(jsonlite)
rf <- read_file("test.json")
rfr <- gsub("\\},", "\\}", rf)
data <- stream_in(textConnection(rfr))
It parses and simplifies into a neat data frame exactly like you want. What I do is look for "}," rather than ",$", because the very last comma is not (necessarily) followed by a newline character(s).
However, this might not be the best solution for very large files.. For those you may need to first look for a way to modify the text file itself by getting rid of the commas. Or, if that's possible, ask the people who exported this file to export it in a normal ndjson format:-)

Related

What is causing a csv load error in weka?

Im receiving the following error when trying to open a CSV file in Weka version 3.8.5
File not recognized as an 'CSV data files' file Reason: wrong number
of values. Read 2, expected 12, read Token [EOL], Line 2 Problem
encountered on Line:2
I have read solutions to similar errors on this site and can't seem to find what is wrong with my particular file. However, as a very newbie weka user, it may just be my misunderstanding of the issue. Can someone take a look at the sample csv data below and let me know if you see what I am not understnding or missing?
LossMonth,LossYear,ClaimNumber,PolicyNumber,ClaimBranch,Agency,LocationCounty,CATCode,CauseCode,IncurredLoss,CurrentReserves,"
City",State,ZIPCODE,"
COLLISIONTYPECD","
CLOSEDDT",DaystoCLose,"
FATALITYCNT","
FATALITYIND","
FAULTRATINGIND","
AUTOGLASSIND","
DEERLOSSIND","
WEATHERRELATEDIND","
POLICYTIERCD",ClaimStatus,AgencyHandled,VEHICLEYEAR,DRIVERRELATIONTOINSUREDDESC,TOTALLOSSIND,INSURANCESCORE,Age
10,2016,4125858,20169200,4,113,73,1,comp,2525,0,PADUCAH,KY,42001,x,42692,18,0,0,0,0,0,1,70,1,0,2004,Other third party,0,703,73
1,2018,4265645,20137828,13,106,37,1,hail,3164,0,BAGDAD,KY,40003,x,43214,88,0,0,0,0,0,0,50,1,0,2010,Named Insured,1,799,63
12,2016,4136759,20322058,5,105,105,1,hail,2547,0,GEORGETOWN,KY,40324,x,42713,2,0,0,0,0,0,0,10,1,0,2010,Named Insured,0,999,68
1,2016,4033032,20175699,13,106,106,1,comp,15327,0,SIMPSONVILLE,KY,40067,x,42469,73,0,0,0,0,0,1,80,1,0,2000,Named Insured,1,999,34
9,2016,4116782,20133146,2,115,115,1,wind,7529,0,SPRINGFIELD,KY,40069,x,42649,8,0,0,0,0,0,0,10,1,0,2003,Named Insured,0,783,47
2,2016,4038442,20170355,7,148,10,1,hail,3631,0,ASHLAND,KY,41101,x,42417,1,0,0,0,0,0,0,50,1,0,2010,Named Insured,0,778,42
2,2016,4039439,20218265,7,45,10,1,hail,3579,0,FLATWOODS,KY,41139,x,42444,25,0,0,0,0,0,0,40,1,0,2013,Named Insured,0,820,52
2,2016,4039440,20218265,7,45,10,1,hail,570,0,FLATWOODS,KY,41139,x,42422,3,0,0,0,0,0,0,40,1,0,2012,Named Insured,0,820,52
3,2018,4275810,20126522,15,40,40,1,hail,3747,0,LANCASTER,KY,40444,x,43216,55,0,0,0,0,0,0,10,1,0,2009,Named Insured,1,999,74
5,2016,4071936,20461965,15,40,40,1,hail,525,0,LANCASTER,KY,40444,x,42521,7,0,0,0,0,0,0,50,1,0,2006,Named Insured,0,999,68
3,2016,4046685,20226270,7,35,35,1,hail,3558,0,FLEMINGSBURG,KY,41041,x,42447,2,0,0,0,0,0,0,80,1,0,2012,Named Insured,0,842,69
4,2016,4055942,20439287,7,35,35,1,hail,2551,0,EWING,KY,41039,x,42475,1,0,0,0,0,0,0,70,1,0,2006,Named Insured,0,867,48
1,2016,4026514,20394097,7,148,10,1,hail,1350,0,ASHLAND,KY,41101,x,42376,3,0,0,0,0,0,0,40,1,0,2007,Named Insured,0,637,65
3,2016,4047152,20212062,15,141,76,1,hail,1739,0,BEREA,KY,40403,x,42473,27,0,0,0,0,0,0,80,1,0,2008,Named Insured,0,777,77
2,2016,4035512,20103029,15,40,40,1,hail,2008,0,LANCASTER,KY,40444,x,42405,1,0,0,0,0,0,0,0,1,0,2000,Named Insured,1,885,72
1,2016,4030456,20385643,15,120,40,1,hail,1497,0,LANCASTER,KY,40444,x,42450,62,0,0,0,0,0,0,20,1,0,2013,Named Insured,0,839,65
4,2016,4053299,20251610,5,69,11,1,hail,1535,0,DANVILLE,KY,40422,x,42514,48,0,0,0,0,0,0,100,1,0,2013,Insured,0,999,64
6,2016,4076264,20337992,17,140,1,1,hail,1799,0,MILLTOWN,KY,42728,x,42529,2,0,0,0,0,0,0,50,1,0,2002,Named Insured,0,999,84
8,2017,4217498,20596983,8,86,86,1,hail,660,0,TOMPKINSVILLE,KY,42167,x,42954,0,0,0,0,0,0,0,100,1,0,2012,Named Insured,0,999,45
1,2016,4026053,20511114,4,113,113,1,hail,1310,0,STURGIS,KY,42459,x,42376,3,0,0,0,0,0,0,100,1,0,2003,Named Insured,0,694,44
1,2016,4026766,20656586,4,113,113,1,hail,2360,0,MORGANFIELD,KY,42437,x,42383,9,0,0,0,0,0,0,20,1,0,2010,Named Insured,0,999,89
1,2016,4027473,20085251,6,42,42,1,hail,1699,0,MAYFIELD,KY,42066,x,42381,5,0,0,0,0,0,0,90,1,0,2008,Named Insured,0,747,50
1,2016,4029284,20167051,17,109,109,1,wind,3133,0,CAMPBELLSVILLE,KY,42718,x,42387,5,0,0,0,0,0,0,10,1,0,1993,Named Insured,0,886,78
1,2016,4031937,20326278,3,81,12,1,comp,3385,0,FOSTER,KY,41043,x,42402,8,0,0,0,0,0,1,40,1,0,2003,Named Insured,0,723,79
1,2016,4027931,20339366,8,107,107,1,wind,5858,0,FRANKLIN,KY,42134,x,42447,70,0,0,0,0,0,0,20,1,0,2014,Named Insured,0,940,80
1,2016,4028456,20453076,15,87,87,1,comp,2056,0,JEFFERSONVILLE,KY,40337,x,42387,7,0,0,0,0,0,1,100,1,0,2013,Named Insured,0,999,51
1,2016,4028597,20051661,4,113,113,1,hail,5320,0,WAVERLY,KY,42462,x,42712,332,0,0,0,0,0,0,20,1,0,2014,Named Insured,0,717,58
3,2016,4046687,20018268,6,42,42,1,hail,2736,0,MAYFIELD,KY,42066,x,42450,5,0,0,0,0,0,0,110,1,0,2012,Named Insured,0,735,73
9,2016,4116499,20128172,3,96,59,1,glss,320,0,TAYLOR MILL,KY,41015,x,42660,20,0,0,0,0,0,1,0,1,0,1997,Spouse,0,923,81
1,2016,4026247,20086164,4,113,113,1,hail,1611,0,MORGANFIELD,KY,42437,x,42376,3,0,0,0,0,0,0,10,1,0,2013,Named Insured,0,902,61
1,2016,4027222,20033936,6,79,79,1,glss,105,0,CALVERT CITY,KY,42029,x,42389,14,0,0,0,0,0,1,110,1,0,2001,Named Insured,0,772,57
1,2016,4028311,20059964,4,75,75,1,comp,1040,0,SACRAMENTO,KY,42372,x,42382,2,0,0,0,0,0,1,10,1,0,1996,Named Insured,0,999,64
1,2016,4029164,20541039,6,42,42,1,wind,1495,0,SEDALIA,KY,42079,x,42382,0,0,0,0,0,0,0,0,1,0,2008,Named Insured,0,756,67
1,2016,4027475,20085251,6,42,42,1,hail,940,0,MAYFIELD,KY,42066,x,42381,5,0,0,0,0,0,0,90,1,0,2013,Named Insured,0,747,50
1,2016,4030356,20007300,4,117,117,1,hail,6550,0,DIXON,KY,42409,x,42436,49,0,0,0,0,0,0,40,1,0,2009,Named Insured,0,864,34
Weka's CSVLoader cannot handle rows that span multiple lines (despite quoting). Once all your rows (header and data) are one per line, you should be fine.
The common-csv (unofficial) Weka package should be able to handle rows spanning multiple lines.

Python String replacement in a file each time gives different result

I have a JSON file and I want to do some replacements in it. I've made a code, it works but it's wonky.
This is where the replacement gets done.
replacements1 = {builtTelefon:'Isim', builtIlce:'Isim', builtAdres:'Isim', builtIsim:'Isim'}
replacements3 = {builtYesterdayTelefon:'Isim', builtYesterdayIlce:'Isim', builtYesterdayAdres:'Isim', builtYesterdayIsim:'Isim'}
with open('veri3.json', encoding='utf-8') as infile, open('veri2.json', 'w') as outfile:
for line in infile:
for src, target in replacements1.items():
line = line.replace(src, target)
for src, target in replacements3.items():
line = line.replace(src, target)
outfile.write(line)
Here's some examples to what builtAdres and builtYesterdayAdres looks like:
01 Temmuz 2018 Pazar.1
30 Haziran 2018 Cumartesi.1
I run this on my data but it results in many different outputs each time. Please do check the screenshot below because I don't know how else I can tell about it.
This is the very same code and I run the same thing everytime but it results in with different outcomes each time.
Here is the original JSON file:
What it should do is testing entire file against 01 Temmuz 2018 Pazar and if it finds just replaces it with string Isim without touching anything else. On a second run checks if anything is 30 Haziran 2018 Cumartesi and replaces them with string Isim too.
What's causing this?
Example files for re-testing:
pastebin - veri3.json
pastebin - code.py
I think you have just one problem: you're trying to use "Isim" as key name multiple times within the same object, and this will botch the JSON.
The reason why you might be "getting different results" might have to do with the client you're using to display the JSON. I think that if you look at the raw data, the JSON should have been fully altered (I ran your script and it seems to be altered). However, the client will not handle very well the repeated key, and will display all objects as well as it can.
In fact, I'm not sure how you get "Isim.1", "Isim.2" as keys, since you actually use "Isim" for all. The client must be trying to cope with the duplicity there.
Try this code, where I use "Isim.1", "Isim.2" etc.:
replacements1 = {builtTelefon:'Isim.3', builtIlce:'Isim.2', builtAdres:'Isim.1', builtIsim:'Isim'}
replacements3 = {builtYesterdayTelefon:'Isim.3', builtYesterdayIlce:'Isim.2', builtYesterdayAdres:'Isim.1', builtYesterdayIsim:'Isim'}
I think you should be able to have all the keys displayed now.
Oh and PS: to use your code with my locale I had to change line 124 to specify 'utf-8' as encoding for the outfile:
with open('veri3.json', encoding='utf-8') as infile, open('veri2.json', 'w', encoding='utf-8') as outfile:

Golang CSV read : extraneous " in field error

I am using a simple program to read CSV file, somehow I noticed when I created a CSV using EXCEL or windows based computer go library fails to read it. even when I use cat command it only shows me last line on the terminal. It always results in this error extraneous " in field.
I researched somewhat than I found it is somewhat related to carriage return differences between OS.
But I really want to ask how to make a generic csv reader. I tried reading the same csv using pandas and it was reading successfully. But i am not been able to achieve this using my Go code.
Also screen shot of correct csv Is here
Your file clearly shows that you've got an extra quote at the end of the content. While programs like pandas may be fine with that, I assume it's not valid csv so go does return an error.
Quick example of what's wrong with your data: https://play.golang.org/p/KBikSc1nzD
Update: After your update and a little bit of searching, I have to apoligize, the carriage return does matter and seems to be tha main culprit here, Go seems to be ok handling the \r\n windows variant but not the \r one. In that case what you can do is wrap the bytes.Reader into a custom reader that replaces the \r byte with the \n byte.
Here's an example: https://play.golang.org/p/vNjzwAHmtg
Please note, that the example is just that, an example, it's not handling all the possible cases where \r might be a legit byte.

Java.io.IOException: wrong number of values (WEKA CSV to ARFF)

Currently working on a Data Mining project using my own dataset I had found using Weka. The only issue is that taking my file from csv format and converting it into arff format is causing issues.
java.io.IOException: wrong number of values. Read 2, expected 5, Read Token[EOL], line 3
This is the error I am getting. I have browsed around online looking for similar issues and have tried removing all quotes and special characters that throw this exception. Every place I looked told me to remove special characters and I believe there are none left. The link to my dataset is here : https://docs.google.com/spreadsheets/d/1xqEe7MZE9SdKB_yvFSgWeSVYuDrq0b31Eu5oECNbGH0/edit#gid=1736568367&vpid=A1
This is the first three lines of my file where the first is the attribute names, file is separated by commas in note
Inequality Adjusted HPI Rank,Sub Region,Inequality Adjusted Life Expectancy,Inquality Adjusted Well being,Footprint
,Inequality adjusted HPI
1,1,73.1,6.9,2.5,48.2
2,6,65.17333333,5.487667631,1.390974448,45.97489063
If you open your file with a text editor, you will see that Footprint has quotes around it. Delete the quotes and you are good to go!
Weka is normally not that good in reading CSV files that include special characters, and ARFF files are normally easier to use. Therefore, in such cases, the easiest way is to convert your CSV file to an ARFF file using R ("RWeka" and "foreign" libraries can handle this conversion).
There is also another possibility. I was creating my CSV file and the header had a different number of elements compared to the rest of the data. So, check the header as well...!

Selectively Import only Json data in txt file into R.

I have 3 questions I would like to ask as I am relatively new to both R and Json format. I read quite a bit of things but I don't quite understand still.
1:) Can R parse Json data when the txt file contains other irrelevant information as well?
Assuming I can't, I uploaded the text file into R and did some cleaning up. So that it will be easier to read the file.
require(plyr)
require(rjson)
small.f.2 <- subset(small.f.1, ! V1 %in% c("Level_Index:", "Feature_Type:", "Goals:", "Move_Count:"))
small.f.3 <- small.f.2[,-1]
This would give me a single column with all the json data in each line.
I tried to write new .txt file .
write.table(small.f.3, file="small clean.txt", row.names = FALSE)
json_data <- fromJSON(file="small.clean")
The problem was it only converted 'x' (first row) into a character and ignored everything else. I imagined it was the problem with "x" so I took that out from the .txt file and ran it again.
json_data <- fromJSON(file="small clean copy.txt")
small <- fromJSON(paste(readLines("small clean copy.txt"), collapse=""))
Both time worked and I manage to create a list. But it only takes the data from the first row and ignore the rest. This leads to my second question.
I tried this..
small <- fromJSON(paste(readLines("small clean copy.txt"), collapse=","))
Error in fromJSON(paste(readLines("small clean copy.txt"), collapse = ",")) :
unexpected character ','
2.) How can I extract the rest of the rows in the .txt file?
3.) Is it possible for R to read the Json data from one row, and extract only the nested data that I need, and subsequently go on to the next row, like a loop? For example, in each array, I am only interested in the Action vectors and the State Feature vectors, but I am not interested in the rest of the data. If I can somehow extract only the information I need before moving on to the next array, than I can save a lot of memory space.
I validated the array online. But the .txt file is not json formatted. Only within each array. I hope this make sense. Each row is a nested array.
The data looks something like this. I have about 65 rows (nested arrays) in total.
{"NonlightningIndices":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],"LightningIndices":[],"SelectedAction":12,"State":{"Features":{"Data":[21.0,58.0,0.599999964237213,12.0,9.0,3.0,1.0,0.0,11.0,2.0,1.0,0.0,0.0,0.0,0.0]}},"Actions":[{"Features":{"Data":[4.0,4.0,1.0,1.0,0.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.12213890532609,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.13055793241076,0.0,0.0,0.0,0.0,0.0,0.231325346416068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.949158357257511,0.0,0.0,0.0,0.0,0.0,0.369666537828737,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0851765937900996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.223409208023677,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.698640447815897,1.69496718435102,0.0,0.0,0.0,0.0,1.42312654023416,0.0,0.38394999584831,0.0,0.0,0.0,0.0,1.0,1.22164326251584,1.30980246401454,1.00411570750454,0.0,0.0,0.0,1.44306759429513,0.0,0.00568191150434618,0.0,0.0,0.0,0.0,0.0,0.0,0.157705869690127,0.0,0.0,0.0,0.0,0.102089274086033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.37039305683305,2.64354332879095,0.0,0.456876463171171,0.0,0.0,0.208651305680117,0.0,0.0,0.0,0.0,0.0,2.0,0.0,3.46713142511126,2.26785558685153,0.284845692694476,0.29200364444299,0.0,0.562185300773834,1.79134869431988,0.423426746571872,0.0,0.0,0.0,0.0,5.06772310533214,0.0,1.95593334724537,2.08448537685298,1.22045520912269,0.251119892385839,0.0,4.86192274732091,0.0,0.186941346075472,0.0,0.0,0.0,0.0,4.37998688020614,0.0,3.04406665275463,1.0,0.49469909818283,0.0,0.0,1.57589195190525,0.0,0.0,0.0,0.0,0.0,0.0,3.55229001446173]}},......
{"NonlightningIndices":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,24],"LightningIndices":[[15,16,17,18,19,20,21,22,23]],"SelectedAction":15,"State":{"Features":{"Data":[20.0,53.0,0.0,11.0,10.0,2.0,1.0,0.0,12.0,2.0,1.0,0.0,0.0,1.0,0.0]}},"Actions":[{"Features":{"Data":[4.0,4.0,1.0,1.0,0.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.110686363475575,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.13427913742728,0.0,0.0,0.0,0.0,0.0,0.218834141070836,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.939443046803111,0.0,0.0,0.0,0.0,0.0,0.357568892126985,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0889329732996782,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.22521492930721,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.700441220022084,1.6762090551226,0.0,0.0,0.0,0.0,1.44526456614638,0.0,0.383689214317325,0.0,0.0,0.0,0.0,1.0,1.22583659574753,1.31795156033445,0.99710368703165,0.0,0.0,0.0,1.44325394830013,0.0,0.00418600599483917,0.0,0.0,0.0,0.0,0.0,0.0,0.157518319482216,0.0,0.0,0.0,0.0,0.110244186273209,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.369899973785845,2.55505143302811,0.0,0.463342609296841,0.0,0.0,0.226088384842823,0.0,0.0,0.0,0.0,0.0,2.0,0.0,3.47842109127488,2.38476342332125,0.0698115810371108,0.276804206873942,0.0,1.53514282355593,1.77391161515718,0.421465101754304,0.0,0.0,0.0,0.0,4.45530484778828,0.0,1.43798302409155,3.46965807176681,0.468528940277049,0.259853183829217,0.0,4.86988325473155,0.0,0.190659677933533,0.0,0.0,0.963116148760181,0.0,4.29930830894124,0.0,2.56201697590845,0.593423384852181,0.46165947868584,0.0,0.0,1.59497392171253,0.0,0.0,0.0,0.0,0.0368838512398189,0.0,4.24538684327048]}},......
I would really appreciate any advice here.