Efficiently unpacking many Json files and combining into multiple dataframes - json

I have a massive number (I believe in the tens of thousands) of json files structured as follows:
{'data':
{
"1":{"A":123,"B":456, "C":789},
"2": {"A":3423,"B":356, "C":549},
...,
"4000":{"A":765,"B":355, "C":321}
},
"timestamp":timestamp1}
My end goal is three pandas dataframes structured such that each parameter denoted as a letter above gets its own dataframe like this:
dfA =
1
2
...
4000
timestamp1
123
3432
...
765
[timestamp from next json]
[from next json]
[from next json]
...
[from next json]
dfB =
1
2
...
4000
timestamp1
456
356
...
355
[timestamp from next json]
[from next json]
[from next json]
...
[from next json]
dfC =
1
2
...
4000
timestamp1
789
549
...
321
[timestamp from next json]
[from next json]
[from next json]
...
[from next json]
I have code which I believe works and achieves this, however I'm not very good with pandas and I know it is extremely slow compared to what it could be.
For each json my code looks something like this:
with open(path) as json_file:
data = json.load(json_file)
df = pd.DataFrame.from_dict(data['data']).T
A_series = df['A']
B_series = df['B']
C_series = df['C']
A_series.name = json_timestamp #timestamp is known before reading each json, but is also present inside each json
B_series.name = json_timestamp #adding these lines make json_timestamp the index in the df
C_series.name = json_timestamp
Adf = Adf.append(A_series.to_dict(),ignore_index=true)
Bdf = Bdf.append(B_series.to_dict(),ignore_index=true)
Cdf = Cdf.append(C_series.to_dict(),ignore_index=true)
This code is looking like its going to take about a week to run- are there changes I can make to make this more efficient or elegant?

The main bottleneck of your code lies in the append. Indeed, pandas return a full copy of the input dataframe with a new added line. This means that the complexity is O(n * n) where n is the number of line of the final dataframe -- i.e. the execution time is quadratic to the number of json files added. Copying the dataframe for each newly added frame is very expensive since the dataframe takes at least several hundreds of MiB in memory.
You can solve that by appending the lines to a list and then call pd.join. This performs the operation in linear time. I expect the computation time to be about 3 order of magnitude faster regarding your input.

Related

Generator usage for log analysing

I have a working python code to analyze logs. Logs are at least 10 MBytes of size and they can sometimes reach 250-300 Mbytes depending on failures, retries.
I used generator which could yield the big file as chunks and it can be configurable and I normally use 1 or 2 Mbytes of log to yield. So I analyze logs as 1Mb chunks for verification of some tests.
My problem is when I use generator it could bring up some edge cases. In log analyzing I check for subsequent appearance of some patterns as the following, so if only those 4 list seen then I keep them for next verification part of the code. The following 4 pattern can be seen in the logs once or twice, not more.
listA
listB
listC
listD
if these all occurs subsequently then I keep them all to evaluate in next step, otherwise ignore..
However now there is a small change the following could happen, some patterns (lists as I use regex findall method to find patterns) can be in next chunk to complete the check. So in the following I have 3 matching case chunk 3-4 and 5-6 and 7-8 creates the condition to take into account.
---- chunk 1 -----
listA
listB
----- chunk 2 -----
nothing
----- chunk 3 -----
listA
listB
----- chunk 4 -----
listC
listD
----- chunk 5 -----
listA
----- chunk 6 -----
listB
listC
listD
---- chunk 7 ------
listA
listB
listC
----- chunk 8 ------
listD
---------------------
Usually it does not happen like this, some patterns (B,C,D) is mostly seen subsequently in logs but listA can be seen 20 maybe the most 30 rows earlier than the rest. But any scenario like above can happen.
Please advise a good approach, I'm not sure what to use, I know there is next() function can be used to check next chunk, in such case
should I use any([listA, listB, listC, listD]) method and if any of the patterns occurs then do I need to check one by one the rest in next chunk like the following?:
if any([listA, listB, listC, listD]):
Then here check which of the patterns not seen and keep them in a notSeen list then check them one by one in next chunk?
next_chunk = next(gen_func(chunksize))
isListA = re.findall(pattern, next_chunk)
Or maybe I completely miss an easier approach for this little project. please let me know your thoughts as you might experience such situation before.
I have used next_chunk = next(gen_func(chunksize))
and added necessary if statements underneath to check only 1 next log piece becase I would arrange log chunks with a generator suitably:
I shared only a part of the code as the rest confidential
import re, os
def __init__(self, logfile):
self.read = self.ReadLog(logfile)
self.search = self.SearchData(logfile)
self.file = os.path.abspath(logfile)
self.data = self.read.read_in_chunks
r_search_combined, scan_result, r_complete, r_failed = [], [], [], []
def test123(self, r_reason: str, cyc: int, b_r):
''' Usage : verify_log.py
--log ${LOGS_FOLDER}/log --reason r_low 1 <True | False>'''
ret = False
r_count = 2*int(cyc) if b_r.lower() == "true" else int(cyc)
r_search_combined, scan_result, r_complete, r_failed = [], [], [], []
result_pattern = self.search.r_scan_pattern()
def check_patterns(chunk):
search_cached = re.findall(self.search.r_search_cached, chunk)
search_full = re.findall(self.search.r_search_full, chunk)
scan_complete = re.findall(self.search.r_scan_complete, chunk)
scan_result = re.findall(result_pattern, chunk)
r_complete = re.findall(self.search.r_auth_complete, chunk)
return search_cached, search_full, scan_complete, scan_result, r_complete
with open(self.file) as rf:
for idx, piece in enumerate(self.data(rf), start=1):
is_failed = re.findall(self.search.r_failure, piece)
if is_failed:
print(f'general failure received : {is_failed}')
r_failed.extend(is_failed)
is_r_search_cached, is_r_search_full, is_scan_complete, is_scan, is_r_complete = check_patterns(piece)
if (is_r_search_cached or is_r_search_full) and all([is_scan_complete, is_scan, is_r_complete]):
if is_r_search_cached:
r_search_combined.extend(is_r_search_cached)
if is_r_search_full:
r_search_combined.extend(is_r_search_full)
scan_result.extend(is_scan)
r_complete.extend(is_r_complete)
elif (is_r_search_cached or is_r_search_full) and not any([is_scan, is_r_complete]):
next_piece = next(self.data(rf))
_, _, _, is_scan_next, is_r_complete_next = check_patterns(next_piece)
if (is_r_search_cached or is_r_search_full) and all([is_scan_next, is_r_complete_next]):
r_search_combined.extend(is_r_search_cached)
r_search_combined.extend(is_r_search_full)
scan_result.extend(is_scan_next)
r_complete.extend(is_r_complete_next)
elif (is_r_search_cached or is_r_search_full) and is_scan and not is_r_complete:
next_piece = next(self.data(rf))
_, _, _, _, is_r_complete_next = check_patterns(next_piece)
if (is_r_search_cached or is_r_search_full) and all([is_scan, is_r_complete_next]):
r_search_combined.extend(is_r_search_cached)
r_search_combined.extend(is_r_search_full)
scan_result.extend(is_scan)
r_complete.extend(is_r_complete_next)

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier

I'm trying to get the sentiments for comments with the help of hugging face sentiment analysis pretrained model. It's returning error like Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier.
Below I'm attaching the code please look at it
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import transformers
import pandas as pd
model = AutoModelForSequenceClassification.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
token = AutoTokenizer.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=token)
data = pd.read_csv('/content/drive/MyDrive/DisneylandReviews.csv', encoding='latin-1')
data.head()
Output is
Review
0 If you've ever been to Disneyland anywhere you...
1 Its been a while since d last time we visit HK...
2 Thanks God it wasn t too hot or too humid wh...
3 HK Disneyland is a great compact park. Unfortu...
4 the location is not in the city, took around 1...
Followed by
classifier("My name is mark")
Output is
[{'label': 'POSITIVE', 'score': 0.9953688383102417}]
Followed by code
basic_sentiment = [i['label'] for i in value if 'label' in i]
basic_sentiment
Output is
['POSITIVE']
Appending the total rows to empty list
text = []
for index, row in data.iterrows():
text.append(row['Review'])
I'm trying to get the sentiment for all the rows
sent = []
for i in range(len(data)):
sentiment = classifier(data.iloc[i,0])
sent.append(sentiment)
The error is :
Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512). Running this sequence through the model will result in indexing errors
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-19-4bb136563e7c> in <module>()
2
3 for i in range(len(data)):
----> 4 sentiment = classifier(data.iloc[i,0])
5 sent.append(sentiment)
11 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1914 # remove once script supports set_grad_enabled
1915 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1916 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1917
1918
IndexError: index out of range in self
some of the sentences in your Review column of the data frame are too long. when these sentences are converted to tokens and sent inside the model they are exceeding the 512 seq_length limit of the model, the embedding of the model used in the sentiment-analysis task was trained on 512 tokens embedding.
to fix this issue you can filter out the long sentences and keep only smaller ones (with token length < 512 )
or you can truncate the sentences with truncating = True
sentiment = classifier(data.iloc[i,0], truncation=True)
If you're tokenizing separately from your classification step, this warning can be output during tokenization itself (as opposed to classification).
In my case, I am using a BERT model, so I have MAX_TOKENS=510 (leaving room for the sequence-start and sequence-end tokens).
token = AutoTokenizer.from_pretrained("your model")
tokens = token.tokenize(
text, max_length=MAX_TOKENS, truncation=True
)
Now, when you run your classifier, the tokens are guaranteed not to exceed the maximum length.

Importing/Conditioning a file.txt with a "kind" of json structure in R

I wanted to import a .txt file in R but the format is really special and it's looks like a json format but I don't know how to import it. There is an example of my data:
{"datetime":"2015-07-08 09:10:00","subject":"MMM","sscore":"-0.2280","smean":"0.2593","svscore":"-0.2795","sdispersion":"0.375","svolume":"8","sbuzz":"0.6026","lastclose":"155.430000000","companyname":"3M Company"},{"datetime":"2015-07-07 09:10:00","subject":"MMM","sscore":"0.2977","smean":"0.2713","svscore":"-0.7436","sdispersion":"0.400","svolume":"5","sbuzz":"0.4895","lastclose":"155.080000000","companyname":"3M Company"},{"datetime":"2015-07-06 09:10:00","subject":"MMM","sscore":"-1.0057","smean":"0.2579","svscore":"-1.3796","sdispersion":"1.000","svolume":"1","sbuzz":"0.4531","lastclose":"155.380000000","companyname":"3M Company"}
To deal with this is used this code:
test1 <- read.csv("C:/Users/test1.txt", header=FALSE)
## Import as 5 observations (5th is all empty) of 1700 variables
#(in fact 40 observations of 11 variables). In fact when I imported the
#.txt file, it's having one line (5th obs) empty, and 4 lines of data and
#placed next to each other 4 lines of data of 11 variables.
# Get the different lines
part1=test1[1:10]
part2=test1[11:20]
part3=test1[21:30]
part4=test1[31:40]
...
## Remove the empty line (there were an empty line after each)
part1=part1[-5,]
part2=part2[-5,]
part3=part3[-5,]
...
## Rename the columns
names(part1)=c("Date Time","Subject","Sscore","Smean","Svscore","Sdispersion","Svolume","Sbuzz","Last close","Company name")
names(part2)=c("Date Time","Subject","Sscore","Smean","Svscore","Sdispersion","Svolume","Sbuzz","Last close","Company name")
names(part3)=c("Date Time","Subject","Sscore","Smean","Svscore","Sdispersion","Svolume","Sbuzz","Last close","Company name")
...
## Assemble data to have one dataset
data=rbind(part1,part2,part3,part4,part5,part6,part7,part8,part9,part10)
## Formate Date Time
times <- as.POSIXct(data$`Date Time`, format='{datetime:%Y-%m-%d %H:%M:%S')
data$`Date Time` <- times
## Keep only the Date
data$Date <- as.Date(times)
## Formate data - Remove text
data$Subject <- gsub("subject:", "", data$Subject)
data$Sscore <- gsub("sscore:", "", data$Sscore)
...
So My code is working to reinstate the data but it's maybe very difficult and more long I know there is better ways to do it, so if you could help me with that I would be very grateful.
There are many packages that read JSON, e.g. rjson, jsonlite, RJSONIO (they will turn in up a google search) - just pick one and give it a go.
e.g.
library(jsonlite)
json.text <- '{"datetime":"2015-07-08 09:10:00","subject":"MMM","sscore":"-0.2280","smean":"0.2593","svscore":"-0.2795","sdispersion":"0.375","svolume":"8","sbuzz":"0.6026","lastclose":"155.430000000","companyname":"3M Company"},{"datetime":"2015-07-07 09:10:00","subject":"MMM","sscore":"0.2977","smean":"0.2713","svscore":"-0.7436","sdispersion":"0.400","svolume":"5","sbuzz":"0.4895","lastclose":"155.080000000","companyname":"3M Company"},{"datetime":"2015-07-06 09:10:00","subject":"MMM","sscore":"-1.0057","smean":"0.2579","svscore":"-1.3796","sdispersion":"1.000","svolume":"1","sbuzz":"0.4531","lastclose":"155.380000000","companyname":"3M Company"}'
x <- fromJSON(paste0('[', json.text, ']'))
datetime subject sscore smean svscore sdispersion svolume sbuzz lastclose companyname
1 2015-07-08 09:10:00 MMM -0.2280 0.2593 -0.2795 0.375 8 0.6026 155.430000000 3M Company
2 2015-07-07 09:10:00 MMM 0.2977 0.2713 -0.7436 0.400 5 0.4895 155.080000000 3M Company
3 2015-07-06 09:10:00 MMM -1.0057 0.2579 -1.3796 1.000 1 0.4531 155.380000000 3M Company
I paste the '[' and ']' around your JSON because you have multiple JSON elements (the rows in the dataframe above) and for this to be well-formed JSON it needs to be an array, i.e. [ {...}, {...}, {...} ] rather than {...}, {...}, {...}.

Scientific data

I want to import data from a corrupted CSV file. It contains scientific numbers and it's a big data set with about 300000 rows and 27 columns. When I import it using,
Import["data.csv","HeaderLines"->1]
the data format is string. So I change it to data table format by
StringSplit[ToString[data[[#]]], ";"] & /#
Range[Dimensions[
Import["data.csv"]][[1]]]
and I need to use the first column to analyse the data. But the problem is that this row is
scientific numbers in string type!! I want to change it to numbers. I used this command:
ToExpression[Internal`StringToDouble[fdata[[All, 1]][[#]]]] & /#
Range[291407];
But it takes more than hours to do so!!! Do you have any idea how I can do this without wasting of time??
You could try the following:
(* read the first 5 rows *)
d = ReadList["data.csv", Table[Number, {27}], 5]
(* read the rows 100 to 150 *)
s = OpenRead["data.csv"];
Skip[s, Record, 99]
d = ReadList[s, Table[Number, {27}], 51]
Close[s]
And d[[All,1]] will get you the first column.

Is it possible to write a table to a file in JSON format in R?

I'm making word frequency tables with R and the preferred output format would be a JSON file. sth like
{
"word" : "dog",
"frequency" : 12
}
Is there any way to save the table directly into this format? I've been using the write.csv() function and convert the output into JSON but this is very complicated and time consuming.
set.seed(1)
( tbl <- table(round(runif(100, 1, 5))) )
## 1 2 3 4 5
## 9 24 30 23 14
library(rjson)
sink("json.txt")
cat(toJSON(tbl))
sink()
file.show("json.txt")
## {"1":9,"2":24,"3":30,"4":23,"5":14}
or even better:
set.seed(1)
( tab <- table(letters[round(runif(100, 1, 26))]) )
a b c d e f g h i j k l m n o p q r s t u v w x y z
1 2 4 3 2 5 4 3 5 3 9 4 7 2 2 2 5 5 5 6 5 3 7 3 2 1
sink("lets.txt")
cat(toJSON(tab))
sink()
file.show("lets.txt")
## {"a":1,"b":2,"c":4,"d":3,"e":2,"f":5,"g":4,"h":3,"i":5,"j":3,"k":9,"l":4,"m":7,"n":2,"o":2,"p":2,"q":5,"r":5,"s":5,"t":6,"u":5,"v":3,"w":7,"x":3,"y":2,"z":1}
Then validate it with http://www.jsonlint.com/ to get pretty formatting. If you have multidimensional table, you'll have to work it out a bit...
EDIT:
Oh, now I see, you want the dataset characteristics sink-ed to a JSON file. No problem, just give us a sample data, and I'll work on a code a bit. Practically, you need to carry out the data into desirable format, hence convert it to JSON. list should suffice. Give me a sec, I'll update my answer.
EDIT #2:
Well, time is relative... it's a common knowledge... Here you go:
( dtf <- structure(list(word = structure(1:3, .Label = c("cat", "dog",
"mouse"), class = "factor"), frequency = c(12, 32, 18)), .Names = c("word",
"frequency"), row.names = c(NA, -3L), class = "data.frame") )
## word frequency
## 1 cat 12
## 2 dog 32
## 3 mouse 18
If dtf is a simple data frame, yes, data.frame, if it's not, coerce it! Long story short, you can do:
toJSON(as.data.frame(t(dtf)))
## [1] "{\"V1\":{\"word\":\"cat\",\"frequency\":\"12\"},\"V2\":{\"word\":\"dog\",\"frequency\":\"32\"},\"V3\":{\"word\":\"mouse\",\"frequency\":\"18\"}}"
I though I'll need some melt with this one, but simple t did the trick. Now, you only need to deal with column names after transposing the data.frame. t coerces data.frames to matrix, so you need to convert it back to data.frame. I used as.data.frame, but you can also use toJSON(data.frame(t(dtf))) - you'll get X instead of V as a variable name. Alternatively, you can use regexp to clean the JSON file (if needed), but it's a lousy practice, try to work it out by preparing the data.frame.
I hope this helped a bit...
These days I would typically use the jsonlite package.
library("jsonlite")
toJSON(mydatatable, pretty = TRUE)
This turns the data table into a JSON array of key/value pair objects directly.
RJSONIO is a package "that allows conversion to and from data in Javascript object notation (JSON) format". You can use it to export your object as a JSON file.
library(RJSONIO)
writeLines(toJSON(anobject), "afile.JSON")