Read and parse a >400MB .json file in Julia without crashing kernel - json

The following is crashing my Julia kernel. Is there a better way to read and parse a large (>400 MB) JSON file?
using JSON
data = JSON.parsefile("file.json")

Unless some effort is invested into making a smarter JSON parser, the following might work: There is a good chance file.json has many lines. In this case, reading the file and parsing a big repetitive JSON section line-by-line or chunk-by-chuck (for the right chunk length) could do the trick. A possible way to code this, would be:
using JSON
f = open("file.json","r")
discard_lines = 12 # lines up to repetitive part
important_chunks = 1000 # number of data items
chunk_length = 2 # each data item has a 2-line JSON chunk
for i=1:discard_lines
l = readline(f)
end
for i=1:important_chunks
chunk = join([readline(f) for j=1:chunk_length])
push!(thedata,JSON.parse(chunk))
end
close(f)
# use thedata
There is a good chance this could be a temporary stopgap solution for your problem. Inspect file.json to find out.

Related

How to split a large CSV file with no code?

I have a CSV file which has nearly 22M records. I want to split this into multiple CSV files so that I can use it further.
I tried to open it using Excel(tried Transform Data Option as well)/Notepad++/Notepad, but all give me an error.
When I explore the options, I found that we can split the file using some coding methodologies like Java, Python, etc.. I am not much familiar with coding and want to know if there is any option to split the file without using any coding process. Also, since the file has client sensitive data I don't want to download/use any external tools.
Any help would be much appreciated.
I know you're concerned about security of sensitive data, and that makes you want to avoid external tools (even a nominally trusted tool like Google Big Query... unless your data is medical in nature).
I know you don't want a custom solution w/Python, but I don't understand why that is—this is a big problem, and CSVs can be tricky to handle.
Maybe your CSV is a "simple one" where there are no embedded line breaks, and the quoting is minimal. But if it isn't, you're going to want to a tool that's meant for CSV.
And because the file is so big, I don't see how you can do it without code. Even if you could load it into a trusted tool, how would you process the 22M records?
I look forward to seeing what else the community has to offer you.
The best I can think of based on my experience is exactly what you said you don't want.
It's a small-ish Python script that uses its CSV library to correctly read in your large file and write out several smaller files. If you don't trust this, or me, maybe find someone you do trust who can read this and assure you it won't compromise your sensitive data.
#!/usr/bin/env python3
import csv
MAX_ROWS = 22_000
# The name of your input
INPUT_CSV = 'big.csv'
# The "base name" of all new sub-CSVs, a counter will be added after the '-':
# e.g., new-1.csv, new-2.csv, etc...
NEW_BASE = 'new-'
# This function will be called over-and-over to make a new CSV file
def make_new_csv(i, header=None):
# New name
new_name = f'{NEW_BASE}{i}.csv'
# Create a new file from that name
new_f = open(new_name, 'w', newline='')
# Creates a "writer", a dedicated object for writing "rows" of CSV data
writer = csv.writer(new_f)
if header:
writer.writerow(header)
return new_f, writer
# Open your input CSV
with open(INPUT_CSV, newline='') as in_f:
# Like the "writer", dedicated to reading CSV data
reader = csv.reader(in_f)
your_header = next(reader) # see note below about "header"
# Give your new files unique, and sequential names: e.g., new-1.csv, new-2.csv, etc...
new_i = 1
# Make first new file and writer
new_f, writer = make_new_csv(new_i, your_header)
# Loop over all input rows, and count how many
# records have been written for each "new file"
new_rows = 0
for row in reader:
if new_rows == MAX_ROWS:
new_f.close() # This file is full, close it and...
break
new_i += 1
new_f, writer = make_new_csv(new_i, your_header) # get a new file and writer
new_rows = 0 # Reset row counter
writer.writerow(row)
new_rows +=1
# All done reading input rows, close last file
new_f.close()
There's also a fantastic tool I use daily for processing large CSVs, also with sensitive client contact and personally identifying information, GoCSV.
Its split command is exactly what you need:
Split a CSV into multiple files.
Usage:
gocsv split --max-rows N [--filename-base FILENAME] FILE
I'd recommend downloading it for your platform, unzipping it, putting a sample file with non-sensitive information in that folder and trying it out:
gocsv split --max-rows 1000 --filename-base New sample.csv
would end up creating a number of smaller CSVs, New-1.csv, New-2.csv, etc..., each with a header and no more than 1000 rows.

Fast processing json using Python3.x from s3

I have json files on s3 like:
{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}
{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}
The structure is not an array, concatenated jsons without any newlines. There are 1000s of files from which I need only a couple of fields. How can I process them fast?
I will use this on AWS Lambda.
The code I am thinking of is somewhat like this:
data_chunk = data_file.read()
recs = data_chunk.split('}')
json_recs = []
# This part onwards it becomes inefficient where I have to iterate every record
for rec in recs:
json_recs.append(json.loads(rec + '}'))
# Extract Individual fields
How can this be improved? Will using Pandas dataframe help? Individual files are small about 128 MB.
S3 Select supports this JSON Lines structure. You can query it with a SQL-like langugage. It's fast and cheap.

Selectively Import only Json data in txt file into R.

I have 3 questions I would like to ask as I am relatively new to both R and Json format. I read quite a bit of things but I don't quite understand still.
1:) Can R parse Json data when the txt file contains other irrelevant information as well?
Assuming I can't, I uploaded the text file into R and did some cleaning up. So that it will be easier to read the file.
require(plyr)
require(rjson)
small.f.2 <- subset(small.f.1, ! V1 %in% c("Level_Index:", "Feature_Type:", "Goals:", "Move_Count:"))
small.f.3 <- small.f.2[,-1]
This would give me a single column with all the json data in each line.
I tried to write new .txt file .
write.table(small.f.3, file="small clean.txt", row.names = FALSE)
json_data <- fromJSON(file="small.clean")
The problem was it only converted 'x' (first row) into a character and ignored everything else. I imagined it was the problem with "x" so I took that out from the .txt file and ran it again.
json_data <- fromJSON(file="small clean copy.txt")
small <- fromJSON(paste(readLines("small clean copy.txt"), collapse=""))
Both time worked and I manage to create a list. But it only takes the data from the first row and ignore the rest. This leads to my second question.
I tried this..
small <- fromJSON(paste(readLines("small clean copy.txt"), collapse=","))
Error in fromJSON(paste(readLines("small clean copy.txt"), collapse = ",")) :
unexpected character ','
2.) How can I extract the rest of the rows in the .txt file?
3.) Is it possible for R to read the Json data from one row, and extract only the nested data that I need, and subsequently go on to the next row, like a loop? For example, in each array, I am only interested in the Action vectors and the State Feature vectors, but I am not interested in the rest of the data. If I can somehow extract only the information I need before moving on to the next array, than I can save a lot of memory space.
I validated the array online. But the .txt file is not json formatted. Only within each array. I hope this make sense. Each row is a nested array.
The data looks something like this. I have about 65 rows (nested arrays) in total.
{"NonlightningIndices":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],"LightningIndices":[],"SelectedAction":12,"State":{"Features":{"Data":[21.0,58.0,0.599999964237213,12.0,9.0,3.0,1.0,0.0,11.0,2.0,1.0,0.0,0.0,0.0,0.0]}},"Actions":[{"Features":{"Data":[4.0,4.0,1.0,1.0,0.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.12213890532609,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.13055793241076,0.0,0.0,0.0,0.0,0.0,0.231325346416068,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.949158357257511,0.0,0.0,0.0,0.0,0.0,0.369666537828737,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0851765937900996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.223409208023677,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.698640447815897,1.69496718435102,0.0,0.0,0.0,0.0,1.42312654023416,0.0,0.38394999584831,0.0,0.0,0.0,0.0,1.0,1.22164326251584,1.30980246401454,1.00411570750454,0.0,0.0,0.0,1.44306759429513,0.0,0.00568191150434618,0.0,0.0,0.0,0.0,0.0,0.0,0.157705869690127,0.0,0.0,0.0,0.0,0.102089274086033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.37039305683305,2.64354332879095,0.0,0.456876463171171,0.0,0.0,0.208651305680117,0.0,0.0,0.0,0.0,0.0,2.0,0.0,3.46713142511126,2.26785558685153,0.284845692694476,0.29200364444299,0.0,0.562185300773834,1.79134869431988,0.423426746571872,0.0,0.0,0.0,0.0,5.06772310533214,0.0,1.95593334724537,2.08448537685298,1.22045520912269,0.251119892385839,0.0,4.86192274732091,0.0,0.186941346075472,0.0,0.0,0.0,0.0,4.37998688020614,0.0,3.04406665275463,1.0,0.49469909818283,0.0,0.0,1.57589195190525,0.0,0.0,0.0,0.0,0.0,0.0,3.55229001446173]}},......
{"NonlightningIndices":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,24],"LightningIndices":[[15,16,17,18,19,20,21,22,23]],"SelectedAction":15,"State":{"Features":{"Data":[20.0,53.0,0.0,11.0,10.0,2.0,1.0,0.0,12.0,2.0,1.0,0.0,0.0,1.0,0.0]}},"Actions":[{"Features":{"Data":[4.0,4.0,1.0,1.0,0.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.110686363475575,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.13427913742728,0.0,0.0,0.0,0.0,0.0,0.218834141070836,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.939443046803111,0.0,0.0,0.0,0.0,0.0,0.357568892126985,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0889329732996782,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.22521492930721,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.700441220022084,1.6762090551226,0.0,0.0,0.0,0.0,1.44526456614638,0.0,0.383689214317325,0.0,0.0,0.0,0.0,1.0,1.22583659574753,1.31795156033445,0.99710368703165,0.0,0.0,0.0,1.44325394830013,0.0,0.00418600599483917,0.0,0.0,0.0,0.0,0.0,0.0,0.157518319482216,0.0,0.0,0.0,0.0,0.110244186273209,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.369899973785845,2.55505143302811,0.0,0.463342609296841,0.0,0.0,0.226088384842823,0.0,0.0,0.0,0.0,0.0,2.0,0.0,3.47842109127488,2.38476342332125,0.0698115810371108,0.276804206873942,0.0,1.53514282355593,1.77391161515718,0.421465101754304,0.0,0.0,0.0,0.0,4.45530484778828,0.0,1.43798302409155,3.46965807176681,0.468528940277049,0.259853183829217,0.0,4.86988325473155,0.0,0.190659677933533,0.0,0.0,0.963116148760181,0.0,4.29930830894124,0.0,2.56201697590845,0.593423384852181,0.46165947868584,0.0,0.0,1.59497392171253,0.0,0.0,0.0,0.0,0.0368838512398189,0.0,4.24538684327048]}},......
I would really appreciate any advice here.

Convert io.BytesIO to io.StringIO to parse HTML page

I'm trying to parse a HTML page I retrieved through pyCurl but the pyCurl WRITEFUNCTION is returning the page as BYTES and not string, so I'm unable to Parse it using BeautifulSoup.
Is there any way to convert io.BytesIO to io.StringIO?
Or Is there any other way to parse the HTML page?
I'm using Python 3.3.2.
the code in the accepted answer actually reads from the stream completely for decoding. Below is the right way, converting one stream to another, where the data can be read chunk by chunk.
# Initialize a read buffer
input = io.BytesIO(
b'Inital value for read buffer with unicode characters ' +
'ÁÇÊ'.encode('utf-8')
)
wrapper = io.TextIOWrapper(input, encoding='utf-8')
# Read from the buffer
print(wrapper.read())
A naive approach:
# assume bytes_io is a `BytesIO` object
byte_str = bytes_io.read()
# Convert to a "unicode" object
text_obj = byte_str.decode('UTF-8') # Or use the encoding you expect
# Use text_obj how you see fit!
# io.StringIO(text_obj) will get you to a StringIO object if that's what you need

pythonic way to access data in file structures

I want to access every value (~10000) in .txt files (~1000) stored in directories (~20) in the most efficient manner possible. When the data is grabbed I would like to place them in a HTML string. I do this in order to display a HTML page with tables for each file. Pseudo:
fh=open('MyHtmlFile.html','w')
fh.write('''<head>Lots of tables</head><body>''')
for eachDirectory in rootFolder:
for eachFile in eachDirectory:
concat=''
for eachData in eachFile:
concat=concat+<tr><td>eachData</tr></td>
table='''
<table>%s</table>
'''%(concat)
fh.write(table)
fh.write('''</body>''')
fh.close()
There must be a better way (I imagine it would take forever)! I've checked out set() and read a bit about hashtables but rather ask the experts before the hole is dug.
Thank you for your time!
/Karl
import os, os.path
# If you're on Python 2.5 or newer, use 'with'
# needs 'from __future__ import with_statement' on 2.5
fh=open('MyHtmlFile.html','w')
fh.write('<html>\r\n<head><title>Lots of tables</title></head>\r\n<body>\r\n')
# this will recursively descend the tree
for dirpath, dirname, filenames in os.walk(rootFolder):
for filename in filenames:
# again, use 'with' on Python 2.5 or newer
infile = open(os.path.join(dirpath, filename))
# this will format the lines and join them, then format them into the table
# If you're on Python 2.6 or newer you could use 'str.format' instead
fh.write('<table>\r\n%s\r\n</table>' %
'\r\n'.join('<tr><td>%s</tr></td>' % line for line in infile))
infile.close()
fh.write('\r\n</body></html>')
fh.close()
Why do you "imagine it would take forever"? You are reading the file and then printing it out - that's pretty much the only thing you have as a requirement - and that's all you're doing.
You could tweak the script in a couple of ways (read blocks not lines, adjust buffers, print out instead of concatenating, etc.), but if you don't know how much time do you take now, how do you know what is better/worse?
Profile first, then find if the script is too slow, then find a place where it's slow, and only then optimise (or ask about optimisation).