I have json files on s3 like:
{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}
{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}
The structure is not an array, concatenated jsons without any newlines. There are 1000s of files from which I need only a couple of fields. How can I process them fast?
I will use this on AWS Lambda.
The code I am thinking of is somewhat like this:
data_chunk = data_file.read()
recs = data_chunk.split('}')
json_recs = []
# This part onwards it becomes inefficient where I have to iterate every record
for rec in recs:
json_recs.append(json.loads(rec + '}'))
# Extract Individual fields
How can this be improved? Will using Pandas dataframe help? Individual files are small about 128 MB.
S3 Select supports this JSON Lines structure. You can query it with a SQL-like langugage. It's fast and cheap.
Related
I have large number of json file that i have to read, flatten and filter (select specific key-value). below is the code i am using
for i in range(len(json_files)): # is the list of json files in directory
text = []
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file: # path_to_json is json directory
jtext = json.load(json_file)
json_text=jtext['KEY_TO_FILTER']
text_json=flatten_data(json_text) # function i'm using to select specific key in json
text.append(text_json)
i really appreciate hint how to use multiprocessing. any other suggestion ow to speed up the process is very welcomed
I have a Dataframe loaded from a CSV file
df = pd.read_csv(input_file, header=0)
and I want to process it and eventually save it into multiple JSON files (for example a new file every X rows).
Any suggestion how to achieve that?
This should work:
for idx, group in df.groupby(np.arange(len(df))//10):
group.to_json(f'{idx}_name.json', orient='index') # orient: split, records, index, values, table, columns
Change the 10 to the number of rows you wish to write out for each iteration.
I have a scenario where I am loading and processing 4TB of data,
which is about 15000 .csv files in a folder.
since I have limited resources, I am planning to process them in two
batches and them union them.
I am trying to understand if I can load only 50% (or first n
number of files in batch1 and the rest in batch 2) using
spark.read.csv.
I can not use a regular expression as these files are generated
from multiple sources and they are of uneven number(from some
sources they are few and from other sources there are many ). If I
consider processing files in uneven batches using wild cards or regex
i may not get optimized performance.
Is there a way where i can tell the spark.read.csv reader to pick first n files and next I would just mention to load last n-1 files
I know this can be doneby writing another program. but I would not prefer as I have more than 20000 files and I dont want to iterate over them.
It's easy if you use hadoop API to list files first and then create dataframes based on this list chunks. For example:
path = '/path/to/files/'
from py4j.java_gateway import java_import
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
list_status = fs.listStatus(spark._jvm.org.apache.hadoop.fs.Path(path))
paths = [file.getPath().toString() for file in list_status]
df1 = spark.read.csv(paths[:7500])
df2 = spark.read.csv(paths[7500:])
The following is crashing my Julia kernel. Is there a better way to read and parse a large (>400 MB) JSON file?
using JSON
data = JSON.parsefile("file.json")
Unless some effort is invested into making a smarter JSON parser, the following might work: There is a good chance file.json has many lines. In this case, reading the file and parsing a big repetitive JSON section line-by-line or chunk-by-chuck (for the right chunk length) could do the trick. A possible way to code this, would be:
using JSON
f = open("file.json","r")
discard_lines = 12 # lines up to repetitive part
important_chunks = 1000 # number of data items
chunk_length = 2 # each data item has a 2-line JSON chunk
for i=1:discard_lines
l = readline(f)
end
for i=1:important_chunks
chunk = join([readline(f) for j=1:chunk_length])
push!(thedata,JSON.parse(chunk))
end
close(f)
# use thedata
There is a good chance this could be a temporary stopgap solution for your problem. Inspect file.json to find out.
I am trying to create a DataFrame from a CSV source that is on S3 on an EMR Spark cluster, using the Databricks spark-csv package and the flights dataset:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('s3n://h2o-airlines-unpacked/allyears.csv')
df.first()
This does not terminate on a cluster of 4 m3.xlarges. I am looking for suggestions to create a DataFrame from a CSV file on S3 in PySpark. Alternatively, I have tried putting the file on HDFS and reading from HFDS as well, but that also does not terminate. The file is not overly large (12 GB).
For reading a well-behaved csv file that is only 12GB, you can copy it onto all of your workers and the driver machines, and then manually split on ",". This may not parse any RFC4180 csv, but it parsed what I had.
Add at least 12GB extra space for worker disk space for each worker when you requisition the cluster.
Use a machine type that has at least 12GB RAM, such as c3.2xlarge. Go bigger if you don't intend to keep the cluster around idle and can afford the larger charges. Bigger machines means less disk file copying to get started. I regularly see c3.8xlarge under $0.50/hour on the spot market.
copy the file to each of your workers, in the same directory on each worker. This should be a physically attached drive, i.e. different physical drives on each machine.
Make sure you have that same file and directory on the driver machine as well.
raw = sc.textFile("/data.csv")
print "Counted %d lines in /data.csv" % raw.count()
raw_fields = raw.first()
# this regular expression is for quoted fields. i.e. "23","38","blue",...
matchre = r'^"(.*)"$'
pmatchre = re.compile(matchre)
def uncsv_line(line):
return [pmatchre.match(s).group(1) for s in line.split(',')]
fields = uncsv_line(raw_fields)
def raw_to_dict(raw_line):
return dict(zip(fields, uncsv_line(raw_line)))
parsedData = (raw
.map(raw_to_dict)
.cache()
)
print "Counted %d parsed lines" % parsedData.count()
parsedData will be a RDD of dicts, where the keys of the dicts are the CSV field names from the first row, and the values are the CSV values of the current row. If you don't have a header row in the CSV data, this may not be right for you, but it should be clear that you could override the code reading the first line here and set up the fields manually.
Note that this is not immediately useful for creating data frames or registering a spark SQL table. But for anything else, it is OK, and you can further extract and transform it into a better format if you need to dump it into spark SQL.
I use this on a 7GB file with no issues, except I've removed some filter logic to detect valid data that has as a side effect the removal of the header from the parsed data. You might need to reimplement some filtering.