ParseTweets function for more than one JSON file at the same time using R - json

I have 100 JSON files contain approximately 800,000 tweets. How can I parse all files at the same time using R in order to clean it?

Related

Dump Pandas Dataframe to multiple json files

I have a Dataframe loaded from a CSV file
df = pd.read_csv(input_file, header=0)
and I want to process it and eventually save it into multiple JSON files (for example a new file every X rows).
Any suggestion how to achieve that?
This should work:
for idx, group in df.groupby(np.arange(len(df))//10):
group.to_json(f'{idx}_name.json', orient='index') # orient: split, records, index, values, table, columns
Change the 10 to the number of rows you wish to write out for each iteration.

Fast processing json using Python3.x from s3

I have json files on s3 like:
{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}
{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}{'key1':value1, 'key2':'value2'}
The structure is not an array, concatenated jsons without any newlines. There are 1000s of files from which I need only a couple of fields. How can I process them fast?
I will use this on AWS Lambda.
The code I am thinking of is somewhat like this:
data_chunk = data_file.read()
recs = data_chunk.split('}')
json_recs = []
# This part onwards it becomes inefficient where I have to iterate every record
for rec in recs:
json_recs.append(json.loads(rec + '}'))
# Extract Individual fields
How can this be improved? Will using Pandas dataframe help? Individual files are small about 128 MB.
S3 Select supports this JSON Lines structure. You can query it with a SQL-like langugage. It's fast and cheap.

How to load only first n files in pyspark spark.read.csv from a single directory

I have a scenario where I am loading and processing 4TB of data,
which is about 15000 .csv files in a folder.
since I have limited resources, I am planning to process them in two
batches and them union them.
I am trying to understand if I can load only 50% (or first n
number of files in batch1 and the rest in batch 2) using
spark.read.csv.
I can not use a regular expression as these files are generated
from multiple sources and they are of uneven number(from some
sources they are few and from other sources there are many ). If I
consider processing files in uneven batches using wild cards or regex
i may not get optimized performance.
Is there a way where i can tell the spark.read.csv reader to pick first n files and next I would just mention to load last n-1 files
I know this can be doneby writing another program. but I would not prefer as I have more than 20000 files and I dont want to iterate over them.
It's easy if you use hadoop API to list files first and then create dataframes based on this list chunks. For example:
path = '/path/to/files/'
from py4j.java_gateway import java_import
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
list_status = fs.listStatus(spark._jvm.org.apache.hadoop.fs.Path(path))
paths = [file.getPath().toString() for file in list_status]
df1 = spark.read.csv(paths[:7500])
df2 = spark.read.csv(paths[7500:])

PySpark Querying Multiple JSON Files

I have uploaded into Spark 2.2.0 many JSONL files (the structure is the same for all of them) contained in a directory using the command (python spark): df = spark.read.json(mydirectory) df.createGlobalTempView("MyDatabase") sqlDF = spark.sql("SELECT count(*) FROM MyDatabase") sqlDF.show().
The uploading works, but when I query sqlDF (sqlDF.show()), it seems that Spark counts the rows of just one file (the first?) and not those of all of them. I am assuming that "MyDatabase" is a dataframe containing all the files.
What am I missing?
If I upload just one file consisting of only one line of multiple json objects {...}, Spark can properly identify the tabular structure. If I have more than one file, I have to put each {} on a new line to get the same result.

Apache Spark - Redundant JSON output after Spark SQL query

"I have done the following:
DataFrame input= sqlContext.read().json("/path/t*json");
DataFrame res = sqlContext.sql("SELECT *, Some_Calculation calc FROM input group by user_id")
res.write().json("/path/result.json");
This reads all the files in the directory that start with a 't' and are JSON files. Thus far, OK. But the output of this program is not a JSON file, as I intended, but a directory called result.json with a as many crc and as many output files as the number of input files. This results in many identical files. Since my calculations are grouped by the user_id, my aggregate calculations check out, but I get just as much output files as the input files I have, and since the aggregate calculations have the same result, many of those files are identical.
First of all, how can I generate a single JSON file as the output?
Second of all, how can I get rid of the redundancy?
Thank you!