I have a large text file containing JSON objects on Amazon S3. I am planning to process this data using Spark on Amazon EMR.
Here are my questions:
How do I load the text file containing JSON objects into Spark?
Is it possible to persist the internal RDD representation of this data on S3, after the EMR cluster is turned-off?
If I am able to persist the RDD representation, is it possible to directly load the data in RDD format next time I need to analyze the same data?
This should cover #1, as long as you're using pyspark:
#Configure spark with your S3 access keys
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "MY-ACCESS-KEY")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "MY-SECRET-ACCESS-KEY")
#Retrieve the data
my_data = sc.textFile("s3n://my-bucket-name/my-key")
my_data.count() #Count all rows
my_data.take(20) #Take the first 20 rows
#Parse it
import json
my_data.map(lambda x: json.loads(x)).take(20) #Take the first 20 rows of json-parsed content
Note the s3 address is s3n://, not s3://. This is a legacy thing from hadoop.
Also, my-key can point to a whole S3 directory*. If you're using a spark cluster, importing several medium-sized files is usually faster than a single big one.
For #2 and #3, I'd suggest looking up spark's parquet support. You can also save text back to s3:
my_data.map(lambda x: json.dumps(x)).saveAsTextFile('s3://my-bucket-name/my-new-key')
Not knowing the size of your dataset and the computational complexity of your pipeline, I can't say which way of storing intermediate data to S3 will be the best use of your resources.
*S3 doesn't really have directories, but you know what I mean.
Related
There is a nested json with very deep structure. File is of the format json.gz size 3.5GB. Once this file is uncompressed it is of size 100GB.
This json file is of the format, where Multiline = True (if this condition is used to read the file via spark.read_json then only we get to see the proper json schema).
Also, this file has a single record, in which it has two columns of Struct type array, with multilevel nesting.
How should I read this file and extract information. What kind of cluster / technique to use to extract relevant data from this file.
Structure of the JSON (multiline)
This is a single record. and the entire data is present in 2 columns - in_netxxxx and provider_xxxxx
enter image description here
I was able to achieve this in a bit different way.
Use the utility - Big Text File Splitter -
BigTextFileSplitter - Withdata Softwarehttps://www.withdata.com › big-text-file-splitter ( as the file was huge and multiple level nested) the split record size I kept was 500. This generated around 24 split files of around 3gb each. Entire process took 30 -40 mins.
Processed the _corrupt_record seperately - and populated the required information.
Read the each split file in a using - this option removes the _corrupt_record and also removes the null rows.
spark.read.option("mode", "DROPMALFORMED").json(file_path)
Once the information is fetched form each file, we can merge all the files into a single file, as per standard process.
I am working with Databricks on AWS. I have mounted an S3 bucket as /mnt/bucket-name/. This bucket contains json files under the prefix jsons. I create a Delta table from these json files as follows:
%python
df = spark.read.json('/mnt/bucket-name/jsons')
df.write.format('delta').save('/mnt/bucket-name/delta')
%sql
CREATE TABLE IF NOT EXISTS default.table_name
USING DELTA
LOCATION '/mnt/bucket-name/delta'
So far, so good. Then new json files arrive in the bucket. In order to update the Delta table, I run the following:
%sql
COPY INTO default.table_name
FROM '/mnt/bucket-name/jsons'
FILEFORMAT = JSON
This does indeed update the Delta table, but it duplicates the rows contained in the initial load, i.e. the rows in df are now contained in table_name twice. I have the following workaround, whereby I create an empty dataframe with the correct schema:
%python
df_schema = spark.read.json('/mnt/bucket-name/jsons').schema
df = spark.createDataFrame([], df_schema)
df.write.format('delta').save('/mnt/bucket-name/delta')
%sql
CREATE TABLE IF NOT EXISTS default.table_name
USING DELTA
LOCATION '/mnt/bucket-name/delta'
%sql
COPY INTO default.table_name
FROM '/mnt/bucket-name/jsons'
FILEFORMAT = JSON
This works and there is no duplication, but it seems neither elegant nor efficient, since spark.read.json('/mnt/bucket-name/jsons').schema reads all the json files, even though only the schema needs to be inferred. (The schema of the json files can be assumed to be stable.) Is there a way to tell COPY INTO to ignore the initial json files? There's the option modifiedAfter, but that would be cumbersome and doesn't sit well idempotently. I also considered recreating the dataframe and then running df.write.format('delta').mode('append').save('/mnt/bucket-name/delta') followed by REFRESH TABLE default.table_name, but this seems inefficient, since why should the initial json files be read again? Edit: This method also duplicates the initial load.
Or is there a way to circumvent using a Spark dataframe entirely and create a Delta table from the json files directly? I have searched for such a solution but to no avail.
One last point: Schema inference is crucial and so I do not want a solution that requires the schema of the json files to be written out manually.
In Scala how do you efficiently (memory consumption + performance) read very large csv file? is it fast enough to just stream it line by line and process each line at each iteration?
What i need to do with CSV data :->
In my application Single line in CSV file is treated as an one single record and all the records of the CSV file are to be converted into XML elements and JSON format and save it into another file in xml and json formats.
So here question is while reading the file from csv is it a good idea to read the file in chunks and provide that chunk to another thread which will convert that CSV records into an xml/json and write that xml/json to file? If yes how?
Data of the CSV can be anything, there is no restriction on the type of the data it can be numeric, big decimal, string or date. Any easy way to handle this different data types before saving it to xml? or we don't need to take care of types?
Many Thanks
If this is not a one time task, create a program that will break this 1GB file to small size files. Then provide those new files as a input to separate futures.
Each future will read one file and resolve in the order of file content. File4 resolves after File3, which resolves after File2, which resolves after File1.
As the file has no key-value pair or hierarchical data structure, so I will suggest, just read as a string.
Hope this helps.
I am using PySpark. I have a list of gziped json files on s3 which I have to access, transform and then export in parquet to s3. Each json file contains around 100k lines so parallelizing it wont make much sense(but i am open to parallelizing it), but there are around 5k files which I have parallelize. My approach is pass the json file list to script -> run parallelize on the list -> run map(? this is where I am getting blocked). how do I access and transform the json create a DF out of the transformed json and dump it as parquet into s3.
To read json in a distributed fashion, you will need to parallelize your keys as you mention. To do this while reading from s3, you'll need to use boto3. Below is a skeleton sketch of how to do so. You'll likely need to modify distributedJsonRead to fit your use case.
import boto3
import json
from pyspark.sql import Row
def distributedJsonRead(s3Key):
s3obj = boto3.resource('s3').Object(bucket_name='bucketName', key=key)
contents = json.loads(s3obj.get()['Body'].read().decode('utf-8'))
return Row(**contents)
pkeys = sc.parallelize(keyList) #keyList is a list of s3 keys
dataRdd = pkeys.map(distributedJsonRead)
Boto3 Reference: http://boto3.readthedocs.org/en/latest/guide/quickstart.html
Edit: to address the 1:1 mapping of input files to output files
Later on, it's likely that having a merged parquet data set would be easier to work with. But if this is the way you need to do it, you could try something like this
for k in keyList:
rawtext = sc.read.json(k) # or whichever method you need to use to read in the data
outpath = k[:-4]+'parquet'
rawtext.write.parquet(outpath)
I don't think you will not be able to parallelize these operations if you want a 1:1 mapping of json to parquet files. Spark's read/write functionality is designed to be called by the driver, and needs access to sc and sqlContext. This is another reason why having 1 parquet directory is likely the way to go.
Aim : Build a small ETL framework to take a Huge CSV and dump it into RDB(say MySQL).
The current approach we are thinking about is to load csv using spark into a dataframe and persist it and later use frameworks like apache scoop and and load it into mySQL.
Need recommendations on which format to persist and on the approach itself.
Edit:
CSV will have around 50 million rows with 50-100 columns.
Since our tasks involves lots of transformations before dumping into RDB, we thought using spark was a good idea.
Spark SQL support to writing to RDB directly. You can load your huge CSV as DataFrame, transform it, and call below API to save it to database.
Please refer to below API:
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils
def saveTable(df: DataFrame,
url: String,
table: String,
properties: Properties): Unit
Saves the RDD to the database in a single transaction.
Example Code:
val url: String = "jdbc:oracle:thin:#your_domain:1521/dbname"
val driver: String = "oracle.jdbc.OracleDriver"
val props = new java.util.Properties()
props.setProperty("user", "username")
props.setProperty("password", "userpassword")
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(dataFrame, url, "table_name", props)