Create dynamic frame from S3 bucket AWS Glue - json

Summary:
I've got a S3 bucket which contains list of JSON files. Bucket contains child folders which are created by date. All the files contain similar file structure. Files get added on daily basis.
JSON Schema
schema = StructType([
StructField("main_data",StructType([
StructField("action",StringType()),
StructField("parameters",StructType([
StructField("project_id",StringType()),
StructField("integration_id",StringType()),
StructField("cohort_name",StringType()),
StructField("cohort_id",StringType()),
StructField("cohort_description",StringType()),
StructField("session_id",StringType()),
StructField("users",StructType([StructField("user_id",StringType())]))
]),
)]
)),
StructField("lambda_data", StructType([
StructField("date",LongType())
]))
])
Question
I am trying to create dynamic frame from options where source is S3 and type is JSON. I'm using following code however it is not returning any value. Where am I going wrong?
Script
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from functools import reduce
from awsglue.dynamicframe import DynamicFrame
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
df = glueContext.create_dynamic_frame.from_options(
connection_type = 's3',
connection_options={'paths':'Location for S3 folder'},
format='json',
# formatOptions=$..*
)
print('Total Count:')
df.count()

Can you check if your Glue Role has access to the S3 bucket,
Also in connection_options add
"recurse" : True

Related

AWS Glue S3 csv to S3 parquet file conversion

Im trying to convert a file from CSV format to parquet and read in athena.
The glue script looks like this
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node Amazon S3
AmazonS3_node1661031713801 = glueContext.create_dynamic_frame.from_options(
format_options={
"quoteChar": "'",
"withHeader": False,
"separator": ",",
"optimizePerformance": False,
},
connection_type="s3",
format="csv",
connection_options={"paths": ["s3://data/input/july1_output.csv"]},
transformation_ctx="AmazonS3_node1661031713801",
)
# Script generated for node Amazon S3
AmazonS3_node1661031823737 = glueContext.getSink(
path="s3://data/output1/",
connection_type="s3",
updateBehavior="UPDATE_IN_DATABASE",
partitionKeys=[],
compression="gzip",
enableUpdateCatalog=True,
transformation_ctx="AmazonS3_node1661031823737",
)
AmazonS3_node1661031823737.setCatalogInfo(
catalogDatabase="sip", catalogTableName="sipflow"
)
AmazonS3_node1661031823737.setFormat("glueparquet")
AmazonS3_node1661031823737.writeFrame(AmazonS3_node1661031713801)
job.commit()
Noticing that the data is getting converted correctly but the column names in parquet file is not coming as expected. I have set output schema as
But for the target table in Athena the column names are coming as col0, col1, col2, col3, ...
Any reason why the column names are not coming correctly in parquet files ?

How do I split / chunk Large JSON Files with AWS glueContext before converting them to JSON?

I'm trying to convert a 20GB JSON gzip file to parquet using AWS Glue.
I've setup a job using Pyspark with the code below.
I got this log WARN message:
LOG.WARN: Loading one large unsplittable file s3://aws-glue-data.json.gz with only one partition, because the file is compressed by unsplittable compression codec.
I was wondering if there was a way to split / chunk the file? I know I can do it with pandas, but unfortunately that takes far too long (12+ hours).
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
import pyspark.sql.functions
from pyspark.sql.functions import col, concat, reverse, translate
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
test = glueContext.create_dynamic_frame_from_catalog(
database="test_db",
table_name="aws-glue-test_table")
# Create Spark DataFrame, remove timestamp field and re-name other fields
reconfigure = test.drop_fields(['timestamp']).rename_field('name', 'FirstName').rename_field('LName', 'LastName').rename_field('type', 'record_type')
# Create pyspark DF
spark_df = reconfigure.toDF()
# Filter and only return 'a' record types
spark_df = spark_df.where("record_type == 'a'")
# Once filtered, remove the record_type column
spark_df = spark_df.drop('record_type')
spark_df = spark_df.withColumn("LastName", translate("LastName", "LName:", ""))
spark_df = spark_df.withColumn("FirstName", reverse("FirstName"))
spark_df.write.parquet("s3a://aws-glue-bucket/parquet/test.parquet")
Spark does not parallelize reading a single gzip file. However, you can do split it in chunks.
Also, Spark is really slow at reading gzip files(since its not paralleized). You can do this to speed it up:
file_names_rdd = sc.parallelize(list_of_files, 100)
lines_rdd = file_names_rdd.flatMap(lambda _: gzip.open(_).readlines())

How to read jsonl.gz file stored in an s3 bucket using Boto3-Python3

I have a few files in my s3 bucket that is stored as .GZ files. I am using boto3 to access those file and I am trying to read the file's contents.
However, I keep getting this error when I run my code:
OSError: [Errno 9] read() on write-only GzipFile object
Here is my code:
import boto3
import os
import json
from io import BytesIO
import gzip
from gzip import GzipFile
from datetime import datetime
import logging
import botocore
# AWS Bucket Info
BUCKET_NAME = '<my_bucket_name>'
#My bucket's key information to where the .GZ files are stored
key1 = 'my/path/to/file/shar1-jsonl.gz'
# key2 = 'my/path/to/file/shar2-jsonl.gz'
# key3 = 'my/path/to/file/shar3-jsonl.gz'
# Create s3 connection
s3_resource = boto3.resource('s3', aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY)
zip_obj = s3_resource.Object(bucket_name=BUCKET_NAME, key=key1)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = gzip.open(buffer,'wb').read().decode(('utf-8'))
Is there any way to collect the jsonl.gz file and then read its contents using boto3? I am new to boto3 and gzip files so any ideas or suggestion would help

How to convert json file into table structure in redshift using python

How can I convert JSON file into a table structure in Redshift? I tried the below python code.
import boto3
import json
import os
import sys
import psycopg2
import csv
from collections import defaultdict
def jsonfile(path):
session = boto3.Session(
aws_access_key_id='dfjfkgj',
aws_secret_access_key='sdfg',
region_name='us-west-2')
s3 = session.resource('s3')
bucket= s3.Bucket('ag-redshift-poc')
with open(path, 'rb') as data:
res=json.load(data)
f = open('data.csv','wb')
output = csv.writer(f)
output.writerow(res[0].keys())
for row in res:
output.writerow(row.values())
bucket.put_object(Key=('C:\Python27\data.csv'),Body=res)
print 'success'
def redshift():
co=psycopg2.connect(dbname= 'redshiftpoc', host='shdjf',
port= '5439', user= 'admin', password= 'snd')
curr = co.cursor()
curr.execute("""copy sample from 's3://ag-redshift-poc/testfile/json.txt'
CREDENTIALS 'aws_access_key_id=fdfd;aws_secret_access_key=sxhd'
""")
co.commit()
print 'success'
curr.close()
co.close()
jsonfile('C:\Python27\json.txt')
redshift()
Redshift can directly absorb JSON to COPY into your table. (Though not very efficient).
In your case, modify the COPY query to,
COPY sample FROM 's3://<bucket_name>/<path_to_json>'
CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx'
JSON 'auto' ACCEPTINVCHARS;
Please note JSON 'auto' in query. This maps every column in table with keys in JSON file.
More details here in the COPY examples

How to deal with multiple csv.gz files in Spark?

I have a huge dataset with multiple tables. Each table is split into hundreds of csv.gz files and I need to import them to Spark through PySpark. Any idea on how to import the "csv.gz" files to Spark? Does SparkContext or SparkSession from SparkSQL provide a function to import this type of files?
You can import gzipped csv files natively using spark.read.csv():
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("stackOverflow") \
.getOrCreate()
fpath1 = "file1.csv.gz"
DF = spark.read.csv(fpath1, header=True)
where DF is a spark DataFrame.
You can read from multiple files by feeding in a list of files:
fpath1 = "file1.csv.gz"
fpath2 = "file2.csv.gz"
DF = spark.read.csv([fpath1, fpath2] header=True)
You can also create a "temporary view" allowing for SQL queries:
fpath1 = "file1.csv.gz"
fpath2 = "file2.csv.gz"
DF = spark.read.csv([fpath1, fpath2] header=True)
DF.createOrReplaceTempView("table_name")
DFres = spark.sql("SELECT * FROM table_name)
where DFres is a spark DataFrame generated from the query.