I have requirement to read schema files from the path where AWS glue script is present in S3 bucket
enter image description here
Right now I am hard coding the bucket name which I don't want.
I want to load json file
f = open('schema.json')
data = json.load(f)
Related
I am trying to write a JSON file into my AWS S3 bucket. However, I do not get a JSON file after it has been uploaded.
I get my data from a website using a request.get() and format it into a JSON file which I then run the following script to upload it to me S3 bucket.
r = requests.get(url=id_url, params=params)
data = r.json()
s3_client.put_object(
Body=json.dumps(data, indent=3),
Bucket='bucket-name',
Key=fileName
)
However, I am not sure what the type of file is but it is supposed to be saved as a JSON file.
Screenshot of my S3 bucket having no file type
Screenshot of my download folder, showing unable to identify the file type
When I open the file by selecting Pycharm, it is just a dictionary with key and values
Solved it, I simply added ".JSON" to the filename and it has solved the file formatting issue. Dont know why I didnt think of this earlier.
Thank you
Ideally you shouldn't rely of file extensions to specify the content type. The put_object method supports specifying ContentType. This means that you can use any file name you like, without needing to specify .json.
e.g.
s3_client.put_object(
Body=json.dumps(data, indent=3),
Bucket='bucket-name',
Key=fileName,
ContentType='application/json'
)
I have an external Snowflake stage created on S3 and am using that to unload JSON data as .json file from a Snowflake table. The files are getting encrypted when I open the file from S3. When I unload JSON data as a .csv file, the file can be opened and there is no encryption on it.
COPY INTO #stage_name/test.json
FROM table
file_format=(type=json)
SINGLE=TRUE;
COPY INTO #stage_name/test.csv
FROM table
file_format=(type=csv)
SINGLE=TRUE;
The STAGE_FILE_FORMAT property has been set as CSV.
Could you please advise why the JSON files are getting encrypted and not csv?
I have a set of text(.txt) files in cloud storage(that are uploaded into the Cloud storage in every 5 minutes). What I want to do is I want to upload them into big query. But bIgquery cant accept text files. So I have to convert it to Bq acceptable format. What is the best possible way?
As per this document, BigQuery only supports loading data with the following file format: CSV, JSON, Avro and Google Cloud Datastore backups.
Hence, if you upload a Text file to BigQuery, BigQuery reads your text file as a CSV file and then really would run into an error.
You would have to manually convert your text file into a CSV file, before uploading it to BigQuery.
Alternatively, you may also use Cloud Dataprep as it supports text files as inputs. You may do transformations with your text file here in Dataprep then export the results to BigQuery.
Here is an Overview of Dataprep and a Quickstart Documentation to learn how to use it.
Here is the code snippet:
def getBlobAsString(bucketName, blobName):
storageClient = storage.Client()
bucket = storageClient.get_bucket(bucketName)
blobFile = bucket.get_blob(blobName)
blobStr = blobFile.download_as_string()
return(blobStr)
def getBlobAsFile(bucketName, blobName, txtStr):
storageClient = storage.Client()
csvFileName = blobName.replace('txt', 'csv')
bucket = storageClient.get_bucket(bucketName)
blob = bucket.blob(csvFileName)
blob.upload_from_string(txtStr)
return(csvFileName)
txtBucket = "bucket-name"
txtBlob = "blob-name"
# Read text file content as string
txtBlobAsStr = getBlobAsString(txtBucket, txtBlob)
txtStr = str(txtBlobAsStr, 'utf-8')
# Write text file content to CSV file
csvBlob = getBlobAsFile(txtBucket, txtBlob, txtStr)
I come to you to find out if you have a pro tips for loading the latest csv files generated by a Glue job into an S3 bucket to load into jupyter notebook.
I use this command to load my csv from an S3 folder. Is there an option to select only files with the last modified csv files ?
df = sqlContext.read.csv(
's3://path',
header=True, sep=","
)
Before I had a tendency to transform my dynamic dataframe into a classic dataframe to overwrite the old files generated by my Glue job.
This is not possible by generating a DyF
Thank you
You can use S3 boto3 api to get csv files with last modified date, then sort them, filter them and pass it to Glue or Spark read api.
Alternatively, you can use AWS S3 Inventory and query over athena: https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html
There is Job Bookmark concept in Glue but it is for newly added files and not modified files.
I have another question on this here Open a CSV file from S3 using Roo on Heroku but I'm not getting any bites - so a reword:
I have a CSV file in an S3 bucket
I want to read it using Roo in a Heroku based app (i.e. no local file access)
How do I open the CSV file from a stream?
Or is there a better tool for doing this?
I am using Rails 4, Ruby 2. Note I can successfuly open the CSV for reading if I post it from a form. How can I adapt this to snap the file from an S3 bucket?
Short answer - don't use Roo.
I ended up using the standard CSV commands, working with small CSV files you can very simply read the file contents into memory using something like this:
body = file.read
CSV.parse(body, col_sep: ",", headers: true) do |row|
row_hash = row.to_hash
field = row_hash["FieldName"]
reading a file passed in from a form, just reference the params:
file = params[:file]
body = file.read
To read in form S3 you can use the AWS gem:
s3 = AWS::S3.new(access_key_id: ENV['AWS_ACCESS_KEY_ID'], secret_access_key: ENV['AWS_SECRET_ACCESS_KEY'])
bucket = s3.buckets['BUCKET_NAME']
# check each object in the bucket
bucket.objects.each do |obj|
import_file = obj.key
body = obj.read
# call the same style import code as above...
end
I put some code together based on this:
Make Remote Files Local With Ruby Tempfile
and Roo seems to work OK when handed a temp file. I couldn't get it to work with S3 directly. I don't particularly like the copy approach, but my processing runs on delayed job, and I want to keep the Roo features a little more than I dislike the file copy. Plain CSV files work without fishing out the encoding info, but XLS files would not.