I come to you to find out if you have a pro tips for loading the latest csv files generated by a Glue job into an S3 bucket to load into jupyter notebook.
I use this command to load my csv from an S3 folder. Is there an option to select only files with the last modified csv files ?
df = sqlContext.read.csv(
's3://path',
header=True, sep=","
)
Before I had a tendency to transform my dynamic dataframe into a classic dataframe to overwrite the old files generated by my Glue job.
This is not possible by generating a DyF
Thank you
You can use S3 boto3 api to get csv files with last modified date, then sort them, filter them and pass it to Glue or Spark read api.
Alternatively, you can use AWS S3 Inventory and query over athena: https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html
There is Job Bookmark concept in Glue but it is for newly added files and not modified files.
Related
I am trying to copy files from one folder to another. However source folder has multiple folders in it and then multiple files. My requirement is to move all the files from each of these folder into single folder. I have about millions of file and each of these files have hardly 1 or 2 records.
Example -
source_folder - dev-bucket/data/
Inside this source_folder, I have following -
folder a - inside this folder, 10000 json files
folder b - inside this folder, 10000 json files
My aim - Target_folder - dev-bucket/final/20000 json files.
I tried writing below code, however, the processing time is also huge. Is there any other way to approach this?
try:
for obj in bucket.objects.filter(Prefix=source_folder):
old_source = {'Bucket': obj.bucket_name,'Key': obj.key}
file_count = file_count+1
new_obj = bucket.Object(final_file)
new_obj.copy(old_source)
except Exception as e:
logger.print("The process has failed to copy files from sftp location to base location", e)
exit(1)
I was thinking of merging the data into 1 single json file before moving the file. However, I am new to Python and AWS and am struggling to understand how should I read and write the data. I was trying to do below but am kind of stuck.
paginator = s3_client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=s3_bucket, Prefix=FOLDER)
response = []
for page in pages:
for obj in page['Contents']:
read_files = obj["Key"]
result = s3_client.get_object(Bucket=s3_bucket, Key=read_files)
text = result["Body"].read().decode()
response = response.append(text)
Can you please guide me? Many thanks in advance.
If you need copy one time, I sugget to use aws cli
aws s3 cp source destination --recursive
https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
If possible, it is best to avoid having high numbers of objects. They are slow to list and to iterate through.
From your question is seems that they contain JSON data and you are happy to merge the contents of files. A good way to do this is:
Use an AWS Glue crawler to inspect the contents of a directory and create a virtual 'table' in the AWS Glue Catalog
Then use Amazon Athena to SELECT data from that virtual table (which reads all the files) and copy it into a new table using CREATE TABLE AS
Depending upon how you intend to use the data in future, Athena can even convert it into a different format, such as Snappy-compressed Parquet files that are very fast for querying
If you instead just wish to continue with your code for copying files, you might consider activating Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. Your Python program could then use that inventory file as the input list of files, rather than having to call ListObjects.
However, I would highly recommend a strategy that reduces the number of objects you are storing unless there is a compelling reason to keep them all separate.
If you receive more files every day, you might even consider sending the data to an Amazon Kinesis Data Firehose, which can buffer data by size or time and store it in fewer files.
I am using spark on Google dataproc cluster. I have created a dictionary in Jupyter notebook which I want to dump in my GCS bucket. However, it seems the usual way of dumping to json using fopen() does not work in case of gcp. So, how can I write my dictionary as .json file to GCS. Or, is there any other way to get the dictionary?
It's funny, I could write spark dataframe to gcs without any hassle, but apparently, I can't load JSON on gcs unless I have it on my local system!
Please help!
Thank you.
The file in GCS is not in your local file system so that's why you cannot call "fopen" on it. You can either save to GCS by directly using a GCS client (for example, this tutorial), or treat the GCS location as an HDFS destination (for example, saveAsTextFile("gs://...")
I have a few large zip files on S3. Each of these zip files contains several gz files, which contain data in JSON format. I need to (i) Copy the gz files to HDFS and (ii) Process the files preferably by Apache Spark/Impala/Hive. What is the easiest/best way of going about it?
1) Try distcp for copying files from s3 to HDFS
2) For processing, use "org.apache.spark.sql.hive.HiveContext"'s read.json for reading JSON data from HDFS and create dataframe.
Then do any operation on it.
Follow this link,
http://spark.apache.org/docs/latest/sql-programming-guide.html#creating-dataframes
I see we can import json files into firebase.
What I would like to know is if there is a way to import CSV files (I have files that could have about 50K or even more records with about 10 columns).
Does it even make sense to have such files in firebase ?
I can't answer if it make sense to have such files in Firebase, you should answer that.
I also had to upload CSV files to Firebase and I finally transformed my CSV into JSON and used firebase-import to add my Json into Firebase.
there's a lot of CSV to JSON converters (even online ones). You can pick the one you like the most (I personnaly used node-csvtojson).
I've uploaded many files (tab separated files) (40MB each) into firebase.
Here are the steps:
I wrote a Java code to translate TSV into JSON files.
I used firebase-import to upload them. To install just type in cmd:
npm install firebase-import
One trick I used on top of all the one already mentioned is to synchronize a google spreadsheet with firebase.
You create a script that upload directly to firebase db base on row / columns. It worked quite well and can be more visual for fine tuning the raw data compared to csv/json format directly.
Ref: https://www.sohamkamani.com/blog/2017/03/09/sync-data-between-google-sheets-and-firebase/
Here is the fastest way to Import your CSV to Firestore:
Create an account in Jet Admin
Connect Firebase as a DataSource
Import CSV to Firestore
Ref:
https://blog.jetadmin.io/how-to-import-csv-to-firestore-database-without-code/
I have an idea on how to extract Table data to Cloud storage using Bq extract command but I would like rather like to know, if there are any options to extract a Big Query table as NewLine Delimited JSON to Local Machine?
I could extract Table data to GCS via CLI and also download JSON data from WEB UI but I am looking for solution using BQ CLI to download table data as JSON in Local machine?. I am wondering is that even possible?
You need to use Google Cloud Storage for your export job. Exporting data from BigQuery is explained here, check also the variants for different path syntaxes.
Then you can download the files from GCS to your local storage.
Gsutil tool can help you further to download the file from GCS to local machine.
You first need to export to GCS, then to transfer to local machine.
If you use the BQ Cli tool, then you can set output format to JSON, and you can redirect to a file. This way you can achieve some export locally, but it has certain other limits.
this exports the first 1000 line as JSON
bq --format=prettyjson query --n=1000 "SELECT * from publicdata:samples.shakespeare" > export.json
It's possible to extract data without using GCS, directly to your local machine, using BQ CLI.
Please see my other answer for details: BigQuery Table Data Export