I am trying to load a json file to dynamo db in AWS amazon, the JSON file has about 20K rows, but only 80 rows was uploaded successfully to dynamodb, any idea about this?
The following is the lambda uploading code:
import boto3
import json
s3_client = boto3.client('s3')
dynamodb1 = boto3.resource('dynamodb')
def lambda_handler(event, context):
# TODO implement
bucket = event['Records'][0]['s3']['bucket']['name']
json_file_name = event['Records'][0]['s3']['object']['key']
json_object = s3_client.get_object(Bucket=bucket, Key=json_file_name)
jsonFileReader = json_object['Body'].read()
jsonDict = json.loads(jsonFileReader)
table1 = dynamodb1.Table('table88')
for record in jsonDict:
table1.put_item(Item=record)
return 'Hello from Lambda'
Did you try increasing the lambda execution timeout value? May be 20k rows need more time to be processed then specified execution timeout.
Related
i have an S3 was over 130k Json Files which i need to calculate numbers based on data in the json files (for example calculate the number of gender of Speakers). i am currently using s3 Paginator and JSON.load to read each file and extract information form. but it take a very long time to process such a large number of file (2-3 files per second). how can i speed up the process? please provide working code examples if possible. Thank you
here is some of my code:
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket='bucket-name',StartAfter='')
for page in result:
if "Contents" in page:
for key in page[ "Contents" ]:
keyString = key[ "Key" ]
s3 = boto3.resource('s3')
content_object = s3.Bucket('bucket-name').Object(str(keyString))
file_content = content_object.get()['Body'].read().decode('utf-8')
json_content = json.loads(file_content)
x = (json_content['dict-name'])
In order to use the code below, I'm assuming you understand pandas (if not, you may want to get to know it). Also, it's not clear if your 2-3 seconds is on the read or includes part of the number crunching, nonetheless multiprocessing will speed this up dramatically. The gist is to read all the files in (as dataframes), concatenate them, then do your analysis.
To be useful for me, I run this on spot instances that have lots of vCPUs and memory. I've found the instances that are network optimized (like c5n - look for the n) and the inf1 (for machine learning) are much faster at reading/writing than T or M instance types, as examples.
My use case is reading 2000 'directories' with roughly 1200 files in each and analyzing them. The multithreading is orders of magnitude faster than single threading.
File 1: your main script
# create script.py file
import os
from multiprocessing import Pool
from itertools import repeat
import pandas as pd
import json
from utils_file_handling import *
ufh = file_utilities() #instantiate the class functions - see below (second file)
bucket = 'your-bucket'
prefix = 'your-prefix/here/' # if you don't have a prefix pass '' (empty string or function will fail)
#define multiprocessing function - get to know this to use multiple processors to read files simultaneously
def get_dflist_multiprocess(keys_list, num_proc=4):
with Pool(num_proc) as pool:
df_list = pool.starmap(ufh.reader_json, zip(repeat(bucket), keys_list), 15)
pool.close()
pool.join()
return df_list
#create your master keys list upfront; you can loop through all or slice the list to test
keys_list = ufh.get_keys_from_prefix(bucket, prefix)
# keys_list = keys_list[0:2000] # as an exampmle
num_proc = os.cpu_count() #tells you how many processors your machine has; function above defaults to 4 unelss given
df_list = get_dflist_multiprocess(keys_list, num_proc=num_proc) #collect dataframes for each file
df_new = pd.concat(df_list, sort=False)
df_new = df_new.reset_index(drop=True)
# do your analysis on the dataframe
File 2: class functions
#utils_file_handling.py
# create this in a separate file; name as you wish but change the import in the script.py file
import boto3
import json
import pandas as pd
#define client and resource
s3sr = boto3.resource('s3')
s3sc = boto3.client('s3')
class file_utilities:
"""file handling function"""
def get_keys_from_prefix(self, bucket, prefix):
'''gets list of keys and dates for given bucket and prefix'''
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
# use Delimiter to limit search to that level of hierarchy
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
keys = [content['Key'] for content in page.get('Contents')]
print('keys in page: ', len(keys))
keys_list.extend(keys)
return keys_list
def read_json_file_from_s3(self, bucket, key):
"""read json file"""
bucket_obj = boto3.resource('s3').Bucket(bucket)
obj = boto3.client('s3').get_object(Bucket=bucket, Key=key)
data = obj['Body'].read().decode('utf-8')
return data
# you may need to tweak this for your ['dict-name'] example; I think I have it correct
def reader_json(self, bucket, key):
'''returns dataframe'''
return pd.DataFrame(json.loads(self.read_json_file_from_s3(bucket, key))['dict-name'])
Ok so I am a beginner to AWS in general. I am writing a lambda function to trigger based on file upload event in S3, remove some coulmns and write it to a new bucket. Been banging my head for the past two datas and I am getting different error each time. Can someone modify my code/fix it? outputlv will be my target bucket.. Currently I am getting '/outputlv/output.csv' path does not exist in the with open('/outputlv/output.csv', 'w') as output_file line. Thanks.
import json
import urllib.parse
import boto3
import csv
s3 = boto3.client('s3')
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
file_name = s3.get_object(Bucket=bucket, Key=key)
csv_reader = csv.reader(file_name)
with open('/outputlv/output.csv', 'w') as output_file:
wtr = csv.writer(output_file)
for i in csv_reader:
wtr.writerow(i[0], i[2], i[3])
target_bucket = 'outputlv'
final_file = 'outputlv/output.csv'
s3.put_object(Bucket=target_bucket, Key=final_file)
Why don't you get the content, is it required to work with local files at all ?
response = s3.get_object(Bucket=bucket, Key=key)
# Get file content
content = response['Body'].read()
# Pass file content to csv reader
csv_reader = csv.reader(content)
I successfully inserted many JSON files (only chosen keys) to a local MongoDB. However, when a collection has a little bit more than 100 million rows that need to be inserted my code seems so slow. I hope multiprocessing will help speeds up the process but I can't come up with the correct ways of doing it without any conflict. Here is my code without multiprocessing:
import json
import os
from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client[db_name]
# get file list
def log_list(log_folder):
log_file = list()
for entry in os.listdir(log_folder):
if os.path.isfile(os.path.join(log_folder, entry)):
log_path = os.path.join(log_folder, entry)
log_file.append(log_path)
return log_file
def func():
collection = db[collection_name]
print('loading folder_name')
root = folder_path
nfile = 0
nrow = 0
# insert data
files = log_list(root)
files.sort()
for file in files:
with open(file, 'r') as f:
nfile += 1
table = [json.loads(line) for line in f]
for row in table:
nrow += 1
entry = {'timestamp': row['#timestamp'], 'user_id': row['user']['id'], 'action': row['#type']}
collection.insert_one(entry).inserted_id
client.close()
print(nfile, 'file(s) processed.', nrow, 'row(s) loaded.')
Split your file into several files. Run a single copy of your program for each chunk of the file. When writing to the database use insert_many rather than insert_one to write more efficiently to the database.
You can use Python multiprocessing to fork multiple parallel jobs.
We do this in our project, users upload lot of files for some task, we handle it using distributed task queues using Celery.
Since this is a similar, asynchronous task, 'Celery' can do great here, it is designed to pick up tasks and then execute in separate process.
Create a task
Set up a broker (like redis)
Run celery in another terminal or in the background
send the task (see task_name.apply_async() or task_name.delay() )
https://docs.celeryproject.org/en/latest/index.html
I am trying to load Firebase Realtime DB backup in json.gz(744 MB in size and 5 GB after unzip) from cloud storage into bigquery.
I have tried through both bigquery UI and python client but i'm getting this below mentioned error:
Error while reading data, error message: Failed to parse JSON: Closing quote expected in string; Could not parse value; Could not parse value; Could not parse value
Since, this is a firebase daily backup, I am unsure that what would be incorrect with the data?
Here is the Python(2.7.15) code which is used to load the data:
import os
from google.cloud import bigquery
credential_path = "path to credentials .json file"
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credential_path
client = bigquery.Client()
dataset_id = 'my_new_dataset'
dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.autodetect = True
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
uri = 'gs://URI-PATH'
load_job = client.load_table_from_uri(
uri,
dataset_ref.table('hit_snapshot_table'),
job_config=job_config) # API request
assert load_job.job_type == 'load'
load_job.result()
Any help/suggestion is greatly appreciated.
I am trying to read a JSON file from Amazon S3 and its file size is about 2GB. When I use the method .read(), it gives me MemoryError.
Are there any solutions to this problem? Any help would do, thank you so much!
So, I found a way which worked for me efficiently. I had 1.60 GB file and need to load for processing.
s3 = boto3.client('s3', aws_access_key_id=<aws_access_key_id>, aws_secret_access_key=<aws_secret_access_key>)
# Now we collected data in the form of bytes array.
data_in_bytes = s3.Object(bucket_name, filename).get()['Body'].read()
#Decode it in 'utf-8' format
decoded_data = data_in_bytes.decode('utf-8')
#I used io module for creating a StringIO object.
stringio_data = io.StringIO(decoded_data)
#Now just read the StringIO obj line by line.
data = stringio_data.readlines()
#Its time to use json module now.
json_data = list(map(json.loads, data))
So json_data is the content of the file. I know there are lots of variable manipulations, but it worked for me.
Just iterate over the object.
s3 = boto3.client('s3', aws_access_key_id=<aws_access_key_id>, aws_secret_access_key=<aws_secret_access_key>)
fileObj = s3.get_object(Bucket='bucket_name', Key='key')
for row in fileObj["body"]:
line = row.decode('utf-8')
print(json.loads(line))
I just solved the problem. Here's the code. Hope it helps for future use!
s3 = boto3.client('s3', aws_access_key_id=<aws_access_key_id>, aws_secret_access_key=<aws_secret_access_key>)
obj = s3.get_object(Bucket='bucket_name', Key='key')
data = (line.decode('utf-8') for line in obj['Body'].iter_lines())
for row in file_content:
print(json.loads(row))
import json
import boto3
def lambda_handler(event, context):
s3 = boto3.resource('s3')
#reading all s3 bucket
for bucket in s3.buckets.all():
print(bucket.name)
#json_data = s3.Object("vkhan-s3-bucket, "config/sandbox/config.json").get()['Body'].read()
json_data=json.loads(s3.Object("vkhan-s3-bucket", "config/sandbox/config.json").get()['Body'].read().decode())
print(json_data)
return {
'statusCode': 200,
'body': json.dumps(json_data)
}