Trouble getting a dynamodb scan to work with boto - boto

I'm using boto to access a dynamodb table. Everything was going well until I tried to perform a scan operation.
I've tried a couple of syntaxes I've found after repeated searches of The Internet, but no luck:
def scanAssets(self, asset):
results = self.table.scan({('asset', 'EQ', asset)})
-or-
results = self.table.scan(scan_filter={'asset':boto.dynamodb.condition.EQ(asset)})
The attribute I'm scanning for is called 'asset', and asset is a string.
The odd thing is the table.scan call always ends up going through this function:
def dynamize_scan_filter(self, scan_filter):
"""
Convert a layer2 scan_filter parameter into the
structure required by Layer1.
"""
d = None
if scan_filter:
d = {}
for attr_name in scan_filter:
condition = scan_filter[attr_name]
d[attr_name] = condition.to_dict()
return d
I'm not a python expert, but I don't see how this would work. I.e. what kind of structure would scan_filter have to be to get through this code?
Again, maybe I'm just calling it wrong. Any suggestions?

OK, looks like I had an import problem. Simply using:
import boto
and specifying boto.dynamodb.condition doesn't cut it. I had to add:
import dynamodb.condition
to get the condition type to get picked up. My now working code is:
results = self.table.scan(scan_filter={'asset': dynamodb.condition.EQ(asset)})
Not that I completely understand why, but it's working for me now. :-)

Or you can do this
exclusive_start_key = None
while True:
result_set = self.table.scan(
asset__eq=asset, # The scan filter is explicitly given here
max_page_size=100, # Number of entries per page
limit=100,
# You can divide the table by n segments so that processing can be done parallelly and quickly.
total_segments=number_of_segments,
segment=segment, # Specify which segment you want to process
exclusive_start_key=exclusive_start_key # To start for last key seen
)
dynamodb_items = map(lambda item: item, result_set)
# Do something with your item, add it to a list for later processing when you come out of the while loop
exclusive_start_key = result_set._last_key_seen
if not exclusive_start_key:
break
This is applicable for any field.
segmentation: suppose you have above script in test.py
you can run parallelly like
python test.py --segment=0 --total_segments=4
python test.py --segment=1 --total_segments=4
python test.py --segment=2 --total_segments=4
python test.py --segment=3 --total_segments=4
in different screens

Related

pyarrow read_csv - how to fill trailing optional columns with nulls

I can't find an option or workaround for this using pyarrow.csv.read_csv and there are many other reasons why using pandas doesn't work for us.
We have csv files with final columns that are effectively optional, and the source data doesn't always include empty cells for them, for example:
name,date,serial_number,prior_name,comments
A,2021-01-01,1234
B,2021-01-02,1235,A,Name changed for new version
C,2021-01-02,1236,B
This fails with an error like pyarrow.lib.ArrowInvalid: CSV parse error: Expected 5 columns, got 3:
I've got to assume that pyarrow can handle this, but I can't see how. Even the invalid row handler doesn't appear to let me return the "appropriate" value, only to "skip" these rows. That would even be okay if I could save them and append later, but as arrow tables are immutable, it just seems like there should be a more straightforward way to handle these cases.
As Pace noted, this is unfortunately not presently available in pyarrow. We actually process the "csv" files on the fly after extracting from a zip, and so creating intermediate files wasn't an option.
For anyone else needing a quick-ish way to handle this, I was able to get around this (and a couple other issues) by creating a class to wrap the stream and overwrite read(size, *args, **kwargs) to quickly perform the stripping. Even with the middleman class, it's faster than attempting to load in pandas (and there are
several other reasons why we aren't using pandas here).
Here's a template example:
class StreamWrapper():
def __init__(self, obj=None):
self.delegation = obj
self.header = None
def __getattr__(self, *args, **kwargs):
# any other call is delegated to the stream/object
return(self.delegation.__getattribute__(*args, **kwargs))
def read(self, size=None, *args, **kwargs):
bytedata = self.delegation.read(size, *args, **kwargs)
# .. the rest of the logic pre-processes the byte data,
# identifies the header and number of columns (which are retained persistently),
# and then strips out extra columns when encountered
return(bytedata)
This allow our call to be:
df = pyarrow.csv.read_csv(
StreamWrapper(zipfile_stream),
parse_options = ...
)

Tf-slim: ValueError: Variable vgg_19/conv1/conv1_1/weights already exists, disallowed. Did you mean to set reuse=True in VarScope?

I am using tf-slim to extract features from several batches of images. The problem is my code works for the first batch , after that I get the error in the title.My code is something like this:
for i in range(0, num_batches):
#Obtain the starting and ending images number for each batch
batch_start = i*training_batch_size
batch_end = min((i+1)*training_batch_size, read_images_number)
#obtain the images from the batch
images = preprocessed_images[batch_start: batch_end]
with slim.arg_scope(vgg.vgg_arg_scope()) as sc:
_, end_points = vgg.vgg_19(tf.to_float(images), num_classes=1000, is_training=False)
init_fn = slim.assign_from_checkpoint_fn(os.path.join(checkpoints_dir, 'vgg_19.ckpt'),slim.get_model_variables('vgg_19'))
feature_conv_2_2 = end_points['vgg_19/pool5']
So as you can see, in each batch, I select a batch of images and use the vgg-19 model to extract features from the pool5 layer. But after the first iteration I get error in the line where I am trying to obtain the end-points. One solution, as I found on the internet is to reset the graph each time , but I don't want to do that because I have some weights in my graph in later part of the code which I train using these extracted features. I don't want to reset them. Any leads highly appreciated. Thanks!
You should create your graph once, not in a loop. The error message tells you exactly that - you try to build the same graph twice.
So it should be (in pseudocode)
create_graph()
load_checkpoint()
for each batch:
process_data()

How does Ruby JSON.parse differ to OJ.load in terms of allocating memory/Object IDs

This is my first question and I have tried my best to find an answer - I have looked everywhere for an answer but haven't managed to find anything concrete to answer this in both the oj docs and ruby json docs and here.
Oj is a gem that serves to improve serialization/deserialization speeds and can be found at: https://github.com/ohler55/oj
I noticed this difference when I tried to dump and parse a hash with a NaN contained in it, twice, and compared the two, i.e.
# Create json Dump
dump = JSON.dump ({x: Float::NAN})
# Create first JSON load
json_load = JSON.parse(dump, allow_nan: true)
# Create second JSON load
json_load_2 = JSON.parse(dump, allow_nan: true)
# Create first OJ load
oj_load = Oj.load(dump, :mode => :compat)
# Create second OJload
oj_load_2 = Oj.load(dump, :mode => :compat)
json_load == json_load_2 # Returns true
oj_load == oj_load_2 # Returns false
I always thought NaN could not be compared to NaN so this confused me for a while until I realised that json_load and json_load_2 have the same object ID and oj_load and oj_load_2 do not.
Can anyone point me in the direction of where this memory allocation/object ID allocation occurs or how I can control that behaviour with OJ?
Thanks and sorry if this answer is floating somewhere on the internet where I could not find it.
Additional info:
I am running Ruby 1.9.3.
Here's the output from my tests re object IDs:
puts Float::NAN.object_id; puts JSON.parse(%q({"x":NaN}), allow_nan: true)["x"].object_id; puts JSON.parse(%q({"x":NaN}), allow_nan: true)["x"].object_id
70129392082680
70129387898880
70129387898880
puts Float::NAN.object_id; puts Oj.load(%q({"x":NaN}), allow_nan: true)["x"].object_id; puts Oj.load(%q({"x":NaN}), allow_nan: true)["x"].object_id
70255410134280
70255410063100
70255410062620
Perhaps I am doing something wrong?
I believe that is a deep implementation detail. Oj does this:
if (ni->nan) {
rnum = rb_float_new(0.0/0.0);
}
I can't find a Ruby equivalent for that, Float.new doesn't appear to exist, but it does create a new Float object every time (from an actual C's NaN it constructs on-site), hence different object_ids.
Whereas Ruby's JSON module uses (also in C) its own JSON::NaN Float object everywhere:
CNaN = rb_const_get(mJSON, rb_intern("NaN"));
That explains why you get different NaNs' object_ids with Oj and same with Ruby's JSON.
No matter what object_ids the resulting hashes have, the problem is with NaNs. If they have the same object_ids, the enclosing hashes are considered equal. If not, they are not.
According to the docs, Hash#== uses Object#== for values that only outputs true if and only if the argument is the same object (same object_id). This contradicts NaN's property of not being equal to itself.
Spectacular. Inheritance gone haywire.
One could, probably, modify Oj's C code (and even make a pull request with it) to use a constant like Ruby's JSON module does. It's a subtle change, but it's in the spirit of being compat, I guess.

How to get all resource record sets from Route 53 with boto?

I'm writing a script that makes sure that all of our EC2 instances have DNS entries, and all of the DNS entries point to valid EC2 instances.
My approach is to try and get all of the resource records for our domain so that I can iterate through the list when checking for an instance's name.
However, getting all of the resource records doesn't seem to be a very straightforward thing to do! The documentation for GET ListResourceRecordSets seems to suggest it might do what I want and the boto equivalent seems to be get_all_rrsets ... but it doesn't seem to work as I would expect.
For example, if I go:
r53 = boto.connect_route53()
zones = r53.get_zones()
fooA = r53.get_all_rrsets(zones[0][id], name="a")
then I get 100 results. If I then go:
fooB = r53.get_all_rrsets(zones[0][id], name="b")
I get the same 100 results. Have I misunderstood and get_all_rrsets does not map onto ListResourceRecordSets?
Any suggestions on how I can get all of the records out of Route 53?
Update: cli53 (https://github.com/barnybug/cli53/blob/master/cli53/client.py) is able to do this through its feature to export a Route 53 zone in BIND format (cmd_export). However, my Python skills aren't strong enough to allow me to understand how that code works!
Thanks.
get_all_rrsets returns a ResourceRecordSets which derives from Python's list class. By default, 100 records are returned. So if you use the result as a list it will have 100 records. What you instead want to do is something like this:
r53records = r53.get_all_rrsets(zones[0][id])
for record in r53records:
# do something with each record here
Alternatively if you want all of the records in a list:
records = [r for r in r53.get_all_rrsets(zones[0][id]))]
When iterating either with a for loop or list comprehension Boto will fetch the additional records (up to) 100 records at a time as needed.
this blog post from 2018 has a script which allows exporting in bind format:
#!/usr/bin/env python3
# https://blog.jverkamp.com/2018/03/12/generating-zone-files-from-route53/
import boto3
import sys
route53 = boto3.client('route53')
paginate_hosted_zones = route53.get_paginator('list_hosted_zones')
paginate_resource_record_sets = route53.get_paginator('list_resource_record_sets')
domains = [domain.lower().rstrip('.') for domain in sys.argv[1:]]
for zone_page in paginate_hosted_zones.paginate():
for zone in zone_page['HostedZones']:
if domains and not zone['Name'].lower().rstrip('.') in domains:
continue
for record_page in paginate_resource_record_sets.paginate(HostedZoneId = zone['Id']):
for record in record_page['ResourceRecordSets']:
if record.get('ResourceRecords'):
for target in record['ResourceRecords']:
print(record['Name'], record['TTL'], 'IN', record['Type'], target['Value'], sep = '\t')
elif record.get('AliasTarget'):
print(record['Name'], 300, 'IN', record['Type'], record['AliasTarget']['DNSName'], '; ALIAS', sep = '\t')
else:
raise Exception('Unknown record type: {}'.format(record))
usage example:
./export-zone.py mydnszone.aws
mydnszone.aws. 300 IN A server.mydnszone.aws. ; ALIAS
mydnszone.aws. 86400 IN CAA 0 iodef "mailto:hostmaster#mydnszone.aws"
mydnszone.aws. 86400 IN CAA 128 issue "letsencrypt.org"
mydnszone.aws. 86400 IN MX 10 server.mydnszone.aws.
the output can be saved as a file and/or copied to the clipboard:
the Import zone file page allows to paste the data:
at the time of this writing the script was working fine using python 3.9.

How to upload multiple JSON files into CouchDB

I am new to CouchDB. I need to get 60 or more JSON files in a minute from a server.
I have to upload these JSON files to CouchDB individually as soon as I receive them.
I installed CouchDB on my Linux machine.
I hope some one can help me with my requirement.
If possible can someone help me with pseudo code.
My Idea:
Is to write a python script for uploading all JSON files to CouchDB.
Each and every JSON file must be each document and the data present in
JSON must be inserted same into CouchDB
(the specified format with values in a file).
Note:
These JSON files are Transactional, every second 1 file is generated
so I need to read the file upload as same format into CouchDB on
successful uploading archive the file into local system of different folder.
python program to parse the json and insert into CouchDb:
import sys
import glob
import errno,time,os
import couchdb,simplejson
import json
from pprint import pprint
couch = couchdb.Server() # Assuming localhost:5984
#couch.resource.credentials = (USERNAME, PASSWORD)
# If your CouchDB server is running elsewhere, set it up like this:
couch = couchdb.Server('http://localhost:5984/')
db = couch['mydb']
path = 'C:/Users/Desktop/CouchDB_Python/Json_files/*.json'
#dirPath = 'C:/Users/VijayKumar/Desktop/CouchDB_Python'
files = glob.glob(path)
for file1 in files:
#dirs = os.listdir( dirPath )
file2 = glob.glob(file1)
for name in file2: # 'file' is a builtin type, 'name' is a less-ambiguous variable name.
try:
with open(name) as f: # No need to specify 'r': this is the default.
#sys.stdout.write(f.read())
json_data=f
data = json.load(json_data)
db.save(data)
pprint(data)
json_data.close()
#time.sleep(2)
except IOError as exc:
if exc.errno != errno.EISDIR: # Do not fail if a directory is found, just ignore it.
raise # Propagate other kinds of IOError.
I would use CouchDB bulk API, even though you have specified that you need to send them to db one by one. For example, by implementing a simple queue that gets sent out every say 5 - 10 seconds via a bulk doc call will greatly increase performance of your application.
There is obviously a quirk in that and that is you need to know the IDs of the docs that you want to get from the DB. But for the PUTs it is perfect. (it is not entirely true, you can get ranges of docs using bulk operation if the IDs you are using for your docs can be sorted nicely).
From my experience working with CouchDB, I have a hunch that you are dealing with Transactional documents in order to compile them into some sort of sum result and act on that data accordingly (maybe creating next transactional doc in series). For that you can rely on CouchDB by using 'reduce' functions on the views you create. It takes a little practice to get reduce function working properly and is highly dependent on what it is you actually what to achieve and what data you are prepared to emit by the view so I can't really provide you with more detail on that.
So in the end the app logic would go something like that:
get _design/someDesign/_view/yourReducedView
calculate new transaction
add transaction to queue
onTimeout
send all in transaction queue
If I got that first part of why you are using transactional docs wrong all that would really change is the part where you getting those transactional docs in my app logic.
Also, before writing your own 'reduce' function, have a look at buil-in ones (they are alot faster then anything outside of db engine can do)
http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API
EDIT:
Since you are starting, I strongly recommend to have a look at CouchDB Definitive Guide.
NOTE FOR LATER:
Here is one hidden stone (well maybe not so much a hidden stone but not an obvious thing to look out for for the new-comer in any case). When you write reduce function make sure that it does not produce too much output for the query without boundaries. This will extremely slow down the entire view even when you provide reduce=false when getting stuff from it.
So you need to get JSON documents from a server and send them to CouchDB as you receive them. A Python script would work fine. Here is some pseudo-code:
loop (until no more docs)
get new JSON doc from server
send JSON doc to CouchDB
end loop
In Python, you could use requests to send the documents to CouchDB and probably to get the documents from the server as well (if it is using an HTTP API).
You might want to checkout the pycouchdb module for python3. I've used it myself to upload lots of JSON objects into couchdb instance. My project does pretty much the same as you describe so you can take a look at my project Pyro at Github for details.
My class looks like that:
class MyCouch:
""" COMMUNICATES WITH COUCHDB SERVER """
def __init__(self, server, port, user, password, database):
# ESTABLISHING CONNECTION
self.server = pycouchdb.Server("http://" + user + ":" + password + "#" + server + ":" + port + "/")
self.db = self.server.database(database)
def check_doc_rev(self, doc_id):
# CHECKS REVISION OF SUPPLIED DOCUMENT
try:
rev = self.db.get(doc_id)
return rev["_rev"]
except Exception as inst:
return -1
def update(self, all_computers):
# UPDATES DATABASE WITH JSON STRING
try:
result = self.db.save_bulk( all_computers, transaction=False )
sys.stdout.write( " Updating database" )
sys.stdout.flush()
return result
except Exception as ex:
sys.stdout.write( "Updating database" )
sys.stdout.write( "Exception: " )
print( ex )
sys.stdout.flush()
return None
Let me know in case of any questions - I will be more than glad to help if you will find some of my code usable.