how to chunk a csv (dict)reader object in python 3.2?

how to chunk a csv (dict)reader object in python 3.2? - csv

I try to use Pool from the multiprocessing module to speed up reading in large csv files. For this, I adapted an example (from py2k), but it seems like the csv.dictreader object has no length. Does it mean I can only iterate over it? Is there a way to chunk it still?
These questions seemed relevant, but did not really answer my question:
Number of lines in csv.DictReader,
How to chunk a list in Python 3?
My code tried to do this:
source = open('/scratch/data.txt','r')
def csv2nodes(r):
strptime = time.strptime
mktime = time.mktime
l = []
ppl = set()
for row in r:
cell = int(row['cell'])
id = int(row['seq_ei'])
st = mktime(strptime(row['dat_deb_occupation'],'%d/%m/%Y'))
ed = mktime(strptime(row['dat_fin_occupation'],'%d/%m/%Y'))
# collect list
l.append([(id,cell,{1:st,2: ed})])
# collect separate sets
ppl.add(id)
return (l,ppl)
def csv2graph(source):
r = csv.DictReader(source,delimiter=',')
MG=nx.MultiGraph()
l = []
ppl = set()
# Remember that I use integers for edge attributes, to save space! Dic above.
# start: 1
# end: 2
p = Pool(processes=4)
node_divisor = len(p._pool)*4
node_chunks = list(chunks(r,int(len(r)/int(node_divisor))))
num_chunks = len(node_chunks)
pedgelists = p.map(csv2nodes,
zip(node_chunks))
ll = []
for l in pedgelists:
ll.append(l[0])
ppl.update(l[1])
MG.add_edges_from(ll)
return (MG,ppl)

From the csv.DictReader documentation (and the csv.reader class it subclasses), the class returns an iterator. The code should have thrown a TypeError when you called len().
You can still chunk the data, but you'll have to read it entirely into memory. If you're concerned about memory you can switch from csv.DictReader to csv.reader and skip the overhead of the dictionaries csv.DictReader creates. To improve readability in csv2nodes(), you can assign constants to address each field's index:
CELL = 0
SEQ_EI = 1
DAT_DEB_OCCUPATION = 4
DAT_FIN_OCCUPATION = 5
I also recommend using a different variable than id, since that's a built-in function name.

Related

Count the number of people having a property bounded by two numbers

The following code goes over the 10 pages of JSON returned by GET request to the URL.
and checks how many records satisfy the condition that bloodPressureDiastole is between the specified limits. It does the job, but I was wondering if there was a better or cleaner way to achieve this in python
import urllib.request
import urllib.parse
import json
baseUrl = 'https://jsonmock.hackerrank.com/api/medical_records?page='
count = 0
for i in range(1, 11):
url = baseUrl+str(i)
f = urllib.request.urlopen(url)
response = f.read().decode('utf-8')
response = json.loads(response)
lowerlimit = 110
upperlimit = 120
for elem in response['data']:
bd = elem['vitals']['bloodPressureDiastole']
if bd >= lowerlimit and bd <= upperlimit:
count = count+1
print(count)

There is no access through fields to json content because you get dict object from json.loads (see translation scheme here). It realises access via __getitem__ method (dict[key]) instead of __getattr__ (object.field) as keys may be any hashible objects not only strings. Moreover, even strings cannot serve as fields if they starts with digits or are the same with built-in dictionary methods.
Despite this, you can define your own custom class realising desired behavior with acceptable key names. json.loads has an argument object_hook wherein you may put any callable object (function or class) that take a dict as the sole argument (not only the resulted one but every one in json recursively) & return something as the result. If your jsons match some template, you can define a class with predefined fields for the json content & even with methods in order to get a robust Python-object, a part of your domain logic.
For instance, let's realise the access through fields. I get json content from response.json method from requests but it has the same arguments as in json package. Comments in code contain remarks about how to make your code more pythonic.
from collections import Counter
from requests import get
class CustomJSON(dict):
def __getattr__(self, key):
return self[key]
def __setattr__(self, key, value):
self[key] = value
LOWER_LIMIT = 110 # Constants should be in uppercase.
UPPER_LIMIT = 120
base_url = 'https://jsonmock.hackerrank.com/api/medical_records'
# It is better use special tools for handling URLs
# in order to evade possible exceptions in the future.
# By the way, your option could look clearer with f-strings
# that can put values from variables (not only) in-place:
# url = f'https://jsonmock.hackerrank.com/api/medical_records?page={i}'
counter = Counter(normal_pressure=0)
# It might be left as it was. This option is useful
# in case of additional counting any other information.
for page_number in range(1, 11):
records = get(
base_url, data={"page": page_number}
).json(object_hook=CustomJSON)
# Python has a pile of libraries for handling url requests & responses.
# urllib is a standard library rewritten from scratch for Python 3.
# However, there is a more featured (connection pooling, redirections, proxi,
# SSL verification &c.) & convenient third-party
# (this is the only disadvantage) library: urllib3.
# Based on it, requests provides an easier, more convenient & friendlier way
# to work with url-requests. So I highly recommend using it
# unless you are aiming for complex connections & url-processing.
for elem in records.data:
if LOWER_LIMIT <= elem.vitals.bloodPressureDiastole <= UPPER_LIMIT:
counter["normal_pressure"] += 1
print(counter)

Why must use DataParallel when testing？

Train on the GPU, num_gpus is set to 1：
device_ids = list(range(num_gpus))
model = NestedUNet(opt.num_channel, 2).to(device)
model = nn.DataParallel(model, device_ids=device_ids)
Test on the CPU：
model = NestedUNet_Purn2(opt.num_channel, 2).to(dev)
device_ids = list(range(num_gpus))
model = torch.nn.DataParallel(model, device_ids=device_ids)
model_old = torch.load(path, map_location=dev)
pretrained_dict = model_old.state_dict()
model_dict = model.state_dict()
pretrained_dict = {k: v for k, v in pretrained_dict.items() if k in model_dict}
model_dict.update(pretrained_dict)
model.load_state_dict(model_dict)
This will get the correct result, but when I delete：
device_ids = list(range(num_gpus))
model = torch.nn.DataParallel(model, device_ids=device_ids)
the result is wrong.

nn.DataParallel wraps the model, where the actual model is assigned to the module attribute. That also means that the keys in the state dict have a module. prefix.
Let's look at a very simplified version with just one convolution to see the difference:
class NestedUNet(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
model = NestedUNet()
model.state_dict().keys() # => odict_keys(['conv1.weight', 'conv1.bias'])
# Wrap the model in DataParallel
model_dp = nn.DataParallel(model, device_ids=range(num_gpus))
model_dp.state_dict().keys() # => odict_keys(['module.conv1.weight', 'module.conv1.bias'])
The state dict you saved with nn.DataParallel does not line up with the regular model's state. You are merging the current state dict with the loaded state dict, that means that the loaded state is ignored, because the model does not have any attributes that belong to the keys and instead you are left with the randomly initialised model.
To avoid making that mistake, you shouldn't merge the state dicts, but rather directly apply it to the model, in which case there will be an error if the keys don't match.
RuntimeError: Error(s) in loading state_dict for NestedUNet:
Missing key(s) in state_dict: "conv1.weight", "conv1.bias".
Unexpected key(s) in state_dict: "module.conv1.weight", "module.conv1.bias".
To make the state dict that you have saved compatible, you can strip off the module. prefix:
pretrained_dict = {key.replace("module.", ""): value for key, value in pretrained_dict.items()}
model.load_state_dict(pretrained_dict)
You can also avoid this issue in the future by unwrapping the model from nn.DataParallel before saving its state, i.e. saving model.module.state_dict(). So you can always load the model first with its state and then later decide to put it into nn.DataParallel if you wanted to use multiple GPUs.

You trained your model using DataParallel and saved it. So, the model weights were stored with a module. prefix. Now, when you load without DataParallel, you basically are not loading any model weights (the model has random weights). As a result, the model predictions are wrong.
I am giving an example.
model = nn.Linear(2, 4)
model = torch.nn.DataParallel(model, device_ids=device_ids)
model.state_dict().keys() # => odict_keys(['module.weight', 'module.bias'])
On the other hand,
another_model = nn.Linear(2, 4)
another_model.state_dict().keys() # => odict_keys(['weight', 'bias'])
See the difference in the OrderedDict keys.
So, in your code, the following three-line works but no model weights are loaded.
pretrained_dict = model_old.state_dict()
model_dict = model.state_dict()
pretrained_dict = {k: v for k, v in pretrained_dict.items() if k in model_dict}
Here, model_dict has keys without the module. prefix but pretrained_dict has when you do not use DataParalle. So, essentially pretrained_dict is empty when DataParallel is not used.
Solution: If you want to avoid using DataParallel, or you can load the weights file, create a new OrderedDict without the module prefix, and load it back.
Something like the following would work for your case without using DataParallel.
# original saved file with DataParallel
model_old = torch.load(path, map_location=dev)
# create new OrderedDict that does not contain `module.`
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in model_old.items():
name = k[7:] # remove `module.`
new_state_dict[name] = v
# load params
model.load_state_dict(new_state_dict)

Is it possible to access a solver propertiese through Pycaffe?

Is it possible to read and access the solver properties in Pycaffe?
I need to use some of the information stored in the solver file, but apparently the solver object which is created using
import caffe
solver = caffe.get_solver(solver_path)
is of no use in this case. Is there any other way to get around this problem?

I couldn't find what I was after using solver object and I ended up writing a quick function to get around this issue:
def retrieve_field(solver_path, field=None):
'''
Returns a specific solver parameter value using the specified field
or the whole content of the solver file, when no field is provided.
returns:
a string, as a field value or the whole content as list
'''
lines = []
field_segments = []
with open(solver_path, 'r') as file:
for line in file:
line = line.strip()
lines.append(line)
field_segments = line.split(':')
if (field_segments[0] == field):
#if that line contains # marks (for comments)
if('#' in field_segments[-1]):
idx = field_segments[-1].index('#')
return field_segments[-1][0:idx]
else:
return field_segments[-1]
return lines

EOF Error During Dict Slice

I am trying to compile monthly data in to an existing JSON file that I loaded via import json. Initially, my json data just had one property which is 'name':
json_data['features'][1]['properties']
>>{'name':'John'}
But the end result with the monthly data I want is like this:
json_data['features'][1]['properties']
>>{'name':'John',
'2016-01': {'x1':0, 'x2':0, 'x3':1, 'x4':0},
'2016-02': {'x1':1, 'x2':0, 'x3':1, 'x4':0}, ... }
My monthly data are on separate tsv files. They have this format:
John 0 0 1 0
Jane 1 1 1 0
so I loaded them via import csv and parsed through a list of urls and set about placing them in a collective dictionary like so:
file_strings = ['2016-01.tsv', '2016-02.tsv', ... ]
collective_dict = {}
for i in strings:
with open(i) as f:
tsv_object = csv.reader(f, delimiter='\t')
collective_dict[i[:-4]] = rows[0]:rows[1:5] for rows in tsv_object
I checked how things turned out by slicing collective_dict like so:
collective_dict['2016-01']['John'][0]
>>'0'
Which is correct; it just needs to be cast into an integer.
For my next feat, I attempted to assign all of the monthly data to the respective json members as part of their external properties:
for i in file_strings:
for j in range(len(json_data['features'])):
json_data['features'][j]['properties'][i[:-4]] = {}
json_data['features'][j]['properties'][i[:-4]]['x1'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][0])
json_data['features'][j]['properties'][i[:-4]]['x2'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][1])
json_data['features'][j]['properties'][i[:-4]]['x3'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][2])
json_data['features'][j]['properties'][i[:-4]]['x4'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][3])
Here I got an arrow pointing at the last few characters:
Syntax Error: unexpected EOF while parsing
It is a pretty complicated slice, I suppose user error is not to be ruled out. However, I did double and triple check things. I also looked up this error. It seems to come up with input() related calls. I'm left a bit confused, I don't see how I made a mistake (although I'm already mentally prepared to accept that).
My only guess was that something somewhere was not a string. When I checked collective_dict and json_data, everything that was supposed to be a string was a string ('John', 'Jane' et all). So, I guess it's something else.
I made the problem as simple as I could while keeping the original structure of the data and for loops and so forth. I'm using Python 3.6.
Question
Why am I getting the EOF error? How can I build my external properties data without encountering such an error?

Here I have rewritten your last code block to:
for i in file_strings:
file_name = i[:-4]
for j in range(len(json_data['features'])):
name = json_data['features'][j]['properties']['name']
file_dict = json_data['features'][j]['properties'][file_name] = {}
for x in range(4):
x_string = 'x{}'.format(x+1)
file_dict[x_string] = int(collective_dict[file_name][name][x])
from:
for i in file_strings:
for j in range(len(json_data['features'])):
json_data['features'][j]['properties'][i[:-4]] = {}
json_data['features'][j]['properties'][i[:-4]]['x1'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][0])
json_data['features'][j]['properties'][i[:-4]]['x2'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][1])
json_data['features'][j]['properties'][i[:-4]]['x3'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][2])
json_data['features'][j]['properties'][i[:-4]]['x4'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][3])
That is just to make it a bit more readable, but that shouldn't change anything.
A thing I noticed in your other part of code is the following:
collective_dict[i[:-4]] = rows[0]:rows[1:5] for rows in tsv_object
The thing I refer to is the = rows[0]:rows[1:5] for rows in tsv_object part. In my IDE, that does not work, and I'm not sure if that is a typo in your question or of that is actually in your code, but I imagine you want it to actually be
collective_dict[i[:-4]] = {rows[0]:rows[1:5] for rows in tsv_object}
or something like that. I'm not sure if that could confuse the parser think that there is an error at the end of the file.
The ValueError: Invalid literal for int()
If your tsv-data is
John 0 0 1 0
Jane 1 1 1 0
Then it should be no problem to do int() of the string value. E.g.: int('42') will become an int with value 42. However, if you have an error in one, or several, lines of your files, then use something like this block of code to figure out which file and line it is:
file_strings = ['2016-01.tsv', '2016-02.tsv', ... ]
collective_dict = {}
for file_name in file_strings:
print('Reading {}'.format(file_name))
with open(file_name) as f:
tsv_object = csv.reader(f, delimiter='\t')
for line_no, (name, *x_values) in enumerate(tsv_object):
if len(x_values) != 4:
print('On line {}, there is only {} values!'.format(line_no, len(x_values)))
try:
intx = [int(x) for x in x_values]
except ValueError as e:
# Catch "Invalid literal for int()"
print('Line {}: {}'.format(line_no, e))

Python Return multiple Variables from a function

Not so much of a question as a valuable observation for people using python.
Unlike the majority of other programming languages, you can return multiple variable from a function
without dealing with objects, lists etc.
simply put
return ReturnValue1, ReturnValue2, ReturnValue3
to return however many you wish.
and to retrieve them:
ReturnValue1, ReturnValue2, ReturnValue3 = functionName(parameters)
But remember to do it in order just like assigning a parameter for a function.
Cheers

As I am not able to just comment on your "Question", I have to put this into an answer:
To be precise, the return-value will be a tuple. So technically you are not returning multiple values, but an instance of the class tuple, containing those exact values. This provides the opportunity to receive those values in quite a lot of different ways:
def f():
return 1, 2, 3
one, two, three = f() # one = 1, two = 2, three = 3
all_three_values = f() # all_three_values = (1, 2, 3)
a, *b = f() # a = 1, b = [2, 3]
assert isinstance(all_three_values, tuple) # True

This may help you:
lst = ['a','s','d','f','w','e']
def list2variables(lst):
di = {}
for i in range(len(lst)): # capture index to form dict
di[lst[i]]=lst[i]
for k, v in di.items(): # move elements as global variable
globals()[k] = v
return globals() # return so that can be used in future
little change can be done in function to take dict as argument too

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

how to chunk a csv (dict)reader object in python 3.2? - csv

Related

Count the number of people having a property bounded by two numbers

Why must use DataParallel when testing？

Is it possible to access a solver propertiese through Pycaffe?

EOF Error During Dict Slice

Python Return multiple Variables from a function

Categories

Resources