Am trying to create a function that takes a filename and it returns a 2-tuple with the number of the non-empty lines in that program, and the sum of the lengths of all those lines. Here is my current program. I made an attempy and got the following code:
def code_metric(file_name):
with open(file_name) as f:
lines = f.read().splitlines()
char_count = sum(map(len,(map(str.strip,filter(None,lines)))))
return len(lines), char_count
Am supposed to use the functionals map, filter, and reduce for this. I had asked the question previously and improved on my answer but its still giving me an error. Here is the link to the previous version of the question:
Old program code
When I run the file cmtest.py which has the following content
import prompt,math
x = prompt.for_int('Enter x')
print(x,'!=',math.factorial(x),sep='')
the result should be
(3,85)
but I keep getting:
(4,85)
Another file colltaz.py to be tested for example:
the result should be:
(73, 2856)
bit I keep getting:
(59, 2796)
Here is a link to the collatz.py file:
Collatz.py file link
Can anyone help me with correcting the code. Am fairly new to python and any help would be great.
Try this:
def code_metric(file_name):
with open(file_name) as f:
lines = [line.rstrip() for line in f.readlines()]
nonblanklines = [line for line in lines if line]
return len(nonblanklines), sum(len(line) for line in nonblanklines)
Examples:
>>> code_metric('collatz.py')
(73, 2856)
>>> code_metric('cmtest.py')
(3, 85)
Discussion
I was able to achieve the desired result for collatz.py only by removing the trailing newline and trailing blanks off the end of the lines. That is done in this step:
lines = [line.rstrip() for line in f.readlines()]
The next step is to remove the blank lines:
nonblanklines = [line for line in lines if line]
We want to return the number of non-blank lines:
len(nonblanklines)
We also want to return the total number of characters on the non-blank lines:
sum(len(line) for line in nonblanklines)
Alternate Version for Large Files
This version does not require keeping the file in memory all at once:
def code_metric2(file_name):
with open(file_name) as f:
lengths = [len(line) for line in (line.rstrip() for line in f.readlines()) if line]
return len(lengths), sum(lengths)
Alternate Version Using reduce
Python's createor, Guido van Rossum, wrote this about the reduce builtin:
So now reduce(). This is actually the one I've always hated most,
because, apart from a few examples involving + or *, almost every time
I see a reduce() call with a non-trivial function argument, I need to
grab pen and paper to diagram what's actually being fed into that
function before I understand what the reduce() is supposed to do. So
in my mind, the applicability of reduce() is pretty much limited to
associative operators, and in all other cases it's better to write out
the accumulation loop explicitly.
Accordingly reduce is no longer a builtin in python3. For compatibility, though, it remains available in the functools module. The code below how reduce can be used for this particular problem:
from functools import reduce
def code_metric3(file_name):
with open(file_name) as f:
lengths = [len(line) for line in (line.rstrip() for line in f.readlines()) if line]
return len(lengths), reduce(lambda x, y: x+y, lengths)
Here is yet another version which makes heavier use of reduce:
from functools import reduce
def code_metric4(file_name):
def fn(prior, line):
nlines, length = prior
line = line.rstrip()
if line:
nlines += 1
length += len(line)
return nlines, length
with open(file_name) as f:
nlines, length = reduce(fn, f.readlines(), (0, 0))
return nlines, length
Related
I have a dataset which is in a deque buffer, and I want to load random batches from this with a DataLoader. The buffer starts empty. Data will be added to the buffer before the buffer is sampled from.
self.buffer = deque([], maxlen=capacity)
self.batch_size = batch_size
self.loader = DataLoader(self.buffer, batch_size=batch_size, shuffle=True, drop_last=True)
However, this causes the following error:
File "env/lib/python3.8/site-packages/torch_geometric/loader/dataloader.py", line 78, in __init__
super().__init__(dataset, batch_size, shuffle,
File "env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 268, in __init__
sampler = RandomSampler(dataset, generator=generator)
File "env/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 102, in __init__
raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=0
Turns out that the RandomSampler class checks that num_samples is positive when it is initialised, which causes the error.
if not isinstance(self.num_samples, int) or self.num_samples <= 0:
raise ValueError("num_samples should be a positive integer "
"value, but got num_samples={}".format(self.num_samples))
Why does it check for this here, even though RandomSampler does support datasets which change in size at runtime?
One workaround is to use an IterableDataset, but I want to use the shuffle functionality of DataLoader.
Can you think of a nice way to use a DataLoader with a deque? Much appreciated!
The problem here is neither the usage of deque nor the fact that the dataset is dynamically growable. The problem is that you are starting with a Dataset of size zero - which is invalid.
The easiest solution would be to just start with any arbitrary object in the deque and dynamically remove it afterwards.
I am building a multi-class image classifier.
There is a debugging trick to overfit on a single batch to check if there any deeper bugs in the program.
How to design the code in a way that can do it in a much portable format?
One arduous and a not smart way is to build a holdout train/test folder for a small batch where test class consists of 2 distribution - seen data and unseen data and if the model is performing better on seen data and poorly on unseen data, then we can conclude that our network doesn't have any deeper structural bug.
But, this does not seems like a smart and a portable way, and have to do it with every problem.
Currently, I have a dataset class where I am partitioning the data in train/dev/test in the below way -
def split_equal_into_val_test(csv_file=None, stratify_colname='y',
frac_train=0.6, frac_val=0.15, frac_test=0.25,
):
"""
Split a Pandas dataframe into three subsets (train, val, and test).
Following fractional ratios provided by the user, where val and
test set have the same number of each classes while train set have
the remaining number of left classes
Parameters
----------
csv_file : Input data csv file to be passed
stratify_colname : str
The name of the column that will be used for stratification. Usually
this column would be for the label.
frac_train : float
frac_val : float
frac_test : float
The ratios with which the dataframe will be split into train, val, and
test data. The values should be expressed as float fractions and should
sum to 1.0.
random_state : int, None, or RandomStateInstance
Value to be passed to train_test_split().
Returns
-------
df_train, df_val, df_test :
Dataframes containing the three splits.
"""
df = pd.read_csv(csv_file).iloc[:, 1:]
if frac_train + frac_val + frac_test != 1.0:
raise ValueError('fractions %f, %f, %f do not add up to 1.0' %
(frac_train, frac_val, frac_test))
if stratify_colname not in df.columns:
raise ValueError('%s is not a column in the dataframe' %
(stratify_colname))
df_input = df
no_of_classes = 4
sfact = int((0.1*len(df))/no_of_classes)
# Shuffling the data frame
df_input = df_input.sample(frac=1)
df_temp_1 = df_input[df_input['labels'] == 1][:sfact]
df_temp_2 = df_input[df_input['labels'] == 2][:sfact]
df_temp_3 = df_input[df_input['labels'] == 3][:sfact]
df_temp_4 = df_input[df_input['labels'] == 4][:sfact]
dev_test_df = pd.concat([df_temp_1, df_temp_2, df_temp_3, df_temp_4])
dev_test_y = dev_test_df['labels']
# Split the temp dataframe into val and test dataframes.
df_val, df_test, dev_Y, test_Y = train_test_split(
dev_test_df, dev_test_y,
stratify=dev_test_y,
test_size=0.5,
)
df_train = df[~df['img'].isin(dev_test_df['img'])]
assert len(df_input) == len(df_train) + len(df_val) + len(df_test)
return df_train, df_val, df_test
def train_val_to_ids(train, val, test, stratify_columns='labels'): # noqa
"""
Convert the stratified dataset in the form of dictionary : partition['train] and labels.
To generate the parallel code according to https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel
Parameters
-----------
csv_file : Input data csv file to be passed
stratify_columns : The label column
Returns
-----------
partition, labels:
partition dictionary containing train and validation ids and label dictionary containing ids and their labels # noqa
"""
train_list, val_list, test_list = train['img'].to_list(), val['img'].to_list(), test['img'].to_list() # noqa
partition = {"train_set": train_list,
"val_set": val_list,
}
labels = dict(zip(train.img, train.labels))
labels.update(dict(zip(val.img, val.labels)))
return partition, labels
P.S - I know about the Pytorch lightning and know that they have an overfitting feature which can be used easily but I don't want to move to PyTorch lightning.
I don't know how portable it will be, but a trick that I use is to modify the __len__ function in the Dataset.
If I modified it from
def __len__(self):
return len(self.data_list)
to
def __len__(self):
return 20
It will only output the first 20 elements in the dataset (regardless of shuffle). You only need to change one line of code and the rest should work just fine so I think it's pretty neat.
I have a csv which looks like this.
Jc,TXF,timer,alpha,beta
15,44,55,12,33
18,87,33,111
9,87,61,29,77
Alpha and Beta combined makes up a city code. I want to add the name of the city to the csv as a new column.
Jc,TXF,timer,alpha,beta,city
15,44,55,12,33,York
18,87,33,111,London
9,87,61,29,77,Sydney
I have another csv with only the columns alpha,beta,city. Which looks like this:
alpha,beta,city
12,33,York
33,111,London
29,77,Sydney
How can I achieve this using Apache NiFi. Please suggest the processors and workflow needed to be used to achieve this.
I see two ways of solving this.
First by using CsvLookupService. However the CsvLookupService only supports a single key, but you have two, alpha and beta. So to use this solution you have to concatenate both keys into a single key, like 12_33.
Second by using ExecuteScript processor. This one is better, because you don't have to modify your source data. Strategy:
Split the CSV text into lines
Enrich each line with the city column by looking up the alpha and beta keys in the mapping file
Merge the individual lines into a single CSV file.
Overall flow:
GenerateFlowFile:
SplitText:
Set header line count to 1 to include the header line in the split content. For the ExecuteScript processor set python as scripting engine and provide following script body:
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback
import csv
# Define a subclass of StreamCallback for use in session.write()
class PyStreamCallback(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
# fetch the mapping CSV file
with open('/home/nifi/mapping.csv', 'r') as mapping:
# read the mapping file
mappingContent = csv.reader(mapping, delimiter=',')
# flowfile content is CSV text with two lines, header and actual content
# split by newline to get access to each inidvidual line
lines = IOUtils.toString(inputStream, StandardCharsets.UTF_8).split('\n')
# the result will contain the header line
# the result will have the additional city column
result = lines[0] + ',city\n'
# take the second line and split it
# to get access to alpha, beta and city values
lineSplit = lines[1].split(',')
# Go through the mapping file
# item[0] -> alpha
# item[1] -> beta
# item[2] -> city
# See if you find alpha and beta on the line content
for item in mappingContent:
if item[0] == lineSplit[3] and item[1] == lineSplit[4]:
result += lines[1] + ',' + item[2]
break
if result is None:
raise Exception('No matching found.')
else:
outputStream.write(bytearray(result.encode('utf-8')))
# end class
flowFile = session.get()
if(flowFile != None):
try:
flowFile = session.write(flowFile, PyStreamCallback())
session.transfer(flowFile, REL_SUCCESS)
except Exception as e:
session.transfer(flowFile, REL_FAILURE)
See comments for a detailed description of the script. /home/nifi/mapping.csv has to be available on your NiFi instance. If you want to learn more about the ExecuteScript processor, refer to the ExecuteScript Cookbook. Finally you merge all the lines into a single CSV file:
Set CSV reader and writer. Leave their default properties. Adjust MergeContent properties to control how many lines should be in each resulting CSV file. Result:
I have started a python class and my book does not seem to help me.
My professor has a program that bombards my code with different inputs and if any of the inputs do not work then my code is "wrong". I have done many days worth of editing and am at a complete loss. I have the code working if someone puts and input of an actual number. But where my code fails the test is if input is "miles_to_laps(26)" it errors out.
I have tried changing the input to int(input()) but that does not fix the issue. I've gone through changing variables and even changing the input method but still am at a loss. I have already tried contacting my teacher but 6 days of no response and 3 days of being late i feel like I'm just going no where.
user_miles = int(input())
def miles_to_laps(user_miles):
x = user_miles
y = 4
x2 = x * y
result = print('%0.2f' % float(x2))
return result
miles_to_laps(user_miles)
my code works for real number inputs but my professor is wanting inputs like
miles_to_laps(26) and miles_to_laps(13) to create the same outputs.
For the wierd input functionality you can try:
import re
def parse_function_text(s):
try:
return re.search("miles_to_laps\((.+)\)", s)[1]
except TypeError:
return None
def accept_input(user_input):
desugar = parse_function_text(user_input)
if desugar is not None:
user_input = desugar
try:
return float(user_input)
except ValueError:
raise ValueError("Cannot process input %s" % user_input)
assert accept_input("miles_to_laps(3.5)") == 3.5
I'm trying to keep all the pedantism aside, but what kind of CS/programming teaching is that?
Areas of concern:
separate user input from rest of code
separate output formatting from function output
the code inside miles_to_laps is excessive
Now here is the code to try:
LAPS_PER_MILE = 4
# the only calculation, "pure" function
def miles_to_laps(miles):
return LAPS_PER_MILE * miles
# sorting out valid vs invalid input, "interface"
def accept_input(user_input):
try:
return float(user_input)
except ValueError:
raise ValueError("Cannot process input %s" % user_input)
if __name__ == "__main__":
# running the program
laps = miles_to_laps(accept_input(input()))
print ('%0.2f' % laps)
Hope this is not too overwhelming.
Update: second attempt
MILE = 1609.34 # meters per mile
LAP = 400 # track lap
LAPS_PER_MILE = MILE/LAP
def miles_to_laps(miles):
return LAPS_PER_MILE * miles
def laps_coerced(laps):
return '%0.2f' % laps
def accept_input(user_input):
try:
return float(user_input)
except ValueError:
raise ValueError("Cannot process input %s" % user_input)
def main(user_input_str):
miles = accept_input(user_input_str)
laps = miles_to_laps(miles)
print (laps_coerced(laps))
if __name__ == "__main__":
main(input())
I'm trying to read a binary file and match number(s) in each of its records.
If the number matches then the record is to be copied to another file.
the number should be present in between 24th to 36th byte of each record.
The script takes the numbers as arguments. Here's the script I'm using:
#!/usr/bin/env python
# search.py
import re
import glob
import os
import binascii
list = sys.argv[1:]
list.sort()
rec_len=452
filelist = glob.glob(os.getcwd() + '*.bin')
print('Input File : %s' % filelist)
for file in filelist:
outfile = file + '.out'
f = open(file, "rb")
g = open(outfile, "wb")
for pattern in list:
print pattern
regex_search = re.compile(pattern).search
while True:
buf = f.read(rec_len)
if len(buf) == 0:
break
else:
match = regex_search(buf)
match2=buf.find(pattern)
#print match
#print match2
if ((match2 != -1) | (match != None )):
g.write(buf)
f.close()
g.close()
print ("Done")
I'm running it like:
python search.py 1234 56789
I'm using python 2.6.
The code is not matching the number.
I also tried using binascii to convert the number to binary before matching but even then it didn't return any record.
If I give any string it works correctly but If i give any number as argument it doesn't match.
Where am I going wrong?
You are depleting the filebuffer by reading all the bytes in the check for the first pattern. Hence, the 2nd pattern will never be matched (an attempt will not even be made), because you've already reached the end of the file, by reading all the records during the for-loop run of the first pattern.
That means that if your first pattern is nowhere to be found in those records, your script will not give you any output.
Consider adding f.seek(0) at the end of the while loop, or changing the order of the two loop constructs such that you first read a record from the file, then match the regex for each of the patterns in the argument list.
Also, try not to shadow the python builtins by using list for the name of an array. It won't give you problems in the code you've shown, but it's definitely something you should be aware of.