Read and print from a text file N lines at a time using a generator only - generator

Python 3.6.0
textfile = "f:\\mark\\python\\Alice_in_Wonderland.txt"
N = 60
def read_in_lines(file, n):
with open(file) as fh:
for i in range(n):
nlines = fh.readline()
if nlines:
yield nlines
else:
break
for lines in read_in_lines(textfile, x):
print(lines)
File is here: https://www.gutenberg.org/files/11/11.txt
My goal is to read in this file N lines at a time, then print the lines,
then read in the next N lines, print, repeat...
If N = 3, output should look like:
line1
line2
line3
line4
line5
line6
line7
line8
line9
line10 <-- assumes this is the last line in the file
The above print pattern should hold for any value of 'N'.
If 'N' = 4:
line1
line2
line3
line4
line5
line6
line7
line8
etc. You get the idea.
NO lists. No built in functions (islice, etc.).
I need to use a generator only.
Each iteration must contain a string containing up to the
number of lines specified by 'N'.
Two issues:
1) The above code returns 'N' lines, then stops. I assume I need to put the whole
thing in a loop, but I am unsure of how to proceed. (Newbie...)
2) The file contains A LOT of blank lines. Every single time I try to use strip()
or any of it's variants, regardless of how big I make 'N' it only ever prints one line.
nlines = fh.readline().strip <-- adding in .strip()
With N = 6000 I get:
Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll
Process finished with exit code 0
If I get rid of .strip() I get all the lines but not in the format I want.
I am on a Win 10 machine. In Notepad++ all of the end of file symbols are CRLF.

Solved:
textfile = "f:\\mark\\python\\test.txt"
def read_n(file, x):
with open(file, mode='r') as fh:
while True:
data = ''.join(fh.readline() for _ in range(x))
if not data:
break
yield data
print()
for nlines in read_n(textfile, 5):
print(nlines.rstrip())
Output:
abc
123
def
456
ghi
789
jkl
abc
123
def
456
ghi
789
jkl
abc
123
def
456
ghi
789
jkl
abc
123
def
456
ghi
789
jkl

Related

Sentence similarity

Can anyone explain how this line works
X_set = {w for w in X_list if not w in sw}
So I need to know why we use the variable w 3 times, and what each w refers to.
I've also posted my code below for further reference
# Program to measure the similarity between
# two sentences using cosine similarity.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# X = input("Enter first string: ").lower()
# Y = input("Enter second string: ").lower()
X =" Ravi went to the market and buy 4 oranges and 2 apples in total how many fruits did Ravi buy"
Y =" Ram went to the shopping mall and buy 1pant and 5 shirts. how many clothes does Ram buy"
# tokenization
X_list = word_tokenize(X)
Y_list = word_tokenize(Y)
# sw contains the list of stopwords
sw = stopwords.words('english')
l1 =[]
l2 =[]
# remove stop words from the string
X_set = {w for w in X_list if not w in sw}
Y_set = {w for w in Y_list if not w in sw}
print(X_set)
print(Y_set)
# form a set containing keywords of both strings
rvector = X_set.union(Y_set)
for w in rvector:
# print(w)
if w in X_set:
l1.append(1) # create a vector
else:
l1.append(0)
if w in Y_set:
l2.append(1)
else:
l2.append(0)
c = 0
#
# # cosine formula
for i in range(len(rvector)):
c+= l1[i]*l2[i]
cosine = c / float((sum(l1)*sum(l2))**0.5)
print("similarity: ", cosine)

Function does not return the list correctly

I have written a code for adding the numbers from two different text files. For a very big data 2-3 GB, I get the MemoryError. So, I am writing a new code using some functions to avoid loading the whole data into memory.
This code opens an input file 'd.txt' an reads the numbers after some lines from a bigger data as following:
SCALAR
ND 3
ST 0
TS 1000
1.0
1.0
1.0
SCALAR
ND 3
ST 0
TS 2000
3.3
3.4
3.5
SCALAR
ND 3
ST 0
TS 3000
1.7
1.8
1.9
and adds to the number have read from a smaller text file 'e.txt' as following:
SCALAR
ND 3
ST 0
TS 0
10.0
10.0
10.0
The result is written in a text file 'output.txt' like this:
SCALAR
ND 3
ST 0
TS 1000
11.0
11.0
11.0
SCALAR
ND 3
ST 0
TS 2000
13.3
13.4
13.5
SCALAR
ND 3
ST 0
TS 3000
11.7
11.8
11.9
The code which I prepared:
def add_list_same(list1, list2):
"""
list2 has the same size as list1
"""
c = [a+b for a, b in zip(list1, list2)]
print(c)
return c
def list_numbers_after_ts(n, f):
result = []
for line in f:
if line.startswith('TS'):
for node in range(n):
result.append(float(next(f)))
return result
def writing_TS(f1):
TS = []
ND = []
for line1 in f1:
if line1.startswith('ND'):
ND = float(line1.split()[-1])
if line1.startswith('TS'):
x = float(line1.split()[-1])
TS.append(x)
return TS, ND
with open('d.txt') as depth_dat_file, \
open('e.txt') as elev_file, \
open('output.txt', 'w') as out:
m = writing_TS(depth_dat_file)
print('number of TS', m[1])
for j in range(0,int(m[1])-1):
i = m[1]*j
out.write('SCALAR\nND {0:2f}\nST 0\nTS {0:2f}\n'.format(m[1], m[0][j]))
list1 = list_numbers_after_ts(int(m[1]), depth_dat_file)
list2 = list_numbers_after_ts(int(m[1]), elev_file)
Eh = add_list_same(list1, list2)
out.writelines(["%.2f\n" % item for item in Eh])
the output.txt is like this:
SCALAR
ND 3.000000
ST 0
TS 3.000000
SCALAR
ND 3.000000
ST 0
TS 3.000000
SCALAR
ND 3.000000
ST 0
TS 3.000000
The addition of lists does not work, besides I checked separately the functions, they work. I don't find the error. I changed it a lot, but it does not work. Any suggustion? I really appreciate any help you can provide!
You can use grouper to read files by fixed count of lines. Next code should works if order of lines in groups is unchanged.
from itertools import zip_longest
#Split by group iterator
#See http://stackoverflow.com/questions/434287/what-is-the-most-pythonic-way-to-iterate-over-a-list-in-chunks
def grouper(iterable, n, padvalue=None):
return zip_longest(*[iter(iterable)]*n, fillvalue=padvalue)
add_numbers = []
with open("e.txt") as f:
# Read data by 7 lines
for lines in grouper(f, 7):
# Suppress first SCALAR line
for line in lines[1:]:
# add last number in every line to array (6 elements)
add_numbers.append(float(line.split()[-1].strip()))
#template for every group
template = 'SCALAR\nND {:.2f}\nST {:.2f}\nTS {:.2f}\n{:.2f}\n{:.2f}\n{:.2f}\n'
with open("d.txt") as f, open('output.txt', 'w') as out:
# As before
for lines in grouper(f, 7):
data_numbers = []
for line in lines[1:]:
data_numbers.append(float(line.split()[-1].strip()))
# in result_numbers sum elements of two arrays by pair (6 elements)
result_numbers = [x + y for x, y in zip(data_numbers, add_numbers)]
# * unpack result_numbers as 6 arguments of function format
out.write(template.format(*result_numbers))
I had to change some small things in the code and now it works but just for small input files, because many variables are loaded into memory. Can you please tell me how can I work with yield.
from itertools import zip_longest
def grouper(iterable, n, padvalue=None):
return zip_longest(*[iter(iterable)]*n, fillvalue=padvalue)
def writing_ND(f1):
for line1 in f1:
if line1.startswith('ND'):
ND = float(line1.split()[-1])
return ND
def writing_TS(f):
for line2 in f:
if line2.startswith('TS'):
x = float(line2.split()[-1])
TS.append(x)
return TS
TS = []
ND = []
x = 0.0
n = 0
add_numbers = []
with open("e.txt") as f, open("d.txt") as f1,\
open('output.txt', 'w') as out:
ND = writing_ND(f)
TS = writing_TS(f1)
n = int(ND)+4
f.seek(0)
for lines in grouper(f, int(n)):
for item in lines[4:]:
add_numbers.append(float(item))
i = 0
for l in grouper(f1, n):
data_numbers = []
for line in l[4:]:
data_numbers.append(float(line.split()[-1].strip()))
result_numbers = [x + y for x, y in zip(data_numbers, add_numbers)]
del data_numbers
out.write('SCALAR\nND %d\nST 0\nTS %0.2f\n' % (ND, TS[i]))
i += 1
for item in result_numbers:
out.write('%s\n' % item)

How to sort .csv file values by combining like strings using python3

I'm still learning so please bear with me. I've been trying to figure this out for sometime now but have not found what im looking for.
My Product.csv file looks like this.
111 ; Info1 ; Description 1 ; Remarks1
123 ; Info1 ; Description 1 ; Remarks1
156 ; Info2 ; Description 2 ; Remarks2
124 ; Info3 ; Description 3 ; Remarks3
I would like to combine entries that are similar like this.
111, 123 ; Info1 ; Description 1 ; Remarks1
156 ; Info2 ; Description 2 ; Remarks2
124 ; Info3 ; Description 3 ; Remarks3
From here i can manipulate my csv file in Excel using vba to insert into a quotation.
This is what i would like to achieve using Python. I'm stummped on where to start. I think I need to sart by opening the file and then reading the csv file. After that assign variable to #(i.e. 111) , info, Description, Remarks. Then sort thru the variables and combine like #'s. Then write it back to the file. Please let me know if you need me to calrify anything.
That's a task for itertools.groupby
EDIT: I re-factored the first version to improve readability
# file group_by_trailing_py2.py
import os
import csv
from itertools import groupby
DELIM=';'
IN_FILENAME = 'My Product.csv'
OUT_FILENAME = 'My Product.grouped.csv'
############ skip this if you run it against productive data ###############
DATA = '''111 ; Info1 ; Description 1 ; Remarks1
123 ; Info1 ; Description 1 ; Remarks1
156 ; Info2 ; Description 2 ; Remarks2
124 ; Info3 ; Description 3 ; Remarks3'''
if (os.environ.get('WITH_DATA_GENERATION')):
open(IN_FILENAME,'w').write(DATA)
##############################################################################
keyfunc = lambda row: row[1:]
with open(IN_FILENAME) as csv_file:
rows = sorted(csv.reader(csv_file, delimiter=DELIM), key=keyfunc)
it = map(lambda t: [", ".join(v[0].strip() for v in t[1]) + " "] + t[0],
groupby(rows, key=keyfunc))
with open(OUT_FILENAME, 'w') as csv_file:
writer = csv.writer(csv_file, delimiter=DELIM)
for row in it:
writer.writerow(row)
if run with
WITH_DATA_GENERATION=1 python3 group_by_trailing_pk2.py
it produces My Product.grouped.csv with the content:
111, 123 ; Info1 ; Description 1 ; Remarks1
156 ; Info2 ; Description 2 ; Remarks2
124 ; Info3 ; Description 3 ; Remarks3
Because you have an existing workload you will not set WITH_DATA_GENERATION and delete the code between and including the '####...' comment lines.
I re-wrote the solution by decltype_auto to make better re-usable:
import csv
import io
from itertools import groupby
def drop_first(row):
"""Returns all but last element."""
return row[1:]
def make_line(group):
"""Create a text line from a group.
Joins the grouped result with comma and adds the rest of
the columns.
"""
return [", ".join(val[0].strip() for val in group[1]) + " "] + group[0]
def open_path_or_fobj(fobj_or_path, mode='r'):
"""Open a file from a path or return the given file object."""
if isinstance(fobj_or_path, str):
return open(fobj_or_path, mode)
return fobj_or_path
def make_combined(in_fobj_or_path, out_path, delim=';'):
"""Combine lines with same content in first column in one line.
"""
with open_path_or_fobj(in_fobj_or_path) as csv_file:
rows = sorted(csv.reader(csv_file, delimiter=delim), key=drop_first)
it = map(make_line, groupby(rows, key=drop_first))
with open(out_path, 'w') as csv_file:
writer = csv.writer(csv_file, delimiter=delim)
for row in it:
writer.writerow(row)
if __name__ == '__main__':
def test_with_file():
"""Example for use with existing input file."""
make_combined('My Product.csv', 'My Product.grouped.csv')
def test_with_stringio():
"""Test with StringIO object as csv input."""
data = '''111 ; Info1 ; Description 1 ; Remarks1
123 ; Info1 ; Description 1 ; Remarks1
156 ; Info2 ; Description 2 ; Remarks2
124 ; Info3 ; Description 3 ; Remarks3'''
fobj_in = io.StringIO(data)
make_combined(fobj_in, 'result.txt')
data2 = '''111 # Info1 # Description 1 # Remarks1
123 # Info1 # Description 1 # Remarks1
156 # Info2 # Description 2 # Remarks2'''
fobj_in = io.StringIO(data2)
make_combined(fobj_in, 'result_delim.txt', delim='#')
# Actually run it with the file.
test_with_file()
I did a few things:
Turn lambda functions into normal function with name and docstring.
Use io.StringIO to work with sample data defined in the source
file as string.
Use if __name__ == '__main__': to allow import as a module and at
the same time to use it as a script.

Writing to multiple columns (not rows) - Python

I am trying to write two lists to a csv file. I want the lists to feed vertically into the spreadsheet into two columns.
import csv
import os
name = "rr"
newname = name+".csv"
rs = [1,2,3,4]
dr = [2,3,4,5]
with open(newname, 'w') as output:
writer = csv.writer(output, lineterminator='\n')
writer.writerow(rs)
writer.writerow(dr)
I am getting this:
1 2 3 4
2 3 4 5
I want this:
1 2
2 3
3 4
4 5
Using your example lists you could do:
rs = [1,2,3,4]
dr = [2,3,4,5]
output = ""
for r in zip(rs, dr):
output += str(str(r[0]) + " " + str(r[1]) + "\n")
#now write the output etc.
This is not complete regarding the loop and lacks the writing. But I think what you actually wanted was, what the zip-builtin provides.
Explanation of what zip() does taken from official documentation:
This function returns a list of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables.

How to count instances of string in a tab separated value file?

How to count instances of strings in a tab separated value (tsv) file?
The tsv file has hundreds of millions of rows, each of which is of form
foobar1 1 xxx yyy
foobar1 2 xxx yyy
foobar2 2 xxx yyy
foobar2 3 xxx yyy
foobar1 3 xxx zzz
. How to count instances of each unique integer in the entire second column in the file, and ideally add the count as the fifth value in each row?
foobar1 1 xxx yyy 1
foobar1 2 xxx yyy 2
foobar2 2 xxx yyy 2
foobar2 3 xxx yyy 2
foobar1 3 xxx zzz 2
I prefer a solution using only UNIX command line stream processing programs.
I'm not entirely clear what you want to do. Do you want to add 0/1 depending on the value of the second column as the fifth column or do you want to get the distribution of the values in the second column, total for the entire file?
In the first case, use something like awk -F'\t' '{ if($2 == valueToCheck) { c = 1 } else { c = 0 }; print $0 "\t" c }' < file.
In the second case, use something like awk -F'\t' '{ h[$2] += 1 } END { for(val in h) print val ": " h[val] }' < file.
One solution using perl assuming that values of second column are sorted, I mean, when found value 2, all lines with same value will be consecutive. The script keeps lines until it finds a different value in second column, get the count, print them and frees memory, so shouldn't generate a problem regardless of how big is the input file:
Content of script.pl:
use warnings;
use strict;
my (%lines, $count);
while ( <> ) {
## Remove last '\n'.
chomp;
## Split line in spaces.
my #f = split;
## Assume as malformed line if it hasn't four fields and omit it.
next unless #f == 4;
## Save lines in a hash until found a different value in second column.
## First line is special, because hash will always be empty.
## In last line avoid reading next one, otherwise I would lose lines
## saved in the hash.
## The hash will ony have one key at same time.
if ( exists $lines{ $f[1] } or $. == 1 ) {
push #{ $lines{ $f[1] } }, $_;
++$count;
next if ! eof;
}
## At this point, the second field of the file has changed (or is last line), so
## I will print previous lines saved in the hash, remove then and begin saving
## lines with new value.
## The value of the second column will be the key of the hash, get it now.
my ($key) = keys %lines;
## Read each line of the hash and print it appending the repeated lines as
## last field.
while ( #{ $lines{ $key } } ) {
printf qq[%s\t%d\n], shift #{ $lines{ $key } }, $count;
}
## Clear hash.
%lines = ();
## Add current line to hash, initialize counter and repeat all process
## until end of file.
push #{ $lines{ $f[1] } }, $_;
$count = 1;
}
Content of infile:
foobar1 1 xxx yyy
foobar1 2 xxx yyy
foobar2 2 xxx yyy
foobar2 3 xxx yyy
foobar1 3 xxx zzz
Run it like:
perl script.pl infile
With following output:
foobar1 1 xxx yyy 1
foobar1 2 xxx yyy 2
foobar2 2 xxx yyy 2
foobar2 3 xxx yyy 2
foobar1 3 xxx zzz 2