I have a txt file with the following structure:
I also want to add to the end of each long line, the data (after the comma) of the short lines above them, without the description (STN_NO, STN_ID, INST_HT), like this:
Is it possible? Any ideas?
P.S. I am using Python Version 3.3.
Alternatively, you could use a simpler (albeit longer) solution that does not involve regex.
f = open('file.txt')
for line in f:
line = line.replace('\n', '')
if 'STN_NO' in line:
stn_no = line.split(',')[-1]
print(line)
elif 'STN_ID' in line:
stn_id = line.split(',')[-1]
print(line)
elif 'INST_HT' in line:
inst_ht = line.split(',')[-1]
print(line)
else:
print(line[:-1] + ',' + stn_no + ',' + stn_id + ',' + inst_ht)
Note that this puts the semicolon from the INST_HT line back at the end of every long line. If not desired, it can be removed with inst_ht[:-1].
Let's assume this simplified version of the file in your image:
STN_NO, 41943043
STN_ID, KAST
INST_HT, 1.01500;
Line 1
Line 2
Line 3
STN_NO, 41943062
STN_ID, S2
INST_HT, 0.75;
Line 4
Line 5
Line 6
STN_NO, 123456
STN_ID, XXX
INST_HT, 0.99;
Line 7
Line 8
Line 9
You can use a regex to capture the pattern in blocks and combine:
import re
pat=re.compile(r'^STN_NO,\s+([^\n]+)$\s*^STN_ID,\s+([^\n]+)$\s*^INST_HT,\s+([^;]+);\s*(.*?)(?=^STN_NO|\Z)', re.S | re.M)
with open(fn) as f:
txt=f.read()
for mg in pat.finditer(txt):
for line in mg.group(4).splitlines():
print(line+','+','.join([mg.group(1), mg.group(2), mg.group(3)]))
Prints:
Line 1,41943043,KAST,1.01500
Line 2,41943043,KAST,1.01500
Line 3,41943043,KAST,1.01500
Line 4,41943062,S2,0.75
Line 5,41943062,S2,0.75
Line 6,41943062,S2,0.75
Line 7,123456,XXX,0.99
Line 8,123456,XXX,0.99
Line 9,123456,XXX,0.99
If your file is bigger than what will fit in memory, use mmap to virtualize.
Related
I am trying to webscrape some website for information. i have saved the page I want to scrape as .html file and have opened it with sublime text but there are some parts that cannot be displayed in a prettified way ; I have the same problem when trying to use beautifulsoup ; see picture below (I cannot really share full code since it's disclosing private info).
Just feed the HTML as a multiline string to BeautifulSoup object and use soup.prettify(). That should work. However beautifulsoup has default indentation to 2 spaces. So if you want custom indent you can writeup a little wrapper like this:
def indentPrettify(soup, indent=4):
# where desired_indent is number of spaces as an int()
pretty_soup = str()
previous_indent = 0
# iterate over each line of a prettified soup
for line in soup.prettify().split("\n"):
# returns the index for the opening html tag '<'
current_indent = str(line).find("<")
# which is also represents the number of spaces in the lines indentation
if current_indent == -1 or current_indent > previous_indent + 2:
current_indent = previous_indent + 1
# str.find() will equal -1 when no '<' is found. This means the line is some kind
# of text or script instead of an HTML element and should be treated as a child
# of the previous line. also, current_indent should never be more than previous + 1.
previous_indent = current_indent
pretty_soup += writeOut(line, current_indent, indent)
return pretty_soup
def writeOut(line, current_indent, desired_indent):
new_line = ""
spaces_to_add = (current_indent * desired_indent) - current_indent
if spaces_to_add > 0:
for i in range(spaces_to_add):
new_line += " "
new_line += str(line) + "\n"
return new_line
def duty2():
numbers = []
while True:
a = Input('Enter a new number, 0 to end: ')
if a == 0:
break
numbers.append(a)
if len(numbers)!=0:
sums = 0
for i in numbers:
sums = sums + i
average = float(sums) / len(numbers)
print "The average of %s is %.2f" % (numbers, average)
else:
print "There is nothing to calculate."
I'm new at coding, I could'n solve the problem please help
**I am getting this error " IndentationError: unindent does not match any outer indentation level*
**
You have an extra space in front of the line that reads numbers.append(a)
When you run the code (I've thrown it into the file tmp.py), it'll tell you exactly which line is causing the issue. For example, when I run your code I get the following:
File "tmp.py", line 8
numbers.append(a)
^
IndentationError: unindent does not match any outer indentation level
This tells me there is an indentation error, that it's on line 8 and it even tells me exactly which line is causing the error.
I have a function takes a file as input and prints certain statistics and also copies the file into a file name provided by the user. Here is my current code:
def copy_file(option):
infile_name = input("Please enter the name of the file to copy: ")
infile = open(infile_name, 'r')
outfile_name = input("Please enter the name of the new copy: ")
outfile = open(outfile_name, 'w')
slist = infile.readlines()
if option == 'statistics':
for line in infile:
outfile.write(line)
infile.close()
outfile.close()
result = []
blank_count = slist.count('\n')
for item in slist:
result.append(len(item))
print('\n{0:<5d} lines in the list\n{1:>5d} empty lines\n{2:>7.1f} average character per line\n{3:>7.1f} average character per non-empty line'.format(
len(slist), blank_count, sum(result)/len(slist), (sum(result)-blank_count)/(len(slist)-blank_count)))
copy_file('statistics')
It prints the statistics of the file correctly, however the copy it makes of the file is empty. If I remove the readline() part and the statistics part, the function seems to make a copy of the file correctly. How can I correct my code so that it does both. It's a minor problem but I can't seem to get it.
The reason the file is blank is that
slist = infile.readlines()
is reading the entire contents of the file, so when it gets to
for line in infile:
there is nothing left to read and it just closes the newly truncated (mode w) file leaving you with a blank file.
I think the answer here is to change your for line in infile: to for line in slist:
def copy_file(option):
infile_name= input("Please enter the name of the file to copy: ")
infile = open(infile_name, 'r')
outfile_name = input("Please enter the name of the new copy: ")
outfile = open(outfile_name, 'w')
slist = infile.readlines()
if option == 'statistics':
for line in slist:
outfile.write(line)
infile.close()
outfile.close()
result = []
blank_count = slist.count('\n')
for item in slist:
result.append(len(item))
print('\n{0:<5d} lines in the list\n{1:>5d} empty lines\n{2:>7.1f} average character per line\n{3:>7.1f} average character per non-empty line'.format(
len(slist), blank_count, sum(result)/len(slist), (sum(result)-blank_count)/(len(slist)-blank_count)))
copy_file('statistics')
Having said all that, consider if it's worth using your own copy routine rather than shutil.copy - Always better to delegate the task to your OS as it will be quicker and probably safer (thanks to NightShadeQueen for the reminder)!
Am trying to create a function that takes a filename and it returns a 2-tuple with the number of the non-empty lines in that program, and the sum of the lengths of all those lines. Here is my current program:
def code_metric(file):
with open(file, 'r') as f:
lines = len(list(filter(lambda x: x.strip(), f)))
num_chars = sum(map(lambda l: len(re.sub('\s', '', l)), f))
return(lines, num_chars)
The result I get is get if I do:
if __name__=="__main__":
print(code_metric('cmtest.py'))
is
(3, 0)
when it should be:
(3,85)
Also is there a better way of finding the sum of the length of lines using using the functionals map, filter, and reduce? I did it for the first part but couldn't figure out the second half. AM kinda new to python so any help would be great.
Here is the test file called cmtest.py:
import prompt,math
x = prompt.for_int('Enter x')
print(x,'!=',math.factorial(x),sep='')
First line has 18 characters (including white space)
Second line has 29 characters
Third line has 38 characters
[(1, 18), (1, 29), (1, 38)]
The line count is 85 characters including white spaces. I apologize, I mis-read the problem. The length total for each line should include the whitespaces as well.
A fairly simple approach is to build a generator to strip trailing whitespace, then enumerate over that (with a start value of 1) filtering out blank lines, and summing the length of each line in turn, eg:
def code_metric(filename):
line_count = char_count = 0
with open(filename) as fin:
stripped = (line.rstrip() for line in fin)
for line_count, line in enumerate(filter(None, stripped), 1):
char_count += len(line)
return line_count, char_count
print(code_metric('cmtest.py'))
# (3, 85)
In order to count lines, maybe this code is cleaner:
with open(file) as f:
lines = len(file.readlines())
For the second part of your program, if you intend to count only non-empty characters, then you forgot to remove '\t' and '\n'. If that's the case
with open(file) as f:
num_chars = len(re.sub('\s', '', f.read()))
Some people have advised you to do both things in one loop. That is fine, but if you keep them separated you can make them into different functions and have more reusability of them that way. Unless you are handling huge files (or executing this coded millions of times), it shouldn't matter in terms of performance.
How to import the following csv into a mysql table, using mysql commands?
##
#File name : proj.csv
#line 1 are the field headers
#record 1 starts at line 2, ends at line 583
#from line 2 "<!DOCTYPE" to line 582 "</html>" are actually text blob of
#record 1 's "html" field
##
line 1: "proj_name","proj_id","url","html","proj_dir"
line 2: "Autorun Virus Remover",1,"http://www.softpedia.com/get/Antivirus/Autorun-Virus-Remover.shtml","<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.01 Transitional//EN"" ""http://www.w3.org/TR/html4/loose.dtd"">
line 3: <html>
line 4: <head profile=""http://a9.com/-/spec/opensearch/1.1/"">
...
line 582: </html>
line 583: ","Antivirus/Autorun-Virus-Remover"
The trouble is that the target csv file has a text blob field (named "html", which contains text with multiple lines) in it, so I can't use a '\n' to be the record seperator, or it will say "Row 1 doesn't contain data for all columns". A thousand thanks !!!