I would like to parse a multiline text file having a content as
section1:
key1 val1
key2 val2
section2:
val1
val2
val3
section3:
section4:
somevalue
The header of the sections (section1, section2, ...) are defined. The goal is to read the values under the different sections. I'm getting in trouble with using the pyparsing module over several lines (the real problem is much more complex than this simple example).
When I use the following code, the parser expects on every line the full list of defined keywords:
# -*- coding: utf-8 -*-
from pyparsing import Literal, ZeroOrMore, LineEnd, ParseException
FileSyntax = None
def Grammar():
#section1:
section1 = Literal("section1:").suppress() + ZeroOrMore(LineEnd())
#section2:
section2 = Literal("section2:").suppress() + ZeroOrMore(LineEnd())
#section3:
section3 = Literal("section3:").suppress() + ZeroOrMore(LineEnd())
#section4:
section4 = Literal("section4:").suppress() + ZeroOrMore(LineEnd())
return section1 + section2 + section3 + section4
def parseFile(filename : str):
global FileSyntax
print("\nparse results:\n")
try:
TestFile = open(filename)
testdata = "".join( TestFile.readlines())
FileSyntax = Grammar()
FileSyntax.parseString(testdata)
except ParseException as err:
print(err.line)
print(" "*(err.column-1) + "^")
print("* " + str(err))
except Exception as e:
import traceback
traceback.print_exc(e)
parseFile("testdata.txt")
How can I make a stateful parsing (dependent on the different sections)? Thank you.
If you print out the grammar expression itself, you'll get something like:
{{{{Suppress:("section1:") [LineEnd]...} {Suppress:("section2:") [LineEnd]...}} {Suppress:("section3:") [LineEnd]...}} {Suppress:("section4:") [LineEnd]...}}
That is, you are parsing all the section headers, but not the body of the sections. So you are probably failing on the first line after 'section1:'.
Also, there is no need to call readlines() and then join everything back together. Just call TestFile.read(). Or even better, pathlib.Path(test_file_name).read_text()
Related
I was able to take a text file, read each line, create a dictionary per line, update(append) each line and store the json file. The issue is when reading the json file it will not read correctly. the error point to a storing file issue?
The text file looks like:
84.txt; Frankenstein, or the Modern Prometheus; Mary Wollstonecraft (Godwin) Shelley
98.txt; A Tale of Two Cities; Charles Dickens
...
import json
import re
path = "C:\\...\\data\\"
books = {}
books_json = {}
final_book_json ={}
file = open(path + 'books\\set_of_books.txt', 'r')
json_list = file.readlines()
open(path + 'books\\books_json.json', 'w').close() # used to clean each test
json_create = []
i = 0
for line in json_list:
line = line.replace('#', '')
line = line.replace('.txt','')
line = line.replace('\n','')
line = line.split(';', 4)
BookNumber = line[0]
BookTitle = line[1]
AuthorName = line[-1]
file
if BookNumber == ' 2701':
BookNumber = line[0]
BookTitle1 = line[1]
BookTitle2 = line[2]
AuthorName = line[3]
BookTitle = BookTitle1 + ';' + BookTitle2 # needed to combine title into one to fit dict format
books = json.dumps( {'AuthorName': AuthorName, 'BookNumber': BookNumber, 'BookTitle': BookTitle})
books_json = json.loads(books)
final_book_json.update(books_json)
with open(path + 'books\\books_json.json', 'a'
) as out_put:
json.dump(books_json, out_put)
with open(path + 'books\\books_json.json', 'r'
) as out_put:
'books\\books_json.json', 'r')]
print(json.load(out_put))
The reported error is: JSONDecodeError: Extra data: line 1 column 133
(char 132) - adding this is right between the first "}{". Not sure
how json should look in a flat-file format? The output file as seen on
an editor looks like: {"AuthorName": " Mary Wollstonecraft (Godwin)
Shelley", "BookNumber": " 84", "BookTitle": " Frankenstein, or the
Modern Prometheus"}{"AuthorName": " Charles Dickens", "BookNumber": "
98", "BookTitle": " A Tale of Two Cities"}...
I ended up changing the approach and used pandas to read the text and then spliting the single-cell input.
books = pd.read_csv(path + 'books\\set_of_books.txt', sep='\t', names =('r','t', 'a') )
#print(books.head(10))
# Function to clean the 'raw(r)' inoput data
def clean_line(cell):
...
return cell
books['r'] = books['r'].apply(clean_line)
books = books['r'].str.split(';', expand=True)
I am trying to webscrape some website for information. i have saved the page I want to scrape as .html file and have opened it with sublime text but there are some parts that cannot be displayed in a prettified way ; I have the same problem when trying to use beautifulsoup ; see picture below (I cannot really share full code since it's disclosing private info).
Just feed the HTML as a multiline string to BeautifulSoup object and use soup.prettify(). That should work. However beautifulsoup has default indentation to 2 spaces. So if you want custom indent you can writeup a little wrapper like this:
def indentPrettify(soup, indent=4):
# where desired_indent is number of spaces as an int()
pretty_soup = str()
previous_indent = 0
# iterate over each line of a prettified soup
for line in soup.prettify().split("\n"):
# returns the index for the opening html tag '<'
current_indent = str(line).find("<")
# which is also represents the number of spaces in the lines indentation
if current_indent == -1 or current_indent > previous_indent + 2:
current_indent = previous_indent + 1
# str.find() will equal -1 when no '<' is found. This means the line is some kind
# of text or script instead of an HTML element and should be treated as a child
# of the previous line. also, current_indent should never be more than previous + 1.
previous_indent = current_indent
pretty_soup += writeOut(line, current_indent, indent)
return pretty_soup
def writeOut(line, current_indent, desired_indent):
new_line = ""
spaces_to_add = (current_indent * desired_indent) - current_indent
if spaces_to_add > 0:
for i in range(spaces_to_add):
new_line += " "
new_line += str(line) + "\n"
return new_line
Complete Julia newbie here.
I'd like to do some processing on a CSV. Something along the lines of:
using CSV
in_file = CSV.Source('/dir/in.csv')
out_file = CSV.Sink('/dir/out.csv')
for line in CSV.eachline(in_file)
replace!(line, "None", "")
CSV.writeline(out_file, line)
end
This is in pseudocode, those aren't existing functions.
Idiomatically, should I iterate on 1:CSV.countlines(in_file)? Do a while and check something?
If all you want to do is replace a string in the line, you do not need any CSV parsing utilities. All you do is read the file line by line, replace, and write. So:
infile = "/path/to/input.csv"
outfile = "/path/to/output.csv"
out = open(outfile, "w+")
for line in readlines(infile)
newline = replace(line, "a", "b")
write(out, newline)
end
close(out)
This will replicate the pseudocode you have in your question.
If you need to parse and read the csv field by field, use the readcsv function in base.
data=readcsv(infile)
typeof(data) #Array{Any,2}
This will return the data in the file as a 2 dimensional array. You can process this data any way you want, and write it back using the writecsv function.
for i in 1:size(data,1) #iterate by rows
data[i, 1] = "This is " * data[i, 1] # Add text to first column
end
writecsv(outfile, data)
Documentation for these functions:
http://docs.julialang.org/en/release-0.5/stdlib/io-network/?highlight=readcsv#Base.readcsv
http://docs.julialang.org/en/release-0.5/stdlib/io-network/?highlight=readcsv#Base.writecsv
There was an earlier question on this, but the asker was just overwriting their output and solved their own problem.
I'm using a subprocess.popen to read video information and write the output to a json. It works fine on MOST videos, but on others is returning an empty string on others - even though it runs fine from the command line. I tried it several times and am getting the data fine through the command line.
Here's the relevant part of the script:
out_prj.write('[')
for m, i in enumerate(files):
print i
out_prj.write('{"$type":"BatchProcessor.Job, BatchProcessor","Id":0,"Ver":1.02,"CurrentTask":0,"IsSelected":true,"TaskList":[')
f_name = os.path.basename(i[0])
f_json = out_folder + os.sep + "06_Output" + os.sep + os.path.basename(i[0]).split(".")[0] + ".json"
trans_f = out_folder + os.sep + "04_Video" + os.sep + os.path.basename(i[0]).split(".")[0] + "-tr.ts"
trans_f_out = out_folder + os.sep + "06_Output" + os.sep + os.path.basename(i[0]).split(".")[0] + "-tr-out.ts"
ffprobe = 'ffprobe.exe'
command = [ffprobe, '-v', 'quiet', '-print_format', 'json', '-show_format', '-show_streams', i[0]]
p = sp.Popen(command, stdout=sp.PIPE, stderr=sp.PIPE, shell=True)
out, err = p.communicate()
io = cStringIO.StringIO(out)
info = json.load(io)
print info
filea = open(f_json, 'w')
filea.write(json.dumps(info))
filea.close()
f = open(f_json)
b = json.load(f)
print b
#########################
###################
f_format = str(b['streams'][0]['codec_long_name'])
Your code ignores error messages (err variable). print err or don't redirect stderr to see them.
Unrelated: the json handling in your code is insane: most operations are redundant.
To save output of the subprocess to a file:
import os
from subprocess import check_call
f_json = os.path.join(out_folder, "06_Output",
os.path.splitext(f_name)[0] + ".json")
with open(f_json, 'wb', 0) as file:
check_call(command, stdout=file)
Note: shell=True is not necessary here. If subprocess can't find ffprobe.exe then specify the full path e.g. (use the path appropriate for your system):
ffprobe = r'C:\Program Files\Real\RealPlayer\RPDS\Tools\ffmpeg\ffprobe.exe'
Note: r'' -- a raw string literal is used to avoid doubling the backslashes.
I had a question concerning some basic transformations in Haskell.
Basically, I have a written Input file, named Input.md. This contains some markdown text that is read in my project file, and I want to write a few functions to do transformations on the text. After completing these functions under a function called convertToHTML, I have output the file as an .html file in the correct format.
module Main
(
convertToHTML,
main
) where
import System.Environment (getArgs)
import System.IO
import Data.Char (toLower, toUpper)
process :: String -> String
process s = head $ lines s
convertToHTML :: String -> String
convertToHTML str = do
x <- str
if (x == '#')
then "<h1>"
else return x
--convertToHTML x = map toUpper x
main = do
args <- getArgs -- command line args
let (infile,outfile) = (\(x:y:ys)->(x,y)) args
putStrLn $ "Input file: " ++ infile
putStrLn $ "Output file: " ++ outfile
contents <- readFile infile
writeFile outfile $ convertToHTML contents
So,
How would I read through my input file, and transform any line that starts with a # to an html tag
How would I read through my input file once more and transform any WORD that is surrounded by _word_ (1 underscore) to another html tag
Replace any Character with an html string.
I tried using such functions such as Map, Filter, ZipWith, but could not figure out how to iterate through the text and transform each text. Please if anybody has any suggestions. I've been working on this for 2 days straight and have a bunch of failed code to show for a couple of weeks and have a bunch of failed code to show it.
I tried using such functions such as Map, Filter, ZipWith, but could not figure out how to iterate through the text and transform each text.
Because they work on appropriate element collection. And they don't really "iterate"; you simply have to feed the appropriate data. Let's tackle the # problem as an example.
Our file is one giant String, and what we'd like is to have it nicely split in lines, so [String]. What could do it for us? I have no idea, so let's just search Hoogle for String -> [String].
Ah, there we go, lines function! Its counterpart, unlines, is also going to be useful. Now we can write our line wrapper:
convertHeader :: String -> String
convertHeader [] = [] -- that prevents us from calling head on an empty line
convertHeader x = if head x == '#' then "<h1>" ++ x ++ "</h1>"
else x
and so:
convertHeaders :: String -> String
convertHeaders = unlines . map convertHeader . lines
-- ^String ^[String] ^[String] ^String
As you can see the function first converts the file to lines, maps convertHeader on each line, and the puts the file back together.
See it live on Ideone
Try now doing the same with words to replace your formatting patterns. As a bonus exercise, change convertHeader to count the number of # in front of the line and output <h1>, <h2>, <h3> and so on accordingly.