How to extract only the child bookmarks from Pdf using Python - extract

I need to extract only the child Bookmarks from following example
Example :
Entire Return
Federal
Mailing Sheet to Taxpayer - Federal
Client's Copy Cover Sheet
Two-Year Comparison Worksheet
Here I need to extract the last three line only last subclass bookmarks.
I have tried the below code to get bookmarks from pdf. its actually working but extracting the whole bookmarks.
import PyPDF2
def show_tree(bookmark_list, indent=0):
for item in bookmark_list:
if isinstance(item, list):
# recursive call with increased indentation
show_tree(item, indent + 4)
else:
print(" " * indent + item.title)
reader = PyPDF2.PdfFileReader("sample.pdf")
show_tree(reader.getOutlines())
I am Expecting to extract only the Child bookmarks with respective pagenumbers from pdf.

Related

How can I write certain sections of text from different lines to multiple lines?

So I'm currently trying to use Python to transform large sums of data into a neat and tidy .csv file from a .txt file. The first stage is trying to get the 8-digit company numbers into one column called 'Company numbers'. I've created the header and just need to put each company number from each line into the column. What I want to know is, how do I tell my script to read the first eight characters of each line in the .txt file (which correspond to the company number) and then write them to the .csv file? This is probably very simple but I'm only new to Python!
So far, I have something which looks like this:
with open(r'C:/Users/test1.txt') as rf:
with open(r'C:/Users/test2.csv','w',newline='') as wf:
outputDictWriter = csv.DictWriter(wf,['Company number'])
outputDictWriter.writeheader()
rf = rf.read(8)
for line in rf:
wf.write(line)
My recommendation would be 1) read the file in, 2) make the relevant transformation, and then 3) write the results to file. I don't have sample data, so I can't verify whether my solution exactly addresses your case
with open('input.txt','r') as file_handle:
file_content = file_handle.read()
list_of_IDs = []
for line in file_content.split('\n')
print("line = ",line)
print("first 8 =", line[0:8])
list_of_IDs.append(line[0:8])
with open("output.csv", "w") as file_handle:
file_handle.write("Company\n")
for line in list_of_IDs:
file_handle.write(line+"\n")
The value of separating these steps is to enable debugging.

R save FlexTable as html file in script

I have a FlexTable produced with the ReporteRs package which I would like to export as .html.
When I print the table to the viewer in RStudio I can do this by clicking on 'Export' and selecting 'Save as webpage'.
How would I replicate this action in my script?
I don't want to knit to a html document or produce a report just yet as at present I just want separate files for each of my draft tables which I can share with collaborators (but nicely formatted so they are easy to read).
I have tried the as.html function and that does produce a .html file but all the formatting is missing (it is just plain text).
Here is a MWE:
# load libraries:
library(data.table)
library(ReporteRs)
library(rtable)
# Create dummy table:
mydt <- data.table(id = c(1,2,3), name = c("a", "b", "c"), fruit = c("apple", "orange", "banana"))
# Convert to FlexTable:
myflex <- vanilla.table(mydt)
# Attempt to export to html in script:
sink('MyFlexTable.html')
print(as.html(myflex))
sink()
# Alternately:
sink('MyFlexTable.html')
knit_print(myflex)
sink()
The problem with both methods demonstrated above is that they output the table without any formatting (no borders etc).
However, manually selecting 'export' and 'save as webpage' in RStudio renders the FlexTable to a html file with full formatting. Why is this?
This works for me:
writeLines(as.html(myflex), "MyFlexTable.html")

How can I make nltk.NaiveBayesClassifier.train() work with my dictionary

I'm currently making a simples spam/ham email filter using Naive Bayles.
For you to understand my algorithm logic: I have a folder with lots os files, which are examples of spam/ham emails. I also have two other files in this folder containing the titles of all my ham examples and another with the titles of all my spam examples. I organized like this so I can open and read this emails properly.
I'm putting all the words I judge to be important in a dictionary structure, with a label "spam" or "ham" depending from which kind of file I extracted them from.
Then I'm using nltk.NaiveBayesClassifier.train() so I can train my classifier, but I'm getting the error:
for featureset, label in labeled_featuresets:
ValueError: too many values to unpack
I don't know why this is happening. When I looked for a solution, I found that strings are not hashable, and I was using a list to do it, then I turned it into a dictionary, which are hashable as far as I know, but it keeps getting this error.
Someone knows how to solve it? Thanks!
All my code is listed below:
import nltk
import re
import random
stopwords = nltk.corpus.stopwords.words('english') #Words I should avoid since they have weak value for classification
my_file = open("spam_files.txt", "r") #my_file now has the name of each file that contains a spam email example
word = {} #a dictionary where I will storage all the words and which value they have (spam or ham)
for lines in my_file: #for each name of file (which will be represenetd by LINES) of my_file
with open(lines.rsplit('\n')[0]) as email: #I will open the file pointed by LINES, and then, read the email example that is inside this file
for phrase in email: #After that, I will take every phrase of this email example I just opened
try: #and I'll try to tokenize it
tokens = nltk.word_tokenize(phrase)
except:
continue #I will ignore non-ascii elements
for c in tokens: #for each token
regex = re.compile('[^a-zA-Z]') #I will also exclude numbers
c = regex.sub('', c)
if (c): #If there is any element left
if (c not in stopwords): #And if this element is a not a stopword
c.lower()
word.update({c: 'spam'})#I put this element in my dictionary. Since I'm analysing spam examples, variable C is labeled "spam".
my_file.close()
email.close()
#The same logic is used for the Ham emails. Since my ham emails contain only ascii elements, I dont test it with TRY
my_file = open("ham_files.txt", "r")
for lines in my_file:
with open(lines.rsplit('\n')[0]) as email:
for phrase in email:
tokens = nltk.word_tokenize(phrase)
for c in tokens:
regex = re.compile('[^a-zA-Z]')
c = regex.sub('', c)
if (c):
if (c not in stopwords):
c.lower()
word.update({c: 'ham'})
my_file.close()
email.close()
#And here I train my classifier
classifier = nltk.NaiveBayesClassifier.train(word)
classifier.show_most_informative_features(5)
nltk.NaiveBayesClassifier.train() expects “a list of tuples (featureset, label)” (see the documentation of the train() method)
What is not mentioned there is that featureset should be a dict of feature names mapped to feature values.
So, in a typical spam/ham classification with a bag-of-words model, the labels are 'spam'/'ham' or 1/0 or True/False;
the feature names are the occurring words and the values are the number of times each word occurs.
For example, the argument to the train() method might look like this:
[({'greetings': 1, 'loan': 2, 'offer': 1}, 'spam'),
({'money': 3}, 'spam'),
...
({'dear': 1, 'meeting': 2}, 'ham'),
...
]
If your dataset is rather small, you might want to replace the actual word counts with 1, to reduce data sparsity.

Reading XML data into R from a html source

I'd like to import data into R from a given webpage, say this one.
In the source code (but not on the actual page), the data I'd like to get is stored in a single line of javascript code which starts like this:
chart_Line1.setDataXML("<graph rotateNames (stuff omitted) >
<set value='699.99' name='16.02.2013' />
<set value='731.57' name='18.02.2013' />
<set value='more values' name='more dates' />
...
<trendLines> (now a different command starts, stuff omitted)
</trendLines></graph>")
(Note that I've included line breaks for readability; the data is in one single line in the original file. It would suffice to import only the line which starts with chart_Line1.setDataXML - it's line 56 in the source if you want to have a look yourself)
I can read the whole html file into a string using scan("URLofFile", what="raw"), but how do I extract the data from this?
Can I specify the data format with what="...", keeping in mind that there are no line breaks to separate the data, but several line breaks in the irrelevant prefix and suffix?
Is this something which can be done in a nice way using R tools, or do you suggest that this data acquisition should rather be done with a different script?
With some trial & error, I was able to find the exact line where the data is contained. I read the whole html file, and then dispose of all other lines.
require(zoo)
require(stringr)
# get html data, scrap all lines but the interesting one
theurl <- "https://www.magickartenmarkt.de/Black_Lotus_Unlimited.c1p5093.prod"
sec <- scan(file =theurl, what = "character", sep="\n")
sec <- sec[45]
# extract all strings of the form "value='X'", where X is a 1 to 3 digit number with some separator and 2 decimal places
values <- str_extract_all(sec, "value='[0-9]{1,3}.[0-9]{2}'")
# dispose of all non-numerical, non-separator values
values <- str_replace_all(unlist(values),"[^0-9/.]","")
# get all dates in the form "name='DD.MM.YYYY"
dates <- str_extract_all(sec, "name='[0-9]{2}.[0-9]{2}.[0-9]{4}'")
# dispose of all non-numerical, non-separator values
dates <- str_replace_all(unlist(dates),"[^0-9/.]","")
# convert dates to canonical format
dates <- as.Date(dates,format="%d.%m.%Y")
# put values and dates into a list of ordered observations, converting the values from characters to numbers first.
MyZoo <- zoo(as.numeric(values),dates)

Insert file (foo.txt) into open file (bar.txt) at caret position

What would be the best method, please, to insert file (foo.txt) into open file (bar.txt) at caret position?
It would be nice to have an open-file dialog to choose anything to be inserted.
The word processing equivalent would be "insert file" here.
Here is a substitute for foo.sublime-snippet, which can be linked to form files elsewhere:
import sublime, sublime_plugin
class InsertFileCommand(sublime_plugin.TextCommand):
def run(self, edit):
v = self.view
template = open('foo.txt').read()
print template
v.run_command("insert_snippet", {"contents": template})
From within a text command you can access the current view. You can get the cursor positions using self.view.sel(). I don't know how to do gui stuff in python, but you can do file selection using the quick panel (similar to FuzzyFileNav).
Here is my unofficial modification of https://github.com/mneuhaus/SublimeFileTemplates which permits me to insert-a-file-here using the quick panel. It works on an OSX operating system (running Mountain Lion).
The only disadvantage I see so far is the inability to translate a double-slash \\ in the form file correctly -- it gets inserted instead as just a single-slash \. In my LaTex form files, the double-slash \\ represents a line ending, or a new line if preceded by a ~. The workaround is to insert an extra slash at each occurrence in the actual form file (i.e., put three slashes, with the understanding that only two slashes will be inserted when running the plugin). The form files need to be LF endings and I'm using UTF-8 encoding -- CR endings are not translated properly. With a slight modification, it is also possible to have multiple form file directories and/or file types.
import sublime, sublime_plugin
import os
class InsertFileCommand(sublime_plugin.WindowCommand):
def run(self):
self.find_templates()
self.window.show_quick_panel(self.templates, self.template_selected)
def find_templates(self):
self.templates = []
self.template_paths = []
for root, dirnames, filenames in os.walk('/path_to_forms_directory'):
for filename in filenames:
if filename.endswith(".tex"): # extension of form files
self.template_paths.append(os.path.join(root, filename))
self.templates.append(os.path.basename(root) + ": " + os.path.splitext(filename)[0])
def template_selected(self, selected_index):
if selected_index != -1:
self.template_path = self.template_paths[selected_index]
print "\n" * 25
print "----------------------------------------------------------------------------------------\n"
print ("Inserting File: " + self.template_path + "\n")
print "----------------------------------------------------------------------------------------\n"
template = open(self.template_path).read()
print template
view = self.window.run_command("insert_snippet", {'contents': template})
sublime.status_message("Inserted File: %s" % self.template_path)