I'm new here (and to python) so any feedback on my post is welcome.
I have some code which asks for an input and then adds it to an entry in various tables.
e.g
import docx
doc = docx.Document('ABC.docx')
length = len(doc.tables)
name = input("What is your name?")
x = range(0,length)
for r in x:
doc.tables[r].cell(0, 1).text = name + ": " + doc.tables[r].cell(0, 1).text
doc.save("ABC_.docx")
and this will take text like "I love you" and change it to "Bob: I love you", which is great. However, I'd like the Bob to appear in bold. How do I do that?
Not sure this is the perfect way to do this, but it works. Basically you store the current cell text in a variable then clear the cell. After that, get the first paragraph of the cell and add formatted runs of text to it, as follows:
import docx
name = input("What is your name?")
doc = docx.Document('ABC.docx')
length = len(doc.tables)
x = range(0,length)
for r in x:
celltext = doc.tables[r].cell(0, 1).text
doc.tables[r].cell(0, 1).text = ""
paragraph = doc.tables[r].cell(0, 1).paragraphs[0]
paragraph.add_run(name).bold = True
paragraph.add_run(": " + celltext)
doc.save("ABC_.docx")
Input:
What is your name?NewCoder
Result:
Related
I'm trying to display the contents of the pdf by converting PDF into HTML using Adobe Acrobat 2021, extracting the paragraph structure, and post-processing. I saw a website whose only source is judgments as PDFs from the Supreme Court Website and displays them flawlessly. Does anybody have any idea how it's done?
My current flow is to convert the PDF into HTML to preserve the page layout and extract the text using Beautifulsoup.
Issues I'm currently facing:
Bulletin numbers are somehow dynamically calculated in the PDF and are tagged as
::before
on the browser. bs4 won't recognize it
Miss some paragraphs in between as some paragraphs are detected incorrectly
Table is detected as a table but some imperfections
PDF example : drive link
HTML from Adobe Acrobat : HTML file of the above PDF
This is my goal : Advocatekhoj
This is how accurate I'm expecting it to be.
Could someone please shed light on this? how-to(s) or any suggestions.
Note: I tried various PDF to HTML tools and the Adobe Acrobat was the best in detecting paragraph layout and preserving structure.
from bs4 import BeautifulSoup
from pprint import pprint
from os import listdir
from os.path import isfile, join
mypath = "sup_del_htmls/"
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
counter = 0
for f in onlyfiles:
print(counter)
with open("output_txt/"+f+".txt", 'w',encoding='utf-8') as txtfile:
with open(mypath+f, encoding='utf-8') as fp:
soup = BeautifulSoup(fp, "html.parser")
para_counter = 1
for li in soup.select("li"):
if li.find_parent("li"):
continue
full_para = ""
for para in li.select("p"):
for match in para.findAll('span'):
match.unwrap()
para_txt = para.get_text().replace("¶", "")
para_txt = para_txt.strip()
if para_txt.endswith(".") or para_txt.endswith(":") or para_txt.endswith(";") or para_txt.endswith(",") or para_txt.endswith('"') or para_txt.endswith("'"):
full_para += para_txt + "\n"
else:
full_para += para_txt + " "
txtfile.write(full_para)
txtfile.write("\n" + "--sep--" + "\n")
if li.find("table"):
tables = li.find_all("table")
for table in tables:
txtfile.write("--table--"+ "\n")
txtfile.write(str(table) + "\n")
txtfile.write("--sep--" + "\n")
reversed_end = []
for p in reversed(soup.select("p")):
if p.find_parent('li') or p.find_parent('ol'):
break
reversed_end.append(" ".join(p.text.split()))
if reversed_end!=[]:
for final_end in reversed(reversed_end):
txtfile.write(final_end + "\n")
txtfile.write("--sep--" + "\n")
The Result : output.txt
For the numbering with :before in css, you can try to extract the selector/s for the numbered items with a function like this
def getLctrSelectors(stsh):
stsh = stsh.get_text() if stsh else ''
ll_ids = list(set([
l.replace('>li', '> li').split('> li')[0].strip()
for l in stsh.splitlines() if l.strip()[:1] == '#'
and '> li' in l.replace('>li', '> li') and
'counter-increment' in l.split('{')[-1].split(':')[0]
]))
for i, l in enumerate(ll_ids):
sel = f'{l} > li > *:first-child'
ll_ids[i] = (sel, 1)
crl = [
ll for ll in stsh.splitlines() if ll.strip().startswith(l)
and 'counter-reset' in ll.split('{')[-1].split(':')[-2:][0]
][:1]
if not crl: continue
crl = crl[0].split('{')[-1].split('counter-reset')[-1].split(':')[-1]
crl = [w for w in crl.split(';')[0].split() if w.isdigit()]
ll_ids[i] = (sel, int(crl[-1]) if crl else 1)
return ll_ids
(It should take a style tag as input and return a list of selectors and starting counts - like [('#l1 > li > *:first-child', 3)] for your sample html.)
You can use it in your code to insert the numbers into the text in the bs4 tree:
soup = BeautifulSoup(fp, "html.parser")
for sel, ctStart in getLctrSelectors(soup.select_one('style')):
for i, lif in enumerate(soup.select(sel)):
lif.insert(0, f'{i + ctStart}. ')
para_counter = 1
### REST OF CODE ###
I'm not sure I can help you with paragraphs and tables issues... Are you sure the site uses the same pdfs as you have? (Or that they use pdfs at all rather than something closer to the original/raw data?) Your pdf itself looked rather different from its corresponding page on the site.
Please help check this issue and recommend any library to make it work. I have used showtext library but not help.
Sample Data & Code
category_name total_readers
មនោសញ្ចេតនា 267867
ស្នេហា 239880
ព្រឺព្រួច 222031
អាថ៌កំបាំង 127858
គុននិយម 101888
df %>%
ggplot(aes(area = total_readers, fill = category_name, label = category_name)) +
geom_treemap() + theme(legend.position = "bottom", ) +
geom_treemap_text(fontface = "italic", colour = "white", place = "centre", grow = FALSE)
Below is my web scraping code for a website; it clicks a form which redirects to a page. From that page I need to extract [img] src url and export it into csv in a text form. I used the code below to extract a content from a td tag. When I run the same code it doesn't work because the td tag has no content but only a img tag. Any help will be appreciated. I am new to web-scraping. Thanks in Advance.
browser.find_element_by_css_selector(".textinput[value='APPLY']").click()
#select_finder = "//tr[contains(text(), 'NB')]//a"
select_finder = "//td[text()='NB')]/../td[2]/a"
browser.find_element_by_css_selector(".content a").click()
assert "Application Details" in browser.title
file_data = []
try:
assert "Application Details" in browser.title
enlargement = browser.find_element_by_xpath("/html/body/center/table[15]/tbody/tr[3]/td[2]/b").text
enlargement_answer1 = browser.find_element_by_xpath("/html/body/center/table[15]/tbody/tr[4]/td[2]").text
enlargement_answer2 = browser.find_element_by_xpath("/html/body/center/table[15]/tbody/tr[4]/td[3]").text
enlargement_text = enlargement + enlargement_answer1 + enlargement_answer2
considerations = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[4]/td[2]/b").text
considerations_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[4]/td[3]").text
considerations_text = considerations + considerations_answer
alteration = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[4]/td[6]/b").text
alteration_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[4]/td[7]").text
alteration_text = alteration + alteration_answer
units = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[5]/td[3]/b").text
units_answer = browser.find_element_by_xpath("/html/body/center/table[15]/tbody/tr[5]/td[4]").text
units_text = units + units_answer
occupancy = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[6]/td[3]/b").text
occupancy_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[6]/td[4]").text
occupancy_text = occupancy + occupancy_answer
coo = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[7]/td[3]/b").text
coo_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[7]/td[4]").text
coo_text = coo + coo_answer
floors = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[8]/td[3]/b").text
floors_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[8]/td[4]").text
floors_text = floors + floors_answer
except (NoSuchElementException, AssertionError) as e:
floors_text.append("No Zoning Characteristics Present")
coo_text.append("n/a")
occupancy_text.append("n/a")
units_text.append("n/a")
alteration_text.append("n/a")
considerations_text.append("n/a")
enlargement_text.append("n/a")
with open('DOB.csv', 'a') as f:
wr = csv.writer(f, dialect='excel')
wr.writerow((block_number, lot_number, houseno, street, condo_text,
vacant_text, city_owned_text, file_data, floors_text, coo_text, occupancy_text, units_text, alteration_text,
considerations_text, enlargement_text ))
browser.close()
As you stated you are new to web scraping I encourage you to read up a bit: http://selenium-python.readthedocs.io/locating-elements.html
You are using XPath exclusively and in ways that are not recommended.
From the docs: "You can use XPath to either locate the element in absolute terms (not advised), or relative to an element that does have an id or name attribute."
Try using other locators to get your image.
for example: driver.find_element_by_css_selector("img[src='images/box_check.gif']")
I am trying to automate my businesses blog. I want to create a dynamic html string to use as a wordpress blog description. I am pulling text data from email body's in my gmail account to use as information. I parse the email body using the first function below.
I have everything working properly except for the for loop (in the second code block) creating the description of the post. I have searched for hours and tried dozens of different techniques but I cant figure it out for the life of me.
Here is how I am reading the text values into an array:
function getMatches(string, regex, index) {
index || (index = 1); // default to the first capturing group
var matches = [];
var match;
while (match = regex.exec(string)) {
matches.push(match[index]);
}
return matches;
}
This is how I am trying to dynamically output the text arrays to create a basic HTML blogpost description (which I pass to xmlrpc to post):
var1 = getMatches(string, regex expression, 1);
var2 = getMatches(string, regex expression, 1);
var3 = getMatches(string, regex expression, 1);
var3 = getMatches(string, regex expression, 1);
var fulldesc = "<center>";
var text = "";
for (var k=0; k<var1.length; k++) {
text = "<u><b>Var 1:</u></b> " + var1[k] + ", <u><b>Var 2:</u></b> " + var2[k] + ", <u><b>Var 3:</u></b> " + var3[k] + ", <u><b>Var 4:</u></b> " + var4[k] + ", <br><br>";
fulldesc += text;
}
fulldesc += "</center>";
Lastly here is the blog post description code (using GAS XMLRPC library):
var fullBlog = "<b><u>Headline:</u> " + sub + "</b><br><br>" + fulldesc + "<br><br>General Description: " + desc;
var blogPost = {
post_type: 'post',
post_status: 'publish', // Set to draft or publish
title: 'Ticker: ' + sub, //sub is from gmail subject and works fine
categories: cat, //cat is defined elsewhere and works fine
date_created_gmt: pubdate2, //defined elsewhere (not working but thats another topic)
mt_allow_comments: 'closed',
description: fullBlog
};
request.addParam(blogPost);
If there's only one value in the var1,2,3,4 arrays all works as it should. But any more than 1 value and I get no output at all from the "fulldesc" var. All other text variables work as they should and the blog still gets posted (just minus some very important information). I'm pretty sure the problem lies in my for loop which adds the HTML description to text var.
Any suggestions would be greatly appreciated, I'm burned out trying to get the answer! I am a self taught programmer (just from reading this forum) so please go easy on me if I missed something stupid :)
Figured it out: It wasnt the html/text loop at all. My blogpost title had to be a variable or text, but not both.
Not working:
title: 'Ticker: ' + sub, //sub is from gmail subject and works fine
Working:
var test = 'Ticker: ' + sub;
//
title:test,
I'm new to LibreOffice Basic. I'm trying to write a macro in LibreOffice Calc that will read the name of a noble House of Westeros from a cell (e.g. Stark), and output the Words of that House by looking it up on the relevant page on A Wiki of Ice and Fire. It should work like this:
Here is the pseudocode:
Read HouseName from column A
Open HtmlFile at "http://www.awoiaf.westeros.org/index.php/House_" & HouseName
Iterate through HtmlFile to find line which begins "<table class="infobox infobox-body"" // Finds the info box for the page.
Read Each Row in the table until Row begins Words
Read the contents of the next <td> tag, and return this as a string.
My problem is with the second line, I don't know how to read a HTML file. How should I do this in LibreOffice Basic?
There are two mainly issues with this.
1. Performance
Your UDF will need get the HTTP resource in every cell, in which it is stored.
2. HTML
Unfortunately there is no HTML parser in OpenOffice or LibreOffice. There is only a XML parser. Thats why we cannot parse HTML directly with the UDF.
This will work, but slow and not very universal:
Public Function FETCHHOUSE(sHouse as String) as String
sURL = "http://awoiaf.westeros.org/index.php/House_" & sHouse
oSimpleFileAccess = createUNOService ("com.sun.star.ucb.SimpleFileAccess")
oInpDataStream = createUNOService ("com.sun.star.io.TextInputStream")
on error goto falseHouseName
oInpDataStream.setInputStream(oSimpleFileAccess.openFileRead(sUrl))
on error goto 0
dim delimiters() as long
sContent = oInpDataStream.readString(delimiters(), false)
lStartPos = instr(1, sContent, "<table class=" & chr(34) & "infobox infobox-body" )
if lStartPos = 0 then
FETCHHOUSE = "no infobox on page"
exit function
end if
lEndPos = instr(lStartPos, sContent, "</table>")
sTable = mid(sContent, lStartPos, lEndPos-lStartPos + 8)
lStartPos = instr(1, sTable, "Words" )
if lStartPos = 0 then
FETCHHOUSE = "no Words on page"
exit function
end if
lEndPos = instr(lStartPos, sTable, "</tr>")
sRow = mid(sTable, lStartPos, lEndPos-lStartPos + 5)
oTextSearch = CreateUnoService("com.sun.star.util.TextSearch")
oOptions = CreateUnoStruct("com.sun.star.util.SearchOptions")
oOptions.algorithmType = com.sun.star.util.SearchAlgorithms.REGEXP
oOptions.searchString = "<td[^<]*>"
oTextSearch.setOptions(oOptions)
oFound = oTextSearch.searchForward(sRow, 0, Len(sRow))
If oFound.subRegExpressions = 0 then
FETCHHOUSE = "Words header but no Words content on page"
exit function
end if
lStartPos = oFound.endOffset(0) + 1
lEndPos = instr(lStartPos, sRow, "</td>")
sWords = mid(sRow, lStartPos, lEndPos-lStartPos)
FETCHHOUSE = sWords
exit function
falseHouseName:
FETCHHOUSE = "House name does not exist"
End Function
The better way would be, if you could get the needed informations from a Web API that would offered from the Wiki. You know the people behind the Wiki? If so, then you could place this there as a suggestion.
Greetings
Axel