I need to turn the name of the file in the folder into a clickable text. As of now, the file name is in one line and link in another.
What's the name of it? Which keywords I should use?
html = '<html><body>'
subset = []
lastFile = None
for file in os.listdir():
if file.endswith(".html"):
subset.append(file)
for r in subset:
if not lastFile:
html += '<h3>%s</h3>' % r
html += 'r' % r
You can just wrap the <h3> tag in an anchor tag, using your code do something like this
html = '<html><body>'
subset = []
lastFile = None
for file in os.listdir():
if file.endswith(".html"):
subset.append(file)
for r in subset:
if not lastFile:
html += '<a href="%s">' % r
html += '<h3>%s</h3></a>' % r
Related
I'm trying to display the contents of the pdf by converting PDF into HTML using Adobe Acrobat 2021, extracting the paragraph structure, and post-processing. I saw a website whose only source is judgments as PDFs from the Supreme Court Website and displays them flawlessly. Does anybody have any idea how it's done?
My current flow is to convert the PDF into HTML to preserve the page layout and extract the text using Beautifulsoup.
Issues I'm currently facing:
Bulletin numbers are somehow dynamically calculated in the PDF and are tagged as
::before
on the browser. bs4 won't recognize it
Miss some paragraphs in between as some paragraphs are detected incorrectly
Table is detected as a table but some imperfections
PDF example : drive link
HTML from Adobe Acrobat : HTML file of the above PDF
This is my goal : Advocatekhoj
This is how accurate I'm expecting it to be.
Could someone please shed light on this? how-to(s) or any suggestions.
Note: I tried various PDF to HTML tools and the Adobe Acrobat was the best in detecting paragraph layout and preserving structure.
from bs4 import BeautifulSoup
from pprint import pprint
from os import listdir
from os.path import isfile, join
mypath = "sup_del_htmls/"
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
counter = 0
for f in onlyfiles:
print(counter)
with open("output_txt/"+f+".txt", 'w',encoding='utf-8') as txtfile:
with open(mypath+f, encoding='utf-8') as fp:
soup = BeautifulSoup(fp, "html.parser")
para_counter = 1
for li in soup.select("li"):
if li.find_parent("li"):
continue
full_para = ""
for para in li.select("p"):
for match in para.findAll('span'):
match.unwrap()
para_txt = para.get_text().replace("ΒΆ", "")
para_txt = para_txt.strip()
if para_txt.endswith(".") or para_txt.endswith(":") or para_txt.endswith(";") or para_txt.endswith(",") or para_txt.endswith('"') or para_txt.endswith("'"):
full_para += para_txt + "\n"
else:
full_para += para_txt + " "
txtfile.write(full_para)
txtfile.write("\n" + "--sep--" + "\n")
if li.find("table"):
tables = li.find_all("table")
for table in tables:
txtfile.write("--table--"+ "\n")
txtfile.write(str(table) + "\n")
txtfile.write("--sep--" + "\n")
reversed_end = []
for p in reversed(soup.select("p")):
if p.find_parent('li') or p.find_parent('ol'):
break
reversed_end.append(" ".join(p.text.split()))
if reversed_end!=[]:
for final_end in reversed(reversed_end):
txtfile.write(final_end + "\n")
txtfile.write("--sep--" + "\n")
The Result : output.txt
For the numbering with :before in css, you can try to extract the selector/s for the numbered items with a function like this
def getLctrSelectors(stsh):
stsh = stsh.get_text() if stsh else ''
ll_ids = list(set([
l.replace('>li', '> li').split('> li')[0].strip()
for l in stsh.splitlines() if l.strip()[:1] == '#'
and '> li' in l.replace('>li', '> li') and
'counter-increment' in l.split('{')[-1].split(':')[0]
]))
for i, l in enumerate(ll_ids):
sel = f'{l} > li > *:first-child'
ll_ids[i] = (sel, 1)
crl = [
ll for ll in stsh.splitlines() if ll.strip().startswith(l)
and 'counter-reset' in ll.split('{')[-1].split(':')[-2:][0]
][:1]
if not crl: continue
crl = crl[0].split('{')[-1].split('counter-reset')[-1].split(':')[-1]
crl = [w for w in crl.split(';')[0].split() if w.isdigit()]
ll_ids[i] = (sel, int(crl[-1]) if crl else 1)
return ll_ids
(It should take a style tag as input and return a list of selectors and starting counts - like [('#l1 > li > *:first-child', 3)] for your sample html.)
You can use it in your code to insert the numbers into the text in the bs4 tree:
soup = BeautifulSoup(fp, "html.parser")
for sel, ctStart in getLctrSelectors(soup.select_one('style')):
for i, lif in enumerate(soup.select(sel)):
lif.insert(0, f'{i + ctStart}. ')
para_counter = 1
### REST OF CODE ###
I'm not sure I can help you with paragraphs and tables issues... Are you sure the site uses the same pdfs as you have? (Or that they use pdfs at all rather than something closer to the original/raw data?) Your pdf itself looked rather different from its corresponding page on the site.
Is there an available package by which we can convert the slack block kit to HTML?
Or if someone has a function for the same, can you please help ?
If anyone is looking for something similar - here's the solution
function slackMarkdownToHtml(markdown) {
// Replace asterisks with bold tags
let html = markdown.replace(/\*(.+?)\*/g, '<b>$1</b>');
// Replace underscores with italic tags
html = html.replace(/\_(.+?)\_/g, '<i>$1</i>');
// Replace tildes with strike-through tags
html = html.replace(/\~(.+?)\~/g, '<s>$1</s>');
// Replace dashes with unordered list items
html = html.replace(/- (.*)/g, '<li>$1</li>');
html = html.replace(/\n-/g, '\n</ul><ul>-')
html = '<ul>' + html + '</ul>';
// Replace numbers with ordered list items
html = html.replace(/[0-9]\. (.*)/g, '<li>$1</li>');
html = html.replace(/\n[0-9]\./g, '\n</ol><ol>$&')
html = '<ol>' + html + '</ol>';
// Replace Slack's link syntax with anchor tags
html = html.replace(/\[(.+?)\]\((.+?)\)/g, '$1');
return html;
}
Also, the reverse - HTML to Slack Markdown
function htmlToSlackMarkdown(html) {
// Replace newline characters with a line break
let markdown = html.replace(/\n/g, '\n\n');
// Replace bold tags with asterisks
markdown = markdown.replace(/<b>/g, '*').replace(/<\/b>/g, '*');
// Replace italic tags with underscores
markdown = markdown.replace(/<i>/g, '_').replace(/<\/i>/g, '_');
// Replace strike-through tags with tildes
markdown = markdown.replace(/<s>/g, '~').replace(/<\/s>/g, '~');
// Replace unordered list items with dashes
markdown = markdown.replace(/<li>/g, '- ').replace(/<\/li>/g, '');
// Replace ordered list items with numbers
markdown = markdown.replace(/<ol>/g, '').replace(/<\/ol>/g, '');
markdown = markdown.replace(/<li>/g, '1. ').replace(/<\/li>/g, '');
markdown = markdown.replace(/\n1\./g, '\n2.');
markdown = markdown.replace(/\n2\./g, '\n3.');
markdown = markdown.replace(/\n3\./g, '\n4.');
markdown = markdown.replace(/\n4\./g, '\n5.');
markdown = markdown.replace(/\n5\./g, '\n6.');
markdown = markdown.replace(/\n6\./g, '\n7.');
markdown = markdown.replace(/\n7\./g, '\n8.');
markdown = markdown.replace(/\n8\./g, '\n9.');
markdown = markdown.replace(/\n9\./g, '\n10.');
// Replace anchor tags with Slack's link syntax
markdown = markdown.replace(/<a href="(.+?)">(.+?)<\/a>/g, '[$2]($1)');
return markdown;
}
I am using R Markdown to create an html file for regression results tables, which are produced by stargazer and lfe in a code chunk.
library(lfe); library(stargazer)
data <- data.frame(x = 1:10, y = rnorm(10), z = rnorm(10))
result <- stargazer(felm(y ~ x + z, data = data), type = 'html')
I create a html file win an inline code r result after the chunk above. However, a bunch of commas appear at the top of the table.
When I check the html code, I see almost every </tr> is followed by a comma.
How can I delete these commas?
Maybe not what you are looking for exactly but I am a huge fan of modelsummary. I knit to HTML to see how it looks and then usually knit to pdf. The modelsummary equivalent would look something like this
library(lfe)
library(modelsummary)
data = data.frame(x = 1:10, y = rnorm(10), z = rnorm(10))
results = felm(y ~ x + z, data = data)
modelsummary(results)
There are a lot of ways to customize it through kableExtra and other packages. The documentation is really good. Here is kind of a silly example
library(kableExtra)
modelsummary(results,
coef_map = c("x" = "Cool Treatment",
"z" = "Confounder",
"(Intercept)" = "(Intercept)")) %>%
row_spec(1, background = "#F5ABEA")
Below is my web scraping code for a website; it clicks a form which redirects to a page. From that page I need to extract [img] src url and export it into csv in a text form. I used the code below to extract a content from a td tag. When I run the same code it doesn't work because the td tag has no content but only a img tag. Any help will be appreciated. I am new to web-scraping. Thanks in Advance.
browser.find_element_by_css_selector(".textinput[value='APPLY']").click()
#select_finder = "//tr[contains(text(), 'NB')]//a"
select_finder = "//td[text()='NB')]/../td[2]/a"
browser.find_element_by_css_selector(".content a").click()
assert "Application Details" in browser.title
file_data = []
try:
assert "Application Details" in browser.title
enlargement = browser.find_element_by_xpath("/html/body/center/table[15]/tbody/tr[3]/td[2]/b").text
enlargement_answer1 = browser.find_element_by_xpath("/html/body/center/table[15]/tbody/tr[4]/td[2]").text
enlargement_answer2 = browser.find_element_by_xpath("/html/body/center/table[15]/tbody/tr[4]/td[3]").text
enlargement_text = enlargement + enlargement_answer1 + enlargement_answer2
considerations = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[4]/td[2]/b").text
considerations_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[4]/td[3]").text
considerations_text = considerations + considerations_answer
alteration = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[4]/td[6]/b").text
alteration_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[4]/td[7]").text
alteration_text = alteration + alteration_answer
units = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[5]/td[3]/b").text
units_answer = browser.find_element_by_xpath("/html/body/center/table[15]/tbody/tr[5]/td[4]").text
units_text = units + units_answer
occupancy = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[6]/td[3]/b").text
occupancy_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[6]/td[4]").text
occupancy_text = occupancy + occupancy_answer
coo = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[7]/td[3]/b").text
coo_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[7]/td[4]").text
coo_text = coo + coo_answer
floors = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[8]/td[3]/b").text
floors_answer = browser.find_element_by_xpath("/html/body/center/table[16]/tbody/tr[8]/td[4]").text
floors_text = floors + floors_answer
except (NoSuchElementException, AssertionError) as e:
floors_text.append("No Zoning Characteristics Present")
coo_text.append("n/a")
occupancy_text.append("n/a")
units_text.append("n/a")
alteration_text.append("n/a")
considerations_text.append("n/a")
enlargement_text.append("n/a")
with open('DOB.csv', 'a') as f:
wr = csv.writer(f, dialect='excel')
wr.writerow((block_number, lot_number, houseno, street, condo_text,
vacant_text, city_owned_text, file_data, floors_text, coo_text, occupancy_text, units_text, alteration_text,
considerations_text, enlargement_text ))
browser.close()
As you stated you are new to web scraping I encourage you to read up a bit: http://selenium-python.readthedocs.io/locating-elements.html
You are using XPath exclusively and in ways that are not recommended.
From the docs: "You can use XPath to either locate the element in absolute terms (not advised), or relative to an element that does have an id or name attribute."
Try using other locators to get your image.
for example: driver.find_element_by_css_selector("img[src='images/box_check.gif']")
Windows "viewers" (like Windows Live Photo Gallery or Windows Photo Viewer) have not supported GIF animation since the days of Windows XP. The handiest way I know now to view animation of a GIF is to open it with MSIE -- but THAT, unlike Windows Photo Viewer, does not let me "scroll" through a directory to view other image files. It occurred to me that I could create a scripted HTML document that would perform that "scrolling" through the directory, but I don't know of a way to set it up so that by right-clicking an animated GIF file in my "Recent Items" (or elsewhere), and selecting "Open with...", that one of the options in that group would be the HTML doc I had created, to be opened in MSIE and given the name of the file I had right-clicked on (in the location.search property, for example), so that it would display THAT animated GIF initially, but then, by my script in the HMTL document, would let me scroll through the directory to view other image files as well. Also, I would want this option to be available for any type of image file, so that I could initially view, say, a JPEG file, but then subsequent "directory scrolling" could include GIFs or BMPs, etc. IS there a way to do that?
As the saying goes, "Don't get me started!" :)
I hadn't actually planned on having the batch write to the HTML file, but given that approach, I decided to put my javascript into a JS file, and have the batch write code that would reference it, thus:
#echo ^<html^>^<body onkeydown='kdn(event.keyCode)'^>^<span id='im'^>^<img style='display:none' src=%1^>^</span^>^<script src='c:/wind/misc/peruse.js'^>^</script^> > c:\wind\misc\peruse.htm
#start c:\wind\misc\peruse.htm
I found that the only way to handle the backslashes in %1 was to store it directly to an img src, as you did; however, I wanted more detailed code for the img than I wanted to write at this stage, so I set it to be invisible and placed it inside an id'd span for later elaboration by my script. It's nice to know about %~p1 but I don't really need it here.
And here is a rudimentary script (in peruse.js) for folder navigation that it calls up:
document.bgColor = 'black';
f = ('' + document.images[0].src).substr(8);
document.getElementById('im').innerHTML = '<table height=100% width=100% cellspacing=0 cellpadding=0><tr><td align="center" valign="middle"><img src="' + f + '" onMouseDown="self.focus()"></td></tr></table>';
fso = new ActiveXObject("Scripting.FileSystemObject");
d = fso.GetFolder(r = f.substr(0, (b = f.lastIndexOf('/')) + (b < 3)));
if(b > 2) r += '/';
b = (document.title = f.substr(++b)).toLowerCase();
for(n = new Enumerator(d.files) , s = [] , k = -1 , x = '..jpg.jpeg.gif.bmp.png.art.wmf.'; !n.atEnd(); n.moveNext()) {
if(x.indexOf((p = n.item().name).substr(p.lastIndexOf('.') + 512 & 511).toLowerCase() + '.') > 0) {
s[++k] = p.toLowerCase() + '*' + p
}
}
for(s.sort() , i = 0 , j = k++ , status = k + ' file' + (k > 1 ? 's' : '') , z = 100; (x = s[n = (i + j) >> 1].substr(0, s[n].indexOf('*'))) != b; ) {
x < b ? i = (i == n) + n : j = n
}
document.title = (n + 1) + ': ' + document.title;
function kdn(e, x) {
if(k > 1 && ((x = e % 63) == 37 || x == 39)) {
document.images[0].src = r + (x = s[n = (n + x - 38 + k) % k].substr(s[n].indexOf("*") + 1));
e = 12;
document.title = (n + 1) + ': ' + x;
setTimeout("status+=''", 150)
};
if(e == 12 || e == 101 || e == 107 || e == 109) {
document.images[0].style.zoom = (z = e < 107 ? 100 : e == 107 ? z * 1.2 : z / 1.2) + '%'
}
}
self.focus()
It sets the page background to black,
recovers the path-and-filename into f (with the problematical backslashes converted to forward slashes),
sets up table code so the image appears in the center of the window,
accesses the filesystemobject, and, with the path portion extracted from f into r,
sets the page title to just the filename (with the lowercase name stored to b),
and iterates the folder, checking for any image file,
creates an array s of all those files, with names in lowercase followed by their original case-format,
sorts the array case-blind, and binary-searches the array for the original file (as b) so it knows where to proceed from,
and prefixes the number-within-folder to the page title;
then the keydown function uses the left and right arrows to move backward and forward in the folder, with wraparound,
and uses the numpad+ and - to enlarge or shrink the image, and numpad-5 to reset the size (which also occurs for every new image).
It still remains, though, that I'd like to know of a way to simply pass the original %1 info to an HTML file, without writing a file in the process. I might expect it to be a way to have it "appended to the web address", as is done with info following a ? which gets placed in location.search. I don't know if the command line for iexplore.exe could have a parameter for passing info to location.search.