pandas dataframe to_html formatters - can't display image - html

Im trying to add an up and down arrow to a pandas data frame with to_html for an email report.
I'm using a lambda function to input an up and down arrow onto column values in my data frame, I know the image html works ok becasue I can put it in the body of the email and it works fine but when I use this function and the pandas formatter it outputs like in the below image (ps. i know the CID is different to whats in my function, I was just tesing something)
Any one have any idea why? Or anyone have a better way to do it?
call:
worst_20_accounts.to_html(index=False,formatters={'Last Run Rank Difference': lambda x: check_html_val_for_arrow(x)}))
function:
def check_html_val_for_arrow(x):
try:
if x > 0:
return str(x) + ' <img src="cid:image7">'
elif x < 0:
return str(x) + ' <img src="cid:image8">'
else:
return str(x)
except:
return str(x)

escape=False
By default, the pandas.DataFrame.to_html method escapes any html in the dataframe's values.
my_img_snippet = (
"<img src='https://www.pnglot.com/pngfile/detail/"
"208-2086079_right-green-arrow-right-green-png-arrow.png'>"
)
df = pd.DataFrame([[my_img_snippet]])
Then
from IPython.display import HTML
HTML(df.to_html(escape=False))
style
You can let the styler object handle the rendering
df.style

Related

Extracting contents from a PDF to display on web pages

I'm trying to display the contents of the pdf by converting PDF into HTML using Adobe Acrobat 2021, extracting the paragraph structure, and post-processing. I saw a website whose only source is judgments as PDFs from the Supreme Court Website and displays them flawlessly. Does anybody have any idea how it's done?
My current flow is to convert the PDF into HTML to preserve the page layout and extract the text using Beautifulsoup.
Issues I'm currently facing:
Bulletin numbers are somehow dynamically calculated in the PDF and are tagged as
::before
on the browser. bs4 won't recognize it
Miss some paragraphs in between as some paragraphs are detected incorrectly
Table is detected as a table but some imperfections
PDF example : drive link
HTML from Adobe Acrobat : HTML file of the above PDF
This is my goal : Advocatekhoj
This is how accurate I'm expecting it to be.
Could someone please shed light on this? how-to(s) or any suggestions.
Note: I tried various PDF to HTML tools and the Adobe Acrobat was the best in detecting paragraph layout and preserving structure.
from bs4 import BeautifulSoup
from pprint import pprint
from os import listdir
from os.path import isfile, join
mypath = "sup_del_htmls/"
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
counter = 0
for f in onlyfiles:
print(counter)
with open("output_txt/"+f+".txt", 'w',encoding='utf-8') as txtfile:
with open(mypath+f, encoding='utf-8') as fp:
soup = BeautifulSoup(fp, "html.parser")
para_counter = 1
for li in soup.select("li"):
if li.find_parent("li"):
continue
full_para = ""
for para in li.select("p"):
for match in para.findAll('span'):
match.unwrap()
para_txt = para.get_text().replace("ΒΆ", "")
para_txt = para_txt.strip()
if para_txt.endswith(".") or para_txt.endswith(":") or para_txt.endswith(";") or para_txt.endswith(",") or para_txt.endswith('"') or para_txt.endswith("'"):
full_para += para_txt + "\n"
else:
full_para += para_txt + " "
txtfile.write(full_para)
txtfile.write("\n" + "--sep--" + "\n")
if li.find("table"):
tables = li.find_all("table")
for table in tables:
txtfile.write("--table--"+ "\n")
txtfile.write(str(table) + "\n")
txtfile.write("--sep--" + "\n")
reversed_end = []
for p in reversed(soup.select("p")):
if p.find_parent('li') or p.find_parent('ol'):
break
reversed_end.append(" ".join(p.text.split()))
if reversed_end!=[]:
for final_end in reversed(reversed_end):
txtfile.write(final_end + "\n")
txtfile.write("--sep--" + "\n")
The Result : output.txt
For the numbering with :before in css, you can try to extract the selector/s for the numbered items with a function like this
def getLctrSelectors(stsh):
stsh = stsh.get_text() if stsh else ''
ll_ids = list(set([
l.replace('>li', '> li').split('> li')[0].strip()
for l in stsh.splitlines() if l.strip()[:1] == '#'
and '> li' in l.replace('>li', '> li') and
'counter-increment' in l.split('{')[-1].split(':')[0]
]))
for i, l in enumerate(ll_ids):
sel = f'{l} > li > *:first-child'
ll_ids[i] = (sel, 1)
crl = [
ll for ll in stsh.splitlines() if ll.strip().startswith(l)
and 'counter-reset' in ll.split('{')[-1].split(':')[-2:][0]
][:1]
if not crl: continue
crl = crl[0].split('{')[-1].split('counter-reset')[-1].split(':')[-1]
crl = [w for w in crl.split(';')[0].split() if w.isdigit()]
ll_ids[i] = (sel, int(crl[-1]) if crl else 1)
return ll_ids
(It should take a style tag as input and return a list of selectors and starting counts - like [('#l1 > li > *:first-child', 3)] for your sample html.)
You can use it in your code to insert the numbers into the text in the bs4 tree:
soup = BeautifulSoup(fp, "html.parser")
for sel, ctStart in getLctrSelectors(soup.select_one('style')):
for i, lif in enumerate(soup.select(sel)):
lif.insert(0, f'{i + ctStart}. ')
para_counter = 1
### REST OF CODE ###
I'm not sure I can help you with paragraphs and tables issues... Are you sure the site uses the same pdfs as you have? (Or that they use pdfs at all rather than something closer to the original/raw data?) Your pdf itself looked rather different from its corresponding page on the site.

Weird KeyError (Python)

So, I have to work with this JSON (from URL):
{'player': {'racing': 25260.154000000017, 'player': 259114.57700000296}, 'farming': {'fishing': 33783.390999999414, 'mining': 29048.60500000002, 'farming': 25334.504000000023}, 'piloting': {'piloting': 25570.18800000001, 'cargos': 3080.713000000036, 'heli': 10433.977000000004}, 'physical': {'strength': 198358.86700000675}, 'business': {'business': 50922.88500000005}, 'trucking': {'mechanic': 2724.5620000000004, 'garbage': 755.642999999997, 'trucking': 223784.99700000713, 'postop': 1411.4190000000006}, 'train': {'bus': 669.1940000000001, 'train': 1363.805999999999}, 'ems': {'fire': 25449.43400000001, 'ems': 13844.628000000012}, 'hunting': {'skill': 4179.033000000316}, 'casino': {'casino': 18545.526000000027}}
It is indeed one line. I am trying to make it so that for example, I can get racing, which is the first one you see. For this, you need go into Player first, and then you can get to Racing. How do I do this?
My current code:
def allthethings():
# Grab all the skills
geturl = ("http://server.tycoon.community:30120/status/data/" + str(setting_playerid))
print(geturl)
a = requests.get(geturl,headers={"X-Tycoon-Key":setting_apikeyTT}).json()
jsonconverted = (a["data"]["gaptitudes_v"])
print(jsonconverted)
# Convert JSON into many, many variables
Raw_RACR = jsonconverted['player.racing']
print(Raw_RACR)
I believe this is all the code that is needed.
Also, this is the error:
KeyError: 'player.racing'

Understanding DetectedBreak in google OCR full text annotations

I am trying to convert the full-text annotations of google vision OCR result to line level and word level which is in Block,Paragraph,Word and Symbol hierarchy.
However, when converting symbols to word text and word to line text, I need to understand the DetectedBreak property.
I went through This documentation.But I did not understand few of the them.
Can somebody explain what do the following Breaks mean? I only understood LINE_BREAK and SPACE.
EOL_SURE_SPACE
HYPHEN
LINE_BREAK
SPACE
SURE_SPACE
UNKNOWN
Can they be replaced by either a newline char or space ?
The link you provided has the most detailed explanation available of what each of these stands for. I suppose the best way to get a better understanding is to run ocr on different images and compare the response with what you see on the corresponding image. The following python script runs DOCUMENT_TEXT_DETECTION on an image saved in GCS and prints all detected breaks except from the ones you have no trouble understanding (LINE_BREAK and SPACE), along with the word immediately preceding them to enable comparison.
import sys
import os
from google.cloud import storage
from google.cloud import vision
def detect_breaks(gcs_image):
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/path/to/json'
client = vision.ImageAnnotatorClient()
feature = vision.types.Feature(
type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION)
image_source = vision.types.ImageSource(
image_uri=gcs_image)
image = vision.types.Image(
source=image_source)
request = vision.types.AnnotateImageRequest(
features=[feature], image=image)
annotation = client.annotate_image(request).full_text_annotation
breaks = vision.enums.TextAnnotation.DetectedBreak.BreakType
word_text = ""
for page in annotation.pages:
for block in page.blocks:
for paragraph in block.paragraphs:
for word in paragraph.words:
for symbol in word.symbols:
word_text += symbol.text
if symbol.property.detected_break.type:
if symbol.property.detected_break.type == breaks.SPACE or symbol.property.detected_break.type == breaks.LINE_BREAK:
word_text = ""
else:
print word_text,symbol.property.detected_break
word_text = ""
if __name__ == '__main__':
detect_breaks(sys.argv[1])

Python Full Web Parsing

As of right now I'm attempting to make a simple music player app that streams music or video directly from a Youtube URL, and in order to do that I need the full download of the search page that's used to search for videos to stream. But I'm having some problems with the urlopen module in python 3, which is what I'm using to make the command application. It won't load the ytd-app tag on Youtube, which is what a good deal of the video and playlist references are put on when you first load the search. Anyone know what's going on, or know some type of workaround for it? Thanks!
My code so far:
BASICURL = "https://www.youtube.com/results?"
query = query.split()
ret = ""
stufffound = {}
for x in query:
ret = ret + x + "+"
ret = (ret[:len(ret)-1])
# URL BUILDER
if filtercriteria:
URL = BASICURL + "sp={0}".format(filtercriteria) + "&search_query={0}".format(ret)
else:
URL = BASICURL + "search_query={0}".format(ret)
query = urlopen(str(URL))
passdict = {}
def findvideosonpage(query,dictToAddTo):
for x in (BS(urlopen(query)).read()).findAll(attrs={'class':'yt-simple-endpoint style-scope ytd-video-renderer'})
dictToAddTo[query.index(x)] = x[href]
print(x)
return list([x for _,x in sorted(zip(dictToAddTo.values(), dictToAddTo.keys()))])
# Dictionary is meant to be converted into a list later to order the results

Converting a while loop into a function? Python

In order to condense my code I am trying to make one of my while loops into a function. I have tried numerous times and have yet to receive the same result upon compiling as I would just leaving the while loop.
Here's the while loop:
while True:
i = find_lowest_i(logs)
if i == -1:
break
print "i=", i
tpl = logs[i].pop(0)
print tpl
out.append(tpl)
print out
And here's what I have so far for my function:
def mergesort(list_of_logs):
i = find_lowest_i(logs)
out = []
while True:
if i == -1:
break
print "i=", i
tpl = logs[i].pop(0)
print tpl
out.append(tpl)
print out
return out
Thanks in advance. This place is a safe-haven for a beginner programmer.
It looks like the parameter to your function is list_of_logs and you're still using logs inside the function's body. The simplest fix is probably to rename your parameter to mergesort from list_of_logs to logs. Otherwise, looks completely correct to me.