HTML to readable text in Swift - html

I need to make a reader func in my Swift App. I will receive text (with json request) in HTML like this :
<h5 align="LEFT" class="western" style="font-weight: normal;"> </h5>
<font size="3"><h5 align="LEFT" class="western" style="font-weight: normal;">
Albert Einstein publie en 1905, une nouvelle théorie connue sous le nom de relativité restreinte. </h5>
<h5 align="LEFT" class="western" style="font-weight: normal;"></h5>
<h5 align="LEFT" style="font-weight: normal;"><font color="#339933">►</font>
<font size="3">Postulat 1 :</font>
</h5><h5 align="LEFT" class="western" style="font-weight: normal;">
</h5>
And I want to show it in readable text with all the attribute, I don't want to lose the style, for that I can use HTMLReader.
Do you know a way to make it in Swift ?

These are define in html file, then have to use webview:
objweb.delegate = self
var path = NSBundle.mainBundle().bundlePath
var baseUrl = NSURL.fileURLWithPath("\(path)")
let bundle = NSBundle.mainBundle()
let pathhtml = bundle.pathForResource("Armory", ofType: "html")
let content = NSString.stringWithContentsOfFile(pathhtml) as String
objweb.loadHTMLString(content, baseURL: baseUrl)
self.view.addSubview(objweb)

Related

how to parse a link from a text in python?

I am trying to parse the href inthe anchor tag from a text, i tried the following code
from flask import Flask,render_template
import requests
import re
app = Flask(__name__)
#app.route('/')
def products():
getprd = requests.get('API')
jsonobj = getprd.text
produ= getprd.json()
prd = produ['items'][0]['id']
htmlcode = produ['items'][0]['description']
htmlcodetxt =str(htmlcode)
return render_template('productdisp.html',
prod=jsonobj, prd=prd, htmlcode=htmlcode)
if __name__ =='__main__':
app.run(debug=True)
and the htmlcodetxt containt the text
<p style="text-align: center;"><strong>Part Number:</strong></p><div style="text-align: center;"><span style="font-size: 16px;">product code</span></div><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Lumens:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>6600-7200 LM</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>CCT:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;">5700K</span><br> </p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Input Voltage:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>100-277VAC, 50-60Hz</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong><strong>Certificates:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>UL, DLC</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Warranty:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>5 Years <br></span></p><hr><p style="text-align: center;"><strong>DOWNLOADS:</strong><br></p><p style="text-align: center;"><br></p><p style="text-align: center;"><strong>Specification Sheet<br></strong><br></p><p><br></p><p style="text-align: center;"><strong>Photometric Data<br></strong></p><p style="text-align: center;"><br></p><p style="text-align: center;"><img src="https://ul_png"> <img src="https://300x295_png"> </p><p style="text-align: center;"><br></p>
One way would be to use the HTMLParser module like this to parse the href link from the htmlcodetxt string.
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
# Parse the 'anchor' tag.
if tag == "a":
# Check the list of defined attributes
for name, value in attrs:
# If href is defined, print it.
if name == "href":
print name, "=", value
# Declare it and feed it your HTML content that you want parsed for the href tag.
parser = MyHTMLParser()
parser.feed(htmlcodetxt)
I'm not sure how your app handler works, but perhaps you could try something like this?
from flask import Flask,render_template
import requests
import re
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == "a":
for name, value in attrs:
if name == "href":
print name, "=", value
app = Flask(__name__)
#app.route('/')
def products():
getprd = requests.get('API')
jsonobj = getprd.text
produ= getprd.json()
prd = produ['items'][0]['id']
htmlcode = produ['items'][0]['description']
htmlcodetxt =str(htmlcode)
parser = MyHTMLParser()
parser.feed(htmlcodetxt)
return render_template('productdisp.html',
prod=jsonobj, prd=prd, htmlcode=htmlcode)
if __name__ =='__main__':
app.run(debug=True)
For example, without using flask, and with using the html code sample that you posted, the following works and returns the expected output.
#!/usr/bin/python
content = '<p style="text-align: center;"><strong>Part Number:</strong></p><div style="text-align: center;"><span style="font-size: 16px;">product code</span></div><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Lumens:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>6600-7200 LM</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>CCT:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;">5700K</span><br> </p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Input Voltage:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>100-277VAC, 50-60Hz</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong><strong>Certificates:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>UL, DLC</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Warranty:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>5 Years <br></span></p><hr><p style="text-align: center;"><strong>DOWNLOADS:</strong><br></p><p style="text-align: center;"><br></p><p style="text-align: center;"><strong>Specification Sheet<br></strong><br></p><p><br></p><p style="text-align: center;"><strong>Photometric Data<br></strong></p><p style="text-align: center;"><br></p><p style="text-align: center;"><img src="https://ul_png"> <img src="https://300x295_png"> </p><p style="text-align: center;"><br></p>'
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == "a":
for name, value in attrs:
if name == "href":
print name, "=", value
parser = MyHTMLParser()
parser.feed(content)
Example output:
$ ./html_parse.py
href = https://dl.dropbox.com/s/saa.pdf?dl=1
href = https://dl.dropbox.com/s/ds.png?dl=1

How to extract different types of bold text and the text in between them using BeautifulSoup?

I have to parse html documents having bold text as section identifiers. But bold text is in different forms , some examples shown below .
Using Beautiful soup I am able to parse them but have to write lot of if else to handle different types of bold. Is there an optimal way to find such bold text and the text in between with using so many if else.
<div style="line-height:120%;padding-bottom:12px;font-size:10pt;">
<font style="font-family:inherit;font-size:10pt;font-weight:bold;">List 1. Work</font>
</div>
<td style="vertical-align:top;padding-left:2px;padding-top:2px;padding-bottom:2px;padding-right:2px;">
<div style="text-align:left;font-size:10pt;">
<font style="font-family:inherit;font-size:10pt;font-weight:bold;">List 1.</font>
</div>
</td>
<td style="vertical-align:top;padding-left:2px;padding-top:2px;padding-bottom:2px;padding-right:2px;">
<div style="text-align:left;font-size:10pt;">
<font style="font-family:inherit;font-size:10pt;font-weight:bold;">Work.</font>
</div>
</td>
<p style="font-family:times;text-align:justify">
<font size="2">
<a name="de42901_List_1._Work"> </a>
<a name="toc_de42901_2"> </a>
</font>
<font size="2"><b> List 1. Work <br> </b></font>
</p>
<p style="font-family:times;text-align:justify">
<font size="2">
<a name="da18101_List_1._Work"> </a>
<a name="toc_da18101_3"> </a>
</font>
<font size="2"><b> List 1. </b></font>
<font size="2"><b><i>Work <br> </i></b></font>
</p>
use the split and join function to remove unwanted /n /b /t and &nbsp:
soup = BeautifulSoup(data, 'html.parser')
data = soup.find_all('b')
for i in data :
final = ' '.join([x for x in i.text.split()])
print (final)
it will arrange your data in same format, hope it resolves your query

BeautifulSoup4 and HTML

I want to extract from the following html code, the following information using python and bs4;
h2 class placename value,
span class value,
div class="aithousaspec" value
<div class="results-list">
<div class="piatsaname">city center</div>
<table>
<tr class="trspacer-up">
<td>
<a href="hall.aspx?id=1001173">
<h2 class="placename">ARENA
<span class="boldelement"><img src="/images/sun.png" height="16" valign="bottom" style="padding:0px 3px 0px 10px" >Θερινός<br>
25 Richmond Avenue st, Leeds</span>
</h2>
<p>
+4497XXXXXXX<br>
STEREO SOUND
</p>
Every Monday 2 tickets 8,00 pounds
</a>
</td>
</tr>
<tr class="trspacer-down">
<td>
<p class="coloredelement">Italian Job</p>
<div class="aithousaspec">
<b></b> Thu.-Wed.: 20.50/ 23.00
<b></b>
</div>
The code that i m using doesnt seem efficient
# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
print(soup.prettify())
mydivs = soup.select('div.results-list')
for info in mydivs:
time= info.select('div.aithousaspec')
print time
listCinemas = info.select("a[href*=hall.aspx]")
print listCinemas
print len(listCinemas)
for times in time:
proj= times.find('div.aithousaspec')
print proj
for names in listCinemas:
theater = names.find('h2', class_='placename')
print(names.find('h2').find(text=True).strip())
print (names.find('h2').contents[1].text.strip())
Is there any better way to get the mentioned info?
data = '''<div class="results-list">
<div class="piatsaname">city center</div>
<table>
<tr class="trspacer-up">
<td>
<a href="hall.aspx?id=1001173">
<h2 class="placename">ARENA
<span class="boldelement"><img src="/images/sun.png" height="16" valign="bottom" style="padding:0px 3px 0px 10px" >Θερινός<br>
25 Richmond Avenue st, Leeds</span>
</h2>
<p>
+4497XXXXXXX<br>
STEREO SOUND
</p>
Every Monday 2 tickets 8,00 pounds
</a>
</td>
</tr>
<tr class="trspacer-down">
<td>
<p class="coloredelement">Italian Job</p>
<div class="aithousaspec">
<b></b> Thu.-Wed.: 20.50/ 23.00
<b></b>
</div>'''
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(data, 'lxml')
print(soup.select('h2.placename')[0].contents[0].strip())
print(re.sub(r'\s{2,}', ' ', soup.select('span.boldelement')[0].text.strip()))
print(soup.select('div.aithousaspec')[0].text.strip())
This will print:
ARENA
Θερινός 25 Richmond Avenue st, Leeds
Thu.-Wed.: 20.50/ 23.00

Parse Classes with similar names using Beautiful Soup in python

I have an html page from which I want to extract the td element with the class attribute bold. Instead other td elements pop up like dark bold
When I use the findAll method in BeautifulSoup,
scores= soup.findAll(lambda tag: tag.name == 'td', { "class" : "bold"})
I get all these elements
<td class="dark bold">
<span class="hide-for-tablet">Sebastian</span>
<span class="hide-for-mobile">Vettel</span>
<span class="uppercase hide-for-desktop">VET</span>
</td>
<td class="bold hide-for-mobile">78</td>
<td class="dark bold">1:44:44.340</td>
<td class="bold">25</td>
Where as all I really want is
<td class="bold">25</td>
How do I narrow down my results?
Try this
scores= soup.findAll(lambda tag: tag.name == 'td' and tag.get('class') == ['bold'])

Use RazorEngine inside tinymce

I'm writing an application that uses a given email template to generate multiple messages.
The e-mail parser works fine. I'm using RazorEngine to create the e-mail template.
The problem is that I need to generate a table using the following construct (a simple foreach):
<table>
<tbody>
<tr><th>Pedido</th><th>NF</th><th>Boleto</th><th>Vencimento</th><th>Valor</th></tr>
#foreach (dynamic item in Model.PagamentosEmAtraso) {
<tr>
<td valign="top" width="76">
<p align="center"><span style="font-size: small;">#item.NumeroPedido</span></p>
</td>
<td valign="top" width="60">
<p align="center"><span style="font-size: small;">#item.NumeroNotaFiscal</span></p>
</td>
<td valign="top" width="88">
<p align="center"><span style="font-size: small;">#item.NumeroBoleto</span></p>
</td>
<td valign="top" width="128">
<p align="center"><span style="font-size: small;">#item.DataVencimento.ToString("dd/MM/yyyy")</span></p>
</td>
<td valign="top" width="119">
<p align="center"><span style="font-size: small;">#item.ValorLiquido.ToString("C2") </span></p>
</td>
</tr>
}
</tbody>
</table>
When I exit the html editor, tinymce messes up my code, "fixing" my code using like this:
#foreach (dynamic item in Model.PagamentosEmAtraso) {}
<table>
This is issue is happening on newer versions of tinymce - it used to accept this kind of markup.
Is there any viable solution to let tinymce accept a possibly broken html without trying to fix it?
My tinymce configuration is:
function initializeTinyMce() {
$('textarea.tinymce').tinymce({
// Location of TinyMCE script
script_url: '/Scripts/tinymce/tiny_mce.js',
// General options
theme: "advanced",
plugins: " pa geb reak,legacyoutput,style,layer,table,save,advimage,advlink,emotions,iespell,inlinepopups,preview,media,searchreplace,print,c o nt extmenu,paste,directionality,fullscreen,noneditable,visualchars,nonbreaking,xhtmlxtras,template",
width: "960",
height: "500",
entity_encoding: "raw",
// Theme options
theme_advanced_buttons1: " bo ld, italic,underline,strikethrough,sub,sup,|,justifyleft,justifycenter,justifyright,justifyfull,styleselect,formatselect,fontse l ec t,fontsizeselect",
theme_advanced_buttons2: " cu t,c opy,paste,pastetext,pasteword,|,bullist,numlist,|,outdent,indent,|,undo,redo,|,link,unlink,image,cleanup,help,code,|,insert d at e,inserttime,preview,|,forecolor,backcolor",
theme_advanced_buttons3: "tablecontrols,|,hr,removeformat,visualaid,||,fullscreen",
theme_advanced_toolbar_location: "top",
theme_advanced_toolbar_align: "left",
theme_advanced_statusbar_location: "bottom",
theme_advanced_resizing: true,
// Example content CSS (should be your site CSS)
//content_css: "/Content/site.css",
// Drop lists for link/image/media/template dialogs
template_external_list_url: "lists/template_list.js",
external_link_list_url: "lists/link_list.js",
external_image_list_url: "lists/image_list.js",
media_external_list_url: "lists/media_list.js",
// Replace values for the template plugin
template_replace_values: {
username: "Some User",
staffid: "991234"
}
});
}
Since 3.4 it is not possible anymore to turn off the tinymce validator using a config setting.
The html needs to be valid, but you may define what gets accepted as valid by the tinymce validator and what not. Have a closer look at the tinymce config params valid_elments and valid_children.