How to omit specific class from URL while extracting text using python - html

I am extracting title & contents from URL using below
def extract_title_text(url):
page = urllib.request.urlopen(url).read().decode('utf8')
soup = BeautifulSoup(page,'lxml')
text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
return soup.title.text, text
URL = 'https://www.bbc.co.uk/news/business-45482461'
titletext, text = extract_title_text(URL)
I would like to omit the contents from span class="off-screen" while extracting the text. Can i get some pointers please to set the filter.

A very simple solution is to filter out your tags, ie:
text = ' '.join(p.text for p in soup.find_all('p') if not "off-screen" in p.get("class", [])
For a more generic solution, soup.find_all() (as well as soup.find()) can take a function as argument, so you can also do this:
def is_content_para(tag):
return tag.name == "p" and "off-screen" not in p.get("class", [])
text = ' '.join(p.text for p in soup.find_all(is_content_para))

Related

Libgdx: How to show HTML text in a label?

I have a string like this:
"noun<br> an expression of greeting <br>- every morning they exchanged polite hellos<br> <font color=dodgerblue> ••</font> Syn: hullo, hi, howdy, how-do-you-do<be>"
want to show it in a label as a rich text. for example Instead of <br> tags, text must go to the next line.
in Android we can do that with:
Html.fromHtml(myHtmlString)
but I don't know how to do it in libgdx.
I try to use Jsoup but it removes all tags and does not go to the next line for <br> tag for example.
Jsoup.parse(myHtmlString).text()
Jsoup.parse returns a document containing many elements -of- strings. Not a single string so you are only seeing the first bit. You can assemble the complete string yourself by going through the elements or try
Document doc = Jsoup.parse(yourHtmlInput);
String htmlString = doc.toString();
String htmlText = "<p>This is an <strong>Example</strong></p>";
//this will convert your HTML text into normal text
String normalText = Jsoup.parse(htmlText).text();
in kotlin i use this code:
var definition = "my html string"
definition = definition.replace("<br>", "\n")
definition = definition.replace("<[^>]*>".toRegex(), "")

How to extract the hyperlink text from a <a> html tag?

Given a string containing 'blabla text blabla', I want to extract 'text' from it.
regexp doc suggests '<(\w+).*>.*</\1>' expression, but it extracts the whole <a> ... </a> thing.
Of course I can continue using strfind like this:
line = 'blabla text blabla';
atag = regexp(line,'<(\w+).*>.*</\1>','match', 'once');
from = strfind(atag, '>');
to = strfind(atag, '<');
text = atag((from(1)+1):(to(2)-1))
, but, can I use another expression to find text at once?
You can use the extractHTMLText function in Matlab, you can read about it in the following link.
Example that get the desired output:
line = 'blabla text blabla';
l = split(extractHTMLText(line), ' ');
l{2}
If you don't want to use a built in function you could use regex as Nick suggested.
line = 'blabla text blabla';
[atag,tok] = regexp(line,'<(\w+).*>(.*?)</\1>','match','tokens');
t = tok(1,1){1};
t{2}
and you'll get the desired output
You can simply use a Group.
Update of your pattern will be something like this:
<(\w+).*>(.*)<\/\1>
and this one include all tags:
<.*>(.*)<.*>
Regex101
If you are using JQuery try this. No Regex required. But this might negatively impact performance if the DOM is hefty.
$jqueryobj = $(line);
var text = $jqueryobj.find("a").text();

How do I stop receiving hashtags as links from Twitter?

I wanted a Twitter forwarder to Telegram.
I found this one: https://github.com/franciscod/telegram-twitter-forwarder-bot
The problem is now, that if a tweet contains a hashtag before a link, Telegram show me the link to the hashtag.
I tried different things and searched about that, but I don't know how to only receive plain text from twitter.
Also I don't get the short link t.co if the tweet is to long. It's just a long link.
for tweet in tweets:
self.logger.debug("- Got tweet: {}".format(tweet.text))
# Check if tweet contains media, else check if it contains a link to an image
extensions = ('.jpg', '.jpeg', '.png', '.gif')
pattern = '[(%s)]$' % ')('.join(extensions)
photo_url = ''
tweet_text = html.unescape(tweet.text)
if 'media' in tweet.entities:
photo_url = tweet.entities['media'][0]['media_url_https']
else:
for url_entity in tweet.entities['urls']:
expanded_url = url_entity['expanded_url']
if re.search(pattern, expanded_url):
photo_url = expanded_url
break
if photo_url:
self.logger.debug("- - Found media URL in tweet: " + photo_url)
for url_entity in tweet.entities['urls']:
expanded_url = url_entity['expanded_url']
indices = url_entity['indices']
display_url = tweet.text[indices[0]:indices[1]]
tweet_text = tweet_text.replace(display_url, expanded_url)
tw_data = {
'tw_id': tweet.id,
'text': tweet_text,
'created_at': tweet.created_at,
'twitter_user': tw_user,
'photo_url': photo_url,
}
try:
t = Tweet.get(Tweet.tw_id == tweet.id)
self.logger.warning("Got duplicated tw_id on this tweet:")
self.logger.warning(str(tw_data))
except Tweet.DoesNotExist:
tweet_rows.append(tw_data)
if len(tweet_rows) >= self.TWEET_BATCH_INSERT_COUNT:
Tweet.insert_many(tweet_rows).execute()
tweet_rows = []
Just disable markdown_twitter_hashtags() function, make it return text without replace that.

How to scrape and parse nested div with scrapy

Trying to follow this github page in order to learn crawl nested divs in facebook. https://github.com/talhashraf/major-scrapy-spiders/blob/master/mss/spiders/facebook_profile.py
parse_info_text_only or parse_info_has_image in the file works fine getting the span information
I have a similar page that I am trying to get the result_id from a nested div, however result_id is in div itself.
From what I understand div I am trying to scrape is in 2nd row, so I try something like
def parse_profile(self, response):
item["BrowseResultsContainer"] = self.parse_info_has_id(response.css('#BrowseResultsContainer'))
return item
def parse_info_has_id(self, css_path):
text = css_path.xpath('div/div').extract()
text = [t.strip() for t in text]
text = [t for t in text if re.search('result_id', t)]
return "\n".join(text)
How can I get the data-xt from above nested div?
with css:
import json
...
def parse_info_has_id(self, css_path):
text = css_path.xpath('div::attr(data-gt)').extract_first()
d = json.loads(text)
return d['result_id']
I think, If you want all data-xt then
def parse_info_has_id(self, css_path):
text = css_path.xpath('//div[#data-xt != ""]').extract()
text = [t.strip() for t in text]
text = [t for t in text if re.search('result_id', t)]
return "\n".join(text)

replace keyword within html string

I am looking for a way to replace keywords within a html string with a variable. At the moment i am using the following example.
returnString = Replace(message, "[CustomerName]", customerName, CompareMethod.Text)
The above will work fine if the html block is spread fully across the keyword.
eg.
<b>[CustomerName]</b>
However if the formatting of the keyword is split throughout the word, the string is not found and thus not replaced.
e.g.
<b>[Customer</b>Name]
The formatting of the string is out of my control and isn't foolproof. With this in mind what is the best approach to find a keyword within a html string?
Try using Regex expression. Create your expressions here, I used this and it works well.
http://regex-test.com/validate/javascript/js_match
Use the text property instead of innerHTML if you're using javascript to access the content. That should remove all tags from the content, you give back a clean text representation of the customer's name.
For example, if the content looks like this:
<div id="name">
<b>[Customer</b>Name]
</div>
Then accessing it's text property gives:
var name = document.getElementById("name").text;
// sets name to "[CustomerName]" without the tags
which should be easy to process. Do a regex search now if you need to.
Edit: Since you're doing this processing on the server-side, process the XML recursively and collect the text element's of each node. Since I'm not big on VB.Net, here's some pseudocode:
getNodeText(node) {
text = ""
for each node.children as child {
if child.type == TextNode {
text += child.text
}
else {
text += getNodeText(child);
}
}
return text
}
myXml = xml.load(<html>);
print getNodeText(myXml);
And then replace or whatever there is to be done!
I have found what I believe is a solution to this issue. Well in my scenario it is working.
The html input has been tweaked to place each custom field or keyword within a div with a set id. I have looped through all of the elements within the html string using mshtml and have set the inner text to the correct value when a match is found.
e.g.
Function ReplaceDetails(ByVal message As String, ByVal customerName As String) As String
Dim returnString As String = String.Empty
Dim doc As IHTMLDocument2 = New HTMLDocument
doc.write(message)
doc.close()
For Each el As IHTMLElement In doc.body.all
If (el.id = "Date") Then
el.innerText = Now.ToShortDateString
End If
If (el.id = "CustomerName") Then
el.innerText = customerName
End If
Next
returnString = doc.body.innerHTML
return returnString
Thanks for all of the input. I'm glad to have a solution to the problem.