How to get readable text from html block - html

region = input("Insert region of search ").lower()
#Get region of the world for language
first_name , last_name = input("").split()
#Get first and last name of person of interest or object
res = requests.get("https://"+ region + ".wikipedia.org/wiki/" + first_name + "_" + last_name)
#get Wiki page of person/object
soup = BeautifulSoup(res.text, "html.parser")
#Parse in html
infobox = (soup.select("img")[:4])
#Get the first 4 images
content = (soup.select("p")[0])
#Get the first block of text
for info in range(len(infobox)):
link = infobox[info].get("src")
if first_name in link:
urllib.request.urlretrieve("http:" + link, "sample.png")
img = Image.open("sample.png")
img.show()
break
#Loop through images until it finds the one about the person/object
So I made this small program, that basically brings back the picture of what you search for - I'm sure it can be improved, if you have any feedback/tips - but I also want to get the first block of text from the wiki article - I am able to get the block of html text, but I do I remove the <p> and <b> ?

Found it, you just have to put ".text" in the content

Related

How to omit specific class from URL while extracting text using python

I am extracting title & contents from URL using below
def extract_title_text(url):
page = urllib.request.urlopen(url).read().decode('utf8')
soup = BeautifulSoup(page,'lxml')
text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
return soup.title.text, text
URL = 'https://www.bbc.co.uk/news/business-45482461'
titletext, text = extract_title_text(URL)
I would like to omit the contents from span class="off-screen" while extracting the text. Can i get some pointers please to set the filter.
A very simple solution is to filter out your tags, ie:
text = ' '.join(p.text for p in soup.find_all('p') if not "off-screen" in p.get("class", [])
For a more generic solution, soup.find_all() (as well as soup.find()) can take a function as argument, so you can also do this:
def is_content_para(tag):
return tag.name == "p" and "off-screen" not in p.get("class", [])
text = ' '.join(p.text for p in soup.find_all(is_content_para))

Dynamic text based on URL parameters

I'm building a website on Instapage and I have a form on the homepage with two fields: State & Practice Area. When a visitor submits his selections, he is redirected to another page with "area" and "state" as URL parameters.
All good so far, but now I want to write a paragraph on the page, let's just say "[practice area] in [state]", where the brackets would contain the corresponding information from the URL parameters. I'm guessing it would be a simple HTML code that replaces a parameter's name within the paragraph I'm writing with its description from the URL.
I also want to have a default text if there are no parameters in the URL.
Just to be clear, I'm trying to do all of this within the same paragraph, so most of the paragraph stays the same, and just a couple of words within it would change based on the parameters (or have a default value if there are no parameters).
Any help would be appreciated. Thank you!
If your URL is like this:
[protocol]://[hostname]/[state]/[practice area]
You can get the state and practice like this:
let state = window.location.pathname.split('/')[1] || 'your default value for state'
let practiceArea = window.location.pathname.split('/')[2] || 'your default value for practice area'
p = document.getElementById('text')
p.innerText = `${state} in ${practiceArea}`
<p id='text'></p>
Update
If your URL is like this:
"[protocol]://[hostname]/?state=statevalue&practiceArea=practiceValue"
You can get the state and practice like this:
var urlParams = new URLSearchParams(window.location.search)
let state = urlParams.get('state') || 'your default value for state'
let practiceArea = urlParams.get('practiceArea') || 'your default value for practice area'
let p = document.getElementById('text')
p.innerText = `${state} in ${practiceArea}`
<p id='text'></p>

.text is scrambled with numbers and special keys in BeautifuSoup

Hello I am currently using Python 3, BeautifulSoup 4 and, requests to scrape some information from supremenewyork.com UK. I have implemented a proxy script (that I know works) into the script. The only problem is that this website does not like programs to scrape this information automatically and so they have decided to scramble this script which I think makes it unusable as text.
My question: is there a way to get the text without using the .text thing and/or is there a way to get the script to read the text? and when it sees a special character like # to skip over it or to read the text when it sees & skip until it sees ;?
because basically how this website scrambles the text is by doing this. Here is an example, the text shown when you inspect element is:
supremetshirt
Which is supposed to say "supreme t-shirt" and so on (you get the idea, they don't use letters to scramble only numbers and special keys)
this  is kind of highlighted in a box automatically when you inspect the element using a VPN on the UK supreme website, and is different than the text (which isn't highlighted at all). And whenever I run my script without the proxy code onto my local supremenewyork.com, It works fine (but only because of the code, not being scrambled on my local website and I want to pull this info from the UK website) any ideas? here is my code:
import requests
from bs4 import BeautifulSoup
categorys = ['jackets', 'shirts', 'tops_sweaters', 'sweatshirts', 'pants', 'shorts', 't-shirts', 'hats', 'bags', 'accessories', 'shoes', 'skate']
catNumb = 0
#use new proxy every so often for testing (will add something that pulls proxys and usses them for you.
UK_Proxy1 = '51.143.153.167:80'
proxies = {
'http': 'http://' + UK_Proxy1 + '',
'https': 'https://' + UK_Proxy1 + '',
}
for cat in categorys:
catStr = str(categorys[catNumb])
cUrl = 'http://www.supremenewyork.com/shop/all/' + catStr
proxy_script = requests.get(cUrl, proxies=proxies).text
bSoup = BeautifulSoup(proxy_script, 'lxml')
print('\n*******************"'+ catStr.upper() + '"*******************\n')
catNumb += 1
for item in bSoup.find_all('div', class_='inner-article'):
url = item.a['href']
alt = item.find('img')['alt']
req = requests.get('http://www.supremenewyork.com' + url)
item_soup = BeautifulSoup(req.text, 'lxml')
name = item_soup.find('h1', itemprop='name').text
#name = item_soup.find('h1', itemprop='name')
style = item_soup.find('p', itemprop='model').text
#style = item_soup.find('p', itemprop='model')
print (alt +(' --- ')+ name +(' --- ')+ style)
#print(alt)
#print(str(name))
#print (str(style))
When I run this script I get this error:
name = item_soup.find('h1', itemprop='name').text
AttributeError: 'NoneType' object has no attribute 'text'
And so what I did was I un-hash-tagged the stuff that is hash-tagged above, and hash-tagged the other stuff that is similar but different, and I get some kind of str error and so I tried the print(str(name)). I am able to print the alt fine (with every script, the alt is not scrambled), but when it comes to printing the name and style all it prints is a None under every alt code is printed.
I have been working on fixing this for days and have come up with no solutions. can anyone help me solve this?
I have solved my own answer using this solution:
thetable = soup5.find('div', class_='turbolink_scroller')
items = thetable.find_all('div', class_='inner-article')
for item in items:
alt = item.find('img')['alt']
name = item.h1.a.text
color = item.p.a.text
print(alt,' --- ', name, ' --- ',color)

How do I stop receiving hashtags as links from Twitter?

I wanted a Twitter forwarder to Telegram.
I found this one: https://github.com/franciscod/telegram-twitter-forwarder-bot
The problem is now, that if a tweet contains a hashtag before a link, Telegram show me the link to the hashtag.
I tried different things and searched about that, but I don't know how to only receive plain text from twitter.
Also I don't get the short link t.co if the tweet is to long. It's just a long link.
for tweet in tweets:
self.logger.debug("- Got tweet: {}".format(tweet.text))
# Check if tweet contains media, else check if it contains a link to an image
extensions = ('.jpg', '.jpeg', '.png', '.gif')
pattern = '[(%s)]$' % ')('.join(extensions)
photo_url = ''
tweet_text = html.unescape(tweet.text)
if 'media' in tweet.entities:
photo_url = tweet.entities['media'][0]['media_url_https']
else:
for url_entity in tweet.entities['urls']:
expanded_url = url_entity['expanded_url']
if re.search(pattern, expanded_url):
photo_url = expanded_url
break
if photo_url:
self.logger.debug("- - Found media URL in tweet: " + photo_url)
for url_entity in tweet.entities['urls']:
expanded_url = url_entity['expanded_url']
indices = url_entity['indices']
display_url = tweet.text[indices[0]:indices[1]]
tweet_text = tweet_text.replace(display_url, expanded_url)
tw_data = {
'tw_id': tweet.id,
'text': tweet_text,
'created_at': tweet.created_at,
'twitter_user': tw_user,
'photo_url': photo_url,
}
try:
t = Tweet.get(Tweet.tw_id == tweet.id)
self.logger.warning("Got duplicated tw_id on this tweet:")
self.logger.warning(str(tw_data))
except Tweet.DoesNotExist:
tweet_rows.append(tw_data)
if len(tweet_rows) >= self.TWEET_BATCH_INSERT_COUNT:
Tweet.insert_many(tweet_rows).execute()
tweet_rows = []
Just disable markdown_twitter_hashtags() function, make it return text without replace that.

Putting hyperlinks into an HTML table in R

I am a biologist trying to do computer science for research, so I may be a bit naïve. But I would like to a make a table containing information from a data frame, with a hyperlink in one of the columns. I imagine this needs to be an html document (?). I found this post this post describing how to put a hyperlink into a data frame and write it as an HTML file using googleVis. I would like to use this approach (it is the only one I know and seems to work well) except I would like to replace the actual URL with a description. The real motivation being that I would like to include many of these hyperlinks, and the links have long addresses which is difficult to read.
To be verbose, I essentially want to do what I did here where we read 'here' but 'here' points to
http:// stackoverflow.com/questions/8030208/exporting-table-in-r-to-html-with-hyperlinks
From your previous question, you can have another list which contains the titles of the URL's:
url=c('http://nytimes.com', 'http://cnn.com', 'http://www.weather.gov'))
urlTitles=c('NY Times', 'CNN', 'Weather'))
foo <- transform(foo, url = paste('<a href = ', shQuote(url), '>', urlTitles, '</a>'))
x = gvisTable(foo, options = list(allowHTML = TRUE))
plot(x)
Building on Jack's answer but consolidating from different threads:
library(googleVis)
library(R2HTML)
url <- c('http://nytimes.com', 'http://cnn.com', 'http://www.weather.gov')
urlTitles <- c('NY Times', 'CNN', 'Weather')
foo <- data.frame(a=c(1,2,3), b=c(4,5,6), url=url)
foo <- transform(foo, url = paste('<a href = ', shQuote(url), '>', urlTitles, '</a>'))
x <- gvisTable(foo, options = list(allowHTML = TRUE))
plot(x)