Scrape Bolded Text with rvest - html

I am scraping in R with rvest. I am trying to extract a list of dollar signs. Either $, $$, or $$$. The site shows $$$, and bolds in dark black 1, 2, or 3 of the symbols. With my current code, it is drawing $$$ for every single page.
This is an example of the HTML code for the site section (This one has one $ bolded in dark black). I am trying to extract one "$" for this example:
<div class = "container">
<div class = "value">
<strong>
" $"
<span>$$</span>
</strong>
</div>
<div class ="label">Price</div>
I am currently using code something like this. It is looping through 1000s of similar pages. All are pulling $$$.:
html <- read_html(website) # save html from page
text <- html_elements(html, ".value") %>%
html_text(trim = T) # save list of text in this section of the page
price <- text[length(text)] # save the price, it is the last text section in the list

Related

Python - Beautifulsoup, differentiate parsed text inside of an html element by using internal tags

So, I'm working on an html parser to extract some text data from a list of and format it before giving an output. I have a title that I need to set as bold, and a description which I'll leave as it is. I've found myself stuck when I reached this situation:
<div class ="Content">
<Strong>Title:</strong>
description
</div>
As you can see the strings are actually already formatted but I can't seem to find a way to get the tags and the text out together.
What my script does kinda looks like:
article = "" #this is where I normally store all the formatted text, it's necessary that I get all the formatted text as one loooong string before I Output
temp1=""
temp2""
result = soup.findAll("div", {"class": "Content"})
if(result!=none):
x=0
for(i in result.find("strong")):
if(x==0):
temp1 = "<strong>" + i.text + "</strong>"
article += temp1
x=1
else:
temp2 = i.nextSibling #I know this is wrong
article += temp2
x = 0
print(article)
It actually throws an AttributeError but it's a wrong one since the output is "Did you call find_all() when you meant to call find()?".
I also know I can't just use .nextSibling like that and I'm litterally losing it over something that looks so simple to solve...
what I need to get is: "Title: description"
Thanks in advance for any response.
I'm sorry if I couldn't explain really well what I'm trying to accomplish but that's kind of articulated; I actually need the data to generate a POST request to a CKEditor session so that it adds the text to the html page, but I need the text to be formatted in a certain way before uploading it. In this case I would need to get the element inside the tags and format it in a certain way, then do the same with the description and print them one after the other, for example a request could look like:
http://server/upload.php?desc=<ul>%0D%0A%09<li><strong>Title%26nbsp%3B<%2strong>description<%2li><%2ul>
So that the result is:
Title1: description
So what I need to do is to differentiate between the element inside the tag and the one out of it using the tag itself as a refernce
EDIT
To select the <strong> use:
soup.select_one('div.Content strong')
and then to select its nextSibling:
strong.nextSibling
you my need to strip it to get rid of whitespaces, ....:
strong.nextSibling.strip()
Just in case
You can use ANSI escape sequences to print something bold, ... but I am not sure, why you would do that. That is something should be improved in your question.
Example
from bs4 import BeautifulSoup
html='''
<div class ="Content">
<Strong>Title:</strong>
description
</div>
'''
soup = BeautifulSoup(html,'html.parser')
text = soup.find('div', {'class': 'Content'}).get_text(strip=True).split(':')
print('\033[1m'+text[0]+': \033[0m'+ text[1])
Output
Title: description
You may want to use htql for this. Example:
text="""<div class ="Content">
<Strong>Title:</strong>
description
</div>"""
import htql
ret = htql.query(text, "<div>{ col1=<strong>:tx; col2=<strong>:xx &trim }")
# ret=[('Title:', 'description')]

Use R to extract sections of HTML document using <b> to indicate section header

I have a few thousand large documents saved locally, where they are all saved as HTML files. Each document is about 300 pages long, and has some sections that have titles in bold letters. My goal is to do a text search in these files, and when I find the given phrase, extract the whole section that contains this phrase. My idea was to parse the html text so that it becomes a list of paragraphs, find the location of the phrase, and then extract everything from the bold letters (title of this section) just prior to bold letters just after (title of the next section).
I tried in a number of different ways, but none of them does what I want. the following was promising:
myhtmlfile = "I:/myfolder/myfile.html"
myhtmltxt2 = htmlTreeParse(myhtmlfile, useInternal = TRUE)
But while I can display the object "myhtmltxt2" and it looks like html with tags (which is what I need so that I can look for "<b>" ), it is an external pointer. So then I am not able to the command below, because grep does not work on pointers.
test2<-grep("myphrase",myhtmltxt2,ignore.case = T)
Alternatively, I did this:
doc.text = unlist(xpathApply(myhtmltxt2, '//p', xmlValue))
test3<-grep("myphrase",doc.text,ignore.case = T)
But in this case, I lost html tags in doc.text, so I no longer have "<b>" which is what I was going to use to indicate section to extract. Is there a way of doing this?
I managed this by following:
singleString <- paste(readLines(myHTMLfile), collapse=" ")
data11 = strsplit(singleString,"<p><b>", fixed = TRUE)
test2<- unlist(data11)
myindex<-regexpr("Myphrase </b>", test2)

how to remove the content together with the html tag using python if it contains some strings

I am currently processing some data in the format of html. The format of the files are more or less like this (do bear in mind that I have already deleted most of the contents for the sake of simplicity showing the code online):
<HTML><HEAD>
<TITLE>some header here</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000">
<P>Some contents that I don't want</P>
<PRE> THE HITCHER
A film review by Mark R. Leeper
Copyright 1987 Mark R. Leeper</PRE>
<P>// some body paragraphs that I need</P>
<P> //some body paragraphs that I need</P>
<PRE>tags that I don't want</PRE>
<HR><P CLASS=flush><SMALL>tags that I don't want</SMALL></P>
<P ALIGN=CENTER>tags that I don't want</A></P>
</P></BODY></HTML>
I only want to get the part marked as <P> //some body paragraphs that I need</P> and read into python program as string. Yet I found it hard to do. Can any one help me with it?
If doing so is not easy, at least tell me how to get rid of the entire tag if it contains some substrings. Say I want to get rid of the tag (from the begin of the tag to the end of the tag including the content, in this case, is this line
<PRE> THE HITCHER
A film review by Mark R. Leeper
Copyright 1987 Mark R. Leeper</PRE>
) containing the keyword "Copyright".
For whom might concern, the data is from IMDB database, I downloaded from Cornell University's website.
You will need two librarys to pull this off. One to get the page conetents from the internet, requests. Then another to parse the HTML content, BeautifulSoup. The code below goes out to an example website with basic HTML.
from bs4 import BeautifulSoup
import requests
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
soup = BeautifulSoup(page.content, 'html.parser')
print "Formated HTML"
print "*****************"
print soup.prettify()
print "*****************"
p_list=soup.find_all('p')
print p_list
For python 3 you will need to change the print functions to the python 3 format such as:
print p_list to print(p_list).
Per the question in the comment below.
You can read data from local HTML files without using the request method. This is done by simply opening the file and reading the data into a variable for beautiful soup to play with. Remember to change the file name inside of the open function.
with open('test.html', 'r') as f:
read_data = f.read()
soup = BeautifulSoup(read_data, 'html.parser')

Parsing HTML to text with link-tags remaining in R

My problem
I am trying to parse a HTML file (downloaded via Google Drive API as text/html) to a list in R.
The HTML looks like this (sorry for the German content):
<p style='padding:0;margin:0;color:#000000;font-size:11pt;font-
family:"Arial";line-height:1.15;orphans:2;widows:2;text-align:left'>
<span>text: Das </span>
<span style="color:#1155cc;text-decoration:underline"><a
href="https://www.google.com/url?q=http://www.bundesverfassungsgericht.de/SharedDocs/Entscheidungen/DE/2011/10/rs20111012_2bvr023608.html&sa=D&ust=1503574789125000&usg=AFQ
jCNE4Ij3mvMX-QttYQYqspAaMxaZaeg" style="color:inherit;text-
decoration:inherit">Verfassungsgericht urteilt</a></span>
<span style='color:#000000;font-weight:400;text-
decoration:none;vertical-align:baseline;font-size:11pt;font-
family:"Arial";font-style:normal'>,
dass eindeutig private Kommunikation von der Überwachung ausgenommen
sein muss</span></p>
It works well when I just try to extract the text from the xmlValues (XML-library) by using something like:
doc <- htmlParse(html, asText = TRUE)
text <- xpathSApply(doc, "//text()", xmlValue)
But in my case, I need to retain the links (<a>-tags) in the HTML file, and delete the https://www.google.com/url?q=-part. So I want to get rid of all styling and keep only the text + the link-tags.
What I tried so far
I tried to get both of the nodes by using //(p | a)in the XPath, but it didn't work.
I prefer to use the rvest package instead of XML.
In this code I use the rvest package to parse the html and extract out the links from the page. Then using the stringr package I split the link text at the ?q= part and return the back half of the original link.
library(rvest)
library(stringr)
#Read html file,
page<-read_html("sample.txt")
#then find the link nodes, then extract the attribute text (ie the links)
link<-page%>% html_nodes("a") %>% html_attr( "href")
#return second string of first list element
#(Use sapply if there are more than 1 link in document)
desiredlink<-str_split(link, "\\?q=")[[1]][2]
#Find the text in all of the span nodes
span_text<-page%>% html_nodes("span") %>% html_text()
# or this for the text under the p nodes
p_text<-page%>% html_nodes("p") %>% html_text()
I have your sample html code from above saved to the file: "sample.txt"

Loop over all the <dd> tags and extract specefic information via Mechanize/Nokogiri

I know the basic things of accessing a website and so (I just started learning yesterday), however I want to extract now. I checked out many tutorials of Mechanize/Nokogiri but each of them had a different way of doing things which made me confused. I want a direct bold way of how to do this:
I have this website: http://openie.allenai.org/sentences/rel=contains&arg2=antioxidant&title=Green+tea
and I want to extract certain things in a structured way. If I inspect the element of this webpage and go to the body, I see so many <dd>..</dd>'s under the <dl class="dl-horizontal">. Each one of them has an <a> part which contains a href. I would like to extract this href and the bold parts of the text ex <b>green tea</b>.
I created a simple structure:
info = Struct.new(:ObjectID, :SourceID) thus from each of these <dd> will add the bold text to the object id and the href to the source id.
This is the start of the code I have, just retrieval no extraction:
agent = Mechanize.new { |agent| agent.user_agent_alias = "Windows Chrome" }
html = agent.get('http://openie.allenai.org/sentences/?rel=contains&arg2=antioxidant&title=Green+tea').body
html_doc = Nokogiri::HTML(html)
The other thing is that I am confused about whether to use Nokogiri directly or through Mechanize. The problem is that there isn't enough documentation provided by Mechanize so I was thinking of using it separately.
For now I would like to know how to loop through these and extract the info.
Here's an example of how you could parse the bold text and href attribute from the anchor elements you describe:
require 'nokogiri'
require 'open-uri'
url = 'http://openie.allenai.org/sentences/?rel=contains&arg2=antioxidant&title=Green%20tea'
doc = Nokogiri::HTML(open(url))
doc.xpath('//dd/*/a').each do |a|
text = a.xpath('.//b').map {|b| b.text.gsub(/\s+/, ' ').strip}
href = a['href']
puts "OK: text=#{text.inspect}, href=#{href.inspect}"
end
# OK: text=["Green tea", "many antioxidants"], href="http://www.talbottteas.com/category_s/55.htm"
# OK: text=["Green tea", "potent antioxidants"], href="http://www.skin-care-experts.com/tag/best-skin-care/page/4"
# OK: text=["Green tea", "potent antioxidants"], href="http://www.specialitybrand.com/news/view/207.html"
In a nutshell, this solution uses XPath in two places:
Initially to find every a element underneath each dd element.
Then to find each b element inside of the as in #1 above.
The final trick is cleaning up the text within the "b" elements into something presentable, of course, you might want it to look different somehow.