Trying to grab a string from HTML with Nokogiri - html

I am a student working on my first CLI project with Ruby, and there is a website I am trying to scrape with Nokogiri. The contents of the website are not strictly organized into their own classes/id, but there is some information that I simply cannot figure out how to scrape.
This is what it looks like:
<p>
<strong> First Aired:</strong>
"2017 | "
<strong> Episodes:</strong>
" 24"
<br>
I want to know if there is a way to scrape the string that comes after each "Episode:" element. The code I tried was
doc = Nokogiri::HTML(open(https://www.techradar.com/best/best-anime))
doc.css('p strong')[1].text <= and that got me the "Episodes:"
then i tried:
doc.css('p strong')[1].next_element <= and it skipped the string and gave me "<br>
I also tried the .children method, but that also returned "Episodes:". I think I am confusing a lot of terms since these methods have no effect on the string. Is it even possible to grab that string with CSS? Lastly, if that were possible to grab, it there a way to grab only the strings after "Episodes:"?
I appreciate any help. I tried to do some research on Nokogiri and Css, but I think I am confusing a lot of things.

HTML is heirarchical, so for all the elements you pasted, p is the parent, and the others are its children. This is especially apparent if the HTML is properly formatted and indented.
This means that you will find the " 24" under p, like this:
html = <<~STR
<p>
<strong> First Aired:</strong>
"2017 | "
<strong> Episodes:</strong>
" 24"
<br>
STR
html_doc = Nokogiri::HTML.parse(html)
p_element = html_doc.css('p')
p_element.children.map(&:name)
# => ["text", "strong", "text", "strong", "text", "br", "text"]
p_element.children.map(&:to_s)
# => [
# "\n ",
# "<strong> First Aired:</strong>",
# "\n \"2017 | \"\n ",
# "<strong> Episodes:</strong>",
# "\n \" 24\"\n ", <------------ this is what you wanted
# "<br>",
# "\n"
# ]
p_element.children[4]
=> #(Text "\n \" 24\"\n ")
If you want the sibling element immediately after one that has "Episodes:" in it, one way is to do this:
consecutive_pairs = p_element.children.each_cons(2)
_before, after = consecutive_pairs.detect do |before, after|
before.text.include?("Episodes")
end
after
# => #(Text "\n \" 24\"\n ")

Related

Cannot prettify html code with sublime text nor beautiful soup

I am trying to webscrape some website for information. i have saved the page I want to scrape as .html file and have opened it with sublime text but there are some parts that cannot be displayed in a prettified way ; I have the same problem when trying to use beautifulsoup ; see picture below (I cannot really share full code since it's disclosing private info).
Just feed the HTML as a multiline string to BeautifulSoup object and use soup.prettify(). That should work. However beautifulsoup has default indentation to 2 spaces. So if you want custom indent you can writeup a little wrapper like this:
def indentPrettify(soup, indent=4):
# where desired_indent is number of spaces as an int()
pretty_soup = str()
previous_indent = 0
# iterate over each line of a prettified soup
for line in soup.prettify().split("\n"):
# returns the index for the opening html tag '<'
current_indent = str(line).find("<")
# which is also represents the number of spaces in the lines indentation
if current_indent == -1 or current_indent > previous_indent + 2:
current_indent = previous_indent + 1
# str.find() will equal -1 when no '<' is found. This means the line is some kind
# of text or script instead of an HTML element and should be treated as a child
# of the previous line. also, current_indent should never be more than previous + 1.
previous_indent = current_indent
pretty_soup += writeOut(line, current_indent, indent)
return pretty_soup
def writeOut(line, current_indent, desired_indent):
new_line = ""
spaces_to_add = (current_indent * desired_indent) - current_indent
if spaces_to_add > 0:
for i in range(spaces_to_add):
new_line += " "
new_line += str(line) + "\n"
return new_line

How to parse all the text content from the HTML using Beautiful Soup

I wanted to extract an email message content. It is in html content, used the BeautifulSoup to fetch the From, To and subject. On fetching the body content, it fetches the first line alone. It leaves the remaining lines and paragraph.
I miss something over here, how to read all the lines/paragraphs.
CODE:
email_message = mail.getEmail(unreadId)
print (email_message['From'])
print (email_message['Subject'])
if email_message.is_multipart():
for payload in email_message.get_payload():
bodytext = email_message.get_payload()[0].get_payload()
if type(bodytext) is list:
bodytext = ','.join(str(v) for v in bodytext)
else:
bodytext = email_message.get_payload()[0].get_payload()
if type(bodytext) is list:
bodytext = ','.join(str(v) for v in bodytext)
print (bodytext)
parsedContent = BeautifulSoup(bodytext)
body = parsedContent.findAll('p').getText()
print body
Console:
body = parsedContent.findAll('p').getText()
AttributeError: 'list' object has no attribute 'getText'
When I use
body = parsedContent.find('p').getText()
It fetches the first line of the content and it is not printing the remaining lines.
Added
After getting all the lines from the html tag, I get = symbol at the end of each line and also &nbsp ; , &lt is displayed.How to overcome those.
Extracted text:
Dear first,All of us at GenWatt are glad to have xyz as a
customer. I would like to introduce myself as your Account
Manager. Should you = have any questions, please feel free to
call me at or email me at ash= wis#xyz.com. You
can also contact GenWatt on the following numbers: Main:
810-543-1100Sales: 810-545-1222Customer Service & Support:
810-542-1233Fax: 810-545-1001I am confident GenWatt will serve you
well and hope to see our relationship=
Let's inspect the result of soup.findAll('p')
python -i test.py
----------
import requests
from bs4 import BeautifulSoup
bodytext = requests.get("https://en.wikipedia.org/wiki/Earth").text
parsedContent = BeautifulSoup(bodytext, 'html.parser')
paragraphs = soup.findAll('p')
----------
>> type(paragraphs)
<class 'bs4.element.ResultSet'>
>> issubclass(type(paragraphs), list)
True # It's a list
Can you see? It's a list of all paragraphs. If you want to access their content you will need iterate over the list or access an element by an index, like a normal list.
>> # You can print all content with a for-loop
>> for p in paragraphs:
>> print p.getText()
Earth (otherwise known as the world (...)
According to radiometric dating and other sources of evidence (...)
...
>> # Or you can join all content
>> content = []
>> for p in paragraphs:
>> content.append(p.getText())
>>
>> all_content = "\n".join(content)
>>
>> print(all_content)
Earth (otherwise known as the world (...) According to radiometric dating and other sources of evidence (...)
Using List Comprehension your code will looks like:
parsedContent = BeautifulSoup(bodytext)
body = '\n'.join([p.getText() for p in parsedContent.findAll('p')]
When I use
body = parsedContent.find('p').getText()
It fetches the first line of the content and it is not printing the
remaining lines.
Do parsedContent.find('p') is exactly the same that do parsedContent.findAll('p')[0]
>> parsedContent.findAll('p')[0].getText() == parsedContent.find('p').getText()
True

Haskell print the first line into Browser Tab [duplicate]

This question already has an answer here:
Return the first line of a String in Haskell
(1 answer)
Closed 8 years ago.
Just a simple question, my code is complete. It takes an input file, breaks it into lines, reads the file line by line, does the conversions, which is in this case, turns certain things into HTML format (ex: #This is a line into a line with H1 HTML tags, formatting it into a header). The only thing I have left is to take the First line of code, and print that code into the browser tab. Also, the body, or tail must be printed into the window, not the tab. So the first line of my .txt file is The Title! which I want to show in the tab of the web browser. Here is something I have for that:
formatToHTML :: String -> String
formatToHTML [] = []
formatToHTML x
| head x == --any char = "<title>" ++ head ++ "</title>"
| tail x == --rest of file = "<body>" ++ tail ++ "</tail>"
| otherwise = null
or
formatToHTML :: [String] -> String
formatToHTML = unlines. map (show) "<title>" ++ head ++ </title>" $ lines
I dont want to, or I think even need to use guards here, but I cant think of a shorter way to do my task.
I would call this from my main method before I output my file to html.
Also, I know its a amateur haskell question. but how would I represent any char. Say, I want to say, if the head of x exists, print the head with the title tags. print tail with body tags. Help? Thank You
My guess of what you want is:
formatHtml :: [String] -> String
formatHtml [] = ""
formatHtml (x:xs) = unlines theLines
where theLines = [ "<title>" ++ ...convert x to html... ++ "</title>",
"<body>" ] ++ map toHtml xs ++ [ "</body>" ]
toHtml :: String -> String
toHmtl str = ...converts str to HTML...
Example:
formatHtml [ "the title", "body line 1", "body line2" ]
results in:
<title>the title</title>
<body>
body line 1
body line 2
</body>
You still have to define the toHtml function and decide how to convert the first line to the inner html of the tag.

Opening multiple html files & outputting to .txt with Nokogiri

Just wondering if these two functions are to be done using Nokogiri or via more basic Ruby commands.
require 'open-uri'
require 'nokogiri'
require "net/http"
require "uri"
doc = Nokogiri.parse(open("example.html"))
doc.xpath("//meta[#name='author' or #name='Author']/#content").each do |metaauth|
puts "Author: #{metaauth}"
end
doc.xpath("//meta[#name='keywords' or #name='Keywords']/#content").each do |metakey|
puts "Keywords: #{metakey}"
end
etc...
Question 1: I'm just trying to parse a directory of .html documents, get the information from the meta html tags, and output the results to a text file if possible. I tried a simple *.html wildcard replacement, but that didn't seem to work (at least not with Nokogiri.parse(open()) maybe it works with ::HTML or ::XML)
Question 2: But more important, is it possible to output all of those meta content outputs into a text file to replace the puts command?
Also forgive me if the code is overly complicated for the simple task being performed, but I'm a little new to Nokogiri / xpath / Ruby.
Thanks.
I have a code similar.
Please refer to:
module MyParser
HTML_FILE_DIR = `your html file dir`
def self.run(options = {})
file_list = Dir.entries(HTML_FILE_DIR).reject { |f| f =~ /^\./ }
result = file_list.map do |file|
html = File.read("#{HTML_FILE_DIR}/#{file}")
doc = Nokogiri::HTML(html)
parse_to_hash(doc)
end
write_csv(result)
end
def self.parse_to_hash(doc)
array = []
array << doc.css(`your select conditons`).first.content
... #add your selector code css or xpath
array
end
def self.write_csv(result)
::CSV.open("`your out put file name`", 'w') do |csv|
result.each { |row| csv << row }
end
end
end
MyParser.run
You can output to a file like so:
File.open('results.txt','w') do |file|
file.puts "output" # See http://ruby-doc.org/core-2.1.2/IO.html#method-i-puts
end
Alternatively, you could do something like:
authors = doc.xpath("//meta[#name='author' or #name='Author']/#content")
keywrds = doc.xpath("//meta[#name='keywords' or #name='Keywords']/#content")
results = authors.map{ |x| "Author: #{x}" }.join("\n") +
keywrds.map{ |x| "Keywords: #{x}" }.join("\n")
File.open('results.txt','w'){ |f| f << results }

Web Scraping with Nokogiri::HTML and Ruby - Output to CSV issue

I have a script that scrapes HTML article pages of a webshop. I'm testing with a set of 22 pages of which 5 article pages have a product description and the others don't.
This code puts the right info on screen:
if doc.at_css('.product_description')
doc.css('div > .product_description > p').each do |description|
puts description
end
else
puts "no description"
end
But now I'm stuck on how to get this correctly to output the found product descriptions to an array from where I'm writing them to a CSV file.
Tried several options, but none of them works so far.
If I replace the puts description for #description << description.content, then all the descriptions of the articles end up in the upper lines in the CSV although they do not belong to the articles in that line.
When I also replace "no description" for #description = "no description" then the first 14 lines in my CSV recieve 1 letter of "no description" each. Looks funny, but it is not exactly what I need.
If more code is needed, just shout!
This is the CSV code I use in the script:
CSV.open("artinfo.csv", "wb") do |row|
row << ["category", "sub-category", "sub-sub-category", "price", "serial number", "title", "description"]
(0..#prices.length - 1).each do |index|
row << [
#categories[index],
#subcategories[index],
#subsubcategories[index],
#prices[index],
#serial_numbers[index],
#title[index],
#description[index]]
end
end
It sounds like your data isn't lined up properly. If it were you should be able to do:
CSV.open("artinfo.csv", "w") do |csv|
csv << ["category", "sub-category", "sub-sub-category", "price", "serial number", "title", "description"]
[#categories, #subcategories, #subsubcategories, #prices, #serial_numbers, #title, #description].transpose.each do |row|
csv << row
end
end