Some weird problem with Nokogiri's xpath method

Some weird problem with Nokogiri's xpath method - html

doc.xpath('//img') #this will get some results
doc.xpath('//img[#class="productImage"]') #and this gets nothing at all
doc.xpath('//div[#id="someID"]') # and this won't work either
I don't know what went wrong here,I double checked the HTML source,There are plenty of img tag which contains the attribute(class="productImage").
It's like the attribute selector just won't work.
Here is the URL which the HTML source come from.
http://www.amazon.cn/s/ref=nb_sb_ss_i_0_1?__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&url=search-alias%3Daps&field-keywords=%E4%B8%93%E5%85%AB&x=0&y=0&sprefix=%E4%B8%93
please do me a favor if you got some spare time.Parse the HTML content like I do see if you can solve this one

The weird thing is if you use open-uri on that page you get a different result than when using something like curl or wget.
However when you change the User-Agent you actually get probably the page you are looking for:
Analysis:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'pp'
URL = 'http://www.amazon.cn/...'
def analyze_html(file)
doc = Nokogiri.HTML(file)
pp doc.xpath('//img').map { |i| i[:class] }.compact.reject(&:empty?)
puts doc.xpath('//div').map { |i| i[:class] }.grep(/productImage/).count
puts doc.xpath('//div[#class="productImage"]//img').count
pp doc.xpath('//div[#class="productImage"]//img').map { |i| i[:src] }
end
puts "Attempt 1:"
analyze_html(open(URL))
puts "Attempt 2:"
analyze_html(open(URL, "User-Agent" => "Wget/1.10.2"))
Output:
Attempt 1:
["default navSprite"]
0
0
[]
Attempt 2:
["default navSprite", "srSprite spr_kindOfSortBtn"]
16
16
["http://ec4.images-amazon.com/images/I/51fOb3ujSjL._AA115_.jpg",
"http://ec4.images-amazon.com/images/I/513UQ1xiaSL._AA115_.jpg",
"http://ec4.images-amazon.com/images/I/41zKxWXb8HL._AA115_.jpg",
"http://ec4.images-amazon.com/images/I/51bj6XXAouL._AA115_.jpg",
"http://ec4.images-amazon.com/images/I/516GBhDTGCL._AA115_.jpg",
"http://ec4.images-amazon.com/images/I/51ADd3HSE6L._AA115_.jpg",
"http://ec4.images-amazon.com/images/I/51CbB-7kotL._AA115_.jpg",
"http://ec4.images-amazon.com/images/I/51%2Bw40Mk51L._AA115_.jpg",
"http://ec4.images-amazon.com/images/I/519Gny1LckL._AA115_.jpg",
"http://ec4.images-amazon.com/images/I/51Dv6DUF-WL._AA115_.jpg",
"http://ec4.images-amazon.com/images/I/51uuy8yHeoL._AA115_.jpg",
"http://ec4.images-amazon.com/images/I/51T0KEjznqL._AA115_.jpg",
"http://ec4.images-amazon.com/images/I/419WTi%2BdjzL._AA115_.jpg",
"http://ec4.images-amazon.com/images/I/51QTg4ZmMmL._AA115_.jpg",
"http://ec4.images-amazon.com/images/I/51l--Pxw9TL._AA115_.jpg",
"http://ec4.images-amazon.com/images/I/51gehW2qUZL._AA115_.jpg"]
Solution:
Use User-Agent: Wget/1.10.2
Use xpath('//div[#class="productImage"]//img')

Related

Stuck trying to display info on a CLI from an API that uses a hash and a nested hash

Trying to display second level information about characters from this Futurama API.
Currently using this code to get information:
def self.character
uri = URI.parse(URL)
response = Net::HTTP.get_response(uri)
data = JSON.parse(response.body)
data.each do |c|
Character.new(c["name"], c["gender"], c["species"], c["homePlanet"], c["occupation"], c["info"], c["sayings"])
end
end
I'm then stuck either returning (gender and species) from the nested hash (if character id > 8) or the original hash (character id < 8) when using this code:
def character_details(character)
puts "Name: #{character.name["first"]} #{character.name["middle"]} #{character.name["last"]}"
puts "Species: #{character.info["species"]}"
puts "Occupation: #{character.homePlanet}"
puts "Gender: #{character.info["gender"]}"
puts "Quotes:"
character.sayings.each_with_index do |s, i|
iplusone = i + 1
puts "#{iplusone}. #{s} "
end
end
Not sure where or what logic to use to get the correct information to display.

Maybe you have a problem when save c['info] in Character.new(c["name"], c["gender"], c["species"], c["homePlanet"], c["occupation"], c["info"], c["sayings"])
I'm running your code and I see info does not exist in the response of API, the gender should be accessed in character.gender
irb(main):037:0> character.gender
=> "Male"
irb(main):039:0> character.species
=> "Human"
I don't understand this comment: (if character id > 8) or the original hash (character id < 8) Can you explain us what u need do?

Ruby: Handling different JSON response that is not what is expected

Searched online and read through the documents, but have not been able to find an answer. I am fairly new and part of learning Ruby I wanted to make the script below.
The Script essentially does a Carrier Lookup on a list of numbers that are provided through a CSV file. The CSV file has just one row with the column header "number".
Everything runs fine UNTIL the API gives me an output that is different from the others. In this example, it tells me that one of the numbers in my file is not a valid US number. This then causes my script to stop running.
I am looking to see if there is a way to either ignore it (I read about Begin and End, but was not able to get it to work) or ideally either create a separate file with those errors or just put the data into the main file.
Any help would be much appreciated. Thank you.
Ruby Code:
require 'csv'
require 'uri'
require 'net/http'
require 'json'
number = 0
CSV.foreach('data1.csv',headers: true) do |row|
number = row['number'].to_i
uri = URI("https://api.message360.com/api/v3/carrier/lookup.json?PhoneNumber=#{number}")
req = Net::HTTP::Post.new(uri)
req.basic_auth 'XXX' , 'XXX'
res = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => true) {|http|
http.request(req)
}
json = JSON.parse(res.body)
new = json["Message360"]["Carrier"].values
CSV.open("new.csv", "ab") do |csv|
csv << new
end
end
File Data:
number
5556667777
9998887777
Good Response example in JSON:
{"Message360"=>{"ResponseStatus"=>1, "Carrier"=>{"ApiVersion"=>"3", "CarrierSid"=>"XXX", "AccountSid"=>"XXX", "PhoneNumber"=>"+19495554444", "Network"=>"Cellco Partnership dba Verizon Wireless - CA", "Wireless"=>"true", "ZipCode"=>"92604", "City"=>"Irvine", "Price"=>0.0003, "Status"=>"success", "DateCreated"=>"2018-05-15 23:05:15"}}}
The response that causes Script to stop:
{
"Message360": {
"ResponseStatus": 0,
"Errors": {
"Error": [
{
"Code": "ER-M360-CAR-111",
"Message": "Allowed Only Valid E164 North American Numbers.",
"MoreInfo": []
}
]
}
}
}

It would appear you can just check json["Message360"]["ResponseStatus"] first for a 0 or 1 to indicate failure or success.
I'd probably add a rescue to help catch any other errors (malformed JSON, network issue, etc.)
CSV.foreach('data1.csv',headers: true) do |row|
number = row['number'].to_i
...
json = JSON.parse(res.body)
if json["Message360"]["ResponseStatus"] == 1
new = json["Message360"]["Carrier"].values
CSV.open("new.csv", "ab") do |csv|
csv << new
end
else
# handle bad response
end
rescue StandardError => e
# request failed for some reason, log e and the number?
end

Getting JRuby-internal Java object from Ruby code

I'm wondering if I could get JRuby-internal Java objects (e.g. org.jruby.RubyString, org.jruby.RubyTime) in Ruby code, and call their Java methods from Ruby. Does anyone know how to do it?
str = "foobar"
rubystring_str = str.toSomethingConversion # <== What I want
# http://jruby.org/apidocs/org/jruby/RubyString.html#getEncoding()
rubystring_str.getEncoding() # Java::org.jcodings.Encoding
# http://jruby.org/apidocs/org/jruby/RubyString.html#getBytes()
rubystring_str.getBytes() # [Java::byte]
time = Time.now
rubytime_time = time.toSomethingConversion # <== What I want
# http://jruby.org/apidocs/org/jruby/RubyTime.html#getDateTime()
rubytime_time.getDateTime() # Java::org.joda.time.DateTime
I know I can do like that using Java code as below, but here, I'd like to do it purely in Ruby.
public org.joda.time.DateTime getJodaDateTime(RubyTime rubytime) {
return rubytime.getDateTime();
}

Ah, I found the answer in my tries-and-errors.
The following works.
"foobar".to_java(Java::org.jruby.RubyString).getEncoding()
Time.now.to_java(Java::org.jruby.RubyTime).getDateTime()

Opening multiple html files & outputting to .txt with Nokogiri

Just wondering if these two functions are to be done using Nokogiri or via more basic Ruby commands.
require 'open-uri'
require 'nokogiri'
require "net/http"
require "uri"
doc = Nokogiri.parse(open("example.html"))
doc.xpath("//meta[#name='author' or #name='Author']/#content").each do |metaauth|
puts "Author: #{metaauth}"
end
doc.xpath("//meta[#name='keywords' or #name='Keywords']/#content").each do |metakey|
puts "Keywords: #{metakey}"
end
etc...
Question 1: I'm just trying to parse a directory of .html documents, get the information from the meta html tags, and output the results to a text file if possible. I tried a simple *.html wildcard replacement, but that didn't seem to work (at least not with Nokogiri.parse(open()) maybe it works with ::HTML or ::XML)
Question 2: But more important, is it possible to output all of those meta content outputs into a text file to replace the puts command?
Also forgive me if the code is overly complicated for the simple task being performed, but I'm a little new to Nokogiri / xpath / Ruby.
Thanks.

I have a code similar.
Please refer to:
module MyParser
HTML_FILE_DIR = `your html file dir`
def self.run(options = {})
file_list = Dir.entries(HTML_FILE_DIR).reject { |f| f =~ /^\./ }
result = file_list.map do |file|
html = File.read("#{HTML_FILE_DIR}/#{file}")
doc = Nokogiri::HTML(html)
parse_to_hash(doc)
end
write_csv(result)
end
def self.parse_to_hash(doc)
array = []
array << doc.css(`your select conditons`).first.content
... #add your selector code css or xpath
array
end
def self.write_csv(result)
::CSV.open("`your out put file name`", 'w') do |csv|
result.each { |row| csv << row }
end
end
end
MyParser.run

You can output to a file like so:
File.open('results.txt','w') do |file|
file.puts "output" # See http://ruby-doc.org/core-2.1.2/IO.html#method-i-puts
end
Alternatively, you could do something like:
authors = doc.xpath("//meta[#name='author' or #name='Author']/#content")
keywrds = doc.xpath("//meta[#name='keywords' or #name='Keywords']/#content")
results = authors.map{ |x| "Author: #{x}" }.join("\n") +
keywrds.map{ |x| "Keywords: #{x}" }.join("\n")
File.open('results.txt','w'){ |f| f << results }

FeedTools behaves differently with JRuby

With Ruby 1.8, FeedTools is able to get and parse rss/atom feed links given a non-feed link. For eg:
ruby-1.8.7-p174 > f = FeedTools::Feed.open("http://techcrunch.com/")
=> #<FeedTools::Feed:0xc99cf8 URL:http://feeds.feedburner.com/TechCrunch>
ruby-1.8.7-p174 > f.title
=> "TechCrunch"
Whereas, with JRuby 1.5.2, FeedTools is unable to get and parse rss/atom feed links given a non-feed link. For eg:
jruby-1.5.2 > f = FeedTools::Feed.open("http://techcrunch.com/")
=> #<FeedTools::Feed:0x1206 URL:http://techcrunch.com/>
jruby-1.5.2 > f.title
=> nil
At times, it also gives the following error:
FeedTools::FeedAccessError: [URL] does
not appear to be a feed.
Any ideas on how I can get FeedTools to work with JRuby?

There seems to be a bug in the feedtools gem. In the method to locate feed links with a given mime type, replace 'lambda' with 'Proc.new' to return from the method from inside the proc when the feed link is found.
--- a/feedtools-0.2.29/lib/feed_tools/helpers/html_helper.rb
+++ b/feedtools-0.2.29/lib/feed_tools/helpers/html_helper.rb
## -620,7 +620,7 ##
end
end
get_link_nodes.call(document.root)
- process_link_nodes = lambda do |links|
+ process_link_nodes = Proc.new do |links|
for link in links
next unless link.kind_of?(REXML::Element)
if link.attributes['type'].to_s.strip.downcase ==

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Some weird problem with Nokogiri's xpath method - html

Related

Stuck trying to display info on a CLI from an API that uses a hash and a nested hash

Ruby: Handling different JSON response that is not what is expected

Getting JRuby-internal Java object from Ruby code

Opening multiple html files & outputting to .txt with Nokogiri

FeedTools behaves differently with JRuby

Categories

Resources