Nokogiri truncating HTML for some pages, not others

Nokogiri truncating HTML for some pages, not others - html

We are using Nokogiri to parse data from iTunes. On some pages, it works. On others, it fails and truncates the page mysteriously.
Our code:
# Get iTunes HTML for app bundle
itunes_url = 'https://itunes.apple.com/us/app-bundle/id918236019'
uri = URI.parse itunes_url
http = Net::HTTP.new uri.host, uri.port
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
req = Net::HTTP::Get.new uri.request_uri
resp = http.request req
# Covnert HTML into XML for parsing
bundle_xml = Nokogiri.XML resp.body
bundle_xml.remove_namespaces!
#puts "ERRORS: #{bundle_xml.errors}"
puts "ORIGINAL\n=============\n#{resp.body}\n\n\n\n============="
puts "NOKO\n=============\n#{bundle_xml}"
Even though errors are returned for other iTunes pages, Nokogiri is able to parse the page properly. Basically, most of the elements after a certain element are removed mysteriously by Nokogiri.
Resp.body output: https://gist.github.com/anonymous/33ecfe82e3d22a39375a
Nokogiri output: https://gist.github.com/anonymous/7622ef92bf430889b9f4
i18n (0.6.11, 0.6.9, 0.6.5, 0.6.4, 0.6.1)
io-console (0.3)
journey (1.0.4)
jquery-rails (3.1.1, 3.1.0, 3.0.4, 2.2.1)
json (1.8.1, 1.8.0, 1.7.7, 1.5.5)
kgio (2.8.1, 2.8.0)
mail (2.4.4)
mime (0.4.0, 0.2.0, 0.1)
mime-types (1.25.1, 1.25, 1.24, 1.23, 1.21)
mini_portile (0.6.0, 0.5.3, 0.5.2, 0.5.1)
minitest (2.5.1)
mongo (1.10.0, 1.9.2, 1.9.1)
mongo_mapper (0.12.0)
mongoid (3.1.6)
moped (1.5.2, 1.5.1)
multi_json (1.10.1, 1.9.2, 1.9.0, 1.8.4, 1.8.2, 1.8.0, 1.7.9, 1.7.7, 1.6.1)
mysql2 (0.3.16, 0.3.15)
newrelic_rpm (3.9.0.229, 3.7.3.204, 3.7.2.192, 3.6.6.147)
nokogiri (1.6.1, 1.6.0)

By trying this myself, I think I see the problem. You're parsing an HTML response.
Change
bundle_xml = Nokogiri.XML resp.body
to:
bundle_xml = Nokogiri.HTML resp.body
and see if this works for you.
The HTML parser is much more lenient and handles missing closing tags, etc.

The HTML has invalid markup. Nokogiri says:
>> #doc.errors
[
[0] #<Nokogiri::XML::SyntaxError: htmlParseEntityRef: no name>,
[1] #<Nokogiri::XML::SyntaxError: htmlParseEntityRef: no name>,
[2] #<Nokogiri::XML::SyntaxError: Tag nav invalid>
]
That can make content disappear as Nokogiri attempts to fix up the HTML to make it usable.

Related

Issue with generating a PDF file using pdfkit in ruby rails application

I am trying to generate a basic PDF file with some data in it.
html = ApplicationController.new.render_to_string("invoices/invoice", layout: false)
kit = PDFKit.new(html, :page_size => 'Letter')
pdf = kit.to_file("#{Rails.root}/")
How to solve below error? or is there a better example that I use as a reference
2.7.4 :123 > html = ApplicationController.new.render_to_string("invoices/invoice", layout: false)
Rendering invoices/invoice.html.erb
Rendered invoices/invoice.html.erb (Duration: 0.0ms | Allocations: 4)
=> "<div style=\"width:100%;\">\n <p> Testing PDF Generation</p>\n</div>\n"
2.7.4 :124 > kit = PDFKit.new(html, :page_size => 'Letter')
=> #<PDFKit:0x0000000107ed4290 #source=#<PDFKit::Source:0x0000000107ed4268 #source="<div style=\"width:100%;\">\n <p> Testin...
2.7.4 :125 > pdf = kit.to_file("#{Rails.root}/")
/private/var/nameee/Desktop/test/test_application/bin/wkhtmltopdf: /private/var/nameee/Desktop/test/test_application/bin/wkhtmltopdf: cannot execute binary file
Traceback (most recent call last):
1: from (irb):125
PDFKit::ImproperWkhtmltopdfExitStatus (Command failed (exitstatus=126): /private/var/nameee/Desktop/test/test_application/bin/wkhtmltopdf --quiet --page-size Letter --margin-top 0.75in --margin-right 0.75in --margin-bottom 0.75in --margin-left 0.75in --encoding UTF-8 - /private/var/nameee/Desktop/test/test_application/)
More Info:
ruby 2.7.4p191
Rails 6.1.6.1
gem 'pdfkit', '~> 0.8.2'

Getting JRuby-internal Java object from Ruby code

I'm wondering if I could get JRuby-internal Java objects (e.g. org.jruby.RubyString, org.jruby.RubyTime) in Ruby code, and call their Java methods from Ruby. Does anyone know how to do it?
str = "foobar"
rubystring_str = str.toSomethingConversion # <== What I want
# http://jruby.org/apidocs/org/jruby/RubyString.html#getEncoding()
rubystring_str.getEncoding() # Java::org.jcodings.Encoding
# http://jruby.org/apidocs/org/jruby/RubyString.html#getBytes()
rubystring_str.getBytes() # [Java::byte]
time = Time.now
rubytime_time = time.toSomethingConversion # <== What I want
# http://jruby.org/apidocs/org/jruby/RubyTime.html#getDateTime()
rubytime_time.getDateTime() # Java::org.joda.time.DateTime
I know I can do like that using Java code as below, but here, I'd like to do it purely in Ruby.
public org.joda.time.DateTime getJodaDateTime(RubyTime rubytime) {
return rubytime.getDateTime();
}

Ah, I found the answer in my tries-and-errors.
The following works.
"foobar".to_java(Java::org.jruby.RubyString).getEncoding()
Time.now.to_java(Java::org.jruby.RubyTime).getDateTime()

Grabbing specific values from JSON

So here is what i'm trying to do. I'm building a simply Ruby file that will as the user for input, a city, and then return weather results for that city. I've never written in Ruby nor have I ever used API's. But here is my attempt.
The API response below:
> {"coord"=>{"lon"=>-85.68, "lat"=>40.11}, "weather"=>[{"id"=>501,
> "main"=>"Rain", "description"=>"moderate rain", "icon"=>"10d"}],
> "base"=>"stations", "main"=>{"temp"=>57.78, "pressure"=>1009,
> "humidity"=>100, "temp_min"=>57, "temp_max"=>60.01},
> "wind"=>{"speed"=>5.17, "deg"=>116.005}, "rain"=>{"1h"=>1.02},
> "clouds"=>{"all"=>92}, "dt"=>1475075671, "sys"=>{"type"=>3,
> "id"=>187822, "message"=>0.1645, "country"=>"US",
> "sunrise"=>1475062634, "sunset"=>1475105280}, "id"=>4917592,
> "name"=>"Anderson", "cod"=>200} [Finished in 2.0s]
The Ruby file below:
require 'net/http'
require 'json'
url = 'http://api.openweathermap.org/data/2.5/weather?q=anderson&APPID=5c89010425b4d730b7558f57234ea3c8&units=imperial'
uri = URI(url)
response = Net::HTTP.get(uri)
parsed = JSON.parse(response)
puts parsed #Print this so I can see results
inputs temp = JSON.parse(response)['main']['temp']
puts desc = JSON.parse(response)['weather']['description']
puts humid = JSON.parse(response)['main']['humidity']
puts wind = JSON.parse(response)['wind']['speed']
What I was trying to do was only pull out a few items like temperature,description, humidity, and wind. But I can't seem to get it right. I keep getting undefined errors with each attempt.
(Wanting to complete this without using gems or anything that isn't already built into Ruby) (I have not written the parts for user input yet)

Your problem is that response['weather'] is an array, so you won't be able to access ['weather']['description'], instead you will have to do something like ['weather'][0]['description'].
2.3.0 :020 > puts parsed['weather'][0]['description']
moderate rain
2.3.0 :021 > puts parsed['main']['humidity']
100
2.3.0 :022 > puts parsed['wind']['speed']
5.17
2.3.0 :025 > puts parsed['main']['temp']
58.8

How to enable automatic code reloading in Rails

Is there a way to do 'hot code reloading' with a Rails application in the development environment?
For example: I'm working on a Rails application, I add a few lines of css in a stylesheet, I look at the browser to see the modified styling. As of right now I have to refresh the page with cmd-r or by clicking the refresh button.
Is there a way to get the page to reload automatically when changes are made?
This works nicely in the Phoenix web framework (and I'm sure Phoenix isn't the only framework in this feature). How could a feature like this be enabled in Ruby on Rails?

I am using this setup reloads all assets, js, css, ruby files
in Gemfile
group :development, :test do
gem 'guard-livereload', '~> 2.5', require: false
end
group :development do
gem 'listen'
gem 'guard'
gem 'guard-zeus'
gem 'rack-livereload'
end
insert this in your development.rb
config.middleware.insert_after ActionDispatch::Static, Rack::LiveReload
i have this in my guard file
# A sample Guardfile
# More info at https://github.com/guard/guard#readme
## Uncomment and set this to only include directories you want to watch
# directories %w(app lib config test spec features) \
# .select{|d| Dir.exists?(d) ? d : UI.warning("Directory #{d} does not exist")}
## Note: if you are using the `directories` clause above and you are not
## watching the project directory ('.'), then you will want to move
## the Guardfile to a watched dir and symlink it back, e.g.
#
# $ mkdir config
# $ mv Guardfile config/
# $ ln -s config/Guardfile .
#
# and, you'll have to watch "config/Guardfile" instead of "Guardfile"
guard 'livereload' do
extensions = {
css: :css,
scss: :css,
sass: :css,
js: :js,
coffee: :js,
html: :html,
png: :png,
gif: :gif,
jpg: :jpg,
jpeg: :jpeg,
# less: :less, # uncomment if you want LESS stylesheets done in browser
}
rails_view_exts = %w(erb haml slim)
# file types LiveReload may optimize refresh for
compiled_exts = extensions.values.uniq
watch(%r{public/.+\.(#{compiled_exts * '|'})})
extensions.each do |ext, type|
watch(%r{
(?:app|vendor)
(?:/assets/\w+/(?<path>[^.]+) # path+base without extension
(?<ext>\.#{ext})) # matching extension (must be first encountered)
(?:\.\w+|$) # other extensions
}x) do |m|
path = m[1]
"/assets/#{path}.#{type}"
end
end
# file needing a full reload of the page anyway
watch(%r{app/views/.+\.(#{rails_view_exts * '|'})$})
watch(%r{app/helpers/.+\.rb})
watch(%r{config/locales/.+\.yml})
end
guard 'zeus' do
require 'ostruct'
rspec = OpenStruct.new
# rspec.spec_dir = 'spec'
# rspec.spec = ->(m) { "#{rspec.spec_dir}/#{m}_spec.rb" }
# rspec.spec_helper = "#{rspec.spec_dir}/spec_helper.rb"
# matchers
# rspec.spec_files = /^#{rspec.spec_dir}\/.+_spec\.rb$/
# Ruby apps
ruby = OpenStruct.new
ruby.lib_files = /^(lib\/.+)\.rb$/
# watch(rspec.spec_files)
# watch(rspec.spec_helper) { rspec.spec_dir }
# watch(ruby.lib_files) { |m| rspec.spec.call(m[1]) }
# Rails example
rails = OpenStruct.new
rails.app_files = /^app\/(.+)\.rb$/
rails.views_n_layouts = /^app\/(.+(?:\.erb|\.haml|\.slim))$/
rails.controllers = %r{^app/controllers/(.+)_controller\.rb$}
# watch(rails.app_files) { |m| rspec.spec.call(m[1]) }
# watch(rails.views_n_layouts) { |m| rspec.spec.call(m[1]) }
# watch(rails.controllers) do |m|
# [
# rspec.spec.call("routing/#{m[1]}_routing"),
# rspec.spec.call("controllers/#{m[1]}_controller"),
# rspec.spec.call("acceptance/#{m[1]}")
# ]
# end
end
I am using zeus instead of spring on this setup.
Run guard
Open localhost:3000 and you are good to go.
This should resolve your question, and have blazing reload times better than browserify.
I commented out guard looking at test directories if you want you can uncomment those lines if your are doing TDD.

CSS hot swapping and auto-reload when HTML/JS changes can be achieved with guard in combination with livereload: https://github.com/guard/guard-livereload

This gem would auto reload when you make changes to js elements(Not css or ruby files).
https://github.com/rmosolgo/react-rails-hot-loader
Never seen css hot code reloading in rails platform.

FeedTools behaves differently with JRuby

With Ruby 1.8, FeedTools is able to get and parse rss/atom feed links given a non-feed link. For eg:
ruby-1.8.7-p174 > f = FeedTools::Feed.open("http://techcrunch.com/")
=> #<FeedTools::Feed:0xc99cf8 URL:http://feeds.feedburner.com/TechCrunch>
ruby-1.8.7-p174 > f.title
=> "TechCrunch"
Whereas, with JRuby 1.5.2, FeedTools is unable to get and parse rss/atom feed links given a non-feed link. For eg:
jruby-1.5.2 > f = FeedTools::Feed.open("http://techcrunch.com/")
=> #<FeedTools::Feed:0x1206 URL:http://techcrunch.com/>
jruby-1.5.2 > f.title
=> nil
At times, it also gives the following error:
FeedTools::FeedAccessError: [URL] does
not appear to be a feed.
Any ideas on how I can get FeedTools to work with JRuby?

There seems to be a bug in the feedtools gem. In the method to locate feed links with a given mime type, replace 'lambda' with 'Proc.new' to return from the method from inside the proc when the feed link is found.
--- a/feedtools-0.2.29/lib/feed_tools/helpers/html_helper.rb
+++ b/feedtools-0.2.29/lib/feed_tools/helpers/html_helper.rb
## -620,7 +620,7 ##
end
end
get_link_nodes.call(document.root)
- process_link_nodes = lambda do |links|
+ process_link_nodes = Proc.new do |links|
for link in links
next unless link.kind_of?(REXML::Element)
if link.attributes['type'].to_s.strip.downcase ==

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Nokogiri truncating HTML for some pages, not others - html

By trying this myself, I think I see the problem. You're parsing an HTML response. Change bundle_xml = Nokogiri.XML resp.body to: bundle_xml = Nokogiri.HTML resp.body and see if this works for you. The HTML parser is much more lenient and handles missing closing tags, etc.

Related

Issue with generating a PDF file using pdfkit in ruby rails application

Getting JRuby-internal Java object from Ruby code

Grabbing specific values from JSON

How to enable automatic code reloading in Rails

FeedTools behaves differently with JRuby

Categories

Resources