Ruby - unable to write using nokogiri - html

I am searching a <div> element by its classname and want to add it next to other <div> element. Following code is not writing data what I get using doc1.search .
require 'nokogiri'
doc1 = Nokogiri::HTML(File.open("overview.html"))
affixButtons = doc1.search('div.margin-0-top-lg.margin-10-bottom-lg.text-center')
doc1.at('div.leftnav-btn').add_next_sibling(affixButtons)
Can someone suggest what I'm missing ?

Your code works just fine, if you would just like to write the edited data to file use File.open as follows:
require 'nokogiri'
doc1 = Nokogiri::HTML(File.open("overview.html"))
affixButtons = doc1.search('div.margin-0-top-lg.margin-10-bottom-lg.text-center')
doc1.at('div.leftnav-btn').add_next_sibling(affixButtons)
File.open('output.html', 'w') {|f| f.write(doc1.to_html)}

You're not saving your resulting HTML to file, you can do it like so:
File.open("result.html", "w"){|f| f.write(doc1.to_html)}

Related

Ruby Nokogiri take all the content

i'm working on a scrapping project but i got a problem:
I wanna get all the data of https://coinmarketcap.com/all/views/all/ with nokigiri
but i only get 20 crypto name on the 200 loaded with nokogiri
the code:
ruby
require 'nokogiri'
require 'open-uri'
require 'rubygems'
def scrapper
return doc = Nokogiri::HTML(URI.open('https://coinmarketcap.com/all/views/all/'))
end
def fusiontab(tab1,tab2)
return Hash[tab1.zip(tab2)]
end
def crypto(page)
array_name=[]
array_value=[]
name_of_crypto=page.xpath('//tr//td[3]')
value_of_crypto=page.xpath('//tr//td[5]')
hash={}
name_of_crypto.each{ |name|
array_name<<name.text
}
value_of_crypto.each{|price|
array_value << price.text
}
hash=fusiontab(array_name,array_value)
return hash
end
puts crypto(scrapper)
can you help me to get all the cryptocurrencies ?
The URL you're using does not generate all the data as HTML; a lot of it is rendered after the page has been loaded.
Looking at the source code for the page, it appears that the data is rendered from a JSON script, embedded in the page.
it took quite some time to find the objects in order to work out what part of the JSON data has the contents that you want to work with:
The JSON object within the HTML, as a String object
page.css('script[type="application/json"]').first.inner_html
The JSON String converted to a real JSON Hash
JSON.parse(page.css('script[type="application/json"]').first.inner_html)
the position inside the JSON or the Array of Crypto Hashes
my_json["props"]["initialState"]["cryptocurrency"]["listingLatest"]["data"]
pretty print the first "crypto"
2.7.2 :142 > pp cryptos.first
{"id"=>1,
"name"=>"Bitcoin",
"symbol"=>"BTC",
"slug"=>"bitcoin",
"tags"=>
["mineable",
"pow",
"sha-256",
"store-of-value",
"state-channel",
"coinbase-ventures-portfolio",
"three-arrows-capital-portfolio",
"polychain-capital-portfolio",
"binance-labs-portfolio",
"blockchain-capital-portfolio",
"boostvc-portfolio",
"cms-holdings-portfolio",
"dcg-portfolio",
"dragonfly-capital-portfolio",
"electric-capital-portfolio",
"fabric-ventures-portfolio",
"framework-ventures-portfolio",
"galaxy-digital-portfolio",
"huobi-capital-portfolio",
"alameda-research-portfolio",
"a16z-portfolio",
"1confirmation-portfolio",
"winklevoss-capital-portfolio",
"usv-portfolio",
"placeholder-ventures-portfolio",
"pantera-capital-portfolio",
"multicoin-capital-portfolio",
"paradigm-portfolio"],
"cmcRank"=>1,
"marketPairCount"=>9158,
"circulatingSupply"=>18960043,
"selfReportedCirculatingSupply"=>0,
"totalSupply"=>18960043,
"maxSupply"=>21000000,
"isActive"=>1,
"lastUpdated"=>"2022-02-16T14:26:00.000Z",
"dateAdded"=>"2013-04-28T00:00:00.000Z",
"quotes"=>
[{"name"=>"USD",
"price"=>43646.858047604175,
"volume24h"=>20633664171.70021,
"marketCap"=>827546305397.4712,
"percentChange1h"=>-0.86544168,
"percentChange24h"=>-1.6482985,
"percentChange7d"=>-0.73945082,
"lastUpdated"=>"2022-02-16T14:26:00.000Z",
"percentChange30d"=>2.18336134,
"percentChange60d"=>-6.84146969,
"percentChange90d"=>-26.08073361,
"fullyDilluttedMarketCap"=>916584018999.69,
"marketCapByTotalSupply"=>827546305397.4712,
"dominance"=>42.1276,
"turnover"=>0.02493355,
"ytdPriceChangePercentage"=>-8.4718}],
"isAudited"=>false,
"rank"=>1,
"hasFilters"=>false,
"quote"=>
{"USD"=>
{"name"=>"USD",
"price"=>43646.858047604175,
"volume24h"=>20633664171.70021,
"marketCap"=>827546305397.4712,
"percentChange1h"=>-0.86544168,
"percentChange24h"=>-1.6482985,
"percentChange7d"=>-0.73945082,
"lastUpdated"=>"2022-02-16T14:26:00.000Z",
"percentChange30d"=>2.18336134,
"percentChange60d"=>-6.84146969,
"percentChange90d"=>-26.08073361,
"fullyDilluttedMarketCap"=>916584018999.69,
"marketCapByTotalSupply"=>827546305397.4712,
"dominance"=>42.1276,
"turnover"=>0.02493355,
"ytdPriceChangePercentage"=>-8.4718}}
}
the value of the first "crypto"
crypto.first["quote"]["USD"]["price"]
the key that you use in your Hash for the first "crypto"
crypto.first["symbol"]
put it all together and you get the following code (looping through each "crypto" with each_with_object)
require `json`
require 'nokogiri'
require 'open-uri'
...
def crypto(page)
my_json = JSON.parse(page.css('script[type="application/json"]').first.inner_html)
cryptos = my_json["props"]["initialState"]["cryptocurrency"]["listingLatest"]["data"]
hash = cryptos.each_with_object({}) do |crypto, hsh|
hsh[crypto["name"]] = crypto["quote"]["USD"]["price"]
end
return hash
end
puts crypto(scrapper);

How to loop through all items in xpath

I'm new to both xpath and html so I'm probably missing something fundamental here. I have a html where I want to extract all the items displayed below. (I'm using scrapy to do my requests, I just need the proper xpath to get the data)
enter image description here
Here I just want to loop through all these items and from there and get some data from inside each item.
for item in response.xpath("//ul[#class='feedArticleList XSText']/li[#class='item']"):
yield {'name': item.xpath("//div[#class='intro lhNormal']").get()}
The problem is that this get only gives me the first item for all the loops. If I instead use .getall() I then get all the items for every loop (which in my view shouldn't work since I thought I only selected one item at the time in each iteration). Thanks in advance!
It seems you're missing a . in your XPath expression (to "indicate" you're working from a context node).
Replace :
yield {'name': item.xpath("//div[#class='intro lhNormal']").get()}
For :
yield {'name': item.xpath(".//div[#class='intro lhNormal']").get()}
You do miss smth. fundamental. Python by default does not have xpath() function.
You better use bs4 or lxml libraries.
See an example with lxml:
import lxml.html
import os
doc = lxml.html.parse('http://www.websters-online-dictionary.org')
if doc:
table = []
trs = doc.xpath("/html/body/div[1]/div[1]/table[1]/tbody/tr")
for tr in islice(trs, 3):
for td in tr.xpath('td'):
table += td.xpath("/b/text() | /text()")
buffer = ''
for i in range(len(table)):
buffer += table[i]
the full explanation is here.

Convert json to array using Perl

I have a chunk of json that has the following format:
{"page":{"size":7,"number":1,"totalPages":1,"totalElements":7,"resultSetId":null,"duration":0},"content":[{"id":"787edc99-e94f-4132-b596-d04fc56596f9","name":"Verification","attributes":{"ruleExecutionClass":"VerificationRule"},"userTags":[],"links":[{"rel":"self","href":"/endpoint/787edc99-e94f-4132-b596-d04fc56596f9","id":"787edc99-e94f-...
Basically the size attribute (in this case) tells me that there are 7 parts to the content section. How do I convert this chunk of json to an array in Perl, and can I do it using the size attribute? Or is there a simpler way like just using decode_json()?
Here is what I have so far:
my $resources = get_that_json_chunk(); # function returns exactly the json you see, except all 7 resources in the content section
my #decoded_json = #$resources;
foreach my $resource (#decoded_json) {
I've also tried something like this:
my $deserialize = from_json( $resources );
my #decoded_json = (#{$deserialize});
I want to iterate over the array and handle the data. I've tried a few different ways because I read a little about array refs, but I keep getting "Not an ARRAY reference" errors and "Can't use string ("{"page":{"size":7,"number":1,"to"...) as an ARRAY ref while "strict refs" in use"
Thank you to Matt Jacob:
my $deserialized = decode_json($resources);
print "$_->{id}\n" for #{$deserialized->{content}};

Download hidden json array in HTML using R

I'm trying to scrape data from tranfermrkt using mainly XML + httr package.
page.doc <- content(GET("http://www.transfermarkt.es/george-corral/marktwertverlauf/spieler/103889"))
After downloading, there is a hidden array named 'series':
'series':[{'type':'line','name':'Valor de mercado','data':[{'y':600000,'verein':'CF América','age':21,'mw':'600 miles €','datum_mw':'02/12/2011','x':1322780400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/3631.png?lm=1403472558)'}},{'y':850000,'verein':'Jaguares de Chiapas','age':21,'mw':'850 miles €','datum_mw':'02/06/2012','x':1338588000000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4774_1441956822.png?lm=1441956822)'}},{'y':1000000,'verein':'Jaguares de Chiapas','age':22,'mw':'1,00 mill. €','datum_mw':'03/12/2012','x':1354489200000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4774_1441956822.png?lm=1441956822)'}},{'y':1000000,'verein':'Jaguares de Chiapas','age':22,'mw':'1,00 mill. €','datum_mw':'29/05/2013','x':1369778400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4774_1441956822.png?lm=1441956822)'}},{'y':1250000,'verein':'Querétaro FC','age':23,'mw':'1,25 mill. €','datum_mw':'27/12/2013','x':1388098800000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4961.png?lm=1409989898)'}},{'y':1500000,'verein':'Querétaro FC','age':24,'mw':'1,50 mill. €','datum_mw':'01/09/2014','x':1409522400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4961.png?lm=1409989898)'}},{'y':1800000,'verein':'Querétaro FC','age':25,'mw':'1,80 mill. €','datum_mw':'01/10/2015','x':1443650400000,'marker':{'symbol':'url(http://akacdn.transfermarkt.de/images/wappen/verysmall/4961.png?lm=1409989898)'}}]}]
Is there a way to download directly? I want to scrape 600+ pages.
Until now, I have tried
page.doc.2 <- xpathSApply(page.doc, "//*/div[#class='eight columns']")
page.doc.2 <- xpathSApply(page.doc, "//*/div[#class='eight columns']", xmlAttrs)
No, there is no way to download just the JSON data: the JSON array you’re interested in is embedded inside the page’s source code, as part of a script.
You can then use conventional XPath or CSS selectors to find the script elements. However, finding and extracting just the JSON part is harder without a library that evaluates the JavaScript code. A better option would definitely be to use an official API, should one exist.
library(rvest) # Better suited for web scraping than httr & xml.
library(rjson)
doc = read_html('http://www.transfermarkt.es/george-corral/marktwertverlauf/spieler/103889')
script = doc %>%
html_nodes('script') %>%
html_text() %>%
grep(pattern = "'series':", value = TRUE)
# Replace JavaScript quotes with JSON quotes
json_content = gsub("'", '"', gsub("^.*'series':", '', script))
# Truncate characters from the end until the result is parseable as valid JSON …
while (nchar(json_content) > 0) {
json = try(fromJSON(json_content), silent = TRUE)
if (! inherits(json, 'try-error'))
break
json_content = substr(json_content, 1, nchar(json_content) - 1)
}
However, there’s no guarantee that the above will always work: it is JavaScript after all, not JSON; the two are similar but not every valid JavaScript array is valid JSON.
It could be possible to evaluate the JavaScript fragment instead but that gets much more complicated. As a start, take a look at the V8 interface for R.

Strip html from string Ruby on Rails

I'm working with Ruby on Rails, Is there a way to strip html from a string using sanitize or equal method and keep only text inside value attribute on input tag?
If we want to use this in model
ActionView::Base.full_sanitizer.sanitize(html_string)
which is the code in "strip_tags" method
There's a strip_tags method in ActionView::Helpers::SanitizeHelper:
http://api.rubyonrails.org/classes/ActionView/Helpers/SanitizeHelper.html#method-i-strip_tags
Edit: for getting the text inside the value attribute, you could use something like Nokogiri with an Xpath expression to get that out of the string.
Yes, call this: sanitize(html_string, tags:[])
ActionView::Base.full_sanitizer.sanitize(html_string)
White list of tags and attributes can be specified as bellow
ActionView::Base.full_sanitizer.sanitize(html_string, :tags => %w(img br p), :attributes => %w(src style))
Above statement allows tags img, br and p and attributes src and style.
I've used the Loofah library, as it is suitable for both HTML and XML (both documents and string fragments). It is the engine behind the html sanitizer gem. I'm simply pasting the code example to show how simple it is to use.
Loofah Gem
unsafe_html = "ohai! <div>div is safe</div> <script>but script is not</script>"
doc = Loofah.fragment(unsafe_html).scrub!(:strip)
doc.to_s # => "ohai! <div>div is safe</div> "
doc.text # => "ohai! div is safe "
How about this?
white_list_sanitizer = Rails::Html::WhiteListSanitizer.new
WHITELIST = ['p','b','h1','h2','h3','h4','h5','h6','li','ul','ol','small','i','u']
[Your, Models, Here].each do |klass|
klass.all.each do |ob|
klass.attribute_names.each do |attrs|
if ob.send(attrs).is_a? String
ob.send("#{attrs}=", white_list_sanitizer.sanitize(ob.send(attrs), tags: WHITELIST, attributes: %w(id style)).gsub(/<p>\s*<\/p>\r\n/im, ''))
ob.save
end
end
end
end
If you want to remove all html tags you can use
htm.gsub(/<[^>]*>/,'')
This is working for me in rails 6.1.3:
.errors-description
= sanitize(message, tags: %w[div span strong], attributes: %w[class])
You can do .to_plain_text:
#my_string = <p>My HTML String</p>
#my_string.to_plain_text
=> My HTML String