Ruby Nokogiri take all the content - html

i'm working on a scrapping project but i got a problem:
I wanna get all the data of https://coinmarketcap.com/all/views/all/ with nokigiri
but i only get 20 crypto name on the 200 loaded with nokogiri
the code:
ruby
require 'nokogiri'
require 'open-uri'
require 'rubygems'
def scrapper
return doc = Nokogiri::HTML(URI.open('https://coinmarketcap.com/all/views/all/'))
end
def fusiontab(tab1,tab2)
return Hash[tab1.zip(tab2)]
end
def crypto(page)
array_name=[]
array_value=[]
name_of_crypto=page.xpath('//tr//td[3]')
value_of_crypto=page.xpath('//tr//td[5]')
hash={}
name_of_crypto.each{ |name|
array_name<<name.text
}
value_of_crypto.each{|price|
array_value << price.text
}
hash=fusiontab(array_name,array_value)
return hash
end
puts crypto(scrapper)
can you help me to get all the cryptocurrencies ?

The URL you're using does not generate all the data as HTML; a lot of it is rendered after the page has been loaded.
Looking at the source code for the page, it appears that the data is rendered from a JSON script, embedded in the page.
it took quite some time to find the objects in order to work out what part of the JSON data has the contents that you want to work with:
The JSON object within the HTML, as a String object
page.css('script[type="application/json"]').first.inner_html
The JSON String converted to a real JSON Hash
JSON.parse(page.css('script[type="application/json"]').first.inner_html)
the position inside the JSON or the Array of Crypto Hashes
my_json["props"]["initialState"]["cryptocurrency"]["listingLatest"]["data"]
pretty print the first "crypto"
2.7.2 :142 > pp cryptos.first
{"id"=>1,
"name"=>"Bitcoin",
"symbol"=>"BTC",
"slug"=>"bitcoin",
"tags"=>
["mineable",
"pow",
"sha-256",
"store-of-value",
"state-channel",
"coinbase-ventures-portfolio",
"three-arrows-capital-portfolio",
"polychain-capital-portfolio",
"binance-labs-portfolio",
"blockchain-capital-portfolio",
"boostvc-portfolio",
"cms-holdings-portfolio",
"dcg-portfolio",
"dragonfly-capital-portfolio",
"electric-capital-portfolio",
"fabric-ventures-portfolio",
"framework-ventures-portfolio",
"galaxy-digital-portfolio",
"huobi-capital-portfolio",
"alameda-research-portfolio",
"a16z-portfolio",
"1confirmation-portfolio",
"winklevoss-capital-portfolio",
"usv-portfolio",
"placeholder-ventures-portfolio",
"pantera-capital-portfolio",
"multicoin-capital-portfolio",
"paradigm-portfolio"],
"cmcRank"=>1,
"marketPairCount"=>9158,
"circulatingSupply"=>18960043,
"selfReportedCirculatingSupply"=>0,
"totalSupply"=>18960043,
"maxSupply"=>21000000,
"isActive"=>1,
"lastUpdated"=>"2022-02-16T14:26:00.000Z",
"dateAdded"=>"2013-04-28T00:00:00.000Z",
"quotes"=>
[{"name"=>"USD",
"price"=>43646.858047604175,
"volume24h"=>20633664171.70021,
"marketCap"=>827546305397.4712,
"percentChange1h"=>-0.86544168,
"percentChange24h"=>-1.6482985,
"percentChange7d"=>-0.73945082,
"lastUpdated"=>"2022-02-16T14:26:00.000Z",
"percentChange30d"=>2.18336134,
"percentChange60d"=>-6.84146969,
"percentChange90d"=>-26.08073361,
"fullyDilluttedMarketCap"=>916584018999.69,
"marketCapByTotalSupply"=>827546305397.4712,
"dominance"=>42.1276,
"turnover"=>0.02493355,
"ytdPriceChangePercentage"=>-8.4718}],
"isAudited"=>false,
"rank"=>1,
"hasFilters"=>false,
"quote"=>
{"USD"=>
{"name"=>"USD",
"price"=>43646.858047604175,
"volume24h"=>20633664171.70021,
"marketCap"=>827546305397.4712,
"percentChange1h"=>-0.86544168,
"percentChange24h"=>-1.6482985,
"percentChange7d"=>-0.73945082,
"lastUpdated"=>"2022-02-16T14:26:00.000Z",
"percentChange30d"=>2.18336134,
"percentChange60d"=>-6.84146969,
"percentChange90d"=>-26.08073361,
"fullyDilluttedMarketCap"=>916584018999.69,
"marketCapByTotalSupply"=>827546305397.4712,
"dominance"=>42.1276,
"turnover"=>0.02493355,
"ytdPriceChangePercentage"=>-8.4718}}
}
the value of the first "crypto"
crypto.first["quote"]["USD"]["price"]
the key that you use in your Hash for the first "crypto"
crypto.first["symbol"]
put it all together and you get the following code (looping through each "crypto" with each_with_object)
require `json`
require 'nokogiri'
require 'open-uri'
...
def crypto(page)
my_json = JSON.parse(page.css('script[type="application/json"]').first.inner_html)
cryptos = my_json["props"]["initialState"]["cryptocurrency"]["listingLatest"]["data"]
hash = cryptos.each_with_object({}) do |crypto, hsh|
hsh[crypto["name"]] = crypto["quote"]["USD"]["price"]
end
return hash
end
puts crypto(scrapper);

Related

How to serialize/deserialize ruby hashes/structs with objects as keys to json

I would like to dump a nested datastructure in ruby to json (I am aware of the Marshal module but I need a standard format) and be able to load/parse the datastructure again. Catch: I use structs (or easier for the example: hashes) as keys of hashes. Example:
require 'json'
h = {{hello: 123} => 123}
JSON.parse(JSON.generate(h)) #=> {"{:hello=>123}"=>123}
So the problem is, that JSON.generate(h) serialises the key {:hello=>123} as a string and when I parse the result again, it remains a string.
How can I solve this and regain the original structure after generate/parse?
JSON only allows strings as object keys. For this reason to_s is called for all keys.
You'll have the following options to solve your issue:
The best option is changing the data structure so it can properly be serialized to JSON.
You'll have to handle the stringified key yourself. An Hash produces a perfectly valid Ruby syntax when converted to a string that can be converted using Kernel#eval like Andrey Deineko suggested in the comments.
result = json.transform_keys { |key| eval(key) }
# json.transform_keys(&method(:eval)) is the same as the above.
The Hash#transform_keys method is relatively new (available since Ruby 2.5.0) and might currently not be in you development environment. You can replace this with a simple Enumerable#map if needed.
result = json.map { |k, v| [eval(k), v] }.to_h
Note: If the incoming JSON contains any user generated content I highly sugest you stay away from using eval since you might allow the user to execute code on your server.
I need a standard format
YAML is a standard format that would suffice here:
▶ h = {{hello: 123} => 123}
#⇒ {{:hello=>123}=>123}
▶ YAML.dump h
#⇒ "---\n? :hello: 123\n: 123\n"
▶ YAML.load _
#⇒ {{:hello=>123}=>123}
As already pointed by mudasobwa, YAML is a good tool: allows you to store also custom class objects:
require 'yaml'
class MyCaptain
attr_accessor :name, :ship
def initialize(name, ship)
#name = name
#ship = ship
end
end
kirk = MyCaptain.new('James T. Kirk', 'USS Enterprise NCC-1701')
picard = MyCaptain.new('Jean-Luc Picard', 'Enterprise NCC-1701D')
captains = [kirk, picard]
File.open("my_captains.yml","w") do |file|
file.write captains.to_yaml
end
p YAML.load_file('my_captains.yml')
#=> [#<MyCaptain:0x007f889d0973b0 #name="James T. Kirk", #ship="USS Enterprise NCC-1701">, #<MyCaptain:0x007f889d096b40 #name="Jean-Luc Picard", #ship="Enterprise NCC-1701D">]

Extract Single Data From JSON URL for Dashing Dashboard

I am trying to show the "sgv" value on a Dashing / Smashing dashboard widget. Ultimately I would also like to show the "direction" value as well. I am running into problems pulling that precise value down which changes every 3 to 5 minutes. I have already been able to mirror the exact string using the following:
require 'net/http'
require 'rest-client'
require 'json'
url = "https://dnarnianbg.herokuapp.com/api/v1/entries/current.json"
response = RestClient.get(url)
JSON.parse(response)
# :first_in sets how long it takes before the job is first run. In this case, it is run immediately
current_nightscout = 0
SCHEDULER.every '5m' do
last_nightscout = current_nightscout
current_nightscout = response
send_event('nightscout', { current: current_nightscout, last: last_nightscout })
end
I have also searched the archives several times. I don't wish to write this to a file like this one shows and the duplicate question has been deleted or moved.
I realize that the JSON.parse(response) is just going to parse out whatever I tell it the response equals, but I don't know how to get that response to equal SGV. Maybe the solution isn't in the RestClient, but that is where I am lost.
Here is the JSON URL: http://dnarnianbg.herokuapp.com/api/v1/entries/current.json
EDIT: The output of that link is something like this:
[{"_id":"5ba295ddb8a1ee0aede71822","sgv":87,"date":1537381813000,"dateString":"2018-09-19T18:30:13.000Z","trend":4,"direction":"Flat","device":"share2","type":"sgv"}]
You need something like response[0]["sgv"] which should return 52 if you end up with many items in the list you will need to iterate over them.
The best thing you can do is to break your problem down into easier parts to debug. As you are having problems accessing some JSON via an API you should make a simple script which only does the function you want in order to test it and see where the problem is.
Here is a short example you can put into a .rb file and run;
#!/usr/bin/ruby
require 'open-uri'
require 'json'
test = JSON.parse(open("https://dnarnianbg.herokuapp.com/api/v1/entries/current.json", :read_timeout => 4).read)
puts test[0]["sgv"]
That should return the value from sgv
I realise that short sweet example may be little use as a learner so here is a more verbose version with some comments;
#!/usr/bin/ruby
require 'open-uri'
require 'json'
# Open the URL and read the result. Time out if this takes longer then 4 sec.
get_data = open("https://dnarnianbg.herokuapp.com/api/v1/entries/current.json", :read_timeout => 4).read
# Parse the response (get_data) to JSON and put in variable output
output = JSON.parse(get_data)
# Put the output to get the 'sgv figure'
p output[0]["sgv"]
It always pays to manually examine the data you get back, in your case the data looks like this (when make pretty)
[
{
"_id": "5ba41a0fb8a1ee0aedf6eb2c",
"sgv": 144,
"date": 1537481109000,
"dateString": "2018-09-20T22:05:09.000Z",
"trend": 4,
"direction": "Flat",
"device": "share2",
"type": "sgv"
}
]
What you actually have is an Array. Your server returns only 1 result, numbered '0' hence you need [0] in your p statement. Once you have accessed the array id then you can simply use the object you need as [sgv]
If your app ever returns more than one record then you will need to change your code to read all of the results and iterate over them in order to get all the values you need.
Here is the final code that made it work
require 'net/http'
require 'json'
require 'rest-client'
# :first_in sets how long it takes before the job is first run. In this case, it is run immediately
current_nightscout = 0
SCHEDULER.every '1m' do
test = JSON.parse(open("https://dnarnianbg.herokuapp.com/api/v1/entries/current.json", :read_timeout => 4).read)
last_nightscout = current_nightscout
current_nightscout = p test[0]["sgv"]
send_event('nightscout', { current: current_nightscout, last: last_nightscout })
end
I can probably eliminate require 'rest-client' since that is no longer being used, but it works right now and that is all that matters.

Crystal handle json file of known format but dynamic keys

So I have a JSON file of a somewhat known format { String => JSON::Type, ... }. So it is basically of type Hash(String, JSON::Type). But when I try and read it from file to memory like so: JSON.parse(File.read(#cache_file)).as(Hash(String, JSON::Type)) I always get an exception: can't cast JSON::Any to Hash(String, JSON::Type)
I'm not sure how I am supposed to handle the data if I can't cast it.
What I basically want to do is the following:
save JSON::Type data under a String key
replace JSON::Type data with other JSON::Type data under a String key
And of course read from / write to file...
Here's the whole thing I've got so far:
class Cache
def initialize(#cache_file = "/tmp/cache_file.tmp")
end
def cache(cache_key : (String | Symbol))
mutable_cache_data = data
value = mutable_cache_data[cache_key.to_s] ||= yield.as(JSON::Type)
File.write #cache_file, mutable_cache_data
value
end
def clear
File.delete #cache_file
end
def data
unless File.exists? #cache_file
File.write #cache_file, {} of String => JSON::Type
end
JSON.parse(File.read(#cache_file)).as(Hash(String, JSON::Type))
end
end
puts Cache.new.cache(:something) { 10 } # => 10
puts Cache.new.cache(:something) { 'a' } # => 10
TL;DR I want to read a JSON file into a Hash(String => i_dont_care), replace a value under a given key name and serialize it to file again. How do I do that?
JSON.parse returns an JSON::Any, not a Hash so you can't cast it. You can however access the underlying raw value as JSON.parse(file).raw and cast this as hash.
Then your code is basically working (I've fixed a few error): https://carc.in/#/r/28c1
You can use use Hash(String, JSON::Type).from_json(File.read(#cache_file)). Hopefully you can restrict the type of JSON::Type down to something more sensible too. JSON::Any and JSON.parse_raw are very much a last resort compared to simply representing your schema using Hash, Array and custom types using JSON.mapping.

How to traverse li tags and collect their values with Nokogiri

I wrote a tiny script to pull usernames from Github. I can get the first username's details out, but I don't understand how I can iterate through a list of elements with the same CSS selector class to put a list of the usernames together:
page = agent.get('https://github.com/angular/angular/stargazers')
html_results = Nokogiri::HTML(page.body)
first_username = html_results.at_css('.follow-list-name').text
first_username_location = html_results.at_css('.follow-list-info').text
Can you help me understand how I can iterate through all of the follow-list-... elements in the page.body and store the values in an array somewhere?
Nokogiri at_css returns a single (first) match. Use css instead to get an array of matching results:
require 'nokogiri'
require 'open-uri'
require 'pp'
html = Nokogiri::HTML(open('https://github.com/angular/angular/stargazers').read)
usernames = html.css('.follow-list-name').map(&:text)
locations = html.css('.follow-list-info').map(&:text)
pp usernames
pp locations
Output:
["Jeff Arese Vilar",
"Yaroslav Dusaniuk",
"Matthieu Le brazidec",
... ]
[" #Wallapop ",
" Ukraine, Vinnytsia",
" Joined on Jul 4, 2014",
... ]
Just note that to parse the rest of the members you will need to handle pagination. I.e. fetching the data from all the other pages with:
http://github.com/.../stargazers?page=NN
...where NN is the page number.
Using the Github API
A much more robust way is to use the Github Stargazers List API:
https://developer.github.com/v3/activity/starring/#list-stargazers

Convert json to array using Perl

I have a chunk of json that has the following format:
{"page":{"size":7,"number":1,"totalPages":1,"totalElements":7,"resultSetId":null,"duration":0},"content":[{"id":"787edc99-e94f-4132-b596-d04fc56596f9","name":"Verification","attributes":{"ruleExecutionClass":"VerificationRule"},"userTags":[],"links":[{"rel":"self","href":"/endpoint/787edc99-e94f-4132-b596-d04fc56596f9","id":"787edc99-e94f-...
Basically the size attribute (in this case) tells me that there are 7 parts to the content section. How do I convert this chunk of json to an array in Perl, and can I do it using the size attribute? Or is there a simpler way like just using decode_json()?
Here is what I have so far:
my $resources = get_that_json_chunk(); # function returns exactly the json you see, except all 7 resources in the content section
my #decoded_json = #$resources;
foreach my $resource (#decoded_json) {
I've also tried something like this:
my $deserialize = from_json( $resources );
my #decoded_json = (#{$deserialize});
I want to iterate over the array and handle the data. I've tried a few different ways because I read a little about array refs, but I keep getting "Not an ARRAY reference" errors and "Can't use string ("{"page":{"size":7,"number":1,"to"...) as an ARRAY ref while "strict refs" in use"
Thank you to Matt Jacob:
my $deserialized = decode_json($resources);
print "$_->{id}\n" for #{$deserialized->{content}};