Nokogiri parsing table with no html element - html

I have this code that attempts to go to a URL and parse 'li' elements into an array. However I have run into a problem when trying to parse anything that is not in a 'b' tag.
Code:
url = '(some URL)'
page = Nokogiri::HTML(open(url))
csv = CSV.open("/tmp/output.csv", 'w')
page.search('//li[not(#id) and not(#class)]').each do |row|
arr = []
row.search('b').each do |cell|
arr << cell.text
end
csv << arr
pp arr
end
HTML:
<li><b>The Company Name</b><br>
The Street<br>
The City,
The State
The Zipcode<br><br>
</li>
I would like to parse all of the elements so that the output would be something like this:
["The Company Name", "The Street", "The City", "The State", "The Zip Code"],
["The Company Name", "The Street", "The City", "The State", "The Zip Code"],
["The Company Name", "The Street", "The City", "The State", "The Zip Code"]

require 'nokogiri'
def main
output = []
page = File.open("parse.html") {|f| Nokogiri::HTML(f)}
page.search("//li[not(#id) and not (#class)]").each do |row|
arr = []
result = row.text
result.each_line { |l|
if l.strip.length > 0
arr << l.strip
end
}
output << arr
end
print output
end
if __FILE__ == $PROGRAM_NAME
main()
end

I ended up finding the solution to my own question so if anyone is interested I simply changed
row.search('b').each do |cell|
into:
row.search('text()'.each do |cell|
I also changed
arr << cell.text
into:
arr << cell.text.gsub("\n", '').gsub("\r", '')
in order to remove all the \n and the \r that were present in the output.

Based on your HTML I'd do it like:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<ol>
<li><b>The Company Name</b><br>
The Street<br>
The City,
The State
The Zipcode<br><br>
</li>
<li><b>The Company Name</b><br>
The Street<br>
The City,
The State
The Zipcode<br><br>
</li>
</ol>
EOT
doc.search('li').map{ |li|
text = li.text.split("\n").map(&:strip)
}
# => [["The Company Name",
# "The Street",
# "The City,",
# "The State",
# "The Zipcode"],
# ["The Company Name",
# "The Street",
# "The City,",
# "The State",
# "The Zipcode"]]

Related

ruby with csv file add next line automatically

def supplier_registeration
print "Registration form :"
print "\nName :"
name = gets.chomp.downcase
print "User Name :"
user_name = gets.chomp.downcase
print "Password :"
password = gets.chomp.downcase
print "contact :"
contact = gets.chomp.to_i
print "address :"
address = gets.chomp.downcase
CSV.open("source/supplier.csv", "wb") do |csv|
csv << ["name", "user name", "password", "contact", "address"]
csv << [name, user_name, password, contact, address]
end
print "Registration successfully..!\n"
supplier
end
This method writes to the CSV file, but on the next registration entry, my file rewrites the initial data. How do I append user input to the next line automatically?
this is the answer for above question
'headers = ['name','user name','password','contact','address']
CSV.open('source/consumer_list.csv', 'a+', {force_quotes: true}) do |csv|
csv << headers if csv.count.eql? 0
csv << [name, user_name, password, contact, address]
end'

compare input to fields in a json file in ruby

I am trying to create a function that takes an input. Which in this case is a tracking code. Look that tracking code up in a JSON file then return the tracking code as output. The json file is as follows:
[
{
"tracking_number": "IN175417577",
"status": "IN_TRANSIT",
"address": "237 Pentonville Road, N1 9NG"
},
{
"tracking_number": "IN175417578",
"status": "NOT_DISPATCHED",
"address": "Holly House, Dale Road, Coalbrookdale, TF8 7DT"
},
{
"tracking_number": "IN175417579",
"status": "DELIVERED",
"address": "Number 10 Downing Street, London, SW1A 2AA"
}
]
I have started using this function:
def compare_content(tracking_number)
File.open("pages/tracking_number.json", "r") do |file|
file.print()
end
Not sure how I would compare the input to the json file. Any help would be much appreciated.
You can use the built-in JSON module.
require 'json'
def compare_content(tracking_number)
# Loads ENTIRE file into string. Will not be effective on very large files
json_string = File.read("pages/tracking_number.json")
# Uses the JSON module to create an array from the JSON string
array_from_json = JSON.parse(json_string)
# Iterates through the array of hashes
array_from_json.each do |tracking_hash|
if tracking_number == tracking_hash["tracking_number"]
# If this code runs, tracking_hash has the data for the number you are looking up
end
end
end
This will parse the JSON supplied into an array of hashes which you can then compare to the number you are looking up.
If you are the one generating the JSON file and this method will be called a lot, consider mapping the tracking numbers directly to their data for this method to potentially run much faster. For example,
{
"IN175417577": {
"status": "IN_TRANSIT",
"address": "237 Pentonville Road, N1 9NG"
},
"IN175417578": {
"status": "NOT_DISPATCHED",
"address": "Holly House, Dale Road, Coalbrookdale, TF8 7DT"
},
"IN175417579": {
"status": "DELIVERED",
"address": "Number 10 Downing Street, London, SW1A 2AA"
}
}
That would parse into a hash, where you could much more easily grab the data:
require 'json'
def compare_content(tracking_number)
json_string = File.read("pages/tracking_number.json")
hash_from_json = JSON.parse(json_string)
if hash_from_json.key?(tracking_number)
tracking_hash = hash_from_json[tracking_number]
else
# Tracking number does not exist
end
end

How to format json in a file using Groovy

I have a question in regards to formatting a file so that it displays a Json output to the correct format.
At the moment the code I have below imports a json into a file but when I open the file, it displays the json in a single line (word wrap unticked) like so:
{"products":[{"type":null,"information":{"description":"Hotel Parque La Paz (One Bedroom apartment) (Half Board) [23/05/2017 00:00:00] 7 nights","items":{"provider Company":"Juniper","provider Hotel ID":"245","provider Hotel Room ID":"200"}},"costGroups":[{"name":null,"costLines":[{"name":"Hotel Cost","search":null,"quote":234.43,"quotePerAdult":null,"quotePerChild":null}
I want to format the json in the file so that it looks like actual json formatting like so:
{
"products": [
{
"type": null,
"information": {
"description": "Hotel Parque La Paz (One Bedroom apartment) (Half Board) [23/05/2017 00:00:00] 7 nights",
"items": {
"provider Company": "Juniper",
"provider Hotel ID": "245",
"provider Hotel Room ID": "200"
}
},
"costGroups": [
{
"name": null,
"costLines": [
{
"name": "Hotel Cost",
"search": null,
"quote": 234.43,
"quotePerAdult": null,
"quotePerChild": null
}
Virtually each header has its own line to contain its values.
What is the best way to implement this to get the correct json formatting within the file?
Below is the code:
def groovyUtils = new com.eviware.soapui.support.GroovyUtils(context)
def dataFolder = groovyUtils.projectPath +"//Log Data//"
def response = testRunner.testCase.getTestStepByName("GET_Pricing{id}").getProperty("Response").getValue();
def jsonFormat = (response).toString()
def fileName = "Logged At - D" +date+ " T" +time+ ".txt"
def logFile = new File(dataFolder + fileName)
// checks if a current log file exists if not then prints to logfile
if(logFile.exists())
{
log.info("Error a file named " + fileName + "already exisits")
}
else
{
logFile.write "Date Stamp: " +date+ " " + time + "\n" + jsonFormat //response
If you have a modern version of groovy, you can do:
JsonOutput.prettyPrint(jsonFormat)

Ruby Parsing json array with OpenStruct

I'm trying to parse a json file with OpenStruct. Json file has an array for Skills. When I parse it I get some extra "garbage" returned. How do I get rid of it?
json
{
"Job": "My Job 1",
"Skills": [{ "Name": "Name 1", "ClusterName": "Cluster Name 1 Skills"},{ "Name": "Name 2", "ClusterName": "Cluster Name 2 Skills"}]
}
require 'ostruct'
require 'json'
json = File.read('1.json')
job = JSON.parse(json, object_class: OpenStruct)
puts job.Skills
#<OpenStruct Name="Name 1", ClusterName="Cluster Name 1 Skills">
#<OpenStruct Name="Name 2", ClusterName="Cluster Name 2 Skills">
If by garbage, you mean #<OpenStruct and ">, it is just the way Ruby represents objects when called with puts. It is useful for development and debugging, and it makes it easier to understand the difference between a String, an Array, an Hash and an OpenStruct.
If you just want to display the name and cluster name, and nothing else :
puts job.Job
job.Skills.each do |skill|
puts skill.Name
puts skill.ClusterName
end
It returns :
My Job 1
Name 1
Cluster Name 1 Skills
Name 2
Cluster Name 2 Skills
EDIT:
When you use job = JSON.parse(json, object_class: OpenStruct), your job variable becomes an OpenStruct Ruby object, which has been created from a json file.
It doesn't have anything to do with json though: it is not a json object anymore, so you cannot just write it back to a .json file and expect it to have the correct syntax.
OpenStruct doesn't seem to work well with to_json, so it might be better to remove object_class: OpenStruct, and just work with hashes and arrays.
This code reads 1.json, convert it to a Ruby object, adds a skill, modifies the job name, writes the object to 2.json, and reads it again as JSON to check that everything worked fine.
require 'json'
json = File.read('1.json')
job = JSON.parse(json)
job["Skills"] << {"Name" => "Name 3", "ClusterName" => "Cluster Name 3 Skills"}
job["Job"] += " (modified version)"
# job[:Fa] = 'blah'
File.open('2.json', 'w'){|out|
out.puts job.to_json
}
require 'pp'
pp JSON.parse(File.read('2.json'))
# {"Job"=>"My Job 1 (modified version)",
# "Skills"=>
# [{"Name"=>"Name 1", "ClusterName"=>"Cluster Name 1 Skills"},
# {"Name"=>"Name 2", "ClusterName"=>"Cluster Name 2 Skills"},
# {"Name"=>"Name 3", "ClusterName"=>"Cluster Name 3 Skills"}]}

Ruby: Extract from deeply nested JSON structure based on multiple criteria

I want to select any marketId of marketName == 'Moneyline' but only those with countryCode == 'US' || 'GB' OR eventName.include?(' # '). (space before and after the #). I tried different combos of map and select but some nodes don't have countryCode which complicates things for me. This is the source, but a sample of what it might look like:
{"currencyCode"=>"GBP",
"eventTypes"=>[
{"eventTypeId"=>7522,
"eventNodes"=>[
{"eventId"=>28024331,
"event"=>
{"eventName"=>"EWE Baskets Oldenburg v PAOK Thessaloniki BC"
},
"marketNodes"=>[
{"marketId"=>"1.128376755",
"description"=>
{"marketName"=>"Moneyline"}
},
{"marketId"=>"1.128377853",
"description"=>
{"marketName"=>"Start Lublin +7.5"}
}}}]},
{"eventId"=>28023434,
"event"=>
{"eventName"=>"Asseco Gdynia v Start Lublin",
"countryCode"=>"PL",
},
"marketNodes"=>
[{"marketId"=>"1.128377853", ETC...
Based on this previous answer, you just need to add a select on eventNodes :
require 'json'
json = File.read('data.json')
hash = JSON.parse(json)
moneyline_market_ids = hash["eventTypes"].map{|type|
type["eventNodes"].select{|event_node|
['US', 'GB'].include?(event_node["event"]["countryCode"]) || event_node["event"]["eventName"].include?(' # ')
}.map{|event|
event["marketNodes"].select{|market|
market["description"]["marketName"] == 'Moneyline'
}.map{|market|
market["marketId"]
}
}
}.flatten
puts moneyline_market_ids.join(', ')
#=> 1.128255531, 1.128272164, 1.128255516, 1.128272159, 1.128278718, 1.128272176, 1.128272174, 1.128272169, 1.128272148, 1.128272146, 1.128255464, 1.128255448, 1.128272157, 1.128272155, 1.128255499, 1.128272153, 1.128255484, 1.128272150, 1.128255748, 1.128272185, 1.128278720, 1.128272183, 1.128272178, 1.128255729, 1.128360712, 1.128255371, 1.128255433, 1.128255418, 1.128255403, 1.128255387
If you want to keep the country code and name information with the id:
moneyline_market_ids = hash["eventTypes"].map{|type|
type["eventNodes"].map{|event_node|
[event_node, event_node["event"]["countryCode"], event_node["event"]["eventName"]]
}.select{|_, country, event_name|
['US', 'GB'].include?(country) || event_name.include?(' # ')
}.map{|event, country, event_name|
event["marketNodes"].select{|market|
market["description"]["marketName"] == 'Moneyline'
}.map{|market|
[market["marketId"],country,event_name]
}
}
}.flatten(2)
require 'pp'
pp moneyline_market_ids
#=> [["1.128255531", "US", "Philadelphia # Seattle"],
# ["1.128272164", "US", "Arkansas # Mississippi State"],
# ["1.128255516", "US", "New England # San Francisco"],
# ["1.128272159", "US", "Indiana # Michigan"],
# ["1.128278718", "CA", "Edmonton # Ottawa"],
# ["1.128272176", "US", "Arizona State # Washington"],
# ["1.128272174", "US", "Alabama A&M # Auburn"],
# ...