Web Scraping with Nokogiri::HTML and Ruby - Output to CSV issue - html

I have a script that scrapes HTML article pages of a webshop. I'm testing with a set of 22 pages of which 5 article pages have a product description and the others don't.
This code puts the right info on screen:
if doc.at_css('.product_description')
doc.css('div > .product_description > p').each do |description|
puts description
end
else
puts "no description"
end
But now I'm stuck on how to get this correctly to output the found product descriptions to an array from where I'm writing them to a CSV file.
Tried several options, but none of them works so far.
If I replace the puts description for #description << description.content, then all the descriptions of the articles end up in the upper lines in the CSV although they do not belong to the articles in that line.
When I also replace "no description" for #description = "no description" then the first 14 lines in my CSV recieve 1 letter of "no description" each. Looks funny, but it is not exactly what I need.
If more code is needed, just shout!
This is the CSV code I use in the script:
CSV.open("artinfo.csv", "wb") do |row|
row << ["category", "sub-category", "sub-sub-category", "price", "serial number", "title", "description"]
(0..#prices.length - 1).each do |index|
row << [
#categories[index],
#subcategories[index],
#subsubcategories[index],
#prices[index],
#serial_numbers[index],
#title[index],
#description[index]]
end
end

It sounds like your data isn't lined up properly. If it were you should be able to do:
CSV.open("artinfo.csv", "w") do |csv|
csv << ["category", "sub-category", "sub-sub-category", "price", "serial number", "title", "description"]
[#categories, #subcategories, #subsubcategories, #prices, #serial_numbers, #title, #description].transpose.each do |row|
csv << row
end
end

Related

Ruby: Handling different JSON response that is not what is expected

Searched online and read through the documents, but have not been able to find an answer. I am fairly new and part of learning Ruby I wanted to make the script below.
The Script essentially does a Carrier Lookup on a list of numbers that are provided through a CSV file. The CSV file has just one row with the column header "number".
Everything runs fine UNTIL the API gives me an output that is different from the others. In this example, it tells me that one of the numbers in my file is not a valid US number. This then causes my script to stop running.
I am looking to see if there is a way to either ignore it (I read about Begin and End, but was not able to get it to work) or ideally either create a separate file with those errors or just put the data into the main file.
Any help would be much appreciated. Thank you.
Ruby Code:
require 'csv'
require 'uri'
require 'net/http'
require 'json'
number = 0
CSV.foreach('data1.csv',headers: true) do |row|
number = row['number'].to_i
uri = URI("https://api.message360.com/api/v3/carrier/lookup.json?PhoneNumber=#{number}")
req = Net::HTTP::Post.new(uri)
req.basic_auth 'XXX' , 'XXX'
res = Net::HTTP.start(uri.hostname, uri.port, :use_ssl => true) {|http|
http.request(req)
}
json = JSON.parse(res.body)
new = json["Message360"]["Carrier"].values
CSV.open("new.csv", "ab") do |csv|
csv << new
end
end
File Data:
number
5556667777
9998887777
Good Response example in JSON:
{"Message360"=>{"ResponseStatus"=>1, "Carrier"=>{"ApiVersion"=>"3", "CarrierSid"=>"XXX", "AccountSid"=>"XXX", "PhoneNumber"=>"+19495554444", "Network"=>"Cellco Partnership dba Verizon Wireless - CA", "Wireless"=>"true", "ZipCode"=>"92604", "City"=>"Irvine", "Price"=>0.0003, "Status"=>"success", "DateCreated"=>"2018-05-15 23:05:15"}}}
The response that causes Script to stop:
{
"Message360": {
"ResponseStatus": 0,
"Errors": {
"Error": [
{
"Code": "ER-M360-CAR-111",
"Message": "Allowed Only Valid E164 North American Numbers.",
"MoreInfo": []
}
]
}
}
}
It would appear you can just check json["Message360"]["ResponseStatus"] first for a 0 or 1 to indicate failure or success.
I'd probably add a rescue to help catch any other errors (malformed JSON, network issue, etc.)
CSV.foreach('data1.csv',headers: true) do |row|
number = row['number'].to_i
...
json = JSON.parse(res.body)
if json["Message360"]["ResponseStatus"] == 1
new = json["Message360"]["Carrier"].values
CSV.open("new.csv", "ab") do |csv|
csv << new
end
else
# handle bad response
end
rescue StandardError => e
# request failed for some reason, log e and the number?
end

JSON to CSV write data to csv file [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I have a small script:
require "csv"
require "json"
puts "JSON file name (include extension) ->"
jsonfile = gets.chomp
json = JSON.parse(File.open(jsonfile).read)
#puts json.first.collect {|k,v| k}.join(',')
puts json.collect {|node| "#{node.collect{|k,v| v}.join(',')}\n"}.join
CSV.open("generated.csv", "wb") do |csv|
csv << json.collect {|node| "#{node.collect{|k,v| v}.join(',')}\n"}.join
end
In the terminal it shows like this:
Missing: [User],{"error"=>[{"Something"=>"", "errno"=>"somthing", "de"=>"smoehting", "pe"=>"error", "errorMessage"=>"Missing "}], "data"=>nil}
Missing: [User],{"error"=>[{"Something"=>"", "errno"=>"somthing", "de"=>"smoehting", "pe"=>"error", "errorMessage"=>"Missing "}], "data"=>nil}
I need to output each row into a seperate row in a csv file. The above is what im trying to do to write it to csv, but it does not work
Your first mistake in your code is that you call << only one time. Each << creates one line, so you have to call << method n-times, where n is a number of lines.
Your second mistake is that you concatenate the array elements and try to pass a string as a <<'s argument. Each <<'s argument should be an array.
Summarizing, to create a CSV file containing two lines:
# my.csv
1,2,3
4,5,6
you should write:
CSV.open("my.csv", "wb") do |csv|
csv << [1, 2, 3]
csv << [4, 5, 6]
end
Similarly, to achieve your desired effect, try to rewrite your code as:
CSV.open("generated.csv", "wb") do |csv|
json.each do |node|
csv << node.collect { |k,v| v }
end
end

CSV record with spaces is not saving to the database

I am trying to save the below kind of CSV records into a DB:
9,Lambert,Kent D,Senator
But it is not being saved in DB, the transaction is being rollbacked and giving this error.
{"state_senate_district_id"=>"9", "last_name"=>"Lambert", "first_name"=>"Kent D", "tag"=>"Senator"}
(0.2ms) BEGIN
(0.1ms) ROLLBACK
["First name should contain only alphabets"]
So there is a space in first_name = "Kent D", hence it is not allowing space, so it's not saving to the DB.
Below is the code to parse the CSV:
hash = {}
CSV.foreach('Senator.csv', {:headers=>:first_row}) do |line|
hash['state_senate_district_id'] = line[0]
hash['last_name'] = line[1]
hash['first_name'] = line[2]
hash['tag'] = line[3]
puts hash
senator = Senator.new(hash)
unless senator.save(hash)
err = senator.errors.full_messages
p err
File.open("errors", "a") do |csv|
err.each do |c|
csv << "\n"
csv << "||||||"
csv << [c]
end
end
end
You probably have a validation rule in the Senator model that is preventing the first_name field from having a space. Remove that validation or change it so that it allows spaces.

Opening multiple html files & outputting to .txt with Nokogiri

Just wondering if these two functions are to be done using Nokogiri or via more basic Ruby commands.
require 'open-uri'
require 'nokogiri'
require "net/http"
require "uri"
doc = Nokogiri.parse(open("example.html"))
doc.xpath("//meta[#name='author' or #name='Author']/#content").each do |metaauth|
puts "Author: #{metaauth}"
end
doc.xpath("//meta[#name='keywords' or #name='Keywords']/#content").each do |metakey|
puts "Keywords: #{metakey}"
end
etc...
Question 1: I'm just trying to parse a directory of .html documents, get the information from the meta html tags, and output the results to a text file if possible. I tried a simple *.html wildcard replacement, but that didn't seem to work (at least not with Nokogiri.parse(open()) maybe it works with ::HTML or ::XML)
Question 2: But more important, is it possible to output all of those meta content outputs into a text file to replace the puts command?
Also forgive me if the code is overly complicated for the simple task being performed, but I'm a little new to Nokogiri / xpath / Ruby.
Thanks.
I have a code similar.
Please refer to:
module MyParser
HTML_FILE_DIR = `your html file dir`
def self.run(options = {})
file_list = Dir.entries(HTML_FILE_DIR).reject { |f| f =~ /^\./ }
result = file_list.map do |file|
html = File.read("#{HTML_FILE_DIR}/#{file}")
doc = Nokogiri::HTML(html)
parse_to_hash(doc)
end
write_csv(result)
end
def self.parse_to_hash(doc)
array = []
array << doc.css(`your select conditons`).first.content
... #add your selector code css or xpath
array
end
def self.write_csv(result)
::CSV.open("`your out put file name`", 'w') do |csv|
result.each { |row| csv << row }
end
end
end
MyParser.run
You can output to a file like so:
File.open('results.txt','w') do |file|
file.puts "output" # See http://ruby-doc.org/core-2.1.2/IO.html#method-i-puts
end
Alternatively, you could do something like:
authors = doc.xpath("//meta[#name='author' or #name='Author']/#content")
keywrds = doc.xpath("//meta[#name='keywords' or #name='Keywords']/#content")
results = authors.map{ |x| "Author: #{x}" }.join("\n") +
keywrds.map{ |x| "Keywords: #{x}" }.join("\n")
File.open('results.txt','w'){ |f| f << results }

Netlogo Export/Tableau issues

Currently I'm playing around with exporting my data from Netlogo to a CSV file and then loading it into Tableau with the following code..
to write-result-to-file
; if nothing to write then stop
if empty? Result-File [stop]
; Open file
file-open Result-File
; Write into the file
file-print (word Days-passed "," num-susceptible "," num-infected "," num-recovered)
; Close file
file-close
end
Where am running into trouble is when I load the data into tableau it isn't properly picking up the measures/dimensions. Is there a way in Netlogo to specify the headers of each of my rows/columns before they are exported to the CSV file?
This question was asked and answered over on NetLogo Users. James Steiner's answer is copied below, with a few typos in the code corrected. It's really quite elegant.
You can print the headers to your results-file during setup!
You might want to make a subroutine to handle all writing to the file, so you don't have to repeat code:
to write-csv [ #filename #items ]
;; #items is a list of the data (or headers!) to write.
if is-list? #items and not empty? #items
[ file-open #filename
;; quote non-numeric items
set #items map quote #items
;; print the items
;; if only one item, print it.
ifelse length #items = 1 [ file-print first #items ]
[file-print reduce [ (word ?1 "," ?2) ] #items]
;; close-up
file-close
]
end
to-report quote [ #thing ]
ifelse is-number? #thing
[ report #thing ]
[ report (word "\"" #thing "\"") ]
end
You would call it with
write-csv "myfilename.csv" ["label1" "label2" "label3"]
to write the column headers in your setup routine, and then
write-csv "myfilename.csv" [10.0 "sometext" 20.3]
to write a row of data - in this case a number, a string, and another number.