batch convert HTML to Markdown - html

I have a whole lot of html files that live in one folder. I need to convert these to markdown I found a couple gems out there that does this great one by one.
my question is...
How can I loop though each file in the folder and run the command to convert these to md on a separate folder.
UPDATE
#!/usr/bin/ruby
root = 'C:/Doc'
inDir = File.join(root, '/input')
outDir = File.join(root, '/output')
extension = nil
fileName = nil
Dir.foreach(inDir) do |file|
# Dir.foreach will always show current and parent directories
if file == '.' or item == '..' then
next
end
# makes sure the current iteration is not a sub directory
if not File.directory?(file) then
extension = File.extname(file)
fileName = File.basename(file, extension)
end
# strips off the last string if it contains a period
if fileName[fileName.length - 1] == "." then
fileName = fileName[0..-1]
end
# this is where I got stuck
reverse_markdown File.join(inDir, fileName, '.html') > File.join(outDir, fileName, '.md')

Dir.glob(directory) {|f| ... } will loop through all files inside a directory. For example using the Redcarpet library you could do something like this:
require 'redcarpet'
markdown = Redcarpet::Markdown.new(Redcarpet::Render::HTML, :autolink => true)
Dir.glob('*.md') do |in_filename|
out_filename = File.join(File.dirname(in_filename), "#{File.basename(in_filename,'.*')}.html")
File.open(in_filename, 'r') do |in_file|
File.open(out_filename, 'w') do |out_file|
out_file.write markdown.render(in_file.read)
end
end
end

Related

Splitting a csv file into multiple files

I have a csv file of 150500 rows and I want to split it into multiple files containing 500 rows (entries)
I'm using Jupyter and I know how to open and read the file. However, I don't know how to specify an output_path to record the newly created files from splitting the big one.
I have found this code online but once again since I don't know what is my output_path I don't know how to use it. Moreover, for this block of code I don't understand how we specify the input file.
import os
def split(filehandler, delimiter=',', row_limit=1000,
output_name_template='output_%s.csv', output_path='.', keep_headers=True):
import csv
reader = csv.reader(filehandler, delimiter=delimiter)
current_piece = 1
current_out_path = os.path.join(
output_path,
output_name_template % current_piece
)
current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
current_limit = row_limit
if keep_headers:
headers = reader.next()
current_out_writer.writerow(headers)
for i, row in enumerate(reader):
if i + 1 > current_limit:
current_piece += 1
current_limit = row_limit * current_piece
current_out_path = os.path.join(
output_path,
output_name_template % current_piece
)
current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
if keep_headers:
current_out_writer.writerow(headers)
current_out_writer.writerow(row)
My file name is DataSet2.csv and it's in the same file in jupyter as my ipynb notebook is running.
number_of_small_files = 301
lines_per_small_file = 500
largeFile = open('large.csv', 'r')
header = largeFile.readline()
for i in range(number_of_small_files):
smallFile = open(str(i) + '_small.csv', 'w')
smallFile.write(header) # This line copies the header to all small files
for x in range(lines_per_small_file):
line = largeFile.readline()
smallFile.write(line)
smallFile.close()
largeFile.close()
This will create many small files in the same directory. About 301 of them. They will be named from 0_small.csv to 300_small.csv.
Using standard unix utilities:
cat DataSet2.csv | tail -n +2 | split -l 500 --additional-suffix=.csv output_
This pipeline takes the original file, strips off the first line with 'tail -n +2', and then splits the rest into 500 line chunks that are put into files with names that start with 'output_' and end with '.csv'

Prevent jekyll from cleaning up generated JSON file?

I've written a simple plugin that generates a small JSON file
module Jekyll
require 'pathname'
require 'json'
class SearchFileGenerator < Generator
safe true
def generate(site)
output = [{"title" => "Test"}]
path = Pathname.new(site.dest) + "search.json"
FileUtils.mkdir_p(File.dirname(path))
File.open(path, 'w') do |f|
f.write("---\nlayout: null\n---\n")
f.write(output.to_json)
end
# 1/0
end
end
end
But the generated JSON file gets deleted every time Jekyll runs to completion. If I uncomment the division by zero line and cause it to error out, I can see that the search.json file is being generated, but it's getting subsequently deleted. How do I prevent this?
I found the following issue, which suggested adding the file to keep_files: https://github.com/jekyll/jekyll/issues/5162 which worked:
The new code seems to avoid search.json from getting deleted:
module Jekyll
require 'pathname'
require 'json'
class SearchFileGenerator < Generator
safe true
def generate(site)
output = [{"title" => "Test"}]
path = Pathname.new(site.dest) + "search.json"
FileUtils.mkdir_p(File.dirname(path))
File.open(path, 'w') do |f|
f.write("---\nlayout: null\n---\n")
f.write(output.to_json)
end
site.keep_files << "search.json"
end
end
end
Add your new page to site.pages :
module Jekyll
class SearchFileGenerator < Generator
def generate(site)
#site = site
search = PageWithoutAFile.new(#site, site.source, "/", "search.json")
search.data["layout"] = nil
search.content = [{"title" => "Test 32"}].to_json
#site.pages << search
end
end
end
Inspired by jekyll-feed code.

ruby sketchup scene serialization

I am very new on Sketchup and ruby , I have worked with java and c# but this is the first time with ruby.
Now I have one problem, I need to serialize all scene in one json (scene hierarchy, object name, object material and position this for single object) how can I do this?
I have already done this for unity3D (c#) without a problem.
I tried this:
def main
avr_entities = Sketchup.active_model.entities # all objects
ambiens_dictionary = {}
ambiens_list = []
avr_entities.each do |root|
if root.is_a?(Sketchup::Group) || root.is_a?(Sketchup::ComponentInstance)
if root.name == ""
UI.messagebox("this is a group #{root.definition.name}")
if root.entities.count > 0
root.entities.each do |leaf|
if leaf.is_a?(Sketchup::Group) || leaf.is_a?(Sketchup::ComponentInstance)
UI.messagebox("this is a leaf #{leaf.definition.name}")
end
end
end
else
# UI.messagebox("this is a leaf #{root.name}")
end
end
end
end
Have you tried the JSON library
require 'json'
source = { a: [ { b: "hello" }, 1, "world" ], c: 'hi' }.to_json
source.to_json # => "{\"a\":[{\"b\":\"hello\"},1,\"world\"],\"c\":\"hi\"}"
Used the code below to answer a question Here, but it might also work here.
The code can run outside of SketchUp for testing in the terminal. Just make sure to follow these steps...
Copy the code below and paste it on a ruby file (example: file.rb)
Run the script in terminal ruby file.rb.
The script will write data to JSON file and also read the content of JSON file.
The path to the JSON file is relative to the ruby file created in step one. If the script can't find the path it will create the JSON file for you.
module DeveloperName
module PluginName
require 'json'
require 'fileutils'
class Main
def initialize
path = File.dirname(__FILE__)
#json = File.join(path, 'file.json')
#content = { 'hello' => 'hello world' }.to_json
json_create(#content)
json_read(#json)
end
def json_create(content)
File.open(#json, 'w') { |f| f.write(content) }
end
def json_read(json)
if File.exist?(json)
file = File.read(json)
data_hash = JSON.parse(file)
puts "Json content: #{data_hash}"
else
msg = 'JSON file not found'
UI.messagebox(msg, MB_OK)
end
end
# # #
end
DeveloperName::PluginName::Main.new
end
end

blank file while copying a file in python

I have a function takes a file as input and prints certain statistics and also copies the file into a file name provided by the user. Here is my current code:
def copy_file(option):
infile_name = input("Please enter the name of the file to copy: ")
infile = open(infile_name, 'r')
outfile_name = input("Please enter the name of the new copy: ")
outfile = open(outfile_name, 'w')
slist = infile.readlines()
if option == 'statistics':
for line in infile:
outfile.write(line)
infile.close()
outfile.close()
result = []
blank_count = slist.count('\n')
for item in slist:
result.append(len(item))
print('\n{0:<5d} lines in the list\n{1:>5d} empty lines\n{2:>7.1f} average character per line\n{3:>7.1f} average character per non-empty line'.format(
len(slist), blank_count, sum(result)/len(slist), (sum(result)-blank_count)/(len(slist)-blank_count)))
copy_file('statistics')
It prints the statistics of the file correctly, however the copy it makes of the file is empty. If I remove the readline() part and the statistics part, the function seems to make a copy of the file correctly. How can I correct my code so that it does both. It's a minor problem but I can't seem to get it.
The reason the file is blank is that
slist = infile.readlines()
is reading the entire contents of the file, so when it gets to
for line in infile:
there is nothing left to read and it just closes the newly truncated (mode w) file leaving you with a blank file.
I think the answer here is to change your for line in infile: to for line in slist:
def copy_file(option):
infile_name= input("Please enter the name of the file to copy: ")
infile = open(infile_name, 'r')
outfile_name = input("Please enter the name of the new copy: ")
outfile = open(outfile_name, 'w')
slist = infile.readlines()
if option == 'statistics':
for line in slist:
outfile.write(line)
infile.close()
outfile.close()
result = []
blank_count = slist.count('\n')
for item in slist:
result.append(len(item))
print('\n{0:<5d} lines in the list\n{1:>5d} empty lines\n{2:>7.1f} average character per line\n{3:>7.1f} average character per non-empty line'.format(
len(slist), blank_count, sum(result)/len(slist), (sum(result)-blank_count)/(len(slist)-blank_count)))
copy_file('statistics')
Having said all that, consider if it's worth using your own copy routine rather than shutil.copy - Always better to delegate the task to your OS as it will be quicker and probably safer (thanks to NightShadeQueen for the reminder)!

Opening multiple html files & outputting to .txt with Nokogiri

Just wondering if these two functions are to be done using Nokogiri or via more basic Ruby commands.
require 'open-uri'
require 'nokogiri'
require "net/http"
require "uri"
doc = Nokogiri.parse(open("example.html"))
doc.xpath("//meta[#name='author' or #name='Author']/#content").each do |metaauth|
puts "Author: #{metaauth}"
end
doc.xpath("//meta[#name='keywords' or #name='Keywords']/#content").each do |metakey|
puts "Keywords: #{metakey}"
end
etc...
Question 1: I'm just trying to parse a directory of .html documents, get the information from the meta html tags, and output the results to a text file if possible. I tried a simple *.html wildcard replacement, but that didn't seem to work (at least not with Nokogiri.parse(open()) maybe it works with ::HTML or ::XML)
Question 2: But more important, is it possible to output all of those meta content outputs into a text file to replace the puts command?
Also forgive me if the code is overly complicated for the simple task being performed, but I'm a little new to Nokogiri / xpath / Ruby.
Thanks.
I have a code similar.
Please refer to:
module MyParser
HTML_FILE_DIR = `your html file dir`
def self.run(options = {})
file_list = Dir.entries(HTML_FILE_DIR).reject { |f| f =~ /^\./ }
result = file_list.map do |file|
html = File.read("#{HTML_FILE_DIR}/#{file}")
doc = Nokogiri::HTML(html)
parse_to_hash(doc)
end
write_csv(result)
end
def self.parse_to_hash(doc)
array = []
array << doc.css(`your select conditons`).first.content
... #add your selector code css or xpath
array
end
def self.write_csv(result)
::CSV.open("`your out put file name`", 'w') do |csv|
result.each { |row| csv << row }
end
end
end
MyParser.run
You can output to a file like so:
File.open('results.txt','w') do |file|
file.puts "output" # See http://ruby-doc.org/core-2.1.2/IO.html#method-i-puts
end
Alternatively, you could do something like:
authors = doc.xpath("//meta[#name='author' or #name='Author']/#content")
keywrds = doc.xpath("//meta[#name='keywords' or #name='Keywords']/#content")
results = authors.map{ |x| "Author: #{x}" }.join("\n") +
keywrds.map{ |x| "Keywords: #{x}" }.join("\n")
File.open('results.txt','w'){ |f| f << results }