HTML tidy/cleaning in Ruby 1.9 - html

I'm currently using the RubyTidy Ruby bindings for HTML tidy to make sure HTML I receive is well-formed. Currently this library is the only thing holding me back from getting a Rails application on Ruby 1.9. Are there any alternative libraries out there that will tidy up chunks of HTML on Ruby 1.9?

http://github.com/libc/tidy_ffi/blob/master/README.rdoc works with ruby 1.9 (latest version)
If you are working on windows, you need to set the library_path eg
require 'tidy_ffi'
TidyFFI.library_path = 'lib\\tidy\\bin\\tidy.dll'
tidy = TidyFFI::Tidy.new('test')
puts tidy.clean
(It uses the same dll as tidy) The above links gives you more example of the usage.

I am using Nokogiri to fix invalid html:
Nokogiri::HTML::DocumentFragment.parse(html).to_html

Here is a nice example of how to make your html look better using tidy:
require 'tidy'
Tidy.path = '/opt/local/lib/libtidy.dylib' # or where ever your tidylib resides
nice_html = ""
Tidy.open(:show_warnings=>true) do |tidy|
tidy.options.output_xhtml = true
tidy.options.wrap = 0
tidy.options.indent = 'auto'
tidy.options.indent_attributes = false
tidy.options.indent_spaces = 4
tidy.options.vertical_space = false
tidy.options.char_encoding = 'utf8'
nice_html = tidy.clean(my_nasty_html_string)
end
# remove excess newlines
nice_html = nice_html.strip.gsub(/\n+/, "\n")
puts nice_html
For more tidy options, check out the man page.

Currently this library is the only
thing holding me back from getting a
Rails application on Ruby 1.9.
Watch out, the Ruby Tidy bindings have some nasty memory leaks. It's currently unusable in long running processes. (for the record, I'm using http://github.com/ak47/tidy)
I just had to remove it from a production Rails 2.3 application because it was leaking about 1MB/min.

Related

Quartz 2.6.2 and .NET Core? - Error "Could Not Initialize DataSource"

I'm using an older version of Quartz.NET (v2.6.2) with .NET Core (or possibly .NET5). I'm getting an error when attempting to use the StdSchedulerFactory.GetScheduler. All my configuration settings are within my appsettings.json where I populate a NameValueCollection with these values and inject them into my classes with DI.
["quartz.scheduler.instanceId"] = "instance_one",
["quartz.threadPool.type"] = "Quartz.Simpl.SimpleThreadPool, Quartz",
["quartz.threadPool.threadCount"] = "5",
["quartz.jobStore.misfireThreshold"] = "60000",
["quartz.jobStore.type"] = "Quartz.Impl.AdoJobStore.JobStoreTX, Quartz",
["quartz.jobStore.useProperties"] = "false",
["quartz.jobStore.dataSource"] = "default",
["quartz.jobStore.tablePrefix"] = "QRTZ_",
["quartz.dataSource.default.provider"] = "SqlServer-20",
["quartz.dataSource.default.connectionString"] = quartzConn
I am using the StdSchedulerFactory like this, where Settings.Properties is that NameValueCollection which contains all the config settings:
var factory = new StdSchedulerFactory(Settings.Properties);
var scheduler = factory.GetScheduler();
On the GetScheduler method, the error, "Could Not Initialize Datasource: default" is thrown.
The crazy thing is this code works fine in a Framework 4.x project that uses a regular web.config to supply the configuration settings. Also, when I change to use Quartz 3.X with my code above, with configurations in the appsettings.json works fine. Seems that me mixing and matching both versions is causing an issue where Quartz doesn't know how to retrieve some value?
Is there a way to manually build my scheduler and not use the factory?
Thanks!
I've had to go back to Framework 4 and Quartz 2.6 to get them to play nicely together. I can only get Quartz 3.x to work with .NET Core/5. Stepping through the source code with dotPeek, Quartz 2.6 is using ConfigurationManager to pull web.config details that don't exist in Core/5. At this point I don't remember if I tried to add my own web.config file to this project or not, but I've since moved on.

Convert xml to html using XSLT stylesheet in node.js

Has anyone tried to convert xml file into html webpage using XSLT stylesheet in node.js? My background is in Java. I normally use SAXON to convert XML into HTML webpages. I am a newbie to node.js. I have tried to implement this using few libraries like node_xslt, libxsltjs etc but was not successful. If anyone has tried using other libraries that works with XSLT stylesheet, please post a link. Any help would be appreciated.
If you want to use Saxon from a Node.js application, you basically have three choices, none of them ideal:
(a) call out to Java, using a variety of mechanisms.
(b) use the port of Saxon/C to Node.js being constructed here: https://github.com/rimmartin/saxon-node This is bleeding-edge stuff and I don't know how far the project has got.
(c) wait for Saxon-JS to arrive any time soon. See http://dev.saxonica.com/blog/mike/2016/02/introducing-saxon-js.html
At time of writing this works for me...
install saxon...
> npm install saxon-js (see https://www.npmjs.com/package/saxon-js)
write a little test program
const saxon = require('saxon-js');
const env = saxon.getPlatform(); const doc = env.parseXmlFromString(env.readFile("styles/listview.xsl"));
doc._saxonBaseUri = "dummy"; const sef = saxon.compile(doc);
let xml = "<EMPLOYEE_ID>107</EMPLOYEE_ID><FIRST_NAME>Summer</FIRST_NAME><LAST_NAME>Payne</LAST_NAME>summer.payne#example.com515.123.8181<HIRE_DATE>2016-06-07</HIRE_DATE><MANAGER_ID>106</MANAGER_ID><JOB_TITLE>Public Accountant</JOB_TITLE>";
let html = saxon.transform({
stylesheetInternal:sef,
sourceType: "xml",
sourceText:xml,
destination: "serialized"}, "async"
).then( output => {
console.log(output.principalResult);
} );
run the test program from the command line...
> node test.js
The output should be the transformed XML.
Luck.

How to render html file as haml

I'm fairly new to ruby and working on building a front-end styleguide that has html snippets I'd like to render as haml into a pre tag. I'm building a helper for middleman and have figured out how to read an HTML file and output its contents. Now I'd like to convert the html to haml and output that.
Looking around it seems like the html2haml gem is what I want to use, though the doc on that gem seems to only cover using it on the command line, whereas I'm trying to add this functionality to a helper.
Here is what I have so far for a helper
helpers do
def render_snippet(page)
p1 = ("<pre><code>").html_safe
p2 = File.read("source/"+"#{page}")
p3 = ("</code></pre>").html_safe
p0+p1+p2+p3
end
end
Here is how I'm using the helper
= render_snippet "partials/examples/typography/elements.html"
To answer your question, this is how you can make a helper to use html2haml gem outside the terminal shell commands
# some_view.html.erb
<%= render html_2_haml("home/my_partial.html") %>
# app/helpers/application_helper.rb
module ApplicationHelper
def html_2_haml(path)
file_name = path.split("/").last
path_with_underscore = path.gsub(file_name, "_#{file_name}")
system "html2haml app/views/#{path_with_underscore} app/views/#{path_with_underscore}.haml"
"#{path}.haml"
end
end
Now i'd like to say this definitely will not work in production (as it's dynamically creating a new file and hosting services like Heroku just won't allow that) but if your just making yourself a development helper for this-and-that then perhaps this could be helpful to you.
I ended up working on this some more and ended up with the following:
def render_html2haml(file)
templateSource = preserve(File.read("source/"+"#{file}"))
haml = Html2haml::HTML.new(templateSource, {:erb => nil})
content_tag(:pre, content_tag(:code, haml.render))
end

General approach to reading lnk files

Several frameworks and languages seem to have lnk file parsers (C#, Java, Python, certainly countless others), to get to their targets, properties, etc. I'd like to know what is the general approach to reading lnk files, if I want to parse the lnk in another language that does not have said feature. Is there a Windows API for this?
There is not an official document from Microsoft describing lnk file format but there are some documents which have description of the format. Here is one of them: Shortcut File Format (.lnk)
As for the API you can use IShellLink Interface
Simply use lnk file parser at J.A.F.A.T. Archive of Forensics Analysis Tools project.
See lnk-parse-1.0.pl at http://jafat.sourceforge.net
There seems no have no dependencies. Syntax is simple and link file becomes a simple text in standard output and to be usable on Linux.
This is an old post, but here is my C# implementation for lnk processing that handles the entire spec, more info and command line tool on this blogspot page.
Using WSH-related components seems the most convenient option to read .lnk files in most languages on a post-XP windows system. You just need access to the COM environment and instantiate the WScript.Shell Component. (remember that on win, references to the Shell usually refer to explorer.exe)
The following snippet, e.g. does the thing on PHP: (PHP 5, using the COM object)
<?php
$wsh=new COM('WScript.Shell'); // the wsh object
// please note $wsh->CreateShortcut method actually
// DOES THE READING, if the file already exists.
$lnk=$wsh->CreateShortcut('./Shortcut.lnk');
echo $lnk->TargetPath,"\n";
This other one, instead, does the same on VBScript:
set sh = WScript.CreateObject("WScript.Shell")
set lnk = sh.CreateShortcut("./Shortcut.lnk")
MsgBox lnk.TargetPath
Most examples in the field are written in VB/VBS, but they translate well on the whole range of languages supporting COM and WSH interaction in a form or another.
This simple tutorial may come handy, as it lists and exemplifies some of the most interesting properties of a .lnk file other than the most important: TargetPath. Those are:
WindowStyle,
Hotkey,
IconLocation,
Description,
WorkingDirectory
here's some C# code using the Shell32 API, from my "ClearRecentLinks" project at https://github.com/jmaton/ClearRecentLinks
To use this your C# project has to reference c:\windows\system32\shell32.dll
string linksPath = "c:\some\folder";
Type shell32Type = Type.GetTypeFromProgID("Shell.Application");
Object shell = Activator.CreateInstance(shell32Type);
Shell32.Folder s32Folder = (Shell32.Folder)shell32Type.InvokeMember("NameSpace", System.Reflection.BindingFlags.InvokeMethod, null, shell, new object[] { linksPath });
foreach (Shell32.FolderItem2 item in s32Folder.Items())
{
if (item.IsLink)
{
var link = (Shell32.ShellLinkObject)item.GetLink;
if (link != null && !String.IsNullOrEmpty(link.Target.Path))
{
string linkTarget = link.Target.Path.ToLower();
// do something...
}
}
}
#Giorgi: Actually, there is an official specification of lnk files, at least it is claimed so.
However, for some reason, the link seems to be dead, and after downloading the whole (45 MB) doc package (Application_Services_and_NET_Framework.zip), it appears that it does not include the file MS-SHLLINK.pdf.
But is this really surprising ?
Once you got the file format, shouldn't be too hard to write code to read it.

Using Tidy in JRuby

After spending some hours with the Ruby Debugger I finally learned that I need to clean up some malformed HTML pages before I can feed those to Hpricot. The best solution I found so far is the Tidy Ruby interface.
Tidy works great from the command line and also the Ruby interface works. However, it requires dl/import, which fails to load in JRuby:
$ jirb
irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'tidy'
LoadError: no such file to load -- dl/import
Is this library available for JRuby? A web search revealed that it wasn't available last year.
Alternatively, can someone suggest other ways to clean up malformed HTML in JRuby?
Update
Following Markus' suggestion I now use Tidy via popen instead of libtidy. I posted the code which pipes the document data through tidy for future reference. Hopefully, this is robust and portable.
def clean(data)
cleaned = nil
tidy = IO.popen('tidy -f "log/tidy.log" --force-output yes -wrap 0 -utf8', 'w+')
begin
tidy.write(data)
tidy.close_write
cleaned = tidy.read
tidy.close_read
rescue Errno::EPIPE
$stderr.print "Running 'tidy' failed: " + $!
tidy.close
end
return cleaned if cleaned and cleaned != ""
return data
end
You could use it from the command line from within JRuby with %x{...} or backticks. You may also want to consider popen (and pipe things through it).
Not elegant perhaps, but more likely to get you going with minimal hassle than trying to mess with unsupported libraries.