Locale aware collation in jruby

Locale aware collation in jruby - jruby

I've tracked down a bug in the rubyrep replication library that results from Ruby's collation not being locale aware. It thinks that '-' comes before 'a' when sorting, which is not correct, at least for the en_US.UTF-8 locale (and the C locale).
Right now the database is sorting these strings in a proper locale aware way but ruby is not.
What's the easiest way for a jruby novice to get locale aware string comparisons working so I can patch this code? I'm fine with hardcoding the locale I want into the code if that's necessary.
(If there is no easy way I will abandon jruby and use this lib but I'm hoping there is a jruby way so I can keep the speed advantage)

Pardon my asking, but how did you figure that - should sort after a in UTF-8? In this ASCII-compatible block, at least, I expect - to come before a.
JRuby strives to be compatible with how MRI behaves, so no matter how MRI is behaving, this is how JRuby will behave.
Also, JRuby has FFI built in, so you are free to use the library you mention.

Related

The use of packages to parse command arguments employing options/switches?

I have a couple questions about adding options/switches (with and without parameters) to procedures/commands. I see that tcllib has cmdline and Ashok Nadkarni's book on Tcl recommends the parse_args package and states that using Tcl to handle the arguments is much slower than this package using C. The Nov. 2016 paper on parse_args states that Tcl script methods are or can be 50 times slower.
Are Tcl methods really signicantly slower? Is there some minimum threshold number of options to be reached before using a package?
Is there any reason to use parse_args (not in tcllib) over cmdline (in tcllib)?
Can both be easily included in a starkit?
Is this included in 8.7a now? (I'd like to use 8.7a but I'm using Manjaro Linux and am afraid that adding it outside the package manager will cause issues that I won't know how to resolve or even just "undo").
Thank you for considering my questions.

Are Tcl methods really signicantly slower? Is there some minimum threshold number of options to be reached before using a package?
Potentially. Procedures have overhead to do with managing the stack frame and so on, and code implemented in C can avoid a number of overheads due to the way values are managed in current Tcl implementations. The difference is much more profound for numeric code than for string-based code, as the cost of boxing and unboxing numeric values is quite significant (strings are always boxed in all languages).
As for which is the one to use, it really depends on the details as you are trading off flexibility for speed. I've never known it be a problem for command line parsing.
(If you ask me, fifty options isn't really that many, except that it's quite a lot to pass on an actual command line. It might be easier to design a configuration file format — perhaps a simple Tcl script! — and then to just pass the name of that in as the actual argument.)
Is there any reason to use parse_args (not in tcllib) over cmdline (in tcllib)?
Performance? Details of how you describe things to the parser?
Can both be easily included in a starkit?
As long as any C code is built with Tcl stubs enabled (typically not much more than define USE_TCL_STUBS and link against the stub library) then it can go in a starkit as a loadable library. Using the stubbed build means that the compiled code doesn't assume exactly which version of the Tcl library is present or what its path is; those are assumptions that are usually wrong with a starkit.
Tcl-implemented packages can always go in a starkit. Hybrid packages need a little care for their C parts, but are otherwise pretty easy.
Many packages either always build in stubbed mode or have a build configuration option to do so.
Is this included in 8.7a now? (I'd like to use 8.7a but I'm using Manjaro Linux and am afraid that adding it outside the package manager will cause issues that I won't know how to resolve or even just "undo").
We think we're about a month from the feature freeze for 8.7, and builds seem stable in automated testing so the beta phase will probably be fairly short. The list of what's in can be found here (filter for 8.7 and Final). However, bear in mind that we tend to feel that if code can be done in an extension then there's usually no desperate need for it to be in Tcl itself.

What's the difference between these JSON Perl modules?

What's the difference between the Perl JSON modules below?
I have come across JSON::PP and JSON::XS. The documentation of JSON::PP says it is compatible with JSON::XS. What does that mean?
I am not sure what the difference between them are, let alone which of them to use. Can someone clarify?

Perl modules sometimes have different implementations. The ::PP suffix is for the Pure Perl implementation (i.e. for portability), the ::XS suffix is for the C-based implementation (i.e. for speed), and JSON is just the top-level module itself (i.e. the one you actually use).
As noted by #Quentin, this site has a good description of them. To quote:
JSON
JSON.pm is a wrapper around JSON::PP and JSON::XS - it also does a bunch of moderately crazy things for compatibility reasons, including extra shim code for very old perls [...]
JSON::PP
This is the standard pure perl implementation, and if you're not performance dependent, there's nothing wrong with using it directly [...]
JSON::XS
Ridiculously fast JSON implementation in C. Absolutely wonderful [...]
As you can see, just installing the top-level JSON module should do it for you. The part about compatibility just means that they both do the same thing, i.e. you should get the same output from both.
I installed the Perl JSON module a few years ago on a RHEL server I managed and it was a really straightforward process: just install (or build) the module from the CPAN site and you're done.
Installing should be a simple case of either using the OS package manager (if in GNU/Linux), using the cpan utility, or building from source. The OS package manager is recommended, as it helps keep things updated automatically.
To verify that it's installed, just try the following command from the terminal (assuming GNU/Linux):
$ perl -e 'use JSON;'
If it doesn't complain, then you should be good to go. If you get errors, then you should get ready to go in an adventure.

You can install JSON module, cpan install JSON
use JSON;
my $result = from_json($json);
if($result->{field})
{
# YOUR CODE
};

Parse HTML using ruby core libraries? (ie, no gems required)

Some friends and I have been working on a set of scripts that make it easier to do work on the machines at uni. One of these tools currently uses Nokogiri, but in order for these tools to run on all machines with as little setup as possible we've been trying to find a 'native' html parser, instead of requiring users to install RVM and custom gems (due to disk space limitations for most users).
Are we pretty much restricted to Nokogiri/Hpricot/? Should we look at just writing our own custom parser that fits our needs?
Cheers.
EDIT: If there's posts on here that I've missed in my searches, let me know! S.O. is sometimes just too large to find things effectively...

There is no html parser in ruby stdlib
html parsers have to be more forgiving of bad markup than xml parsers
You could run the html though tidy (http://tidy.sourceforge.net)
to tidy up the html and produce valid markup
This can now be read via rexml :-) which is in stdlib
rexml is much slower than nokogiri, last checked in 2009
Sam Ruby had been working on making rexml faster though
A better way would be to have a better deployment
Take a look at http://gembundler.com/bundle_package.html and using capistrano (or some such) to provision servers

Converting ipython notebook files (JSON based) to other formats with pandoc

When attempting to use pandoc to convert JSON based files (.ipynb) from iPython notebook (0.12), I receive an error stating "bad decodeArgs" for the JSON. I suspect that it may be due to the Ubuntu provided version of pandoc that I am using (1.8.1.1). It seems that getting the latest pandoc version requires setting up the Haskell platform which I was not successful doing because of dependency challenges (and really don't want to). I don't want to spend any more time trying to install Haskell if this is not my problem.
Is there a way to get the latest pandoc binaries for Ubuntu without rebuilding it?
Given that iPython notebook is new (and very cool!!), it would be nice to hear about experiences related to translating the JSON to other formats. Perhaps there is a different way to accomplish this other than pandoc.

Regarding your "keeping up to date with Pandoc", I'm afraid you do need Haskell installed. The best way to do this via the Haskell Platform ("HP") package, and then just like with Ruby, it's a lot more consistent to use the environment's package manager for dependencies than your OS. I've had no trouble getting it working, even in Windoze. . .
I'm sure questions to the Haskell mailing list would result in quick help for a platform as mainstream as Debian/Ubuntu, but you might need to manually install a newer version of HP that what's available through the OS package manager.
Once you get HP up and running, the dev Pandoc is dead easy to compile, and git will keep you up to date with the latest - specific instructions here, currently maintained:
https://github.com/jgm/pandoc/wiki/Installing-the-development-version-of-pandoc-1.9
Note that v1.9 has now been officially released if you really don't want to go to the trouble of keeping up to date with the dev cycle, but of course again you won't get it in your OS package manager for quite some time after that (I assume anyway).
==========================
Regarding your attempts to treat JSON as a document syntax:
The best syntax inputs for Pandoc at this point are its native markdown+extensions, and reST (especially for Python people/environments), basically maintained as functionally equivalent, although there may be features available in the former that aren't represented in the latter, since John can just add extensions anytime he wants. AFAIK Pandoc hasn't begun to support the Sphinx extensions (yet?)
The JSON format used internally within Pandoc isn't documented (yet?) but it's the native Haskell data type. As the Thomas K notes, there may be some similarity between how the two tools represent data, but probably not enough to treat either as "just another markup format".
However, if you're working on this, it's easy enough to see what Pandoc looks for in the way of JSON input.
pandoc -t json
compare this to
pandoc -t native
and it's easy to see the specs created by Text.Pandoc.Definition and Text.JSON.Generic
Using Pandoc's internal data representation as input would obviously be more stable than a marked up text stream, and others have expressed a desire for documentation on this and it would be a great contribution to the community.
Please do inform the Pandoc mail list of any work done in this area. The crew there is very responsive, including getting quick feedback from John M (the lead developer) himself directly.

I doubt pandoc or any other tool knows what to do with ipynb files yet (at the time of writing, the IPython notebook was released less than a month ago). JSON is just a generic data structure like XML, not a document format.
We're (IPython) working on tools to export notebooks to other formats, but they're not ready for a proper release yet. If you want to help develop that, see this mailing list thread. Hopefully it will be part of the next IPython release.

Convert chinese characters to hanyu pinyin

How to convert from chinese characters to hanyu pinyin?
E.g.
你 --> Nǐ
马 --> Mǎ
More Info:
Either accents or numerical forms of hanyu pinyin are acceptable, the numerical form being my preference.
A Java library is preferred, however, a library in another language that can be put in a wrapper is also OK.
I would like anyone who has personally used such a library before to recommend or comment on it, in terms of its quality/ reliabilitty.

The problem of converting hanzi to pinyin is a fairly difficult one. There are many hanzi characters which have multiple pinyin representations, depending on context. Compare 长大 (pinyin: zhang da) to 长城 (pinyin: chang cheng). For this reason, single-character conversion is often actually useless, unless you have a system that outputs multiple possibilities. There is also the issue of word segmentation, which can affect the pinyin representation as well. Though perhaps you already knew this, I thought it was important to say this.
That said, the Adso Package contains both a segmenter and a probabilistic pinyin annotator, based on the excellent Adso library. It takes a while to get used to though, and may be much larger than you are looking for (I have found in the past that it was a bit too bulky for my needs). Additionally, there doesn't appear to be a public API anywhere, and its C++ ...
For a recent project, because I was working with place names, I simply used the Google Translate API (specifically, the unofficial java port, which, for common nouns at least, usually does a good job of translating to pinyin. The problem is commonly-used alternative transliteration systems, such as "HongKong" for what should be "XiangGang". Given all of this, Google Translate is pretty limited, but it offers a start. I hadn't heard of pinyin4j before, but after playing with it just now, I have found that it is less than optimal--while it outputs a list of potential candidate pinyin romanizations it makes no attempt to statistically determine their likelihood. There is a method to return a single representation, but it will soon be phased out, as it currently only returns the first romanization, not the most likely. Where the program seems to do well is with conversion between romanizations and general configurability.
In short then, the answer may be either any one of these, depending on what you need. Idiosyncratic proper nouns? Google Translate. In need of statistics? Adso. Willing to accept candidate lists without context information? Pinyin4j.

In Python try
from cjklib.characterlookup import CharacterLookup
cjk = CharacterLookup('C')
cjk.getReadingForCharacter(u'北', 'Pinyin')
You would get
['běi', 'bèi']
Disclaimer: I'm the author of that library.

For Java, I'd try the pinyin4j library

As mentioned in other answers the conversion is fuzzy and even google translate apparently gets a certain percentage of character combinations wrong.
A reasonable result which will not be 100% accurate can be achieved with open-source libraries available for some programming languages.
The simplest code to do the conversion with python with the pypinyin library (to install it use pip3 install pypinyin):
from pypinyin import pinyin
def to_pinyin(chin):
return ' '.join([seg[0] for seg in pinyin(chin)])
print(to_pinyin('好久不见'))
# OUTPUT: hǎo jiǔ bú jiàn
NOTE: The pinyin method from the module returns a list of possible candidate segments, and the to_pinyin method takes the first variant whenever more than one conversion is available. For tricky corner cases this is likely to produce incorrect results, but generally you'll probably get at least a ~90..95% success rate.
There are a few other python libraries for pinyin conversion but in my tests they proved to have a higher error rate than pypinyin. Also, they don't appear to be actively maintained.
If you need better accuracy then you'll need a more complex approach that will rely on bigger datasets and possibly some machine learning.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008