What should we use instead of nltk.Text.generate()? - nltk

It seems that nltk.Text.generate() is not available in NLTK 3.0 (see this answer). How should we be generating sentences instead? Thanks.

Unfortunately the generate() function relied on a buggy implementation of ngram models. It has been removed from NLTK 3.0 until someone can get around to fixing it, as you can see here (search for the words "removed ngram model package"). No replacement for this functionality has been provided.
The package nltk.model is still present in the NLTK 3.0 source tree, but it is not part of the distribution. So in principle you could download the source and get it to work, but given the bugs that led to its removal, it's probably a better idea to do without it, or to roll your own. Random text generation is not very interesting unless you control the generation algorithm, anyway.

Related

The use of packages to parse command arguments employing options/switches?

I have a couple questions about adding options/switches (with and without parameters) to procedures/commands. I see that tcllib has cmdline and Ashok Nadkarni's book on Tcl recommends the parse_args package and states that using Tcl to handle the arguments is much slower than this package using C. The Nov. 2016 paper on parse_args states that Tcl script methods are or can be 50 times slower.
Are Tcl methods really signicantly slower? Is there some minimum threshold number of options to be reached before using a package?
Is there any reason to use parse_args (not in tcllib) over cmdline (in tcllib)?
Can both be easily included in a starkit?
Is this included in 8.7a now? (I'd like to use 8.7a but I'm using Manjaro Linux and am afraid that adding it outside the package manager will cause issues that I won't know how to resolve or even just "undo").
Thank you for considering my questions.
Are Tcl methods really signicantly slower? Is there some minimum threshold number of options to be reached before using a package?
Potentially. Procedures have overhead to do with managing the stack frame and so on, and code implemented in C can avoid a number of overheads due to the way values are managed in current Tcl implementations. The difference is much more profound for numeric code than for string-based code, as the cost of boxing and unboxing numeric values is quite significant (strings are always boxed in all languages).
As for which is the one to use, it really depends on the details as you are trading off flexibility for speed. I've never known it be a problem for command line parsing.
(If you ask me, fifty options isn't really that many, except that it's quite a lot to pass on an actual command line. It might be easier to design a configuration file format — perhaps a simple Tcl script! — and then to just pass the name of that in as the actual argument.)
Is there any reason to use parse_args (not in tcllib) over cmdline (in tcllib)?
Performance? Details of how you describe things to the parser?
Can both be easily included in a starkit?
As long as any C code is built with Tcl stubs enabled (typically not much more than define USE_TCL_STUBS and link against the stub library) then it can go in a starkit as a loadable library. Using the stubbed build means that the compiled code doesn't assume exactly which version of the Tcl library is present or what its path is; those are assumptions that are usually wrong with a starkit.
Tcl-implemented packages can always go in a starkit. Hybrid packages need a little care for their C parts, but are otherwise pretty easy.
Many packages either always build in stubbed mode or have a build configuration option to do so.
Is this included in 8.7a now? (I'd like to use 8.7a but I'm using Manjaro Linux and am afraid that adding it outside the package manager will cause issues that I won't know how to resolve or even just "undo").
We think we're about a month from the feature freeze for 8.7, and builds seem stable in automated testing so the beta phase will probably be fairly short. The list of what's in can be found here (filter for 8.7 and Final). However, bear in mind that we tend to feel that if code can be done in an extension then there's usually no desperate need for it to be in Tcl itself.

Always importing too many classes... I think

I have a basic problem with knowing which classes to import for a given application, renderer, AS package, mxml component, etc. There seems to be hundreds of classes (both mx and flash) and I'm never sure which one(s) to import... so I just keep adding import statements until the errors go away. Is there a reference somewhere that I don't know about? Or does this just come with experience? Also... does importing a load of classes actually make the file size larger or does Flex only import the classes used nregardless of what I specify? If it only uses what is needed, why wouldn't everyone just do: import mx.*;
I would suggest that if you find yourself bringing in tons of imports, you should ask yourself: Does this class do to much?
It is less of a technical issue, and more of problem of object-oriented design -- maintainability, testability and stability.
I do my best to limit my external dependencies. I try to conform to SOLID principles that tell me that classes should exist for one reason. If a class does too much, it is a "code smell" and an indication that you should split it up.
How much is too much? It is tough to have a specific litmus test or limit... I just ask myself "What does this class do"? If my answer contains an "and" in it, then I consider splitting it up.
I think your problem is a not a real problem if you use any half decent IDE. If you're not using one, you probably should (even if it's not stricly necessary and you can write and compile with notepad and the command line).
If you are using Flex/Flash Builder, it will add the imports automatically (and remove the unneeded ones as well). Also, you can use Ctrl + SPACE to prompt autocomplete, which should add the necessary imports.
Flash Develop also manages this for you (the shortcut was Ctrl + Shift + 1 if I recall correctly, but I haven't used FD for a while).
There are other IDEs out there that I haven't personally used but also have this very basic feature.
If you're using the Flash IDE, well, it really sucks for writting code, so you should probably consider writting your code in some other less brain-dead editor if you plan to do anything more than a couple of lines of code here and there (again, you can write code in the Flash IDE but why not taking advantage of better tools when they're available?).
When you get an error, look at the API Reference for the class, and then either import the whole package or just the class you want. Highlighting the class and hitting F1 should also work (but I never search help this way).
As for file size, see my answer on Is it possible to dynamically create an instance of user-defined Class in Action Script 3?
As Juan pointed out, use FlashDevelop, it is a great (and free) IDE.
If you're using FlashDevelop with the Flex Compiler, you can compile straight from FlashDevelop, and use the refactoring tools they offer to slim down your imports.
Aside from that though, if you're not referencing them, they don't get compiled, so it's not like your compiled swf is any bigger.

Convert chinese characters to hanyu pinyin

How to convert from chinese characters to hanyu pinyin?
E.g.
你 --> Nǐ
马 --> Mǎ
More Info:
Either accents or numerical forms of hanyu pinyin are acceptable, the numerical form being my preference.
A Java library is preferred, however, a library in another language that can be put in a wrapper is also OK.
I would like anyone who has personally used such a library before to recommend or comment on it, in terms of its quality/ reliabilitty.
The problem of converting hanzi to pinyin is a fairly difficult one. There are many hanzi characters which have multiple pinyin representations, depending on context. Compare 长大 (pinyin: zhang da) to 长城 (pinyin: chang cheng). For this reason, single-character conversion is often actually useless, unless you have a system that outputs multiple possibilities. There is also the issue of word segmentation, which can affect the pinyin representation as well. Though perhaps you already knew this, I thought it was important to say this.
That said, the Adso Package contains both a segmenter and a probabilistic pinyin annotator, based on the excellent Adso library. It takes a while to get used to though, and may be much larger than you are looking for (I have found in the past that it was a bit too bulky for my needs). Additionally, there doesn't appear to be a public API anywhere, and its C++ ...
For a recent project, because I was working with place names, I simply used the Google Translate API (specifically, the unofficial java port, which, for common nouns at least, usually does a good job of translating to pinyin. The problem is commonly-used alternative transliteration systems, such as "HongKong" for what should be "XiangGang". Given all of this, Google Translate is pretty limited, but it offers a start. I hadn't heard of pinyin4j before, but after playing with it just now, I have found that it is less than optimal--while it outputs a list of potential candidate pinyin romanizations it makes no attempt to statistically determine their likelihood. There is a method to return a single representation, but it will soon be phased out, as it currently only returns the first romanization, not the most likely. Where the program seems to do well is with conversion between romanizations and general configurability.
In short then, the answer may be either any one of these, depending on what you need. Idiosyncratic proper nouns? Google Translate. In need of statistics? Adso. Willing to accept candidate lists without context information? Pinyin4j.
In Python try
from cjklib.characterlookup import CharacterLookup
cjk = CharacterLookup('C')
cjk.getReadingForCharacter(u'北', 'Pinyin')
You would get
['běi', 'bèi']
Disclaimer: I'm the author of that library.
For Java, I'd try the pinyin4j library
As mentioned in other answers the conversion is fuzzy and even google translate apparently gets a certain percentage of character combinations wrong.
A reasonable result which will not be 100% accurate can be achieved with open-source libraries available for some programming languages.
The simplest code to do the conversion with python with the pypinyin library (to install it use pip3 install pypinyin):
from pypinyin import pinyin
def to_pinyin(chin):
return ' '.join([seg[0] for seg in pinyin(chin)])
print(to_pinyin('好久不见'))
# OUTPUT: hǎo jiǔ bú jiàn
NOTE: The pinyin method from the module returns a list of possible candidate segments, and the to_pinyin method takes the first variant whenever more than one conversion is available. For tricky corner cases this is likely to produce incorrect results, but generally you'll probably get at least a ~90..95% success rate.
There are a few other python libraries for pinyin conversion but in my tests they proved to have a higher error rate than pypinyin. Also, they don't appear to be actively maintained.
If you need better accuracy then you'll need a more complex approach that will rely on bigger datasets and possibly some machine learning.

Parsing language for both binary and character files

The problem:
You have some data and your program needs specified input. For example strings which are numbers. You are searching for a way to transform the original data in a format you need.
And the problem is: The source can be anything. It can be XML, property lists, binary which
contains the needed data deeply embedded in binary junk. And your output format may vary
also: It can be number strings, float, doubles....
You don't want to program. You want routines which gives you commands capable to transform the data in a form you wish. Surely it contains regular expressions, but it is very good designed and it offers capabilities which are sometimes much more easier and more powerful.
ADDITION:
Many users have this problem and hope that their programs can convert, read and write data which is given by other sources. If it can't, they are doomed or use programs like business
intelligence. That is NOT the problem.
I am talking of a tool for a developer who knows what is he doing, but who is also dissatisfied to write every time routines in a regular language. A professional data manipulation tool, something like a hex editor, regex, vi, grep, parser melted together
accessible by routines or a REPL.
If you have the spec of the data format, you can access and transform the data at once. No need to debug or plan meticulously how to program the transformation. I am searching for a solution because I don't believe the problem is new.
It allows:
joining/grouping/merging of results
inserting/deleting/finding/replacing
write macros which allows to execute a command chain repeatedly
meta-grouping (lists->tables->n-dimensional tables)
Example (No, I am not looking for a solution to this, it is just an example):
You want to read xml strings embedded in a binary file with variable length records. Your
tool reads the record length and deletes the junk surrounding your text. Now it splits open
the xml and extracts the strings. Being Indian number glyphs and containing decimal commas instead of decimal points, your tool transforms it into ASCII and replaces commas with points. Now the results must be stored into matrices of variable length....etc. etc.
I am searching for a good language / language-design and if possible, an implementation.
Which design do you like or even, if it does not fulfill the conditions, wouldn't you want to miss ?
EDIT: The question is if a solution for the problem exists and if yes, which implementations are available. You DO NOT implement your own sorting algorithm if Quicksort, Mergesort and Heapsort is available. You DO NOT invent your own text parsing
method if you have regular expressions. You DO NOT invent your own 3D language for graphics if OpenGL/Direct3D is available. There are existing solutions or at least papers describing the problem and giving suggestions. And there are people who may have worked and experienced such problems and who can give ideas and suggestions. The idea that this problem is totally new and I should work out and implement it myself without background
knowledge seems for me, I must admit, totally off the mark.
UPDATE:
Unfortunately I had less time than anticipated to delve in the subject because our development team is currently in a hot phase. But I have contacted the author of TextTransformer and he kindly answered my questions.
I have investigated TextTransformer (http://www.texttransformer.de) in the meantime and so far I can see it offers a complete and efficient solution if you are going to parse character data.
For anyone who will give it a try to implement a good parsing language, the smallest set of operators to directly transform any input data to any output data if (!) they were powerful enough seems to be:
Insert/Remove: Self-explaining
Group/Ungroup: Split the input data into a set of tokens and organize them into groups
and supergroups (datastructures, lists, tables etc.)
Transform
Substituition: Change the content of the tokens (special operation: replace)
Transposition: Change the order of tokens (swap,merge etc.)
Have you investigated TextTransformer?
I have no experience with this, but it sounds pretty good and the author makes quite competent posts in the comp.compilers newsgroup.
You still have to some programming work.
For a programmer, I would suggest:
Perl against a SQL backend.
For a non-programmer, what it sounds like you're looking for is some sort of business intelligence suite.
This suggestion may broaden the scope of your search too much... but here it is:
You could either reuse, as-is, or otherwise get "inspiration" from the [open source] code of the SnapLogic framework.
Edit (answering the comment on SnapLogic documentation etc.)
I agree, the SnapLogic documentation leaves a bit to be desired, in particular for people in your situation, i.e. when just trying to quickly get an overview of what SnapLogic can do, and if it would generally meet their needs, without investing much time and learn the system in earnest.
Also, I realize that the scope and typical uses of of SnapLogic differ, somewhat, from the requirements expressed in the question, and I should have taken the time to better articulate the possible connection.
So here goes...
A salient and powerful feature of SnapLogic is its ability to [virtually] codelessly create "pipelines" i.e. processes made from pre-built components;
Components addressing the most common needs of Data Integration tasks at-large are supplied with the SnapLogic framework. For example, there are components to
read in and/or write to files in CSV or XML or fixed length format
connect to various SQL backends (for either input, output or both)
transform/format [readily parsed] data fields
sort records
join records for lookup and general "denormalized" record building (akin to SQL joins but applicable to any input [of reasonnable size])
merge sources
Filter records within a source (to select and, at a later step, work on say only records with attribute "State" equal to "NY")
see this list of available components for more details
A relatively weak area of functionality of SnapLogic (for the described purpose of the OP) is with regards to parsing. Standard components will only read generic file formats (XML, RSS, CSV, Fixed Len, DBMSes...) therefore structured (or semi-structured?) files such as the one described in the question, with mixed binary and text and such are unlikely to ever be a standard component.
You'd therefore need to write your own parsing logic, in Python or Java, respecting the SnapLogic API of course so the module can later "play nice" with the other ones.
BTW, the task of parsing the files described could be done in one of two ways, with a "monolithic" reader component (i.e. one which takes in the whole file and produces an array of readily parsed records), or with a multi-component approach, whereby an input component reads in and parse the file at "record" level (or line level or block level whatever this may be), and other standard or custom SnapLogic components are used to create a pipeline which effectively expresses the logic of parsing a record (or block or...) into its individual fields/attributes.
The second approach is of course more modular and may be applicable if the goal is to process many different files format, whereby each new format requires piecing together components with no or little coding. Whatever the approach used for the input / parsing of the file(s), the SnapLogic framework remains available to create pipelines to then process the parsed input in various fashion.
My understanding of the question therefore prompted me to suggest SnapLogic as a possible framework for the problem at hand, because I understood the gap in feature concerning the "codeless" parsing of odd-formatted files, but also saw some commonality of features with regards to creating various processing pipelines.
I also edged my suggestion, with an expression like "inspire onself from", because of the possible feature gap, but also because of the relative lack of maturity of the SnapLogic offering and its apparent commercial/open-source ambivalence.
(Note: this statement is neither a critique of the technical maturity/value of the framework per-se, nor a critique of business-oriented use of open-source, but rather a warning that business/commercial pressures may shape the offering in various direction)
To summarize:
Depending on the specific details of the vision expressed in the question, SnapLogic may be worthy of consideration, provided one understands that "some-assembly-required" will apply, in particular in the area of file parsing, and that the specific shape and nature of the product may evolve (but then again it is open source so one can freeze it or bend it as needed).
A more generic remark is that SnapLogic is based on Python which is a very swell language for coding various connectors, convertion logic etc.
In reply to Paul Nathan you mentioned writing throwaway code as something rather unpleasant. I don't see why it should be so. After all, all of our code will be thrown away and replaced eventually, no matter how perfect we wrote it. So my opinion is that writing throwaway code is pretty much ok, if you don't spend too much time writing it.
So, it seems that there are two approaches to solving your solution: either a) find some specific tool intended for the purpose (parse data, perform some basic operations on it and storing it in some specific structure) or b) use some general purpose language with lots of libraries and code it yourself.
I don't think that approach a) is viable because sooner or later you'll bump into an obstacle not covered by the tool and you'll spend your time and nerves hacking the tool, or mailing the authors and waiting for them to implement what you need. I might as well be wrong, so please if you find a perfect tool, drop here a link (I myself am doing lots of data processing in my day job and I can't swear that I couldn't do it more efficiently).
Approach b) may at first seem "unpleasant", but given a nice high-level expressive language with bunch of useful libraries (regexps, XML manipulation, creating parsers...) it shouldn't be too hard, and may be gradually turned into a DSL for the very purpose. Beside Perl which was already mentioned, Python and Ruby sound like good candidates for these languages (I bet some Lisp derivatives too, but I have no experience there).
You might find AntlrWorks useful if you go so far as defining formal grammars for what you're parsing.

Studying standard library sources

How does one study open-source libraries code, particularly standard libraries?
The code base is often vast and hard to navigate. How to find some function or class definition?
Do I search through downloaded source files?
Do I need cvs/svn for that?
Maybe web-search?
Should I just know the structure of the standard library?
Is there any reference on it?
Or do some IDEs have such features? Or some other tools?
How to do it effectively without one?
What are the best practices of doing this in any open-source libraries?
Is there any convention of how are sources manipulated on Linux/Unix systems?
What are the differences for specific programming languages?
Broad presentation of the subject is highly encouraged.
I mark this 'community wiki' so everyone can rephrase and expand my awkward formulations!
Update: Probably didn't express the problem clear enough. What I want to, is to view just the source code of some specific library class or function. And the problem is mostly about work organization and usability - how do I navigate in the huge pile of sources to find the thing, maybe there are specific tools or approaches? It feels like there should've long existed some solution(s) for that.
One thing to note is that standard libraries are sometimes (often?) optimized more than is good for most production code.
Because they are widely used, they have to perform well over a wide variety of conditions, and may be full of clever tricks and special logic for corner cases.
Maybe they are not the best thing to study as a beginner.
Just a thought.
Well, I think that it's insane to just site down and read a library's code. My approach is to search whenever I come across the need to implement something by myself and then study the way that it's implemented in those libraries.
And there's also allot of projects/libraries with excellent documentation, which I find more important to read than the code. In Unix based systems you often find valuable information in the man pages.
Wow, that's a big question.
The short answer: it depends.
The long answer:
Some libraries provide documentation while others don't. Standard libraries are usually pretty well documented, whether your chosen implementation of the library includes documentation or not. For instance you may have found an implementation of the c standard library without documentation but the c standard has been around long enough that there are hundreds of good reference books available. Documentation with hyperlinks is a very useful way to learn a new API. In any case the first place I would look is the library's main website
For less well known libraries lacking documentation I find two different approaches very helpful.
First is a doc generator. Nearly every language I know of has one. It basically parses an source tree and creates documentation (usually as html or xml) which can be used to learn a library. Some use specially formatted comments in the code to create more complete documentation. JavaDoc is one good example of this. Doc generators for many other languages borrow from JavaDoc.
Second an IDE with a class browser. These act as a sort of on the fly documentation. Some display just the library's interface. Other's include description comments from the library's source.
Both of these will require access to the libraries source (which will come in handy if you intend actually use a library).
Many of these tools and techniques work equally well for closed/proprietary libraries.
The standard Java libraries' source code is available. For a beginning Java programmer these can be a great read. Especially the Collections framework is a good place to start. Take for instance the implementation of ArrayList and learn how you can implement a resizeable array in Java. Most of the source has even useful comments.
The best parts to read are probably whose purpose you can understand immediately. Start with the easy pieces and try to follow all the steps that are hidden behind that single call you make from your own code.
Something I do from time to time :
apt-get source foo
Then new C++ project (or whatever) in Eclipse and import.
=> Wow ! Browsable ! (use F3)