Speed up Stanford Tagger in Knime - knime

I am using the Standford Tagger in knime and the performance is quite slow,
The set-up is:
It is a german hgc tagger, and the memory policy is: Keep only small tables in memory.
How can i speed this up?

As this node is streamable use streaming option/extension available in KNIME. Also String to Document and Tag Filter nodes are streamable. Here is an blog post explaining streaming in KNIME in details. It is older but still valid.

Related

Running Fairseq in memory and pre-load language models

I'm running Fairseq in the command line. Fairseq loads language models on the fly and do the translation. It works fine but it takes time to load the models and do the translation. I'm thinking, if we run the Fairseq as an in-memory service and pre-load all language models, it will be quick to run the service and do the translations.
My questions are,
Will it be more efficient if we run the Fairseq as an in-memory service and pre-load the language models?
How much efficiency increase that we can expect?
How easy will it be to implement such an in-memory Fairseq service?
Thank you very much for helping out.
There is an issue about preloading models:
https://github.com/pytorch/fairseq/issues/1694
For a custom model, the code below shows how to preload fairseq model in memory, which is an official example and can be found in: https://github.com/pytorch/fairseq/tree/master/examples/translation#example-usage-torchhub
from fairseq.models.transformer import TransformerModel
zh2en = TransformerModel.from_pretrained(
'/path/to/checkpoints',
checkpoint_file='checkpoint_best.pt',
data_name_or_path='data-bin/wmt17_zh_en_full',
bpe='subword_nmt',
bpe_codes='data-bin/wmt17_zh_en_full/zh.code'
)
zh2en.translate('你好 世界')
# 'Hello World'
You can go through the source code to find more details about the method from_pretrained: https://github.com/pytorch/fairseq/blob/579a48f4be3876082ea646880061a98c94357af1/fairseq/models/fairseq_model.py#L237
Once preload, you can repeatly use without command lines.
If you want to use gpu, remember execute: model.to('cuda').
Certainly it can be more efficient if you preload. For a big model of quite big size, it takes seconds to be loaded into memory.

How to covert a large JSON file into XML?

I have a large JSON file, its size is 5.09 GB. I want to convert it to an XML file. I tried online converters but the file is too large for them. Does anyone know how to to do that?
The typical way to process XML as well as JSON files is to load these files completely into memory. Then you have a so called DOM which allows you various kinds of data processing. But neither XML nor JSON are really designed for storing that much data you have here. To my experience you typically will run into memory problems as soon as you exceed a 200 MByte limit. This is because DOMs are created that are composed from individual objects. This approach results in a huge memory overhead that far exceeds the amount of data you want to process.
The only way for you to process files like that is basically to take a stream approach. The basic idea: Instead of parsing the whole file and loading it into memory you parse and process the file "on the fly". As data is read it is parsed and events are triggered on which your software can react and perform some actions as needed. (For details on that have a look at the SAX API in order to understand this concept in more detail.)
As you stated you are processing JSON, not XML. Stream APIs for JSON should be available in the wild as wel. Anyway you could implement one fairly easily yourself: JSON is a pretty simple data format.
Nevertheless such an approach is not optimal: Typically such a concept will result in very slow data processing because of millions of method invocations involved: For every item encountered you typically need to call a method in order to perform some data processing task. This together with additional checks about what kind of information you currently have encountered in the stream will slow down data processing pretty much.
You really should consider to use a different kind of approach. First split your file into many small ones, then perform processing on them. This approach might not seem to be very elegant, but it helps to keep your task much simpler. This way you gain a main advantage: It will be much easier for you to debug your software. Unfortunately you are not very specific about your problem, so I can only guess, but large files typically imply that the data model is pretty complex. Therefor you will probably be much better off by having many small files instead of a single huge one. And later it allows you to dig into individual aspects of your data and the data processing process as needed. You will probably fail getting any detailed insights into that while having a single large file of 5 GByte to process. On errors you will have trouble to identify which part of the huge file is causing the problem.
As I already stated you unfortunately are very unspecific about your problem. Sorry, but because of having no more details about your problem (and your data in particular) I can only give you these general recommendations about data processing. I do not know any details about your data, so I can not give you any recommendation about which approach will work best in your case.

Loading json file into titan graph database

I have given a task to load a json file into titandb with dynamodb as back end.Is there any java tutorial or if possible please upload java sample coding...
thanks.
Titan is an abstraction layer so whether you use Cassandra, dynamo, hbase, etc, you merely need to find Titan data loading instructions. They are a bit dated but you might want to start with these blog posts:
http://thinkaurelius.com/2014/05/29/powers-of-ten-part-i/
http://thinkaurelius.com/2014/06/02/powers-of-ten-part-ii/
The code examples work with an older version of Titan (the schema portion) but the concepts still apply.
You will find that the strategy for data loading with Titan has a lot to do with the size of your graph. You said you are loading "a JSON file" so I imagine you have a smaller graph in the millions of edges. In this case, a simple groovy script will likely suffice. Write a script to parse your JSON and write the data to the Titan.

Mashery IODocs - Latency issue due to heavy json config file

Mashery IOdocs is a really a great tools for documenting API.
I'm using it for a quite big project with more then 50 methods and complex structures sent to this API, so that my json config file is more than 4000 lines long.
I self-host IOdocs on a VPS along with other stuff and the doc is awfully slow because of my long json file.
Any idea to cope with this latency ? Except obviously split my json config file into several.
I have a fork of IO Docs with some performance improvements which may help. In this instance they involve stripping out json-minify (which is only used to allow comments in the source specifications), server-side cacheing of the specifications and not having to load the specification via a synchronous AJAX call on the client.

Weka: Limitations on what one can output as source?

I was consulting several references to discover how I may output trained Weka models into Java source code so that I may use the classifiers I am training in actual code for research applications I have been developing.
As I was playing with Weka 3.7, I noticed that while it does output Java code to its main text buffer when use simpler classification (supervised in my case this time) methods such as J48 decision tree, it removes the option (rather, it voids it by removing the ability to checkmark it and fades the text) to output Java code for RandomTree and RandomForest (which are the ones that give me the best performance in my situation).
Note: I am clicking on the "More Options" button and checking "Output source code:".
Does Weka not allow you to output RandomTree or RandomForest as Java code? If so, why? Or if it does and just doesn't put it in the output buffer (since RF is multiple decision trees which I imagine it doesn't want to waste buffer space), how does one go digging up where in the file system Weka outputs java code by default?
Are there any tricks to get Weka to give me my trained RandomForest as Java code? Or is Serialization of the output *.model files my only hope when it comes to RF and RandomTree?
Thanks in advance to those who provide help.
NOTE: (As an addendum to the answer provided below) If you run across a similar situation (requiring you to use your trained classifier/ML model in your code), I recommend following the links posted in the answer that was provided in response to my question. If you do not specifically need the Java code for the RandomForest, as an example, de-serializing the model works quite nicely and fits into Java application code, fulfilling its task as a trained model/hardened algorithm meant to predict future unlabelled instances.
RandomTree and RandomForest can't be output as Java code. I'm not sure for the reasoning why, but they don't implement the "Sourceable" interface.
This explains a little about outputting a classifier as Java code: Link 1
This shows which classifiers can be output as Java code: Link 2
Unfortunately I think the easiest route will be Serialization, although, you could maybe try implementing "Sourceable" for other classifiers on your own.
Another, but perhaps inconvenient solution, would be to use Weka to build the classifier every time you use it. You wouldn't need to load the ".model" file, but you would need to load your training data and relearn the model. Here is a starters guide to building classifiers in your own java code http://weka.wikispaces.com/Use+WEKA+in+your+Java+code.
Solved the problem for myself by turning the output of WEKA's -printTrees option of the RandomForest classifier into Java source code.
http://pielot.org/2015/06/exporting-randomforest-models-to-java-source-code/
Since I am using classifiers with Android, all of the existing options had disadvantages:
shipping Android apps with serialized models didn't reliably work across devices
computing the model on the phone took too much resources
The final code will consist of three classes only: the class with the generated model + two classes to make the classification work.