Running Fairseq in memory and pre-load language models - fairseq

I'm running Fairseq in the command line. Fairseq loads language models on the fly and do the translation. It works fine but it takes time to load the models and do the translation. I'm thinking, if we run the Fairseq as an in-memory service and pre-load all language models, it will be quick to run the service and do the translations.
My questions are,
Will it be more efficient if we run the Fairseq as an in-memory service and pre-load the language models?
How much efficiency increase that we can expect?
How easy will it be to implement such an in-memory Fairseq service?
Thank you very much for helping out.

There is an issue about preloading models:
https://github.com/pytorch/fairseq/issues/1694
For a custom model, the code below shows how to preload fairseq model in memory, which is an official example and can be found in: https://github.com/pytorch/fairseq/tree/master/examples/translation#example-usage-torchhub
from fairseq.models.transformer import TransformerModel
zh2en = TransformerModel.from_pretrained(
'/path/to/checkpoints',
checkpoint_file='checkpoint_best.pt',
data_name_or_path='data-bin/wmt17_zh_en_full',
bpe='subword_nmt',
bpe_codes='data-bin/wmt17_zh_en_full/zh.code'
)
zh2en.translate('你好 世界')
# 'Hello World'
You can go through the source code to find more details about the method from_pretrained: https://github.com/pytorch/fairseq/blob/579a48f4be3876082ea646880061a98c94357af1/fairseq/models/fairseq_model.py#L237
Once preload, you can repeatly use without command lines.
If you want to use gpu, remember execute: model.to('cuda').
Certainly it can be more efficient if you preload. For a big model of quite big size, it takes seconds to be loaded into memory.

Related

Dash client-side callback vs dcc.store

I have dash app connected to an AWS RDS. I have a live-updated graph that triggers a callback with a n_interval of 5min to query the database and do some expensive formatting. I store the transformed data (~500 data points) in a dcc.store from which another 6 graphs and a datatable utilize this data (no further processing required. My question is: To further improve the efficiency of the dashboard should I utilize client-side callbacks instead of dcc.store? Since from what I've read the client-side only utilizes the client browser and doesn't need to communicate back to the dash server on callback? Thank you.( I'm secretly hoping it makes little difference as I don't want to learn javascript)
The answer depends a lot on your architecture, how complex are the graphs and how many server side callbacks are currently used.
You have already mentioned the main advantage of clientside callbacks: since the data is on the clientside, it saves some time and network by updating the components directly on the client. The major inconvenience is that those are synchronous calculations and will block the app main thread (dash does not support promises and asynchronous for now), which Can lead to a bad UX if these functions take too much time.
For very simple plots and table, I can guarantee that there is a significant improvement in using clientside callbacks, especially because it avoids the Plotly API that can be much slower than just defining a JSON object with traces and layout.

Informatica PowerCenter pipelines to Azure Data Factory

I am trying to move my informatica pipelines in PC 10.1 to Azure Data Factory/ Synapse pipelines. Other than rewriting them from scratch, is there a way to migrate them somehow.. I am not finding any tools to achieve this as well. Has anyone faced this problem. Any leads on how to proceed ahead.
Thanks
There are no out of box solutions available to complete this migration. Unfortunately, you will have to author them again.
Informatica PowerCenter pipelines are a physical implementation of an Extract Transform Load (ETL) process. Each provider has different approaches to the implementations and they do not necessarily map well from one to another. Core Azure Data Factory (ADF) is actually more suited to Extract, Load and Transform (ELT), unless of course you use Data Flows.
So what you have to do is:
map out physically what your current pipeline is doing, if you don't have that documentation already. A simple spreadsheet template mapping out the components of the existing pipeline, tracking source, target plus any transformations will suffice
logically map out what the pipeline is doing; ie without using PowerCenter- specific terminology lay out what the "as is" pipeline is doing. A data flow diagram is a great way to do this
logically map out what the "to be" pipeline should do; ie without using any ADF-specific terminology, attempt to refine the "as is" pipeline to its simplest form
using expert knowledge of the ADF components (eg Copy, Lookup, Notebook, Stored Proc to name but a few) map from the logical "to be" to the physical (in the loosest sense of the word, it's all cloud now right : ), eg move data from place to place with the Copy activity, transform data in a SQL database using the Stored Proc activity, a repeated activity might use a For Each loop (bear in mind these execute in parallel), do sophisticated transformations or processing using Databricks notebooks if required and so on. If you require a low-code approach, consider Data Flows.
So you can see it's just a few simple steps. Good luck!

Weka: Limitations on what one can output as source?

I was consulting several references to discover how I may output trained Weka models into Java source code so that I may use the classifiers I am training in actual code for research applications I have been developing.
As I was playing with Weka 3.7, I noticed that while it does output Java code to its main text buffer when use simpler classification (supervised in my case this time) methods such as J48 decision tree, it removes the option (rather, it voids it by removing the ability to checkmark it and fades the text) to output Java code for RandomTree and RandomForest (which are the ones that give me the best performance in my situation).
Note: I am clicking on the "More Options" button and checking "Output source code:".
Does Weka not allow you to output RandomTree or RandomForest as Java code? If so, why? Or if it does and just doesn't put it in the output buffer (since RF is multiple decision trees which I imagine it doesn't want to waste buffer space), how does one go digging up where in the file system Weka outputs java code by default?
Are there any tricks to get Weka to give me my trained RandomForest as Java code? Or is Serialization of the output *.model files my only hope when it comes to RF and RandomTree?
Thanks in advance to those who provide help.
NOTE: (As an addendum to the answer provided below) If you run across a similar situation (requiring you to use your trained classifier/ML model in your code), I recommend following the links posted in the answer that was provided in response to my question. If you do not specifically need the Java code for the RandomForest, as an example, de-serializing the model works quite nicely and fits into Java application code, fulfilling its task as a trained model/hardened algorithm meant to predict future unlabelled instances.
RandomTree and RandomForest can't be output as Java code. I'm not sure for the reasoning why, but they don't implement the "Sourceable" interface.
This explains a little about outputting a classifier as Java code: Link 1
This shows which classifiers can be output as Java code: Link 2
Unfortunately I think the easiest route will be Serialization, although, you could maybe try implementing "Sourceable" for other classifiers on your own.
Another, but perhaps inconvenient solution, would be to use Weka to build the classifier every time you use it. You wouldn't need to load the ".model" file, but you would need to load your training data and relearn the model. Here is a starters guide to building classifiers in your own java code http://weka.wikispaces.com/Use+WEKA+in+your+Java+code.
Solved the problem for myself by turning the output of WEKA's -printTrees option of the RandomForest classifier into Java source code.
http://pielot.org/2015/06/exporting-randomforest-models-to-java-source-code/
Since I am using classifiers with Android, all of the existing options had disadvantages:
shipping Android apps with serialized models didn't reliably work across devices
computing the model on the phone took too much resources
The final code will consist of three classes only: the class with the generated model + two classes to make the classification work.

Classifying an input from Java Code while Loading model only once

I'm developing an app which uses RapidMiner for classification. I train the classifier time to time (e.g. daily) but I use the classifier in very high rate (250 per sec).
For this purpose, I created two Processes using RM GUI. First one trains the classifier and saves it into a model file while the second one uses it for classification.
In the second process I load the model file which the first process creates. This made it very slow since it seems that the process loads it every time I want to classify an input.
You can see the second process in the following picture:
(source: shiaupload.ir)
What's the more smart way of doing this?
P.S. I think a solution is to create another process which only loads the created classifier only once and then give it to ApplyModel subprocess as another input. But I didn't find a solution for doing so in Java code.
Already discussed and solved here.

What is instrumentation?

I've heard this term used a lot in the same context as logging, but I can't seem to find a clear definition of what it actually is.
Is it simply a more general class of logging/monitoring tools and activities?
Please provide sample code/scenarios when/how instrumentation should be used.
I write tools that perform instrumentation. So here is what I think it is.
DLL rewriting. This is what tools like Purify and Quantify do. A previous reply to this question said that they instrument post-compile/link. That is not correct. Purify and Quantify instrument the DLL the first time it is executed after a compile/link cycle, then cache the result so that it can be used more quickly next time around. For large applications, profiling the DLLs can be very time consuming. It is also problematic - at a company I worked at between 1998-2000 we had a large 2 million line app that would take 4 hours to instrument, and 2 of the DLLs would randomly crash during instrumentation and if either failed you would have do delete both of them, then start over.
In place instrumentation. This is similar to DLL rewriting, except that the DLL is not modified and the image on the disk remains untouched. The DLL functions are hooked appropriately to the task required when the DLL is first loaded (either during startup or after a call to LoadLibrary(Ex). You can see techniques similar to this in the Microsoft Detours library.
On-the-fly instrumentation. Similar to in-place but only actually instruments a method the first time the method is executed. This is more complex than in-place and delays the instrumentation penalty until the first time the method is encountered. Depending on what you are doing, that could be a good thing or a bad thing.
Intermediate language instrumentation. This is what is often done with Java and .Net languages (C~, VB.Net, F#, etc). The language is compiled to an intermediate language which is then executed by a virtual machine. The virtual machine provides an interface (JVMTI for Java, ICorProfiler(2) for .Net) which allows you to monitor what the virtual machine is doing. Some of these options allow you to modify the intermediate language just before it gets compiled to executable instructions.
Intermediate language instrumentation via reflection. Java and .Net both provide reflection APIs that allow the discovery of metadata about methods. Using this data you can create new methods on the fly and instrument existing methods just as with the previously mentioned Intermediate language instrumentation.
Compile time instrumentation. This technique is used at compile time to insert appropriate instructions into the application during compilation. Not often used, a profiling feature of Visual Studio provides this feature. Requires a full rebuild and link.
Source code instrumentation. This technique is used to modify source code to insert appropriate code (usually conditionally compiled so you can turn it off).
Link time instrumentation. This technique is only really useful for replacing the default memory allocators with tracing allocators. An early example of this was the Sentinel memory leak detector on Solaris/HP in the early 1990s.
The various in-place and on-the-fly instrumentation methods are fraught with danger as it is very hard to stop all threads safely and modify the code without running the risk of requiring an API call that may want to access a lock which is held by a thread you've just paused - you don't want to do that, you'll get a deadlock. You also have to check if any of the other threads are executing that method, because if they are you can't modify it.
The virtual machine based instrumentation methods are much easier to use as the virtual machine guarantees that you can safely modify the code at that point.
(EDIT - this item added later) IAT hooking instrumentation. This involved modifying the import addess table for functions linked against in other DLLs/Shared Libraries. This type of instrumentation is probably the simplest method to get working, you do not need to know how to disassemble and modify existing binaries, or do the same with virtual machine opcodes. You just patch the import table with your own function address and call the real function from your hook. Used in many commercial and open source tools.
I think I've covered them all, hope that helps.
instrumentation is usually used in dynamic code analysis.
it differs from logging as instrumentation is usually done automatically by software, while logging needs human intelligence to insert the logging code.
It's a general term for doing something to your code necessary for some further analysis.
Especially for languages like C or C++, there are tools like Purify or Quantify that profile memory usage, performance statistics, and the like. To make those profiling programs work correctly, an "instrumenting" step is necessary to insert the counters, array-boundary checks, etc that is used by the profiling programs. Note that in the Purify/Quantify scenario, the instrumentation is done automatically as a post-compilation step (actually, it's an added step to the linking process) and you don't touch your source code.
Some of that is less necessary with dynamic or VM code (i.e. profiling tools like OptimizeIt are available for Java that does a lot of what Quantify does, but no special linking is required) but that doesn't negate the concept.
A excerpt from wikipedia article
In context of computer programming,instrumentation refers to an
ability to monitor or measure the level of a product's performance, to
diagnose errors and to write trace information. Programmers implement
instrumentation in the form of code instructions that monitor specific
components in a system (for example, instructions may output logging
information to appear on screen). When an application contains
instrumentation code, it can be managed using a management tool.
Instrumentation is necessary to review the performance of the
application. Instrumentation approaches can be of two types, source
instrumentation and binary instrumentation.
Whatever Wikipedia says, there is no standard / widely agreed definition for code instrumentation in IT industry.
Please consider, instrumentation is a noun derived from instrument which has very broad meaning.
"Code" is also everything in IT, I mean - data, services, everything.
Hence, code instrumentation is a set of applications that is so wide ... not worth giving it a separate name ;-).
That's probably why this Wikipedia article is only a stub.