How to use Stanford NER in Apache Spark - nltk

We are thinking of using Stanford NER for entity extraction for our domain. So we need to retrain the classifier. However, we have Apache Spark environment. I am wondering can anybody suggest how to use Stanford NER on Spark. I am using python 2.7 + NLTK.
Any response would be greatly appreciated.

The maintainers of Spark have written some code for running Stanford CoreNLP annotations.
The GitHub project is here: https://github.com/databricks/spark-corenlp

Related

Autodesk Forge Data Management DotNet Package

I wonder if there is a forge data management library for dotnet like design automation package.
https://github.com/Autodesk-Forge/forge-api-dotnet-design.automation
So I could use data management to store the files used in the design automation in an easy way.
I found some samples, but not compact in a DataManagementClient class.
I created my own package with a simple implementation for the OSS.
https://www.nuget.org/packages/ricaun.Autodesk.Forge.Oss
https://github.com/ricaun-io/forge-api-dotnet-oss
We have a .NET SDK that handles Data Management API (among other services).
You can refer to nuget and github.
In this SDK we have classes for each specific Data Management entity, such as:
Hub
Project
Folder
Item
Version
The most recent version (1.9.7) also covers a simple way to handle binary transfers, as described in this blog post.
We are also looking for early adopters of the alpha version of a new SDK for the APS OSS service (refer here)

Json library for Cloud Dataproc

I need to find a json library for Google Cloud Dataproc.
I'm a bit not sure where can find a list of supported json libraries.
Or if I write my own, which dependencies can be taken into Dataproc?
Any data on this topic will be highly appreciated.
Best Regards,
Oleg
If you are talking about reading/parsing JSON objects, than you can use Gson library witch is a part of Hadoop distribution on Dataproc.
Also, you can use JSON library of your choice and any other dependencies, but you should create uber jar for your job and include all these libraries/dependencies into it.
If you are talking about Google JSON API Client libraries, than Dataproc by default deploys 1.20.0 version as part of GCS and BQ connectors. You still can use newer JSON API Client library version if you will relocate it inside your job's uber jar to avoid conflicts with version deployed to Dataproc.
See more detailed answer on conflicting dependencies management in Dataproc here.

How to convert Caffe trained model and parameters to directly be used for inference in caffe2?

I have a trained caffe model on a CPU desktop. I want to port it mobile platform to do inference using Caffe2. Any insights into how should I go about with it? Do the scripts that are provided by Caffe2 allow for conversion of the model and reuse of the weights? Any help would be appreciated! Thank you!!
You can follow the steps mentioned in the below link:
https://caffe2.ai/docs/caffe-migration.html#caffe-to-caffe2
Its pretty clear. Make sure your protobuf version is up-to-date before running the caffe_translator script.
Good Luck !!

JSONiq: Java implementation as library?

Looking at the implementation of the JSONiq specification (www.jsoniq.org).
Most of them are standalone deployment. e.g. Zorba, VXQuery, etc and are designed to query JSON based databases or process large JSON documents.
I am surprised to find all implementations are trying to solve such problem without modularizing the JSONiq execution as library. It should have been much like Apache Lucene(library) to Apache Solr(Search Server+Rest API) and other indexing solution.
Is there a java library available (similar to Saxon for XQuery), which can be embeded into java apps and can execute the JSONiq specs defined as functions in .xq or .xquery files ?
Or How can Saxon be extended to parse and execute the JSONiq specification ?
JSONiq is an XQuery-like language for processing JSON. Most of its good ideas were incorporated into XQuery 3.1, but in a way that integrated the XML and JSON data models. I don't believe JSONiq offers any functionality that's not in XQuery 3.1, and it's not an open standard, so there would be little point in implementing it in Saxon.
There are currently two released JSONiq implementations in Java, and they can both read data from HDFS or the local filesystem, and process large amounts in parallel on multiple cores/machines:
Rumble (Spark) -- supports the JSONiq core language (JSON-friendly syntax), can also read JSON-like formats (Parquet, Avro, CSV, ROOT, ...) from any file system supported by Spark (S3, HDFS, local file system, ...). Rumble also exposes its functionality via a Java API and is available as a Maven dependency.
VXQuery (Hyracks) -- supports the JSONiq extension to XQuery

Has anyone any experience on implementing the R Package XGBoost within the Azure ML Studio environment?

I was hoping that someone would have tried to or had success in implementing it and would have knowledge of any pitfalls in using it.
You need to zip & load the package windows binaries in dataset & import it to the R environment.
You can follow the instructions over here. I couldn't import it for the latest version, so I simply downloaded the xgboost version from this experiment & loaded it to my saved datasets
This is for any generic packages which are not preloaded in the environment
The following is a list of experiments to publish R models as a web service
Hope this helps!
Edit: You can also simply change the R version to Microsoft Open R (current version 3.2.2) and you can import xgboost as any common library
Here you can find an example. It shows for example that you would need to import external libraries individually for both training and scoring.