Hierarchical Dirichlet Process - Inferring Truncation Level - lda

I am making use of the HDP implementation by Gensim to infer the topics of a dataset, but I have a question regarding the truncation level.
Is there a way to infer the most appropriate truncation level? I have noticed that the final number of topics is dependent on the value for truncation level selected.

Related

Evaluating the performance of variational autoencoder on unlabeled data

I've designed a variational autoencoder (VAE) that clusters sequential time series data.
To evaluate the performance of VAE on labeled data, First, I run KMeans on the raw data and compare the generated labels with the true labels using Adjusted Mutual Info Score (AMI). Then, after the model is trained, I pass validation data to it, run KMeans on latent vectors, and compare the generated labels with the true labels of validation data using AMI. Finally, I compare the two AMI scores with each other to see if KMeans has better performance on the latent vectors than the raw data.
My question is this: How can we evaluate the performance of VAE when the data is unlabeled?
I know we can run KMeans on the raw data and generate labels for it, but in this case, since we consider the generated labels as true labels, how can we compare the performance of KMeans on the raw data with KMeans on the latent vectors?
Note: The model is totally unsupervised. Labels (if exist) are not used in the training process. They're used only for evaluation.
In unsupervised learning you evaluate the performance of a model by either using labelled data or visual analysis. In your case you do not have labelled data, so you would need to do analysis. One way to do this is by looking at the predictions. If you know how the raw data should be labelled, you can qualitatively evaluate the accuracy. Another method is, since you are using KMeans, is to visualize the clusters. If the clusters are spread apart in groups, that is usually a good sign. However, if they are closer together and overlapping, the labelling of vectors in the respective areas may be less accurate. Alternatively, there may be some sort of a metric that you can use to evaluate the clusters or come up with your own.

Azure DocumentDB Data Modeling, Performance & Price

I'm fairly new to NoSQL type databases, including Azure's DocumentDB. I've read through the documentation and understand the basics.
The documentation left me with some questions about data modeling, particularly in how it relates to pricing.
Microsoft charges fees on a "per collection" basis, with a collection being a list of JSON objects with no particular schema, if I understand it correctly.
Now, since there is no requirement for a uniform schema, is the expectation that your "collection" is analogous to a "database" in that the collection itself might contain different types of objects? Or is the expectation that each "collection" is analogous to a "table" in that it contains only objects of similar type (allowing for variance in the object properties, perhaps).
Does query performance dictate one way or another here?
Thanks for any insight!
The normal pattern under DocumentDB is to store lots of different types of objects in the same "collection". You distinguish them by either have a field type = "MyType" or with isMyType = true. The latter allows for subclassing and mixin behavior.
As for performance, DocumentDB gives you guaranteed 10ms read/15ms write latency for your chosen throughput. For your production system, put everything in one big "partitioned collection" and slide the size and throughput levers over time as your space needs and load demands. You'll get essentially infinite scalability and DocumentDB will take care of allocating (and deallocating) resources (secondaries, partitions, etc.) as you increase (or decrease) your throughput and size levers.
A collection is analogous to a database, more so than a relational table. Normally, you would store a type property within documents to distinguish between types, and add the AND type='MyType' filter to each of your queries if restricting to a certain type.
Query performance will not be significantly different if you store different types of documents within the same collection vs. different collections because you're just adding another filter against an indexed property (type). You might however benefit from pooling throughput into a single collection, vs. spreading small amounts of throughput for each type/collection.

Generative model and inference

I was looking at the hLDA model here:
https://papers.nips.cc/paper/2466-hierarchical-topic-models-and-the-nested-chinese-restaurant-process.pdf
I have questions on how the generative model works. What will be the output of the generative model and how is it used in the inference(Gibbs sampling) stage. I am getting mixed up with the generative model and inference part and am not able to distinguish between them.
I am new to this area and any reference articles or papers that can be useful to clear the concept would be very useful.
To get a handle on how this type of Bayesian model works, I'd recommend David Blei's original 2003 LDA paper (google scholar "Latent Dirichlet Allocation" and it'll appear near the top). They used variational inference (as opposed to Gibbs sampling) to estimate the "posterior" (which you could call the "best fit solution"), but the principles behind using a generative model are well explained.
In a nutshell, Bayesian topic models work like this: you presume that your data is created by some "generative model". This model describes a probabilistic process for generating data, and has a few unspecified "latent" variables. In a topic model, these variables are the "topics" that you're trying to find. The idea is to find the most probable values for the "topics" given the data at hand.
In Bayesian inference these most probable values for latent variables are known as the "posterior". Strictly speaking, the posterior is actually a probability distribution over possible values for the latent variables, but a common approach is to use the most probable set of values, called "maximum a posteriori" or MAP estimation.
Note that for topic models, what you get is an estimate for the true MAP values. Many of the latent values, perhaps especially those close to zero, are essentially noise, and cannot be taken seriously (except for being close to zero). It's the larger values that are more meaningful.

Is JSON type in PostgreSQL part of transactions?

Just want to know if JSON type is also comes under the transactions. For e.g. If I have started a transaction which insert data both for column JSON types and others and if something wrong happens, will it rollback the json stuff as well?
Everything is transactional and crash-safe in PostgreSQL unless explicitly documented not to be.
PostgreSQL's transactions operate on tuples, not individual fields. The data type is irrelevant. It isn't really possible to implement a data type that is not transactional in PostgreSQL. (The SERIAL "data type" is just a wrapper for the integer type with a DEFAULT, and is a bit of a special case).
Only a few things have special behaviour regarding transactions - sequences, advisory locks, etc - and they're pretty clearly documented where that's the case.
Note that this imposes some limitations you may not immediately expect. Most importantly, because PostgreSQL relies on MVCC for concurrency control it must copy a value when that value is modified (or, sometimes, when other values in the same tuple are modified). It cannot change fields in-place. So if you have a 5MB json document in a field and you change a single integer value, the whole json document must be copied and written out with the changed value. PostgreSQL will then come along later and mark the old copy as free space that can be re-used.

How to import and use feature vectors in MALLET's topic modelling?

I am using MALLET's topic modelling.
I have set of keywords along with weights for a set of documents which I want to train and use the model to infer new documents.
Note: each keyword of the document has weight assigned to it which is similar to tf-idf score.
Based on what I can infer from the documentation, MALLET's topic modelling supports only sequence data and not vector data.
I want to use the weights assigned to each keyword of the document for the analysis. If I don't then each keyword would be treated equally as a result I loose important information while analysing.
Any suggestions of how I can use MALLET topic modelling for my data?