IR Offline Metric Precision#k in Topic Modeling - lda

Good day everyone! I'm currently learning LDA and I curious on how to validate the result of the topic model.
I've read the statement on this paper https://arxiv.org/pdf/2107.02173.pdf p.10
To the extent that our experimentation accurately represents current practice, our results do suggest that topic model evaluation—both automated and human—is overdue for a careful reconsideration. In this, we agree with Doogan and Buntine (2021), who write that “coherence measures designed for older models [. . . ] may be incompatible with newer models” and instead argue for evaluation paradigms centered on corpus exploration and labeling. The right starting point for this reassessment is the recognition that both automated and human evaluations are abstractions of a real-world problem. The familiar use of precision-at-10 in information retrieval, for example, corresponds to a user who is only willing to consider the top ten retrieved documents.
Suppose you have a corpus of a play store app user reviews, after a topic model processes this corpus, let's say it generates K=10 topics. How are we gonna use precision#10 offline metric to evaluate the result(10 topics)?

Related

is it worth it to release a single language model of Google's BERT for Italian?

I am currently working on my thesis which is related to automated question answering using a translation of Stanford's SQUAD data set in Italian. I am going to use Google's BERT https://github.com/google-research/bert as it is giving the best results in the SQUAD challenge by now. Google provided a multilingual pre-trained model for many languages including Italian.
is it worth it to release a single language model of Google's BERT only for Italian? my assumption is a single language model means a smaller network means less training time and smaller in size.

human trace data for evaluation of reinforcement learning agent playing Atari?

In recent reinforcement learning researches about Atari games, agents performance is evaluated by human start.
[1507.04296] Massively Parallel Methods for Deep Reinforcement Learning
[1509.06461] Deep Reinforcement Learning with Double Q-learning
[1511.05952] Prioritized Experience Replay
In the human start evaluation, learned agents begin episodes of randomly sampled point from a human professional's game-play.
My question is:
Where can I get this human professional's game-play trace data?
For fare comparison, the trace data should be same among each research but I could not find the data.
I'm not aware of that data being publicly available anywhere. Indeed, as far as I know all the papers that use such human start evaluations were written by the same lab/organization (DeepMind), so that doesn't rule out the possibility that DeepMind has kept the data internal and hasn't shared it with external researchers.
Note that the paper Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents proposes a different (arguably better) approach for introducing the desired stochasticity in the environment to disincentivize an algorithm from simply memorizing strong sequences of actions. Their approach, referred to as sticky actions, is described in Section 5.2 of that paper. In 5.3 they also describe numerous disadvantages of other approaches, including disadvantages of the human starts approach.
In addition to arguably simply being a better approach, the sticky actions approach also has the advantage that it can very easily be implemented and used by all researchers, allowing for fair comparisons. So, I'd strongly recommend simply using the sticky actions instead of human starts. The disadvantage obviously is that you can't compare results easily anymore to results reported in those DeepMind papers with human starts, but those evaluations have numerous flaws as described in the paper linked above anyway (human starts can be considered as one flaw, but they also often have other flaws, such as reporting results of best run instead of reporting average of multiple runs, etc.).

IF I wanted to predict future purchases in online shopping using historical data, do I need data science or data analysis or big data?

I wanted to learn to predict future events like......being able to predict number of plane crashes in 2018 using past two decades of plane crash data.....or.....predict how many tee-shirts with justin beibers face on it will be sold by 2018 depending upon fan base from previuos data..........or how many iphones 8's and samsungs s9's will be sold if they decide to launch on the same exact date....predicting somewhat accurate whole sale market.....stuff like that....please suggest a book...i really love head first series....is head first data analysis right for me? ....I dont lnow if i can ask questions other than programming here or not.....but here i am.....By the way does big data have anything to do with this?
it all falls in the category of data science (which is big data and data analysis). What you need for predictions and such stuff is some machine learning approach to data you have or can access about stuff you want to predict.
I'd recommend this, newest series of articles: https://medium.com/machine-learning-for-humans/why-machine-learning-matters-6164faf1df12
Apart from really nice intro, you'll find lots of resources for further learning there.
Also I highly recommend deeplearning.ai and machine learning course from Stanford you can find on Coursera.
Cheers!
I think most of the scenarios that you have asked are a case of Supervised Learning which is a type of machine learning, wherein you have previous data to train your machine learning model with the input and output values and once you have trained a model you feed new input values and it gives you the output which is the prediction.
I would highly recommend the following Machine Learning course by Andrew NG which on Coursera which covers all the basics of ML including Supervised and Unsupervised Learning.
https://www.coursera.org/learn/machine-learning
As for the books the following link from Analytics Vidya is a great place to start with, you can go through the books as they can give you some good basics of statistics and data sciences.
https://www.analyticsvidhya.com/blog/2015/10/read-books-for-beginners-machine-learning-artificial-intelligence/
As for the differences between Data Science, Data Analytics and Big Data. Data science and data analytics are similar in the sense that they both try to find patterns in data and based on those pattern you derive some insights.
Big data on the other hand is basically Data of huge size which is distributed across multiple machines, so you can store and compute large amount of data simultaneously and in parallel.
So you may ask how is big data and machine learning related? well the answer lies in the training of machine learning model, since the accuracy of prediction is to a certain extent depends on the amount of data you train it on. So more the training data better the predictions and in terms of quantity big data way ahead of others, hence the relation.

Topic Modeling tool for large data set (30GB)

I'm looking for some topic modeling tool which can be applicable to a large data set.
My current data set for training is 30 GB. I tried MALLET topic modeling, but always I got OutOfMemoryError.
If you have any tips, please let me know.
There are many options available to you, and this response is agnostic as to how they compare.
I think that the important thing with such a large dataset is the method of approximate posterior inference used, and not necessarily the software implementation. According to this paper, online Variational Bayes inference is much more efficient, in terms of time and space, than Gibbs sampling. Though I've never used it, the gensim package looks good. It's in python, and there are in-depth tutorials at the project's webpage.
For code that comes straight from the source, see the webpage of David Blei, one of the authors of the LDA model, here. He links to more than a few implementations, in a variety of languages (R, Java, C++).
I suggest using a "big data" tool such as graphlab, which supports topic modeling: http://docs.graphlab.org/topic_modeling.html
The GraphLab Create topic model toolkit (with Python API bindings) should be able to handle a dataset that large.

Interesting NLP/machine-learning style project -- analyzing privacy policies

I wanted some input on an interesting problem I've been assigned. The task is to analyze hundreds, and eventually thousands, of privacy policies and identify core characteristics of them. For example, do they take the user's location?, do they share/sell with third parties?, etc.
I've talked to a few people, read a lot about privacy policies, and thought about this myself. Here is my current plan of attack:
First, read a lot of privacy and find the major "cues" or indicators that a certain characteristic is met. For example, if hundreds of privacy policies have the same line: "We will take your location.", that line could be a cue with 100% confidence that that privacy policy includes taking of the user's location. Other cues would give much smaller degrees of confidence about a certain characteristic.. For example, the presence of the word "location" might increase the likelihood that the user's location is store by 25%.
The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.
I wanted to ask whether you guys think this is a good approach to this problem. How exactly would you approach a problem like this? Furthermore, are there any specific tools or frameworks you'd recommend using. Any input is welcome. This is my first time doing a project which touches on artificial intelligence, specifically machine learning and NLP.
The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.
This is text classification. Given that you have multiple output categories per document, it's actually multilabel classification. The standard approach is to manually label a set of documents with the classes/labels that you want to predict, then train a classifier on features of the documents; typically word or n-gram occurrences or counts, possibly weighted by tf-idf.
The popular learning algorithms for document classification include naive Bayes and linear SVMs, though other classifier learners may work too. Any classifier can be extended to a multilabel one by the one-vs.-rest (OvR) construction.
A very interesting problem indeed!
On a higher level, what you want is summarization- a document has to be reduced to a few key phrases. This is far from being a solved problem. A simple approach would be to search for keywords as opposed to key phrases. You can try something like LDA for topic modelling to find what each document is about. You can then search for topics which are present in all documents- I suspect what will come up is stuff to do with licenses, location, copyright, etc. MALLET has an easy-to-use implementation of LDA.
I would approach this as a machine learning problem where you are trying to classify things in multiple ways- ie wants location, wants ssn, etc.
You'll need to enumerate the characteristics you want to use (location, ssn), and then for each document say whether that document uses that info or not. Choose your features, train your data and then classify and test.
I think simple features like words and n-grams would probably get your pretty far, and a dictionary of words related to stuff like ssn or location would finish it nicely.
Use the machine learning algorithm of your choice- Naive Bayes is very easy to implement and use and would work ok as a first stab at the problem.