I'm trying to get the text of article Y on Wikipedia. I don't necessarily want the latest version of the article, but the last version made by X user.
I do know X user made some edits to the article.
Right now I query the API by title Y the get information about recent changes to the article.
https://en.wikipedia.org/w/api.php?action=query&format=json&prop=revisions&rvprop=user|ids&rvlimit=500&titles=Y
I process that information to get the first revid associated with X user, and then put that into this request to get the full article text of when user X last vetted the article.
https://en.wikipedia.org/w/api.php?action=parse§ion=0&format=json&prop=text&oldid=REVID
This is working now, but has a major flaw. If the user X didn't edit the page within Z edits then they're not going to appear in the revisions query. Any way I can more directly get the text for article Y last vetted by X user?
Related
Is it rational to use topic modelling for a single document or to be more precise is it mathematically okay to use LDA-gibbs method for a single document.If so what should be value of k and seed.
Also what is be the role of k and seed for single as well as large set of documents.
K and SEED are variable of the function LDA (in r studio).
Also let me know if I am wrong anywhere in this question.
To tell about my project ,I am trying to find out the main topics which can be used to represent the content of a single document.
I have already tried using k=4,7,10.Part of my question also is what value of k should be better.
It really depends on the document. A document could be a 700 page book or a single sentence. Your k is also going to be dependent on the document I think you mean the number of topics? If your document is the entire Wikipedia corpus 1500 topics might be appropriate if your document is a list of comments about movies then 20 topics might be appropriate. Optimizing that number can be done using the elbow method check out 17.
Seed can be pretty random it's just a leaver so your results can be replicated - it runs if you leave it blank. I would say try it and check your coherence, eyeball your topics and if it looks right then sure you can train an LDA on one document. A single document should process pretty fast.
Here is an example in python of using seed parameters. My data set is 1,048,575 rows note the seed is much higher:
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=bow_corpus,
num_topics=20, alpha =.1, id2word=dictionary, iterations = 1000,
random_seed = 569356958)
Referring to the original paper on CycleGAN i am confused about this line
The optimal G thereby translates the domain X to a domain Yˆ
distributed identically to Y . However, such a translation does not
guarantee that an individual input x and output y are paired up in a
meaningful way – there are infinitely many mappings G that will induce
the same distribution over yˆ.
I understand there are two sets of images and there is no pairing between them so when generator will taken one image lets say x from set X as input and try to translate it to an image similar to the images in Y set then my question is that there are many images present in the set Y so which y will our x be translated into? There are so many options available in set Y. Is that what is pointed out in these lines of the paper that i have written above? And is this the reason we take cyclic loss to overcome this problem and to create some type of pairing between any two random images by converting x to y and then converting y back to x?
The image x won't be translated to a concrete image y but rather to a "style" of the domain Y. The input is fed to the generator, which tries to produce a sample from the desired distribution (the other domain), the generated image then goes to the discriminator, which tries to predict if the sample is from the actual distribution or produced by the generator. This is just the normal GAN workflow.
If I understand it correctly, in the lines you quoted, authors explain the problems that arise with adversarial loss. They say it again here:
Adversarial training can, in theory, learn mappings G and F that produce outputs identically distributed as target domains Y and X respectively. However, with large enough capacity, a network can map the same set of input images to any random permutation of images in the target domain, where any of the learned mappings can induce an output distribution that matches the target distribution. Thus, an adversarial loss alone cannot guarantee that the learned function can map an individual input x_i to a desired output y_i.
This is one of the reasons for introducing the concept of cycle-consistency to produce meaningful mappings, reduce the space of possible mapping functions (can be viewed as a form of regularization). The idea is not to create a pairing between 2 random images which already are in the dataset (the dataset stays unpaired), but to make sure, that if you map a real image from the domain X to the domain Y and then back again, you get the original image back.
Cycle consistency encourages generators to avoid unnecessary changes and thus to generate images that share structural similarity with inputs, it also prevents generators from excessive hallucinations and mode collapse.
I hope that answers your questions.
As the title, why and in what situation do we use ATTR Function in Tableau?
i understand what it is, and its simplest form.
thank you guys.
I know you say you understand what it is in its simplest form, but just for folks who stumble upon this question, I'd also offer this blog post, which gives an example and a "plain English" break-down of this Tableau Knowledge Base article "When to Use the Attribute (ATTR) Function". This same article is also linked to in this older SO question of a similar vein.
In a phrase from the blog post:
It returns a value if it is unique, else it returns *
As another example, this thread discusses ATTR in the context of calculating an average where date is less than the value from another data source.
The typical case is for a tooltip where most of the time there is only one value for a field among all the data rows at the level of detail of the viz, but there are a few exceptions. In that case, you may not want to create new marks by treating the field as a dimension, which would increase the level of detail of the viz. This is so common that Tableau automatically uses ATTR() when you put dimensions on the tooltip shelf.
Other cases are when you are putting a dimension on the detail or row shelves, and want to "debug" the level of detail of your viz. Say, you expect that using A, B, and C as dimensions will cause each mark to have the same value for D. You can use ATTR(D) to concisely show which combinations of A, B, C differ from your expectation.
ATTR is also useful with calculated fields in some cases, say in a IF condition when defining an aggregate calc.
I have created an assessment using google forms. I have a few questions which there is no wrong answer. For example, Question 1 has options A, B, and C. The user can pick any option and still be right. However, I want to assign a different point value for each option. So picking A would give them 1 point, B 2 points, and C 3 points. Is there a way to do that using Forms? When assigning points in the answer key, it looks like I can only assign 1 point value that is used for options A, B, and C.
You could definitely do that with Forms + apps script. But it would be difficult to do with forms alone unless you're using lots of plug-ins.
You could code a scoring system depending on the formResponse and email the results or store them in a sheet.
When you search for a file name- on the left it gives you numbers ranging from 0-999. What do these numbers represent? It seems like a search ranking but I'm not sure.
They are a measure of the likelihood that the result will match your search query. This algorithm happens under the hood on most predictive or autocomplete searches (like Google's or Mac's Spotlight search) but the ST2 team decided it would be neat to show you the numeric result.
It takes a few items into consideration. Each one of these criteria adds more value to that result:
of matching characters
How frequently the file has been used
Proximity to the top folder
Whether the letters are in sequence or dispersed through the filename
Whether the filename starts with the matched letters, or the matched letters are somewhere in between.
In the example below, you can see the values go up as "index.html" gradually becomes more accurate. As expected, buried files, or files that are used less frequently get a lower value.