how to predict topics for a batch of documents with mallet - lda

I am using mallet from a scala project. After training the topic models and got the inferencer file, I tried to assign topics to new texts. The problem is I got different results with different calling methods. Here are the things I tried:
creating a new InstanceList and ingest just one document and get the topic results from the InstanceList
somecontentList.map(text=>getTopics(text, model))
def getTopics(text:String, inferencer: TopicInferencer):Array[Double]={
val testing = new InstanceList(pipe)
testing.addThruPipe(new Instance(text, null, "test instance", null))
inferencer.getSampledDistribution(testing.get(0), iter, 1, burnIn)
}
Put everything in a InstanceList and predict topics together.
val testing = new InstanceList(pipe)
somecontentList.foreach(text=>
testing.addThruPipe(new Instance(text, null, "test instance", null))
)
(0 until testing.size).map(i=>
ldaModel.getSampledDistribution(testing.get(i), 100, 1, 50))
These two methods produce very different results except for the first instance. What is the right way of using the inferencer?
Additional information:
I checked the instance data.
0: topic (0)
1: beaten (1)
2: death (2)
3: examples (3)
4: forum (4)
5: wanted (5)
6: contributing (6)
I assume the number in parenthesis is the index of words used in prediction. When I put all text into the InstanceList, the index is different because the collection has more text. Not sure how exactly that information is considered in the model prediction process.

Remember that the new instances must be imported with the pipe from the original data as recorded in the Inferencer in order for the alphabets to match. It's not clear where pipe is coming from in the scala code, but the fact that the first six words seem to have what looks like it might be ids starting with 0 suggests that this is a new alphabet.

I too found similar issue, although with R plug in. We ended up calling the Inferencer for each row/document separately.
However, there will be some differences in inferences when you call for the same row, because of stochasticity in the drawing and inferencer. Although, I agree that the differences should be small.

Related

What is the right way to generate long sequence using PyTorch-Transformers?

I am trying to generate a long sequence of text using PyTorch-Transformers from a sample text. I am following this tutorial for this purpose. Because the original article only predicts one word from a given text, I modified that script to generate long sequence instead of one. This is the modified part of the code
# Encode a text inputs
text = """An examination can be defined as a detailed inspection or analysis
of an object or person. For example, an engineer will examine a structure,
like a bridge, to see if it is safe. A doctor may conduct"""
indexed_tokens = tokenizer.encode(text)
# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])
seq_len = tokens_tensor.shape[1]
tokens_tensor = tokens_tensor.to('cuda')
with torch.no_grad():
for i in range(50):
outputs = model(tokens_tensor[:,-seq_len:])
predictions = outputs[0]
predicted_index = torch.argmax(predictions[0, -1, :])
tokens_tensor = torch.cat((tokens_tensor,predicted_index.reshape(1,1)),1)
pred = tokens_tensor.detach().cpu().numpy().tolist()
predicted_text = tokenizer.decode(pred[0])
print(predicted_text)
Output
An examination can be defined as a detailed inspection or analysis
of an object or person. For example, an engineer will examine a
structure, like a bridge, to see if it is safe. A doctor may conduct
an examination of a patient's body to see if it is safe.
The doctor may also examine a patient's body to see if it is safe. A
doctor may conduct an examination of a patient's body to see if it is
safe.
As you can see the generated text does not generates any unique text sequence but it generates the same sentence over and over again with minor changes.
How should we create long sequence using PyTorch-Transformers?
There is usually no such thing as generating a complete sentence or complete text once. There were some research approaches on that but almost all of the state-of-the-art models generate a text word by word. The generated word at time t-1 is then used as input (together with other already generated or given words) while generating the next word at time t. So, it is normal that it generates word by word. I do not understand what you mean by this.
Which model are you using?

Splitting a feature collection by system index in Google Earth Engine?

I am trying to export a large feature collection from GEE. I realize that the Python API allows for this more easily than the Java does, but given a time constraint on my research, I'd like to see if I can extract the feature collection in pieces and then append the separate CSV files once exported.
I tried to use a filtering function to perform the task, one that I've seen used before with image collections. Here is a mini example of what I am trying to do
Given a feature collection of 10 spatial points called "points" I tried to create a new feature collection that includes only the first five points:
var points_chunk1 = points.filter(ee.Filter.rangeContains('system:index', 0, 5));
When I execute this function, I receive the following error: "An internal server error has occurred"
I am not sure why this code is not executing as expected. If you know more than I do about this issue, please advise on alternative approaches to splitting my sample, or on where the error in my code lurks.
Many thanks!
system:index is actually ID given by GEE for the feature and it's not supposed to be used like index in an array. I think JS should be enough to export a large featurecollection but there is a way to do what you want to do without relying on system:index as that might not be consistent.
First, it would be a good idea to know the number of features you are dealing with. This is because generally when you use size().getInfo() for large feature collections, the UI can freeze and sometimes the tab becomes unresponsive. Here I have defined chunks and collectionSize. It should be defined in client side as we want to do Export within the loop which is not possible in server size loops. Within the loop, you can simply creating a subset of feature starting from different points by converting the features to list and changing the subset back to feature collection.
var chunk = 1000;
var collectionSize = 10000
for (var i = 0; i<collectionSize;i=i+chunk){
var subset = ee.FeatureCollection(fc.toList(chunk, i));
Export.table.toAsset(subset, "description", "/asset/id")
}

I need a basic concrete example on how to use TDD along with Design by Contract

I have seen many questions like this and this. Some people see there is overlapping between TDD and Design by Contract and others say they are complementary, I am biased to the second one, so I need a very basic, correct and complete example in any language or even in pseudo on how to use them together.
This is a slightly tricky question because both "test driven development" (TDD) and "design by contract" (DbC) imply something about your development process. (generally that the tests/contracts are written before the code)
Since you're asking about code examples, though, you are more interested in what it would look like to use tests and contracts together. Here is an example:
def sort_numbers(nums: List[int]) -> List[int]:
'''
Tests:
>>> sort_numbers([4, 1, 2])
[1, 2, 4]
>>> sort_numbers([])
[]
Contracts:
post: len(__return__) == len(nums)
post: __return__[0] <= __return__[-1]
'''
return sorted(nums)
Tests
We use tests to check how specific inputs affect the output. For example, sorting the numbers [4, 1, 2] produces the list [1, 2, 4]. Furthermore, sorting the empty list produces the empty list.
(these tests are written using doctest and can be checked with python -m doctest <file>)
Contracts
We use contracts to ensure that some properties hold, no matter what the inputs are. In this example, we assert that:
The returned list has the same length as the input list.
The first item returned is always less than or equal to the last item returned.
(these contracts are written in PEP-316 syntax and can be checked with CrossHair)

Overwrite function only for a particular instance in LUA

I basically don't look for an answer on how to do something but I found how to do it, yet want more information. Hope this kind of question is OK here.
Since I just discovered this the code of a game I'm modding I don't have any idea what should I google for.
In Lua, I can have for example:
Account = {balance = 0}
function Account.withdraw (v)
self.balance = self.balance - v
end
I can have (in another lua file)
function Account.withdrawBetter (v)
if self.balance > v then
self.balance = self.balance - v
end
end
....
--somewhere in some function, with an Account instance:
a1.withdraw = a1.withdrawBetter
`
What's the name for this "technique" so I can find some more information about it (possible pitfalls, performance considerations vs. override/overwrite, etc)? note I'm only changing withdraw for the particular instance (a1), not for every Account instance.
Bonus question: Any other oo programming languages with such facility?
Thanks
OO in Lua
First of all, it should be pointed out that Lua does not implement Object Oriented Programming; it has no concept of objects, classes, inheritance, etc.
If you want OOP in Lua, you have to implement it yourself. Usually this is done by creating a table that acts as a "class", storing the "instance methods", which are really just functions that accept the instance as its first argument.
Inheritance is then achieved by having the "constructor" (also just a function) create a new table and set its metatable to one with an __index field pointing to the class table. When indexing the "instance" with a key it doesn't have, it will then search for that key in the class instead.
In other words, an "instance" table may have no functions at all, but indexing it with, for example, "withdraw" will just try indexing the class instead.
Now, if we take a single "instance" table and add a withdraw field to it, Lua will see that it has that field and not bother looking it up in the class. You could say that this value shadows the one in the class table.
What's the name for this "technique"
It doesn't really have one, but you should definitely look into metatables.
In languages that do support this sort of thing, like in Ruby (see below) this is often done with singleton classes, meaning that they only have a single instance.
Performance considerations
Indexing tables, including metatables takes some time. If Lua finds a method in the instance table, then that's a single table lookup; if it doesn't, it then needs to first get the metatable and index that instead, and if that doesn't have it either and has its own metatable, the chain goes on like that.
So, in other words, this is actually faster. It does use up some more space, but not really that much (technically it could be quite a lot, but you really shouldn't worry about that. Nonetheless, here's where you can read up on that, if you want to).
Any other oo programming languages with such facility?
Yes, lots of 'em. Ruby is a good example, where you can do something like
array1 = [1, 2, 3]
array2 = [4, 5, 6]
def array1.foo
puts 'bar'
end
array1.foo # prints 'bar'
array2.foo # raises `NoMethodError`

Django Query Natural Sort

Let's say I have this Django model:
class Question(models.Model):
question_code = models.CharField(max_length=10)
and I have 15k questions in the database.
I want to sort it by question_code, which is alphanumeric. This is quite a classical problem and has been talked about in:
http://blog.codinghorror.com/sorting-for-humans-natural-sort-order/
Does Python have a built in function for string natural sort?
I tried the code in the 2nd link (which is copied below, changed a bit), and notice it takes up to 3 seconds to sort the data. To make sure about the function's performance, I write a test which creates a list of 100k random alphanumeric string. It takes only 0.76s to sort that list. So what's happening?
This is what I think. The function needs to get the question_code of each question for comparing, thus calling this function to sort 15k values means requesting mysql 15k separate times. And this is the reason why it takes so long. Any idea? And any solution to natural sort for Django in general? Thanks a lot!
def natural_sort(l, ascending, key=lambda s:s):
def get_alphanum_key_func(key):
convert = lambda text: int(text) if text.isdigit() else text
return lambda s: [convert(c) for c in re.split('([0-9]+)', key(s))]
sort_key = get_alphanum_key_func(key)
return sorted(l, key=sort_key, reverse=ascending)
As far as I'm aware there isn't a generic Django solution to this. You can reduce your memory usage and limit your db queries by building an id/question_code lookup structure
from natsort import natsorted
question_code_lookup = Question.objects.values('id','question_code')
ordered_question_codes = natsorted(question_code_lookup, key=lambda i: i['question_code'])
Assuming you want to page the results you can then slice up ordered_question_codes, perform another query to retrieve all the questions you need order them according to their position in that slice
#get the first 20 questions
ordered_question_codes = ordered_question_codes[:20]
question_ids = [q['id'] for q in ordered_question_codes]
questions = Question.objects.filter(id__in=question_ids)
#put them back into question code order
id_to_pos = dict(zip((question_ids), range(len(question_ids))))
questions = sorted(questions, key = lambda x: id_to_pos[x.id])
If the lookup structure still uses too much memory, or takes too long to sort, then you'll have to come up with something more advanced. This certainly wouldn't scale well to a huge dataset