How can I use "Interpolated Absolute Discounting" for a bigram model in language modeling? - nltk

I want to compare two smoothing methods for a bigram model:
Add-one smoothing
Interpolated Absolute Discounting
For the first method, I found some codes.
def calculate_bigram_probabilty(self, previous_word, word):
bigram_word_probability_numerator = self.bigram_frequencies.get((previous_word, word), 0)
bigram_word_probability_denominator = self.unigram_frequencies.get(previous_word, 0)
if self.smoothing:
bigram_word_probability_numerator += 1
bigram_word_probability_denominator += self.unique__bigram_words
return 0.0 if bigram_word_probability_numerator == 0 or bigram_word_probability_denominator == 0 else float(
bigram_word_probability_numerator) / float(bigram_word_probability_denominator)
However, I found nothing for the second method except for some references for 'KneserNeyProbDist'. However, this is for trigrams!
How can I change my code above to calculate it? The parameters of this method must be estimated from a development-set.

In this answer I just clear up a few things that I just found about your problem, but I can't provide a coded solution.
with KneserNeyProbDist you seem to refer to a python implementation of that problem: https://kite.com/python/docs/nltk.probability.KneserNeyProbDist
There exists an article about Kneser–Ney smoothing on wikipedia: https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing
The article above links this tutorial: https://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf but this has a small fault on the most important page 29, the clear text is this:
Modified Kneser-Ney
Chen and Goodman introduced modified Kneser-Ney:
Interpolation is used instead of backoff. Uses a separate discount for one- and two-counts instead of a single discount for all counts. Estimates discounts on held-out data instead of using a formula
based on training counts.
Experiments show all three modifications improve performance.
Modified Kneser-Ney consistently had best performance.
Regrettable the modified Version is not explained in that document.
The original documentation by Chen & Goodman luckily is available, the Modified Kneser–Ney smoothing is explained on page 370 of this document: http://u.cs.biu.ac.il/~yogo/courses/mt2013/papers/chen-goodman-99.pdf.
I copy the most important text and formula here as screenshot:
So the Modified Kneser–Ney smoothing now is known and seems being the best solution, just translating the description beside formula in running code is still one step to do.
It might be helpful that below the shown text (above in screenshot) in the original linked document is still some explanation that might help to understand the raw description.

Related

In R package Formattable, how to apply digits and conditional formatting at the same time?

I have the object TABLE_LIST which is a list that has tables (I can't provide the contents for privacy policies, sorry).
I first created the object TABLE_LIST (It is a list of data.frames 2x12)
TABLE_LIST=lapply(1:4, function(x) data.frame(rbind(total.ratio4[[x]][-(1)], total.ratio2[[x]][-(1)]), row.names=row))
The following code gives me red and green font colors based on the value on the cell, and it works like a charm:
formattable(TABLE_LIST[[1]], list(area(,-(c(5,10)))~formatter("span", style=x~style(color=ifelse(x>1,"red","green"))),area(,(c(5,10)))~formatter("span", style=x~style(color=ifelse(x>1,"green","red")))))
However, I need COLOR AND comma separated numbers. My failed attempt is:
formattable(TABLE_LIST[[1]], list(area(,-(c(5,10)))~formatter("span", style=x~style(color=ifelse(x>1,"red","green"))),area(,(c(5,10)))~formatter("span", style=x~style(color=ifelse(x>1,"green","red"),digits(x,2))),
area(1:2,1:10)~formatter("span",x~ style(digits(x,2)))))
This code works well, but erases the formatting of the color. I do not know what else to do.
I have to mention I cannot change the original data.frame without messing everything up. So I gotta make the changes on table_list or formattable. Thank you.
I think I solved it. So I will share this small knowledge to people who may have the same problems as me:
formattable(TABLE_LIST[[1]],
list(
area(,-(c(5,10)))~formatter("span",
style=x~style(color=ifelse(x>1,"red","green")),
x~style(digits(x,4))),
area(,(c(5,10)))~formatter("span",
style=x~style(color=ifelse(x>1,"green","red")),
x~style(digits(x,4)))))
Basically, inside the same formatter, on the level of style, add a comma and x~style.

How edge indexing works in graphhopper?

I'm here with a new question.
I'm making a custom algorithm that need precomputed data for the graph edges. I use the AllEdgesIterator like this :
AllEdgesIterator it = graph.getAllEdges();
int nbEdges = it.getCount();
int count = 0;
int[] myData = new int[nbEdges];
while (it.next())
{
count++;
...
}
The first weird thing is that nbEdges is equal to 15565 edges but count is only equal to 14417. How is it possible ?
The second weird thing is when I run my custom A* : I simply browse nodes using the outEdgeExplorer but I get an IndexOutOfBound at index 15569 on myData array. I thought that the edge indexes were included in [0 ; N-1] where N is the number of edges, is it really the case ?
What could be happening here ? By the way, I have disabled graph contraction hierarchies.
Thank you for answering so fast every time !
The first weird thing is that nbEdges is equal to 15565 edges
but count is only equal to 14417. How is it possible ?
This is because of the 'compaction' where unreachable subnetworks are removed, but currently only nodes are removed from the graph the edges are just disconnected and stay in the edges-'array' marked as deleted. So iter.getCount is just an upper limit but the AllEdgeIterator excludes such unused edges correctly when iterating and has the correct count. But using iter.getCount to allocate your custom data array is the correct thing to do.
Regarding the second question: that is probably because the QueryGraph introduces new virtual edges with a bigger edgeId as iter.getCount. Depending on the exact scenario there are different solutions like just excluding or using the original edge instead etc

Simulink Matlab function block deleting rows from a vector

That i want to do is to delete certain rows (or columns doesn't really mater...) from a given vector.
By going through Simulink's components found out that there is nothing performing such an operation,there are blocks help one add elements but nothing clearly for removing,so ended up trying to delete them by using a function block and following the online examples that demonstrate the usage of "[]".Lets say that i want to delete the second column of the vector u,i do u(:, 2) = [];.
That works absolutely fine in a separate m file or function but unfortunately not in a function block returning:
"Simulink does not have enough information to determine output sizes for
this block. If you think the errors below are inaccurate, try specifying
types for the block inputs and/or sizes for the block outputs."
and:
Size mismatch (size [4 x 4] ~= size [4 x 3]).
The size to the left is the size of the left-hand side of the assignment.
Function 'MATLAB Function' (#107.41.42), line 4, column 1:
"u"
Launch diagnostic report.
Is there any alternative you can suggest to remove several elements in a given vector in Simulink?
Thanks in advance
George
Finally,managed to do it without function block.There is a much easier way,by using Pad,and defining the output vector to be shorter than the input resulting in truncation.

MDX Children of Several Members

The children functions returns the set of the member.
But I need the children of several members.
The problem is, that I can't use Union to make it work like that:
Union([Geography].[Geography].[USA].children,[Geography].[Geography].[Canada].children)
I don't know how many member it will be... So I actually would need all children of a set of members.
like:
([Geography].[Geography].[USA],[Geography].[Geography].[Canada],[Geography].[Geography].[GB]).children
Is there a function like that?
I couldn't answer my question and so I just edit it. With the help of DHN's answer and some brain work I found a solution I could use:
Except(DRILLDOWNLEVEL( {[Geography].[Geography].[USA],[Geography].[Geography].[Canada]},,0 ),
{[Geography].[Geography].[USA],[Geography].[Geography].[Canada]})
That does work for me.
Explanation: I drilldown the elements the tool provides me, which returns children plus parents and then I use DHN's idea and except the parents so clean the list up a bit.
Hopefully it is understandable.
You can use the Descendants method (the fourth form of the description linked uses a set as its first argument. Thus,
Descendants( {
[Geography].[Geography].[USA],
[Geography].[Geography].[Canada],
[Geography].[Geography].[GB]
},
1,
SELF
)
should deliver exactly what you want.
Well actually, you could use a Crossjoin to get the set you want.
Something like
[Geography].[Geography].[USA] * [Geography].[Geography].[Canada] * [Geography].[Geography].[GB]
But this is only a proper solution, if you have only a few different search criteria.
Alternatively, you could use Except to remove those criteria you're not interested in. E.g.
Except([Geography].[Geography].children, [Geography].[Geography].[Germany])
This would give you the whole content of the [Geography] dimension, except the one of [Germany].
Hope this helps a bit.
Edit after comment of TO
Ok, this wasn't part of your question, but I think what you need is the MemberToStr() function. Please find the doc here.
I think something like this should do the trick.
with member [Measures].[Cities]
as membertostr([Geography].[Geography].members.children)
select [Measures].[Cities] on 0
from [WhatEverYourCubeNameIs]
where (
[Geography].[Geography].[USA],
[Geography].[Geography].[Canada]
)
Please note that this query is totally untested. I also may have lost some of my skills, because it's been a while, since I used mdx. You will also have to create the query dynamically, since the selection seems to be user dependant. But I'm sure that you're aware of it. ;)

more minimaler cubism.js horizon chart from json example

Following up on a previous question... I've got my minimal horizon chart example much more minimaler than before ( minimal cubism.js horizon chart example (TypeError: callback is not a function) )
<body>
<div class="mag"></div>
<script type="text/javascript">
var myContext = cubism.context();
var myMetr = myContext.metric(function(start, stop, step, callback) {
d3.json("../json/600s.json.php?t0=" + start/1000 + "&t1=" + stop/1000 + "&ss=" + step/1000, function(er, dt) {
if (!dt) return callback(new Error("unable to load data, or has NaNs"));
callback(null, dt.val);
});
});
var myHoriz = myContext.horizon()
.metric(myMetr);
d3.select(".mag")
.call(myHoriz);
</script>
</body>
The d3.json() bit calls a server side .php that I've written that returns a .json version of my measurements. The .php takes the start, stop, step (which cubism's context.metric() uses) as the t0, t1, and ss items in its http query string and sends back a .json file. The divides by 1000 are because I made my .php expect parameters in s, not ms. And the dt.val is because the actual array of my measurements is in the "val" member of the json output, e.g.
{
"other":"unused members...",
"n":5,
"val":[
22292.078125,
22292.03515625,
22292.005859375,
22292.02734375,
22292.021484375
]
}
The problem is, now that I've got it pared down to (I think) the bare minimum, AND I actually understand all of it instead of just pasting from other examples and hoping for the best (in which scenario, most things I try to change just break things instead of improving them), I need to start adding parameters and functions back to make it visually more useful.
Two problems first of all are, this measurement hovers all day around 22,300, and only varies +/- 10 maybe all day, so the graph is just a solid green rectangle, AND the label just says constantly "22k".
I've fixed the label with .format(d3.format(".3f")) (versus the default .2s which uses SI metric prefixes, thus the "22k" above).
What I can't figure out is how to use either axis, scale, extent, or what, so that this only shows a range of numbers that are relevant to the viewer. I don't actually care about the positive-green and negative-blue and darkening colours aspects of the horizon chart. I just used it as proof-of-concept to get the constantly-shifting window of measurements from my .json data source, but the part I really need to keep is the serverDelay, step, size, and such features of cubism.js that intelligently grab the initial window of data, and incrementally grab more via the .json requests.
So how do I keep the cubism bits I need, but usefully change my all-22300s graph to show the important +/- 10 units?
update re Scott Cameron's suggestion of horizon.extent([22315, 22320])... yes I had tried that and it had zero effect. Other things I've changed so far from "minimal" above...
var myHoriz = myContext.horizon()
.metric(myMetr)
.format(d3.format(".2f"))
.height(100)
.title("base1 (m): ")
.colors(["#08519c", "#006d2c"])
// .extent([22315, 22320]) // no effect with or without this line
;
I was able to improve the graph by using metric.subtract inserting it above the myHoriz line like so: (but it made the numerical label useless now):
var myMetr2 = myMetr.subtract(22315);
var myHoriz = myContext.horizon()
.metric(myMetr2)
.format...(continue as above)
All the examples seem so concise and expressive and work fine verbatim but so many of the tweaks I try to make to them seem to backfire, I'm not sure why that is. And similarly when I refer to the API wiki... maybe 4 out of 5 things I use from the API work immediately, but then I always seem to hit one that seems to have no effect, or breaks the chart completely. I'm not sure I've wrapped my head around how so many of the parameters being passed around are actually functions, for one thing.
Next hurdles after this scale/extent question, will be getting the horizontal time axis back (after having chopped it out to make things more minimal and easier to understand), and switching this from an area-looking graph to more of a line graph.
Anyway, all direction and suggestion appreciated.
Here's the one with the better vertical scale, but now the numerical label isn't what I want:
Have you tried horizon.extent? It lets you specify the [min, max] value for the horizon chart. By default, a linear scale will be created to map values within the extent to the pixels within the chart's height (specified with `horizon.height or default to 30 pixels).