Inconsistent results with AllenNLP Reading Comprehension - allennlp

We're seeing strange behavior with Reading Comprehension. Given the following people (names redacted) and their roles and contact numbers:
Asking RC for PersonAfname's number results in 4444, which is correct. It also correctly returns 3333 for PersonBfname's number. But when we ask for PersonCfname's number, it returns 4444 instead of 5555. What's up?
https://demo.allennlp.org/reading-comprehension/bidaf-elmo/s/what-is-perscfnames-number/Z8Z6J4A9D9

Reading comprehension models aren't perfect, but when I change the query from PersCfname's to PersonCfname's, it returns 5555.

Related

jq: groupby and nested json arrays

Let's say I have: [[1,2], [3,9], [4,2], [], []]
I would like to know the scripts to get:
The number of nested lists which are/are not non-empty. ie want to get: [3,2]
The number of nested lists which contain or not contain number 3. ie want to get: [1,4]
The number of nested lists for which the sum of the elements is/isn't less than 4. ie want to get: [3,2]
ie basic examples of nested data partition.
Since stackoverflow.com is not a coding service, I'll confine this response to the first question, with the hope that it will convince you that learning jq is worth the effort.
Let's begin by refining the question about the counts of the lists
"which are/are not empty" to emphasize that the first number in the answer should correspond to the number of empty lists (2), and the second number to the rest (3). That is, the required answer should be [2,3].
Solution using built-in filters
The next step might be to ask whether group_by can be used. If the ordering did not matter, we could simply write:
group_by(length==0) | map(length)
This returns [3,2], which is not quite what we want. It's now worth checking the documentation about what group_by is supposed to do. On checking the details at https://stedolan.github.io/jq/manual/#Builtinoperatorsandfunctions,
we see that by design group_by does indeed sort by the grouping value.
Since in jq, false < true, we could fix our first attempt by writing:
group_by(length > 0) | map(length)
That's nice, but since group_by is doing so much work when all we really need is a way to count, it's clear we should be able to come up with a more efficient (and hopefully less opaque) solution.
An efficient solution
At its core the problem boils down to counting, so let's define a generic tabulate filter for producing the counts of distinct string values. Here's a def that will suffice for present purposes:
# Produce a JSON object recording the counts of distinct
# values in the given stream, which is assumed to consist
# solely of strings.
def tabulate(stream):
reduce stream as $s ({}; .[$s] += 1);
An efficient solution can now be written down in just two lines:
tabulate(.[] | length==0 | tostring )
| [.["true", "false"]]
QED
p.s.
The function named tabulate above is sometimes called bow (for "bag of words"). In some ways, that would be a better name, especially as it would make sense to reserve the name tabulate for similar functionality that would work for arbitrary streams.

How to create RegEx with SubMatches of the same Match that capture 2 different types of output?

I'm trying to get my Jira data via JSON REST API into Excel, i.e. using VBA, and I'm parsing JSON output using RegEx. There are plenty of useful tutorials on the web, and after a couple of days I do have more or less working solution I'm happy with, except one minor obstacle. Long story short:
Among many issue fields I need friendly Assignee name, but some issues in my projects may be Unassigned, that obviously results in TWO VERY different kinds of JSON output:
Unassigned issue:
..."assignee":null,"updated"...
Assigned issue:
"assignee":{
"self":...
<Lots of NOT needed fields here>
...
},
"displayName":"Doe, John", <-- That's what I need, name only part
"active":...
<Lots of NOT needed fields here>
...
},
"updated"...
Well, I suppose that something like:
"assignee".*?"displayName":"(.*?)"|"assignee":(.*?),"updated"
will handle the job by producing TWO possible Matches, but... Is there a way to create RegEx where ANY of output options will result in SubMatches of ONE Match?
I'm a total newbie to RegEx, so sorry if the wording of my question is silly due to incorrectly used terms. Anyway, I hope the sample part is more or less clear, and I'll be extremely grateful for useful suggestions.
After an hour of tryouts on regex101 I ended up with the following RegEx:
"assignee":(null|.*?"displayName":"(.*?)","active")
Probably it's ugly and may be improved - but it DOES the job, and does NOT ruin in the process the indexes of subsequent Matches in collection, therefore keeping the rest of code working as it is now.

How do I use the modulo operators in Socrata SoQL?

https://dev.socrata.com/docs/datatypes/number.html#, says that % and ^ can be used to get the modulo of one number divided by another number. I cannot get them to work and cannot find examples.
When I try ^ I appear to get exponentiation. Example:
http://data.cityofchicago.org/resource/pubx-yq2d.json?$select=streetnumberto,streetnumberfrom,(streetnumberto-streetnumberfrom)^100%20as%20address_length_in_blocks
When I try % itself, I get a "malformed" error, not so surprisingly. When I try the %25 code for %, I still get a "malformed" error but one that seems to suggest that it correctly inserted the % but does not know what it means. (I am restricted from posting more than two links but just replace the ^ above with % and %25.)
Can anyone help me get this working?
By the way, at the risk of mixing topics, I would ideally like to use an int or round sort of function but they do not seem to exist in SoQL so I was trying to back into getting the integer portion of dividing by 100.
Thank you.
Great question. I just tried it myself and I'm getting a malformed exception too, and it's definitely getting through to the query optimizer:
https://data.cityofchicago.org/resource/erhc-fkv9.json?totalfees=4160%20%25%2010.0
Note I'm using the SODA 2.1 version of that API, which I recommend you migrate to:
https://dev.socrata.com/foundry/#/data.cityofchicago.org/erhc-fkv9
I'll check with our engineering team and see what might be going on. I'll pass on the feature request for round, ceil, floor, etc as well.

How can I go about querying a database for non-similar, but almost matching items

How can I go about querying a database for items that are not only exactly similar to a sample, but also those that are almost similar? Almost as search engines work, but only for a small project, preferably in Java. For example:
String sample = "Sample";
I would like to retrieve all the following whenever I query sample:
String exactMatch = "Sample";
String nonExactMatch = "S amp le";
String nonExactMatch_2 = "ampls";
You need to define what similar means in terms that your database can understand.
Some possibilities include Levenshtein distance, for example.
In your example, sample matches...
..."Sample", if you search without case sensitivity.
..."S amp le", if you remove a set of ignored characters (here space only) from both the query string and the target string. You can store the new value in the database:
ActualValue SearchFor
John Q. Smith johnqsmith%
When someone searches for "John Q. Smith, Esq." you can boil it down to johnqsmithesq and run
WHERE 'johnqsmithesq' LIKE SearchFor
"ampls" is more tricky. Why is it that 'ampls' is matched by 'sample'? A common substring? A number of shared letters? Does their order count (i.e. are anagrams valid)? Many approaches are possible, but it is you who must decide. You might use Levenshtein distance, or maybe store a string such as "100020010003..." where every digit encodes the number of letters you have, up to 9 (so 3 C's and 2 B's but no A's would give "023...") and then run the Levenshtein distance between this syndrome and the one from each term in the DB:
ActualValue Search1 Rhymes abcdefghij_Contains anagramOf
John Q. Smith johnqsmith% ith 0000000211011... hhijmnoqst
...and so on.
One approach is to ask oneself, how must I transform both searched value and value searched for, so that they match?, and then proceed and implement that in code.
You can use match_against in myisam full text indexes columns.

Scrape html Twitter followers using R

I have a continous task that I think can be automated using R.
Using the twitteR-package I have extracted a list of tweets. Those have been categorized into positive (and neutral) and negative tweets. This have been a manuel task - but I am looking into doing some machine learning on it.
My problem is the reach-part. I want to know not only the number of positive and negative tweets but also the number of people who potentialle have been exposed to the tweet.
There is a way to do this using the twitteR-package, but it is slow, as it requires the machine to sleep between each and every search. And with thousands of tweets this is not a proper way for me.
My thought was therefore if it is possible to extract the number of followers from the html-sourcecode of twitter using the html <- webpage <- getURL("http://www.twitter.com/AngelHaze") and here extract the number of followers.
Also, on top of this, I want to be able to do this using a vector of URL's ("http://www.twitter.com/AngelHaze") and then combining them into a dataframe with the ScreenName (AngelHaze) and the number of followers. I am from Denmark, so the sourcecode containing the number of followers look like this
a class="ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav u-textUserColor" title="196.262 følgere" data-nav="followers"
href="/AngelHaze/followers""
Where "196.262 følgere" is the relevant part.
Is this possible? And if yes, can anyone help me going?
Best, Sander Ehmsen.