How can I harnest Wikidata to build a Siri-like service? - json

I'd like to discuss the first part of this Siri-like service.
Ideally, I'd like to be able to query for things like:
"the social network"
"beethoven"
"bad blood taylor swift"
And get results like this:
{type:"film"}
{type:"composer"}
{type:"song"}
I care about nothing else, I find descriptions, images and general information utterly useless outside Wikipedia. I see Wikidata as a meta-data service that can provide me with the semantics of the text I search for.
Do all data structures have "types" or some kind of a property that has to do with its meaning? Is there a list of all the types? Is there a suggestions feature for entities that have double meaning like "apple"? Finally, how can I send a text query and read the "type" of the response data structure?
I know I'm not providing any code but I really can't wrap my head around Wikidata's API. I've searched everywhere and all I can't find are some crippled fetch examples and messed up Objective-C HTML parsers. I can't even get their "example query" page to work because of some error I don't understand.
Really newbie not-friendly and full of heavy terminology.

The problem with Wikidata's API is that it does not have a query interface. All it does is return information for a specific data item, if you already know the ID. We have simply not been able to build a query interface yet that is powerful enough and able to scale. There is an early beta of a SPARQL endpoint though: https://tools.wmflabs.org/ppp-sparql/.
Once that is up and running, we hope to provide easier to use services on top of this, like Magnus' WDQ http://magnusmanske.de/wordpress/?p=72.
(Edit to answer the concrete questions about the API:)
I've searched everywhere and all I can't find are some crippled fetch examples
Documentation could be nicer, but https://www.wikidata.org/wiki/Wikidata:Data_access is a good start. Also note that https://www.wikidata.org/w/api.php is self-documenting. In particular, have a look at https://www.wikidata.org/w/api.php?action=help&modules=wbgetentities and https://www.wikidata.org/w/api.php?action=help&modules=wbsearchentities
Do all data structures have "types" or some kind of a property that has to do with its meaning?
All statements about a data item have to do with its meaning. Many have a statement about the "instance of" (P31) or "subclass of" (P279) property, which is pretty close to what you want, I suppose.
Is there a list of all the types?
No. Wikidata doesn't use a closed, pre-defined ontology to describe the world. It's a platform to describe the world collaboratively, in a machine readable way; from that, a fluid ontology emerges, which is never quite complete or consistent.
Any data item can serve as the class or suprt-class of another item. An item can be an instance or subclass of multiple classes. The relationships are quite complex.
Is there a suggestions feature for entities that have double meaning like "apple"?
There is a search interface that can list all matching data items for a given term. It's called wbsearchentities, for instance https://www.wikidata.org/w/api.php?action=wbsearchentities&search=apple&language=en (add format=json for machine readable JSON).
However, the ranking in the result is very naive. And without the semantic context of the original sentence, there is no way to find which word sense is meant. This is an interesting area of research called "word sense disambiguation".
Finally, how can I send a text query and read the "type" of the response data structure?
At the moment, you will have to do two API calls: one to wbsearchentities to get the ID of the entity you are interested in, and one to wbgetentities to get the instance-of statement for that entity. It would be nice to combine this in a single call; there's a ticket open for this: https://phabricator.wikimedia.org/T90693
As to Siri-like services: an early prototype called "wiri" by Magnus Manske has been around for a long time. It uses very simple patterns though: https://tools.wmflabs.org/magnus-toolserver/thetalkpage/
Bene* has been working on a more advanced approach for natural language question answering, see the Platypus Demo: https://projetpp.github.io/demo.html
Just yesterday, he presented a new prototype he has been developing together with Tpt, which generates SPARQL queries from natural language input: https://tools.wmflabs.org/ppp-sparql/
All of these projects are open source, and were created by enthusiastic volunteers. Look at the code and talk to them. :)

Related

How can I handle relational data in Meteor?

I am learning Meteor using the Discover Meteor book.
I come from a PHP and MySQL background, and the application I am thinking of doing as a side-project is a real-time Backgammon web application. While Meteor's reactivity is a very, very big plus, I am stumped on how I can handle relational data (e.g. games, users, tournaments, friends, teams, etc).
I have read a lot of answers (ranging from old to very old) on StackOverflow on how one can use MySQL with Meteor. My search has led me to numtel/meteor-mysql. However, when I look at the examples provided in that repository, it is nowhere as clean as Meteor's own implementation of MongoDB.
My options, as I understand them, are the following:
Use MongoDB, and rewrite a lot of the features present in RDBMS in Javascript
Use an RDBMS that is not as well-supported in Meteor as MongoDB
IMO, option two is much less work, and I think might lead to less problems in the future. Take the problem in the epilogue of Why You Should Never Use MongoDB, for example.
We could also model this data as a set of nested hashes. The set of information about a particular TV show is one big nested key/value data structure. Inside a TV show, there’s an array of seasons, each of which is also a hash. Within each season, an array of episodes, each of which is a hash, and so on. This is how MongoDB models the data. Each TV show is a document that contains all the information we need for one show.
But then, how would you query for the TV shows that someone has starred in?
Back to my original question: is there something I'm missing here? Handling relational data is something that a lot of applications will need to do, but I can't seem to find a clean solution
It will be much less work if you go with option 1 in my opinion.
It won't be difficult to learn to use MongoDB, and since MongoDB uses JSON objects and is supported natively by Meteor and all it's packages, it will be much less work.
I advise having a look at the aldeed packages: collection2 and simple-schema to structure your collections. I also advice using the collection-helpers package to help with joins.
If you have a posts collection with name, authorId and content fields, then to get the author of the post, you'd write Meteor.users.findOne(userId).
Hope that clears things up a bit and gets you on your way.

Extracting 'useful' information out of sentences?

I am currently trying to understand sentences of this form:
The problem was more with the set-top box than the television. Restarting the set-top box solved the problem.
I am totally new to Natural Language Processing and started using Python's NLTK package to get my hands dirty. However, I am wondering if someone could give me an overview of the high-level steps involved in achieving this.
What I am trying to do is to identify what the problem was so in this case, set-top box and whether the action that was taken resolved the problem so in this case, yes because restarting fixed the problem. So if all the sentences were of this form, my life would have been easier but because it is natural language, the sentences could also be of the following form:
I took a look at the car and found nothing wrong with it. However, I suspect there is something wrong with the engine
So in this case, the problem was with the car. The action taken did not resolve the problem because of the presence of the word suspect. And the potential problem could be with the engine.
I am not looking for an absolute answer as I suspect this is very complex. What I am looking for is more rather a high-level overview that will point me in the right direction. If there is an easier/alternate way to do this, that is welcome as well.
Really the best you could hope for is a Naive Bayesian Classifier with a sufficiently large (probably more than you have) training set and be willing to tolerate a fair rate of false determinations.
Seeking the holy grail of NLP is bound to leave you somewhat unsatisfied.
Probably, if the sentences are well-formed, I would experiment with dependency parsing (http://nltk.googlecode.com/svn/trunk/doc/api/nltk.parse.malt.MaltParser-class.html#raw_parse). That gives you a graph of the constituents of a sentence and you can tell the relations between the lexical items. Later, you can extract phrases from the output of a dependency parser (http://nltk.googlecode.com/svn/trunk/doc/book/ch08.html#code-cfg2) That could help you to extract the direct object of a sentence, or the verb phrase in a sentence.
If you just want to get phrases or "chunks" from a sentence, you can try chunk parser (http://nltk.googlecode.com/svn/trunk/doc/api/nltk.chunk-module.html). You can also carry out named entity recognition (http://streamhacker.com/2009/02/23/chunk-extraction-with-nltk/). It's usually used to extract instances of places, organizations or people names but it could work in your case as well.
Assuming that you solve the problem of extracting noun/verb phrases from a sentence, you may need to filter them out to ease the job of your domain expert (too many phrases could overwhelm a judge). You may carry out a frequency analysis on your phrases, remove very frequent ones that are not usually related to the problem domain, or compile a white-list and keep the phrases that contain a pre-defined set of words, etc.

First write code using API, then actual API - does this approach have a name and is valid for API design process?

Standard way of working on new API (library, class, whatever) usually looks like this:
you think about what methods would API user need
you implement API that you suspect user will need
So basically you trying to guess what your API should look like. It very often leads to over engineering stuff, huge APIs that you think user will need and it is very possible that great part of your code won't be used at all.
Some time ago, maybe few years even, I read some article that promoted writing client code first. I don't remember where I found it but author pointed out several advantages like better understanding how API will be used, what it should provide and what is basically obsolete. I think idea was that it goes along with SCRUM methodology and user stories but on implementation level.
Just out of curiosity for my latest private project I started not with actual API (some kind of toolkit library) but with client code that would use this API. Of course my code is all in red because classes, methods and properties does not exist and I can forget about help from intellisense but what I noticed is that after few days of coding my application "has" all basic functionalities and my library API "is" a lot smaller than I imagined when starting a project.
I don't say that if somebody took my library and started using it it wouldn't lack some features but I think it helped me to realize that my idea of this API was somewhat flawed because I usually try to cover all bases and provide methods "just in case". And sometimes it bites me badly because I made some stupid mistake in basic functions being more focused on code that somebody maybe would need.
So what I would like to ask you do you ever tried this approach when needed to create a new API and did it helped you? Is it some recognized technique that has a name?
So basically you're trying to guess what your API should look like.
And that's the biggest problem with designing anything this way: there should be no (well, minimal) guesswork in software design. Designing an API based on assumptions rather than actual information is dangerous, for several reasons:
It's directly counter to the principle of YAGNI: in order to get anything done, you have to assume what the user is going to need, with no information to back up those assumptions.
When you're done, and you finally get around to using your API, you'll invariably find that it sucks to use (poor user experience), because you weren't thinking about how the library is used (UX), you were thinking about what the library must do (features).
An API, by definition, is an interface for users (i.e., developers). Designing as anything else just makes for a bad design, without fail.
Writing sample code is like designing a GUI before writing the backend: a Good Thing. It forces you to think about user experience and practical effects of design decisions without getting bogged down in useless theorising and assumption.
And contrary to Gabriel's answer, this is not bottom-up design: it's top-down. Rather than design the concrete backend of your library and then force an abstract interface on top of it, you first design the interface and then worry about the implementation.
Generally speaking, the idea of designing the concrete first and abstracting from that afterwards is called bottom-up design. Test Driven Development uses similar principle to what you describe to support better design. Firstly you write a test, which is an use of code you are going to write afterwards. It is important to proceed stepwise, because you have to proove the API is implementable. IMportant part of each part is refactoring - this allows you design more concise API and reuse parts of your code.

How do I find methods?

Here's a somewhat general computer question. I've always been able to follow the LOGIC of programming, but when I go to code something, I always find that I don't know some method or another to get what I need to get done. When I see it, I always think, "OF COURSE!".
How do you go about finding relevant methods for your programming needs that are "built-in?" I don't enjoy re-inventing the wheel, but I find it difficult to find what I need to do what I want to do.
First try Google:
You can use google to search your required method. For example If I want to search a value in array in PHP then I go to Google and type "Search values in array in PHP". I find my required function at first place.
Then try Standard Documentation:
Try standard documentation to search for your required method. For example if my problem is related to strings in PHP then I go to String Functions documentation and find the required function.
Finally try Stackoverflow:
Otherwise you can ask your problem at Stackoverflow for your required methods and libraries. You will always get a shortest way.
What you are asking here is for the best way to do research. Well, that's hard skill to explain, even more so to teach.
Nevertheless here are some tips:
Go to a search engine. It makes no
sense to start in a place like MSDN,
since all of its content is indexed
by the search engines anyway.
Phrase your question several
different ways.
As you learn more
about the issue you will learn new
vocabulary about it. Use that new
vocabulary to do even more searches.
If the searches turn out empty,
switch to browsing a specific
section of the official
documentation that you think is the
most related to what you are doing. If nothing else, it will expand your horizons around the issue and give you more vocabulary to do more searches.
Finally, if all else fails ask a question on StackOverflow explaining what you want to do as clearly as possible.
Note that if there's a simple API that does what you need, you will rarely reach step 4.
You say:
It's very frustrating to suddenly find
an "easy" button mid-way through.
Try to see it differently. Think of these moments as blessings. You've just learned something. You invested a lot of effort - and instead of seeing that effort as wasted, see it as critical to proper learning. You - better than the guy who just happened across the magic method - really understand what it's for and something about how it works. And you really, really, understand why you need it, and you properly appreciate its value. You're never going to forget that method.
So it was costly, but you learned something important. Celebrate, and move on.
It is usually included in some form of documentation. Most IDEs support the documentation format and gives you auto-complete functionality.
if you are using MVS so MSDN is really good for it
In addition to this and this answer above, google's basic and advanced searching tips prove very helpful.
In addition to above, changing the order of keywords in search criteria also sorts the list in different orders.
In essence I believe that searching is still an art rather than a science, and is best learnt - quoting from David Reis' answer above: "2. As you learn more about the issue you will learn new vocabulary about it. Use that new vocabulary to do even more searches."
Search in the API documentation. But the best way to (I found so) is to search on the internet for multiple solutions and then choose the one that you think is best. Make your search as narrow as possible. For example you want to implement random number generation function, then search like this, "How to generate random numbers in Java?".
Namespaces, namingconventions, Autocomplete/Intellisence
I assume that you are trying to find some kind of Object-Oriented-apis . I use .net in my example.
First try to find a class that might be responsable for the method you are looking for.
Example: If you want to "Make a new Directory in the Filesystem" you must know (or learn) that (in dotnet) these classes are in the namespace System.IO:
This namespace contains subnamespaces like Compresseion and Classes like File, Path, Directory, ...
Second you sould know NamingConventions. There are common Naming-Prefixes for methods like Get, Set, Insert, Create. In the documentation for class Directory you will find a CreateDirectory-Method.
If you have an intelligent editor that knows your programming language and the classes and namespaces learning is much easier. In the dotnet-world this feature is called Autocomplete/Intellisence

Creative Terminology

I seem to use bland words such as node, property, children (etc) too often, and I fear that someone else would have difficulty understanding my code simply because the parts' names are vague, common words.
How do you find creative names for classes and components to make them more memorable?
I am particularly having trouble with generic tools which have no real description except their rather generic functional purpose. I would like to know if others have found creative ways to name things rather than simply naming them by their utility, such as AnonymousFunctionWrapperCallerExecutorFactory.
It's hard to answer. I find them just because they seem to 'fit'.
What I do know, however, is that I find it basically impossible to move on writing code unless something is named correctly, and it 'feels' good. If it isn't named right, I find it hard to use, and the code is generally confusing.
I'm not too concerned about something being 'memorable', only 'accurate'.
I have been known to sit around thinking out loud about what to name something. Take your time, and make sure you are really happy with the name. don't be afraid of using common/simple words.
I don't really have an answer, but three things for you to think about.
The late Phil Karlton famously said: "There are only two hard problems in computer science. Cache Invalidation and Naming Things." So, the fact that you are having trouble coming up with good names is entirely normal and even expected.
OTOH, having trouble naming things can also be a sign of bad design. (And yes, I am perfectly aware, that #1 and #2 contradict each other. Or maybe one should think of it more like balancing each other.) E.g., if a thing has too many responsibilities, it is pretty much impossible to come up with a good name. (Witness all the "Service", "Util", "Model" and "Manager" classes in bad OO designs. Here's an example Google Code Search for "ManagerFactoryFactory".)
Also, your names should map to the domain jargon used by subject matter experts. If you can't find a subject matter expert, that's a sign that you are currently worrying about code that you're not supposed to worry about. (Basically, code that implements your core business domain should be implemented and designed well, code in ancillary domains should be implemented and designed so-so, and all other code should not be implemented or designed at all, but bought from a vendor, where what you are buying is their core business domain. [Please interpret "buy" and "vendor" liberally. Community-developed Free Software is just fine.])
Regarding #3 above, you mentioned in another comment that you are currently working on implementing a tree data structure. Unless your company is in the business of selling tree data structures, that is not a part of your core domain. And the reason that you have trouble finding good names could be that you are working outside your core domain. Now, "selling tree data structures" may sound stupid, but there are actually companies that do that. For example, the BCL team inside Microsoft's developer division: they actually sell (well, for certain definitions of "sell", anyway) the .NET framework's Base Class Libraries, which include, among others, tree data structures. But note that for example Microsoft's C++ compiler team actually (literally) buys their STL from a third-party vendor – they figure that their core domain is writing compilers, and they leave the writing of libraries to a company who considers writing STLs their core domain. (And indeed, AFAIK, that company does nothing but write and sell STL implementations. That's their sole product.)
If, however, selling tree data structures is your core domain, then the names you listed are just fine. They are the names that subject matter experts (programmers, in this case) use when talking about the domain of tree data structures.
Using 'metaphors' is a common theme in agile (and pattern) literature.
'Children' (in your question) is an example of a metaphor that is extensively used and for good reasons.
So, I'd encourage the use of metaphors, provided they are applicable and not a stretch of the imagination.
Metaphors are everywhere in computing. From files to bugs to pointers to streams... you can't avoid them.
I believe that for the purpose of standardization and communication, it's good to use a common vocab, like in the same case for design patterns. I have a problem with a programmer who keeps 'inventing' his own terms and I have trouble understanding him. (He kept using the term 'events orchestrating' instead of 'scripting' or 'FCFS process'. Kudos for creativity though!)
Those common vocab describe stuff we are used to. A node is a point, somewhere in a graph, in a tree, or what-not. One way is to be specific to the domain. If we are doing a mapping problem, instead of 'node', we can use 'location'. That helps in a sense, at least for me. So I find there is a need to balance being able to communicate with other programmers, and at the same time keeping the descriptor specific enough to help me remember what it does.
I think node, children, and property are great names. I can already guess the following about your classes, just by their "bland" names:
Node - this class is part of a graph of objects
children - this variable holds a list of nodes belonging to the containing node.
I don't think "node" is either vague or common, and if you're coding a generic data structure, it's probably ok to have generic names! (With that being said, if you are coding up a tree, you could use something like TreeNode to emphasize that the node is part of a tree.) One way you can make the life of developers who will use your API easier is to follow the naming conventions of your platform's built in libraries. If everyone calls a node a node, and an iterator an iterator, it makes life easy.
Names that reflect the purpose of the class, method or property are more memorable than creative ones. Modern IDEs make it easier to use longer names so feel fee to be descriptive. Getting creative won't help as much as getting accurate.
I recommend to pick nouns from a specific application domain. E.g. if you are putting cars in a tree, call the node class Car - the fact that it is also a node should be apparent from the API. Also, don't try to be too generic in your implementation - don't put all attributes of the car into a hashtable named properties, but create separate attributes for make, color, etc.
A lot of languages and coding styles like to use all sorts of descriptive prefixes. In PHP there are no clear types, so this may help greatly. Instead of doing
$isAvailable = true;
try
$bool_isAvailable = true;
It is admittedly a pain, but usually well worth the time.
I also like to use long names to describe things. It may seem strange, but is usually easier to remember, especially when I go back to refactor my code
$leftNode->properties < $leftTreeNode->arrayOfNodeProperties;
And if all else fails. Why not fall back on a solid star wars themed program.
$luke->lightsaber($darth[$ewoks]);
And lastly, in college I named my classes after my professor, and then my class methods all the things I wanted to do to that jerk.
$Kube->canEat($myShorts, $withKetchup);