Coding a domain specific text generator - generator

A friend of mine is in the real estate business and after being showed the art of writing copy for real estate ads, I realized that it is very formulaic. Especially when advertising online as there are predefined fields you fill in.
Naturally, I thought about creating a generator that pretty much automates writing the ads. i don't expect it to generate outstanding or even very good copy, just that it can put together words and sentences like a human would.
I have a skeleton/template that defines an ad and I've also put together a set of phrases and words that can be randomly selected, but I am interested in more general aspects of coding such a generator? Any suggestions, tips or literature that I can read to better understand this little project better?

using metadata about the listing would be one way.
Say for a given house, you have these attributes:
(type: bungalo, sq feet: <= 1400) You could use the phrase "cozy cottage".
bedrooms: obvious, same thing with bathrooms. Assume using the word Large, medium, etc.
garage spots: if > 2 then "Can park many vehicles", etc.
You could go even further with this given the lat/lon for the address, there are web services that you can find the amount of parks nearby, crime in the neighborhood, etc.
Rick

I'd say there are three basic approaches you could take to a problem like this, depending on how flexible you want the system to be and on how much work you want to put into it. The simplest is to treat it as a report generation problem, along the lines of Rick's suggestion. That's probably the way I'd go to produce a first draft of a listing. The results would be pure boilerplate, but each listing could be quickly punched up by the copywriter.
If you wanted to get fancy, though, you could come at it as a natural language generation problem. You'd start with some kind of a knowledge representation describing the meaning of the listing and set of rules (finite state transducers, say) for mapping meanings to linguistic forms. There's a sizable academic literature on that kind of stuff, though it's kind of out of fashion these days. Places to start might be Blackburn & Bos's book or the NLTK suite (especially some of the projects in the contrib package).
The third way of doing it would be to treat it as a translation problem, essentially "translating" database entries into ad copy. You'd start with a large collection of listings and the corresponding human-written ads and construct a statistical model of the relationship between the two. Moses/Giza++ is a general purpose tool for building and applying such models.

Related

Interesting NLP/machine-learning style project -- analyzing privacy policies

I wanted some input on an interesting problem I've been assigned. The task is to analyze hundreds, and eventually thousands, of privacy policies and identify core characteristics of them. For example, do they take the user's location?, do they share/sell with third parties?, etc.
I've talked to a few people, read a lot about privacy policies, and thought about this myself. Here is my current plan of attack:
First, read a lot of privacy and find the major "cues" or indicators that a certain characteristic is met. For example, if hundreds of privacy policies have the same line: "We will take your location.", that line could be a cue with 100% confidence that that privacy policy includes taking of the user's location. Other cues would give much smaller degrees of confidence about a certain characteristic.. For example, the presence of the word "location" might increase the likelihood that the user's location is store by 25%.
The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.
I wanted to ask whether you guys think this is a good approach to this problem. How exactly would you approach a problem like this? Furthermore, are there any specific tools or frameworks you'd recommend using. Any input is welcome. This is my first time doing a project which touches on artificial intelligence, specifically machine learning and NLP.
The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.
This is text classification. Given that you have multiple output categories per document, it's actually multilabel classification. The standard approach is to manually label a set of documents with the classes/labels that you want to predict, then train a classifier on features of the documents; typically word or n-gram occurrences or counts, possibly weighted by tf-idf.
The popular learning algorithms for document classification include naive Bayes and linear SVMs, though other classifier learners may work too. Any classifier can be extended to a multilabel one by the one-vs.-rest (OvR) construction.
A very interesting problem indeed!
On a higher level, what you want is summarization- a document has to be reduced to a few key phrases. This is far from being a solved problem. A simple approach would be to search for keywords as opposed to key phrases. You can try something like LDA for topic modelling to find what each document is about. You can then search for topics which are present in all documents- I suspect what will come up is stuff to do with licenses, location, copyright, etc. MALLET has an easy-to-use implementation of LDA.
I would approach this as a machine learning problem where you are trying to classify things in multiple ways- ie wants location, wants ssn, etc.
You'll need to enumerate the characteristics you want to use (location, ssn), and then for each document say whether that document uses that info or not. Choose your features, train your data and then classify and test.
I think simple features like words and n-grams would probably get your pretty far, and a dictionary of words related to stuff like ssn or location would finish it nicely.
Use the machine learning algorithm of your choice- Naive Bayes is very easy to implement and use and would work ok as a first stab at the problem.

Coding Competition, language agnostic guidelines?

I might be doing a coding competition soon, I was wondering if anyone made one and what where the guidelines/ process.
I'd like to make the competition appealing to all devs, and I m trying to come up with ideas as to how.
the scenario is: There is an event running and we(of the coding competition) will have a room that we can use (either to code or for questions, etc), however, ideally the task for the competition should be assignet and they should eb able to go and do other things, if they are so inclined.
what i wonder is what kind of challenges to give, and most importantly, what is the criteria to "win" teaching and learning good coding standards takes a looong time, and I d like to think that if you ve been coding for longer you ll do things right and quick... but in a competition, you would be cutting corners...
I would really appreciate your input on this
A competition that's appealing to all devs? That sounds... difficult. But if you want to make the competition about problem solving and algorithms, then I am a big fan of the Sphere Online Judge. Basically this is a repository of programming puzzles but you can also become a problem setter and create problems or contests on the site.
It supports a huge number of programming languages, from the "popular" ones to more obscure ones. Programs will typically read from standard in, and write to standard out. The standard judge program will simply diff a program's output with the expected output, but more elaborate judges are possible. You also set a time limit for the execution of submissions, which usually requires programmers to be more clever than brute force.
Winner is whoever solved the most problems. Ties are broken by time of correct submissions, with some time penalty for wrong submissions.
Guidelines
Limit the languages which can be submitted. If you don't, you may get proprietary languages which require a special purchased compiler or some other inconvenience.
Correctness
This is easy. Provide an easy-to-read unit test in all languages you will accept. This will allow simple, automatic testing of submissions, and will guide the interface of the solution.
Challenges
Create a theme. Make it focused, but not too specific as to require certain paradigms or language features. Then develop challenges based around that theme.
Assign points to each challenge. Give more points to more difficult problems. Be sure to review each challenge carefully and have a team attempt them before giving points so you can make a more accurate decision.
As #miorel mentioned in his answer, time limits and memory limits are wonderful. Set a time limit per test per challenge, or at least monitor them and have these metrics contribute to the points given for solving the problem.
You should look at the ACM competition. Each year they have collegiate programming competitions. These are language agnostic. The archive is located here.
http://www.ntnu.edu.tw/acm/
To start improve yourself, you need a project you can work on, a problem you want to solve, something you want to achieve. Without any context and a destination you want to end, you won't be able to learn all the necessary methods and all the connections of a language.
There is a competition, taking place 2 times a year, it is called ludum dare.
It also doesn't matter which language you are writing, you just have to create a game within 48h ( compo, just one person and all assets created by yourself ) and 72h ( jam, a team working together, can purchase assets ). After the competition when every uploaded his game, the voting starts. This will take like 20 days where everybody can upvote your game or you can upvote other peoples' games. There are taking part approximately 3000 people.
Every time the competition starts, there is a voting on 5 days in sequence. Everyday you vote on a set of themes which can be possibly the theme you will have to create a game for. My last competition had the theme "unconventional weapon". After the voting ended, the competition starts and you have to think about a game with (in my case ) an unconventional weapon and start coding a game you like.
This is not about being the best, you should start looking at other peoples projects after the competition ended. You can learn a lot of other people, ways they solve their problem and I am sure you would improve your self every time taking place in such a contest.
It's gonna be hard to designed a coding competition suitable for a wide array of languages, since languages typically serve different purposes. I'd suggest that what you're looking for doesn't exist.

Estimating Flash project hours

I'm trying to estimate the hours required to build a group of 5 simple children's games in Flash. They will include such things as having kids drag and drop healthy food items into a basket; choosing the healthy and unhealthy food items by marking them in some way; etc.
I have no experience building games in Flash, but I have programmed in Flex and Actionscript. How many hours do you estimate for this project?
While your ActionScript background will help, I find Flash to be a VERY different experience from Flex and that proficiency in one environment does not translate well.
Is there a compelling reason not to use Flex? I think you would likely be much more efficient.
That aside, the mechanics of a simple drag and drop game could be put together fairly quickly. There are some good examples of basic drag and drop around. It can be a little tricky to get the mouse coordinates right if it is your first time.
That aside, there are other hidden costs you need to remember. Connecting infrastructure for example. Are the games connected in some way? Is there a running score or persistence that might imply authentication? Is there a story?
Also, If your forte is programming, don't underestimate that challenges of obtaining or creating art and sound assets.
Before you can estimate the time you'll need to break down what the games do. In other words, you'll need to write up very clear and definite requirements. You may even need to write up specifications. Once you've analyzed what the software should do, the estimate will also take a while - one part for example is figuring out whether there's already software that does what you want.
In my opinion, the best way you can possibly estimate a programming project, especially one in a technology you don't understand, would be to apply the Use Case Points methodology. Basically you break the project up into use-cases (what the users are trying to do) and actors (the user types and the system itself) and then list a few team and environment factors (how big an issue is code re-use, how familiar are you with the lanaguage, etc.) Studies have shown that it's more accurate for inexperienced developers than estimating based on features alone.
A google search for "use case points estimate" reveals many useful links. This explanation of the methodology seems to do a good job explaining how it works, though I've not read the entire thing. This worksheet will help when you're ready to start listing points.

How do I explain APIs to a non-technical audience?

A little background: I have the opportunity to present the idea of a public API to the management of a large car sharing company in my country. Currently, the only options to book a car are a very slow web interface and a hard to reach call center. So I'm excited of the possiblity of writing my own search interface, integrating this functionality into other products and applications etc.
The problem: Due to the special nature of this company, I'll first have to get my proposal trough a comission, which is entirely made up of non-technical and rather conservative people. How do I explain the concept of an API to such an audience?
Don't explain technical details like an API. State the business problem and your solution to the business problem - and how it would impact their bottom line.
For years, sales people have based pitches on two things: Features and Benefit. Each feature should have an associated benefit (to somebody, and preferably everybody). In this case, you're apparently planning to break what's basically a monolithic application into (at least) two pieces: a front end and a back end. The obvious benefits are that 1) each works independently, so development of each is easier. 2) different people can develop the different pieces, 3) it's easier to increase capacity by simply buying more hardware.
Though you haven't said it explicitly, I'd guess one intent is to publicly document the API. This allows outside developers to take over (at least some) development of the front-end code (often for free, no less) while you retain control over the parts that are crucial to your business process. You can more easily [allow others to] add new front-end code to address new market segments while retaining security/certainty that the underlying business process won't be disturbed in the process.
HardCode's answer is correct in that you should really should concentrate on the business issues and benefits.
However, if you really feel you need to explain something you could use the medical receptionist analogue.
A medical practice has it's own patient database and appointment scheduling system used by it's admin and medical staff. This might be pretty complex internally.
However when you want to book an appointment as a patient you talk to the receptionist with a simple set of commands - 'I want an appointment', 'I want to see doctor X', 'I feel sick' and they interface to their systems based on your medical history, the symptoms presented and resource availability to give you an appointment - '4:30pm tomorrow' - in simple language.
So, roughly speaking using the receptionist is analogous to an exterior program using an API. It allows you to interact with a complex system to get the information you need without having to deal with the internal complexities.
They'll be able to understand the benefit of having a mobile phone app that can interact with the booking system, and an API is a necessary component of that. The second benefit of the API being public is that you won't necessarily have to write that app, someone else will be able to (whether or not they actually do is another question, of course).
You should explain which use cases will be improved by your project proposal. An what benefits they can expect, like customer satisfaction.

Seating plan software recommendations (does such a beast even exist?)

I'm getting married soon and am busy with the seating plan, and am running into the usual issues of who sits where: X and Y must sit together, but A and B cannot stand each other etc.
The numbers I'm dealing with aren't huge (so the manual option will work just fine), but being of the geeky persuasion, I was wondering if there was any software available to do this for me?
Failing an exact match, what should I look for (the problem space, books, reference code) to tweak for my purposes?
I am the developer of PerfectTablePlan. I post here as well as Joel's Business of Software . ;0)
Combinatorial problems, such as seat assignment, are quite nasty algorithmically. NP-hard in fact. The number of ways to seat 60 guests in 60 seats is 60! (60 factorial) and that is more than the number of atoms in the known universe.
PerfectTablePlan allows you to specify that A must sit next to B, but nowhere near C. It uses a genetic algorithm to automatically the assign seats. This works pretty well in practice - it will usually find a decent solution for 100 guests in a few seconds. You might need to make a coffee for 1000+ guests. In practice some drag and drop fine-tuning is also usually required to cope with the vagaries of local customs and family politics (Uncle Bob is a bit deaf, we had better put him nearer the top table).
You can find out a bit more about the genetic algorithm here.
Ps/ The automatic seat assignment is only a small part of creating a good seating plan. See the PerfectTablePlan tour and the tips page for more details.
http://www.perfecttableplan.com/
I believe this is from a guy that usually posts at Joel On Software.
Never tried it though.
Hope it helps.
Try modeling this using GLPK. Integer linear programming is amenable to introducing constraints into graph-based problems with multiple possible outcomes.
My personal favorite is to not assign seating: allow folks to sit wherever they want.
That might lead to [un]intentional cliquishness, but it means you're not having to worry about it.
I expect this isn't a great answer, but you could research flocking behavior
If you take away the random jitters at each step, the flock eventually settles where each member has found its optimum position in relation to the rest of the group.
This sounds like a constraint satisfaction problem. You should probably check out logic programming systems that are also equipped with constraints-solvers. They're usually like prolog, only they are actually declarative for problems that are soluble by their solvers.
Hopefully there is one that has an easy interface from your favourite language, to get the data in and out.
I used a program a while ago that would fit this perfectly... it was a Java App, you could define rules, and it would create test cases that satisfied the rules. The file extension was .als
fact GateRules {
all g:Gate | one g.loc // Gates have 1 Location
I'll keep wracking my brain for the program name.
EDIT: It was Alloy
Now that I think about it, it may not be ideal - the notion of "seats in a fixed configuration" would be a little difficult to model. I used it differently: defining rules (an airport gate is in one location, only one plane is on a runway), and testing pre and post conditions for functions (after I land a plane, can i even have more than one plane on a runway?).