Interesting NLP/machine-learning style project -- analyzing privacy policies - language-agnostic

I wanted some input on an interesting problem I've been assigned. The task is to analyze hundreds, and eventually thousands, of privacy policies and identify core characteristics of them. For example, do they take the user's location?, do they share/sell with third parties?, etc.
I've talked to a few people, read a lot about privacy policies, and thought about this myself. Here is my current plan of attack:
First, read a lot of privacy and find the major "cues" or indicators that a certain characteristic is met. For example, if hundreds of privacy policies have the same line: "We will take your location.", that line could be a cue with 100% confidence that that privacy policy includes taking of the user's location. Other cues would give much smaller degrees of confidence about a certain characteristic.. For example, the presence of the word "location" might increase the likelihood that the user's location is store by 25%.
The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.
I wanted to ask whether you guys think this is a good approach to this problem. How exactly would you approach a problem like this? Furthermore, are there any specific tools or frameworks you'd recommend using. Any input is welcome. This is my first time doing a project which touches on artificial intelligence, specifically machine learning and NLP.

The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.
This is text classification. Given that you have multiple output categories per document, it's actually multilabel classification. The standard approach is to manually label a set of documents with the classes/labels that you want to predict, then train a classifier on features of the documents; typically word or n-gram occurrences or counts, possibly weighted by tf-idf.
The popular learning algorithms for document classification include naive Bayes and linear SVMs, though other classifier learners may work too. Any classifier can be extended to a multilabel one by the one-vs.-rest (OvR) construction.

A very interesting problem indeed!
On a higher level, what you want is summarization- a document has to be reduced to a few key phrases. This is far from being a solved problem. A simple approach would be to search for keywords as opposed to key phrases. You can try something like LDA for topic modelling to find what each document is about. You can then search for topics which are present in all documents- I suspect what will come up is stuff to do with licenses, location, copyright, etc. MALLET has an easy-to-use implementation of LDA.

I would approach this as a machine learning problem where you are trying to classify things in multiple ways- ie wants location, wants ssn, etc.
You'll need to enumerate the characteristics you want to use (location, ssn), and then for each document say whether that document uses that info or not. Choose your features, train your data and then classify and test.
I think simple features like words and n-grams would probably get your pretty far, and a dictionary of words related to stuff like ssn or location would finish it nicely.
Use the machine learning algorithm of your choice- Naive Bayes is very easy to implement and use and would work ok as a first stab at the problem.

Related

Coding a domain specific text generator

A friend of mine is in the real estate business and after being showed the art of writing copy for real estate ads, I realized that it is very formulaic. Especially when advertising online as there are predefined fields you fill in.
Naturally, I thought about creating a generator that pretty much automates writing the ads. i don't expect it to generate outstanding or even very good copy, just that it can put together words and sentences like a human would.
I have a skeleton/template that defines an ad and I've also put together a set of phrases and words that can be randomly selected, but I am interested in more general aspects of coding such a generator? Any suggestions, tips or literature that I can read to better understand this little project better?
using metadata about the listing would be one way.
Say for a given house, you have these attributes:
(type: bungalo, sq feet: <= 1400) You could use the phrase "cozy cottage".
bedrooms: obvious, same thing with bathrooms. Assume using the word Large, medium, etc.
garage spots: if > 2 then "Can park many vehicles", etc.
You could go even further with this given the lat/lon for the address, there are web services that you can find the amount of parks nearby, crime in the neighborhood, etc.
Rick
I'd say there are three basic approaches you could take to a problem like this, depending on how flexible you want the system to be and on how much work you want to put into it. The simplest is to treat it as a report generation problem, along the lines of Rick's suggestion. That's probably the way I'd go to produce a first draft of a listing. The results would be pure boilerplate, but each listing could be quickly punched up by the copywriter.
If you wanted to get fancy, though, you could come at it as a natural language generation problem. You'd start with some kind of a knowledge representation describing the meaning of the listing and set of rules (finite state transducers, say) for mapping meanings to linguistic forms. There's a sizable academic literature on that kind of stuff, though it's kind of out of fashion these days. Places to start might be Blackburn & Bos's book or the NLTK suite (especially some of the projects in the contrib package).
The third way of doing it would be to treat it as a translation problem, essentially "translating" database entries into ad copy. You'd start with a large collection of listings and the corresponding human-written ads and construct a statistical model of the relationship between the two. Moses/Giza++ is a general purpose tool for building and applying such models.

Nielson's usablity scale

Just wondering if anyone out there knows of a standard survey (preferably based off Jacob Nielson's work on usability) that web admin's can administer to test groups for usability?
I could just make up my own but I feel there as got to be some solid research out there on the sort of judgments on tasks I should be asking.
For example
Q:: Ask user to find profile page
Do I ...
A.) Present them with standard likert scale after each question
B.) Present them the likert after all the questions
..
Then what should that likert be, I know Nielson's usability judgments scale is based on Learnability, Efficiency of Use, Memorability, Error Rate, Satisfaction but I can only imagine a likert I would design that would effectively measure satisfaction...how am I suppose to ask a user to rank the Memorability of a site after one use on a 1-5 scale? Surely someone has devised a good way to pose the question?
A few recommendations:
Don't determine your standard exclusively by listening to the users and waiting for their feedback. Nielsen says that rule #1 in usability is "Don't listen to users"; it's more important to watch them work.
Here is an FAQ regarding development of Likert questionnaires. I would err on the side of simplicity and brevity if you are going to ask users a list of questions after every task. There are advantages and disadvantages to both of the options you are considering. If you make a user wait until they have finished all of their tasks before they fill out a survey, they may not remember their initial difficulties with the interface as they adjust to its learning curve. On the other hand, if you ask them questions after each task, they may start rushing through the questionnaire as they get toward the end of the list of tasks. An extra option, depending on how many tasks you have, may be to have the user fill out a survey after every several tasks.
The University of Maryland HCI Laboratory maintains a Questionnaire for User Interaction Satisfaction, which is available for download and now on version 7.0. You may be able to use their survey, or at least tailor it for your use.
The short and easy System Usability Scale (SUS) has been found by Tullis and Stetson (2004) to psychometrically outperform other subjective scales including the renowned QUIS. Most SUS items seem related to learnability or memorability, along with a couple for efficiency. However, I wouldn’t try to break it into subscales; all items are highly intercorrelated suggesting this scale measures a single underlying construct.
I would doubt you can get a scale to measure each of Nielsen’s dimensions separately. A user can tell you if a product is “hard” to use, but it’s much more difficult for them to break it down further. They know it took a lot of work to do something, but was it because they couldn’t figure out an easier way (learnability)? Or maybe they had learned a better way on a previous task, but forgot it (memorability)? Or is that just the way it has to be (efficiency)? Users are not going to have sufficient information to make the distinction.
If you are specifically interested in each of Nielsen’s dimensions separately, then assess them separately and directly. You can measure learnability crudely through recording the number of errors or time between clicks, and precisely by how many trials it takes for users to learn the normative interaction sequence. For efficiency, after you train users to do the normative interaction sequence, record how long it takes them to do it. You can also get a pretty good answer analytically using something like GOMS-KLM. For memorability, bring the same users in a week or so later and compare their performance to that of the efficiency-measuring trial.
Like nearly all subjective scales, the SUS is primarily useful for comparing the overall subjective experience of different products. It’s hard to know what to make out of a single score without something to compare it to. These scales won’t tell what specific problems a product has or why it has them (e.g., to help you determine improvements). For that, qualitative observation and debriefing your test participants is best.

So was that Data Structures & Algorithms course really useful after all?

I remember when I was in DSA I was like wtf O(n) and wondering where would I use it other than in grad school or if you're not a PhD like Bloch. Somehow uses for it does pop up in business analysis, so I was wondering when have you guys had to call up your Big O skills to see how to write an algorithm, which data structure did you use to fit or whether you had to actually create a new ds (like your own implementation of a splay tree or trie).
Understanding Data Structures has been fundamental to many of the projects I've worked on, and that goes beyond the ten minute song 'n dance one does when asked such a question in an interview situation.
Granted that modern environments with all sorts of collection classes can make light work of storing and accessing large amounts of data, but having an understanding that a particular problem is best solved with a particular data structure can be a great timesaver. And by "timesaver" I mean "the difference between something working and not working".
Honestly, being able to answer that stuff is my biggest criterion for taking interviewees seriously in an interview. Knowing how basic data structures work, basic O(n) analysis, and some light theory is really crucial to being able to write large applications successfully.
It's important in the interview because it's important in the job. I've worked with techs in the past that were self taught, without taking the data structures course or reading a data structures book, and their code is occasionally bad in ways they should have seen coming.
If you don't know that n2 is going to run slowly compared to n log n, you've got more to learn.
As far as the later half of the data structures courses, it isn't generally applicable to most tech jobs, but if you ever do wind up needing it, you'll wish you had paid more attention.
Big-O notation is one of the basic notations used when describing algorithms implemented by a particular library. For example, all documentation on STL that I've seen describes various operations in terms of big-O, so naturally you have to e.g. understand the difference between O(1), O(log n) and O(n) to understand the implications of your choice of STL containers and algorithms. MSDN also does that for .NET classes, and IIRC Java documentation does that for standard Java classes. So, I'd say that knowing the notation is pretty much a requirement for understanding documentation of most popular frameworks out there.
Sure (even though I'm a humble MS in EE -- no PhD, no CS, differently from my colleague Joshua Block), I write a lot of stuff that needs to be highly scalable (or components that may need to be reused in highly scalable apps), so big-O considerations are most always at work in my design (and it's not hard to take them into account). The data structures I use are almost always from Python's simple but rich supply (which I did lend a hand developing;-), rarely is a totally custom one needed (rather than building on top of list, dict, etc); but when it does happen (e.g. the bitvectors in my open source project gmpy), no big deal.
I was able to use B-Trees right when I learned about them in algorithm class (that was about 15 years ago when there were much less open source implementations available). And even later the knowledge about the differences of e. g. container classes came in handy...
Absolutely: even though stacks, queues, etc. are pretty straightforward, it helps to have been introduced to them in a disciplined fashion.
B-Tree's and more advanced sorting are a bit more difficult so learning them early was a big benefit and I have indeed had to implement each of them at various points.
Finally, I created an algorithm for single-connected components a few years back that was significantly better than the one our signal-processing team was using but I couldn't convince them that it was better until I could show that it was O(n) complexity rather than O(nlogn).
...just to name a few examples.
Of course, if you are content to remain a CRUD-system hacker with no real desire to do more than collect a paycheck, then it may not be necessary...
I found my knowledge of data structures very useful when I needed to implement a customizable event-driven system about ten years ago. That's the biggie, but I use that sort of knowledge fairly frequently in lesser ways.
For me, knowing the exact algorithms has been... nice as background knowledge. However, the thing that's been the most useful is the more general background of having to pay attention to how different pieces of an algorithm interact. For instance, there can be places in code where moving one piece of code (ie, outside a loop) can make a huge difference in both time and space.
Its less of the specific knowledge the course taught and, rather, more that it acted like several years of experience. The course took something that might take years to encounter (have drilled into you) all the variations of in pure "real world experience" and condensed it.
The title of your question asks about data structures and algorithms, but the body of your question focuses on complexity analysis, so I'll focus on that too:
There are lots of programming jobs where being able to do complexity analysis is at least occasionally useful. See What career can I hope for if I like algorithms? for some examples of these.
I can think of several instances in my career where either I or a co-worker have discovered a a piece of code where the (usually time, sometimes space) complexity was higher that it should have been. eg: something that was quadratic or cubic when it could have been linear or nlog(n). Such code would work fine when given small inputs, but on larger inputs would quickly become really slow or consume all available memory. Knowing alternative algorithms and data structures, their complexities, and also how to analyze the complexity to build new algorithms is vital in being able to correct these problems (or avoid them in the first place).
Networking is all I've used it: in an implementation of traveling salesman.
Unfortunately I do a lot of "line of business" and "forms over data" apps, so most problems I work on can be solved by hammering together arrays, linked lists, and hash tables. However, I've had the chance to work my data structures magic here and there:
Due to weird complex business rules, I worked on an application which used a custom thread pool implemented as a leftist-heap.
My dev team struggled to write a complex multithreaded app. It was plagued with race conditions, dead locks, and lousy performance due to very fine-grained locking. We re-worked the code to share state between threads, opting to write a very light-weight wrapper to facilitate message passing. Next, we converting our linked lists and hash tables to immutable stacks and immutable style and immutable red-black trees, we had no more problems with thread safety or performance. The resulting code was immaculate and surprisingly readable.
Frequently, a business rules engine requires you to roll your own state machine, which is very naturally modelled as a graph where vertexes and states and edges are transitions between states.
If for no other reasons, I'm glad I took the time to readable about data structures and algorithms simply to be able picture novel problems a little differently, especially combinatorial problems and graph problems. Graph theory is no longer a synonym for "scary".

How to measure usability to get hard data?

There are a few posts on usability but none of them was useful to me.
I need a quantitative measure of usability of some part of an application.
I need to estimate it in hard numbers to be able to compare it with future versions (for e.g. reporting purposes). The simplest way is to count clicks and keystrokes, but this seems too simple (for example is the cost of filling a text field a simple sum of typing all the letters ? - I guess it is more complicated).
I need some mathematical model for that so I can estimate the numbers.
Does anyone know anything about this?
P.S. I don't need links to resources about designing user interfaces. I already have them. What I need is a mathematical apparatus to measure existing applications interface usability in hard numbers.
Thanks in advance.
http://www.techsmith.com/morae.asp
This is what Microsoft used in part when they spent millions redesigning Office 2007 with the ribbon toolbar.
Here is how Office 2007 was analyzed:
http://cs.winona.edu/CSConference/2007proceedings/caty.pdf
Be sure to check out the references at the end of the PDF too, there's a ton of good stuff there. Look up how Microsoft did Office 2007 (regardless of how you feel about it), they spent a ton of money on this stuff.
Your main ideas to approach in this are Effectiveness and Efficiency (and, in some cases, Efficacy). The basic points to remember are outlined on this webpage.
What you really want to look at doing is 'inspection' methods of measuring usability. These are typically more expensive to set up (both in terms of time, and finance), but can yield significant results if done properly. These methods include things like heuristic evaluation, which is simply comparing the system interface, and the usage of the system interface, with your usability heuristics (though, from what you've said above, this probably isn't what you're after).
More suited to your use, however, will be 'testing' methods, whereby you observe users performing tasks on your system. This is partially related to the point of effectiveness and efficiency, but can include various things, such as the "Think Aloud" concept (which works really well in certain circumstances, depending on the software being tested).
Jakob Nielsen has a decent (short) article on his website. There's another one, but it's more related to how to test in order to be representative, rather than how to perform the testing itself.
Consider measuring the time to perform critical tasks (using a new user and an experienced user) and the number of data entry errors for performing those tasks.
First you want to define goals: for example increasing the percentage of users who can complete a certain set of tasks, and reducing the time they need for it.
Then, get two cameras, a few users (5-10) give them a list of tasks to complete and ask them to think out loud. Half of the users should use the "old" system, the rest should use the new one.
Review the tapes, measure the time it took, measure success rates, discuss endlessly about interpretations.
Alternatively, you can develop a system for bucket-testing -- it works the same way, though it makes it far more difficult to find out something new. On the other hand, it's much cheaper, so you can do many more iterations. Of course that's limited to sites you can open to public testing.
That obviously implies you're trying to get comparative data between two designs. I can't think of a way of expressing usability as a value.
You might want to look into the GOMS model (Goals, Operators, Methods, and Selection rules). It is a very difficult research tool to use in my opinion, but it does provide a "mathematical" basis to measure performance in a strictly controlled environment. It is best used with "expert" users. See this very interesting case study of Project Ernestine for New England Telephone operators.
Measuring usability quantitatively is an extremely hard problem. I tackled this as a part of my doctoral work. The short answer is, yes, you can measure it; no, you can't use the results in a vacuum. You have to understand why something took longer or shorter; simply comparing numbers is worse than useless, because it's misleading.
For comparing alternate interfaces it works okay. In a longitudinal study, where users are bringing their past expertise with version 1 into their use of version 2, it's not going to be as useful. You will also need to take into account time to learn the interface, including time to re-understand the interface if the user's been away from it. Finally, if the task is of variable difficulty (and this is the usual case in the real world) then your numbers will be all over the map unless you have some way to factor out this difficulty.
GOMS (mentioned above) is a good method to use during the design phase to get an intuition about whether interface A is better than B at doing a specific task. However, it only addresses error-free performance by expert users, and only measures low-level task execution time. If the user figures out a more efficient way to do their work that you haven't thought of, you won't have a GOMS estimate for it and will have to draft one up.
Some specific measures that you could look into:
Measuring clock time for a standard task is good if you want to know what takes a long time. However, lab tests generally involve test subjects working much harder and concentrating much more than they do in everyday work, so comparing results from the lab to real users is going to be misleading.
Error rate: how often the user makes mistakes or backtracks. Especially if you notice the same sort of error occurring over and over again.
Appearance of workarounds; if your users are working around a feature, or taking a bunch of steps that you think are dumb, it may be a sign that your interface doesn't give the tools to figure out how to solve their problems.
Don't underestimate simply asking users how well they thought things went. Subjective usability is finicky but can be revealing.

Seating plan software recommendations (does such a beast even exist?)

I'm getting married soon and am busy with the seating plan, and am running into the usual issues of who sits where: X and Y must sit together, but A and B cannot stand each other etc.
The numbers I'm dealing with aren't huge (so the manual option will work just fine), but being of the geeky persuasion, I was wondering if there was any software available to do this for me?
Failing an exact match, what should I look for (the problem space, books, reference code) to tweak for my purposes?
I am the developer of PerfectTablePlan. I post here as well as Joel's Business of Software . ;0)
Combinatorial problems, such as seat assignment, are quite nasty algorithmically. NP-hard in fact. The number of ways to seat 60 guests in 60 seats is 60! (60 factorial) and that is more than the number of atoms in the known universe.
PerfectTablePlan allows you to specify that A must sit next to B, but nowhere near C. It uses a genetic algorithm to automatically the assign seats. This works pretty well in practice - it will usually find a decent solution for 100 guests in a few seconds. You might need to make a coffee for 1000+ guests. In practice some drag and drop fine-tuning is also usually required to cope with the vagaries of local customs and family politics (Uncle Bob is a bit deaf, we had better put him nearer the top table).
You can find out a bit more about the genetic algorithm here.
Ps/ The automatic seat assignment is only a small part of creating a good seating plan. See the PerfectTablePlan tour and the tips page for more details.
http://www.perfecttableplan.com/
I believe this is from a guy that usually posts at Joel On Software.
Never tried it though.
Hope it helps.
Try modeling this using GLPK. Integer linear programming is amenable to introducing constraints into graph-based problems with multiple possible outcomes.
My personal favorite is to not assign seating: allow folks to sit wherever they want.
That might lead to [un]intentional cliquishness, but it means you're not having to worry about it.
I expect this isn't a great answer, but you could research flocking behavior
If you take away the random jitters at each step, the flock eventually settles where each member has found its optimum position in relation to the rest of the group.
This sounds like a constraint satisfaction problem. You should probably check out logic programming systems that are also equipped with constraints-solvers. They're usually like prolog, only they are actually declarative for problems that are soluble by their solvers.
Hopefully there is one that has an easy interface from your favourite language, to get the data in and out.
I used a program a while ago that would fit this perfectly... it was a Java App, you could define rules, and it would create test cases that satisfied the rules. The file extension was .als
fact GateRules {
all g:Gate | one g.loc // Gates have 1 Location
I'll keep wracking my brain for the program name.
EDIT: It was Alloy
Now that I think about it, it may not be ideal - the notion of "seats in a fixed configuration" would be a little difficult to model. I used it differently: defining rules (an airport gate is in one location, only one plane is on a runway), and testing pre and post conditions for functions (after I land a plane, can i even have more than one plane on a runway?).