Item matching with domain knowlege - language-agnostic

I have various product items that I need to decide if they are the same. A quick example:
Microsoft RS400 mouse with middle button should match Microsoft Red Style 400 three buttoned mouse but not Microsoft Red Style 500 mouse
There isn't anything else nice that I can match with apart from the name and just doing it on the ratio of matching words isn't good enough (The error rate is far too high)
I do know about the domain and so I can (for example) hand write the fact that a three buttoned mouse is probably the same as a mouse with a middle button. I also know the manufacturers (or can take a very good guess at them).
The only thought I have had so far is matching them by trying to use hand written rules to reduce the size of the string and then checking the matching words, but I wondered if anyone had any ideas best way of doing this matching was with a better accuracy and precision (or where to start looking) and if anyone knew of any work that had been done in this area? (papers, examples etc).

"I do know about the domain..."
How much exactly do you know about the domain? If you know everything about the domain, then you might be better off building an index of all your manufacturers products (basically the description of the product from the manufacturers webpage). Then instead of trying to match your descriptions to each other, matching them to your index of products.
Advantages to this approach:
presumably all words used in the description of the product have been used somewhere in the promotional literature
if when building the index you were able to weight some of the information (such as product codes) then you may have more success
Disadvantages:
may take a long time to create the index (especially if done by hand)
If you don't know everything about your domain, then you might consider down-ranking words that are very common (you can get lists of common words off the internet), and up-ranking numbers and words that aren't in a dictionary (you can get lists of words off the internet/most linux/unix distributions come with them for spell checking purposes).
I don't know how much you know about search, but in the past I've found the book "Search Engines: Information Retrieval in Practice" by W. Bruce Croft, Donald Metzler, Trevor Strohman to be useful. There are some sample chapters in the publishers website which will tell you if the book's for you or not: pearsonhighered.com
Hope that helps.

In addition to hand-written rules, you may try to use supervised learning with feature extraction.
Let features be the words in description, than look on descriptions as feature vectors.
When teaching the algorithm, let it show you two vectors that look similar by the ratio, and if it's same item, let the algorithm improve weighs for those words.
For example, each pair of words may have bigger weight than simple ratio, as you have done.
[3-button] [middle]
[wheel] [button]
[mouse] [mouse]
By your algorithm, it'll give ratio of 1/3 to similarity. When you set this as "same item" algorithm should add more value to those pair of words, when it reaches them next time.

Just tokenize (you should seperate numbers from letters in that step aswell, so not just a whitespace tokenizer), stem, filter stopwords and uninteresting words like mouse. Perhaps you should have a list with words producers aswell and shorten all not producers and numbers to their first letter. (if you do that, you have to seperate capital letters aswell in the tokenizer)
Microsoft RS400 mouse with middle button -> Microsoft R S 400
Microsoft Red Style 400 three buttoned mouse -> Microsoft R S 400
Microsoft Red Style 500 mouse -> Microsoft R S 500
If you want a better solution
vsm (vector space model) out of plagiarism detection would be nice. (Every word gets a weight, according to their discriminative value and those weights are projected into a multidimensional space. After that you just measure the angular degree between 2 texts)

I would suggest something a lot more generally applicable. As I understand it, you want some nlp processing that will deal with things that you recognize as synonyms. I think that's a pretty simple implementation right there.
If I were you I would make a keyword object that had a list of synonyms as a parameter, then write a script that would scrape whatever text you have for words that only appear occasionally (have some capped frequency at which the keyword is actually considered applicable), then add a list of keywords as a parameter of each keyword that contains it's synonyms. If you were willing to go a step further I would set weights on the synonym list showing how similar they are.
With this kind of nlp problem, the chance that you will get to 100% accuracy is 0, but you could well get above 90%, I would suggest adding an element by which you can adjust the weights in an automated way. I have to be fairly vague here, but in my last job I was tasked with a similar problem, and was able to get accuracy in the high 90's. My implementation was also probably more complicated than what you need, but even a simple implementation should get you pretty good return, but if you aren't dealing with a fairly large data set (~hundreds+) it's probably not worth scripting.
Quick example, in your example the difference can be distilled pretty accurately to just saying that "middle" and "three" are synonyms. You can get more complex if you need to, but that would match a lot.

Related

do i always need labeled data for ann?

i put a small request on upwork where i am requesting help for a topic which is right now out of my skill zone.
The problem is a fitting problem of small rectangles in a big rectangle via a ANN.
Problem is the first freelancer baffled me a little bit with a comment.
So my thinking was, because the solution is easy verified and rewardable, that you can simply throw a ANN on this problem and with enough time it will perform better and better.
The freelancer requested labeled data first before he can tackle the problem(thats the comment which confuses me).
I was thinking that unlabeled random Input data is enough for the start.
Do I think wrong?
here the link to the job post.
https://www.upwork.com/jobs/~01e040711c31ac0979
edit: directly the original job description
I want python code for training a ANN and using it in a productive enviroment.
The problem it needs to solve is a rectangle fitting problem.
Input are
1000 small Rectangles(groupid,width,heigth,Oriantion(free,restricted,hor or ver), value) --sRect
1 big Rectangles(width, heigth)--bRect
Layout(bool,bool,bool,xpos,ypos,Oriantaion(hor or ver))--Layout
Output
Layout
The bRect will be duplicated to 3 Rectangles where the sRects need to be fitted into.
The Worth of the solution is determined by the sum of the value of sRect inside the bRect.
Further is the value decreased if the sRect is placed in the second bRect or third bRect.
sum(sRect(value))*0.98^nth bRect
Not all sRect needs to be placed.
Layout is structered that the three bool at the start represent at which bRect the sRect is placed. If a sRect is placed at one of the bRect, then the Solution Layout muss stay for this sRect the same.
Restricted Ori means all of the sRect with the same group need to be Oriantated the same way. Hor means the sRect is not turned, ver the sRect is turned by 90degrees.
Other then that normal rules apply, like all sRect needs to be inside the bRect and not Overlapp between sRect.
Looking forward to replys and i am avaible for further explanations.
edit: example picture
important i dont want to optimise for maximum plate usage, because it can happen that a smaller sRect can have a higher value then a bigger sRect.
example fitting problem
Without expected output for each input you cannot use the most standard training methodology - supervised learning. If you only have a way to verify the solution (e.g. in a game of chess you can tell me if I won but you cant tell me how to win) then the most standard approach is reinforcement learning. That being said, it is much more complex problem, not something that say a newcomer to the field of ML will be capable of doing (while supervised learning is something that one can do essentially by following basic tutorials online)

Randomly Generate Directed Graph on a grid

I am trying to randomly generate a directed graph for the purpose of making a puzzle game similar to the ice sliding puzzles from pokemon.
This is essentially what I want to be able to randomly generate: http://bulbanews.bulbagarden.net/wiki/Crunching_the_numbers:_Graph_theory
I need to be able to limit the size of the graph in an x and y dimension. In the example in the link, it would be restricted to an 8x4 grid.
The problem I am running in to is not randomly generating the graph, but randomly generating a graph which I can properly map out in a 2d space, since I need something (like a rock) on the opposite side of a node, to make it visually make sense when you stop sliding. The problem with this is sometimes the rock ends up in the path between two other nodes or possibly on another node itself, which causes the entire graph to become broken.
After discussing the problem with a few people I know, we came to a couple of conclusions that may lead to a solution. Including the obstacles in the grid as part of the graph when constructing it. Start out with a fully filled grid and just draw a random path and delete out blocks that will make that path work, though the problem then becomes figuring out which ones to delete so that you don't accidentally introduce an additional, shorter path. We were also thinking a dynamic programming algorithm may be beneficial, though none of us are too skilled with creating dynamic programming algorithms from nothing. Any ideas or references about what this problem is officially called (if it's an official graph problem) would be most helpful.
I wouldn't look at it as a graph problem, since as you say the representation is incomplete. To generate a puzzle I would work directly on a grid, and work backwards; first fix the destination spot, then place rocks in some way to reach it from one or more spots, and iteratively add stones to reach those other spots, with the constraint that you never add a stone which breaks all the paths to the destination.
You might want to generate a planar graph, which means that the edges of the graph will not overlap each other in a two dimensional space. Another definition of planar graphs ist that each planar graph does not have any subgraphs of the type K_3,3 (complete bi-partite with six nodes) or K_5 (complete graph with five nodes).
There's a paper on the fast generation of planar graphs.

document image processing

I working on an application for processing document images (mainly invoices) and basically, I'd like to convert certain regions of interest into an XML-structure and then classify the document based on that data. Currently I am using ImageJ for analyzing the document image and Asprise/tesseract for OCR.
Now I am looking for something to make developing easier. Specifically, I am looking for something to automatically deskew a document image and analyze the document structure (e.g. converting an image into a quadtree structure for easier processing). Although I prefer Java and ImageJ I am interested in any libraries/code/papers regardless of the programming language it's written in.
While the system I am working on should as far as possible process data automatically, the user should oversee the results and, if necessary, correct the classification suggested by the system. Therefore I am interested in using machine learning techniques to achieve more reliable results. When similar documents are processed, e.g. invoices of a specific company, its structure is usually the same. When the user has previously corrected data of documents from a company, these corrections should be considered in the future. I have only limited knowledge of machine learning techniques and would like to know how I could realize my idea.
The following prototype in Mathematica finds the coordinates of blocks of text and performs OCR within each block. You may need to adapt the parameters values to fit the dimensions of your actual images. I do not address the machine learning part of the question; perhaps you would not even need it for this application.
Import the picture, create a binary mask for the printed parts, and enlarge these parts using an horizontal closing (dilation and erosion).
Query for each blob's orientation, cluster the orientations, and determine the overall rotation by averaging the orientations of the largest cluster.
Use the previous angle to straighten the image. At this time OCR is possible, but you would lose the spatial information for the blocks of text, which will make the post-processing much more difficult than it needs to be. Instead, find blobs of text by horizontal closing.
For each connected component, query for the bounding box position and the centroid position. Use the bounding box positions to extract the corresponding image patch and perform OCR on the patch.
At this point, you have a list of strings and their spatial positions. That's not XML yet, but it sounds like a good starting point to be tailored straightforwardly to your needs.
This is the code. Again, the parameters (structuring elements) of the morphological functions may need to change, based on the scale of your actual images; also, if the invoice is too tilted, you may need to "rotate" roughly the structuring elements in order to still achieve good "un-skewing."
img = ColorConvert[Import#"http://www.team-bhp.com/forum/attachments/test-drives-initial-ownership-reports/490952d1296308008-laura-tsi-initial-ownership-experience-img023.jpg", "Grayscale"];
b = ColorNegate#Binarize[img];
mask = Closing[b, BoxMatrix[{2, 20}]]
orientations = ComponentMeasurements[mask, "Orientation"];
angles = FindClusters#orientations[[All, 2]]
\[Theta] = Mean[angles[[1]]]
straight = ColorNegate#Binarize[ImageRotate[img, \[Pi] - \[Theta], Background -> 1]]
TextRecognize[straight]
boxes = Closing[straight, BoxMatrix[{1, 20}]]
comp = MorphologicalComponents[boxes];
measurements = ComponentMeasurements[{comp, straight}, {"BoundingBox", "Centroid"}];
texts = TextRecognize#ImageTrim[straight, #] & /# measurements[[All, 2, 1]];
Cases[Thread[measurements[[All, 2, 2]] -> texts], (_ -> t_) /; StringLength[t] > 0] // TableForm
The paper we use for skew angle detection is: Skew detection and text line position determination in digitized documents by Gatos et. al. The only limitation with this paper is that it can detect skew upto -5 and +5 degrees. After that, we need something to slap the user with a message! :)
In your case, where there are primarily invoice scans, you may beautifully use: Multiresolution Analysis in Extraction of Reference Lines from Documents with Gray Level Background by Tag et. al.
We wrote the code in MATLAB, if you need help let me know!
I worked on a similar project once, and for being a long time user of OpenCV I ended up using it once again. OpenCV is a popular-cross-platform-computer-vision-library that offers programming interfaces for C and C++.
I found an interesting blog that had a post on how to detect the skew angle of a text using OpenCV, and then another on how to deskew.
To retrieve the text of the document and be able to pass a smaller image to tesseract, I suggest taking a look at the bounding box technique.
I don't know if the image acquisition procedure is your responsibility, but if it is you might want to take a look at how to do camera calibration with OpenCV to fix the distortion in the image caused by some camera lenses.

Vector graphics flood fill algorithms?

I am working on a simple drawing application, and i need an algorithm to make flood fills.
The user workflow will look like this (similar to Flash CS, just more simpler):
the user draws straight lines on the workspace. These are treated as vectors, and can be selected and moved after they are drawn.
user selects the fill tool, and clicks on the drawing area. If the area is surrounded by lines in every direction a fill is applied to the area.
if the lines are moved after the fill is applied, the area of fill is changed accordingly.
Anyone has a nice idea, how to implement such algorithm? The main task is basically to determine the line segments surrounding a point. (and storing this information somehow, incase the lines are moved)
EDIT: an explanation image: (there can be other lines of course in the canvas, that do not matter for the fill algorithm)
EDIT2: a more difficult situation:
EDIT3: I have found a way to fill polygons with holes http://alienryderflex.com/polygon_fill/ , now the main question is, how do i find my polygons?
You're looking for a point location algorithm. It's not overly complex, but it's not simple enough to explain here. There's a good chapter on it in this book: http://www.cs.uu.nl/geobook/
When I get home I'll get my copy of the book and see if I can try anyway. There's just a lot of details you need to know about. It all boils down to building a DCEL of the input and maintain a datastructure as lines are added or removed. Any query with a mouse coord will simply return an inner halfedge of the component, and those in particular contain pointers to all of the inner components, which is exactly what you're asking for.
One thing though, is that you need to know the intersections in the input (because you cannot build the trapezoidal map if you have intersecting lines) , and if you can get away with it (i.e. input is few enough segments) I strongly suggest that you just use the naive O(n²) algorithm (simple, codeable and testable in less than 1 hour). The O(n log n) algorithm takes a few days to code and use a clever and very non-trivial data structure for the status. It is however also mentioned in the book, so if you feel up to the task you have 2 reasons to buy it. It is a really good book on geometric problems in general, so for that reason alone any programmer with interest in algorithms and datastructures should have a copy.
Try this:
http://keith-hair.net/blog/2008/08/04/find-intersection-point-of-two-lines-in-as3/
The function returns the intersection (if any) between two lines in ActionScript. You'll need to loop through all your lines against each other to get all of them.
Of course the order of the points will be significant if you're planning on filling them - that could be harder!
With ActionScript you can use beginFill and endFill, e.g.
pen_mc.beginFill(0x000000,100);
pen_mc.lineTo(400,100);
pen_mc.lineTo(400,200);
pen_mc.lineTo(300,200);
pen_mc.lineTo(300,100);
pen_mc.endFill();
http://www.actionscript.org/resources/articles/212/1/Dynamic-Drawing-Using-ActionScript/Page1.html
Flash CS4 also introduces support for paths:
http://www.flashandmath.com/basic/drawpathCS4/index.html
If you want to get crazy and code your own flood fill then Wikipedia has a decent primer, but I think that would be reinventing the atom for these purposes.

How to implement text selecting?

My question is not language based or OS based. I guess every system is offering some sort of TextOut(text, x, y) method. I am looking for some guidlines or articles how should I implement selection of outputed text. Could not find any info about this.
The only thing which comes to my mind is like this:
When user clicks some point on the text canvas I know the coordinates of this point. I need to calculate where exactly it will be in my text buffer. So I am traversing from the begining of the buffer and I am applying to each character (or block of text) a style (if it has any). After this, I know that after given style the letter has given size. I am adding its width and height to previously calculated X,Y coordinates. In this way, I am traversing the buffer until the calculated position has not reached the point that has been clicked by the user. After I reach the point within range of some offset I have starting point for the selection.
This is the basic idea. I don't know if this is good, I would like to know how this is done for real like for example in Firefox. I know I could browse the sources and if I won't have a choice I'll do it. But first I am trying to find some article about it...
Selecting text is inherently specific to the control which is containing it and the means it stores that text.
A very simple (though questionably inefficient means) is to run the text flow algorithm you are using when clicking on a point and stopping the algorithm when you have reached what is closest to that point. More advanced controls might cache the text layout to make selections or drawing their content more efficient. Depending on how much you value CPU time or memory there are ways to use caches and special cases to make this “hit test” cheaper.
If you can make any assertions (only one font in the control, so every line has the same height) then it is possible to make these tests cheaper by indexing the font layout by lines and then doing simple arithmetic to find out which line was clicked on. If your text control is also using monospace fonts (every character occupies the same width as well as height) then you are in even more luck, as you can jump straight to the character information via a lookup table and two simple divisions.
Keep in mind that writing a text control from scratch is obscenely difficult. For best practice, you should keep the content of the document separate from the display information. The reason for this is because the text itself will need to be edited quite often, so algorithms such as Ropes or Gap Buffers may be employed on the data side to provide faster insertion around the caret. Every time text is edited it must also be rendered, which involves taking that data and running it through some kind of formatting / flow algorithms to determine how it needs to be displayed to the user. Both of these sides require a lot of algorithms that may be annoying to get right.
Unfortunately using the native TextOut functions will not help you. You will need to use methods which give you the text extents for individual characters, and more advanced (multiline for example) controls often must do their own rendering of characters using this information. Functions like TextOut are not built to deal with blinking insertion carets for example, or performing incremental updates on text layouts. While some TextOut style functions may support word wrap and alignment for you, they also require re-rendering the entire string which becomes more undesirable in proportion to the amount of text you need to work with in your control.
You are thinking at a much lower level than necessary (not an insult. you are thinking that you need to do much more work then you need to). Most (if not all) languages with GUI support will also have some form of selectionRange that gives you either the string that was selected or the start and stop indices in the string.
With a modern language, you should never have to calculate pixels and character widths.
For text selection in Javascript, see this question: Understanding what goes on with textarea selection with JavaScript