I'm re-asking a question that was previously deleted here in SO for not being a "programming question". Hopefully, this is a bit more "programming" than the last post.
First, a few definitions:
model - 2011 Nissan Sentra
trim - 2011 Nissan Sentra LX
Generally, a particular vehicle would have a list of, say, available colors or equipment options. So a 2011 Nissan Sentra may be available in the following colors:
Black
White
Red
Then, the manufacturer may have made a special color only available to the 2011 Nissan Sentra LX trim:
Pink with Yellow Polka Dots
If I were building a car website wherein I wanted to capture this information, which of the following should I do:
Associate the colors to the model?
Associate the colors to the trim?
Associate the colors to the model and trim?
My gut feeling is that associating it to the model would be sufficient. Associating to trim would mean duplicates (e.g. 2011 Nissan Sentra LX and 2011 Nissan Sentre SE would both have "Black" as a color). Trying to associate colors to model and trim might be overkill.
Suggestions?
If there are special cases, as you say, where a manufacturer has made a special color only available to a specific trim, like "Pink with Yellow Polka Dots" for the "2011 Nissan Sentra LX trim"
and you want to have those special case stored, you should choose the 2nd option.
So, your relationships would be:
1 manufacturer makes many models
1 model has many trims
1 trim can have many colors and for 1 colour many trims have it
(so you'll need an association table for this relationship)
Manufacturer
1\
\
\N
Model
1\
\
\N
Trim Colour
1\ 1/
\ /
\N /M
TrimColour
With additional information about colours:
One GeneralColour can be named as many Colours by different Manufacturers and
one Manufacturer can "baptize" a GeneralColour with various Colour (names)
Manufacturer
1/ 1\
/ \
/N \
Model \ GeneralColour
1\ \ 1/
\ \ /
\N \N /M
Trim Colour
1\ 1/
\ /
\N /M
TrimColour
Thinking more clearly, the extra Manufacturer-Colour relationship is not needed:
Manufacturer
1\
\
\N
Model GeneralColour
1\ 1/
\ /
\N /M
Trim Colour
1\ 1/
\ /
\N /M
TrimColour
If different trims for the same model may have different color options (as you imply) then you should associate the color to the trim, otherwise you will have incorrect/incompatible information. aka If "pink with yellow polka dots" is associated to the "2011 Nissan Sentra" model then you will incorrectly show it as an option for trims other than LX.
You're missing the association of the trim to the model; without that, I don't know that you can really properly complete your associations.
As requested in response to my comment...
I would just make 'color' a free-form text field, possibly with a pre-populated drop-down showing current popular colors in the database. The main advantage is that it makes your DB schema much simpler, and keeps your car model/color researchers from going insane. But it also allows for custom paint jobs that aren't available from the manufacturer at all.
manufacturers
-------------
id
models
------
id
manufacturer (FK to manufacturers.id)
model_name (VARCHAR)
trims
-----
id
model (FK to models.id)
cars
-------
id
trim (FK to trims.id)
year INT
color VARCHAR
If I were building a car website wherein I wanted to capture this
information
then you'd have to build a logical model that captured that information. (How hard was that?) And that means you have to model these facts.
Some colors apply to the model.
Some colors apply to the trim package.
(And I'll bet I can find a manufacturer where some colors apply to
the make.)
(And I'll bet that all these colors also have something to do with the year.)
Capturing all the known requirements is one thing. Implementing them is another. Once you understand how the colors actually work,
you're free to ignore whatever real-world behavior you want to.
But, as Dr. Phil often says,
"When you choose the behavior, you choose the consequences."
Simplifying the known requirements--ignoring the fact that some colors apply only to one or two trim packages--means you design your database to deliberately allow invalid data. Your database might end up with information about a "Pink with Yellow Polka Dots" Nissan Altima, or a "Copper" 2002 Nissan Sentra. (I think Nissan introduced copper in 2004.)
So here's the real question.
How much bad data can you tolerate?
That's always going to be application-dependent. A social media site that collected information about your car color would be a lot more tolerant of impossible color choices than a company that sells touch-up paint.
Related
I want to train my own model to detect and recognize ID card with Tesseract. I want to extract the key information like name, id from it. The data looks like: [sample of data]
The introduction of training can only input text with single line.I'm confused how to train the detection model in Tesseract and should I label single character or label the whole text line in each box. (https://github.com/tesseract-ocr/tesstrain)
enter image description here
1 by One Character Replacement from image to text is based on training in groups.
so here in the first tesseract training test sample, the idea is to let tesseract understand that the ch ligature is to be output as two letters the δ is to be lower case d with f as k and that Uber is Aber etc.
However that does not correct spelling of words without a dictionary of accepted character permutations and thus you need to either train all words you could expect like 123 is allowed but not 321 or else you allow all numbers.
The problem then is should ¦ be i | l or 1 ! ? and only human intelligent context is likely to agree what is 100% correct, especially when italics so is / = i | l or 1 ! or is it italic / ?
The clearer the characters are compared in contrast to the background, is usually going to produce the best result, and well defined void space within a character will help to distinguish well between B and 8 thus resolution is also a help or hindrance.
= INT 3O 80 S~A MARIA
A dictionary entry of BO and STA would possibly help in this case.
Oh, I think I get it. Tesseract doesn't need a detection model to get the position of the text line, it recognize each blob(letter) and uses the position of each letter to locate the text line.
Hi I am trying to do one hot encoding in Orange in order to conduct market basket analysis.
Currently I have transaction data as follows in my CSV:
C#
Items
C1
Apple
Orange
C2
Baby Milk
Apple
Orange
I would like to find out what are the steps that I can do to process the data in orange or other software such that I am able to get this state for my data
C#
Apple
Orange
Baby Milk
C1
1
1
0
C2
1
1
1
Currently when I try to preprocess the data in orange using "continous discrete variables - one feature per line" I get individual feature value columns.
It is not entirely straightforward, but you could concatenate your products with comma or semicolon, pass it to Corpus, apply tokenization based on your concatenation character (comma, semicolon) with a Regex, then use Bag of Words from the Text add-on. I have tried it with Associate add-on, and it seems to work.
I have a csv file named movie_reviews.csv and the data inside looks like this:
1 Pixar classic is one of the best kids' movies of all time.
1 Apesar de representar um imenso avanço tecnológico, a força
1 It doesn't enhance the experience, because the film's timeless appeal is down to great characters and wonderful storytelling; a classic that doesn't need goggles or gimmicks.
1 As such Toy Story in 3D is never overwhelming. Nor is it tedious, as many recent 3D vehicles have come too close for comfort to.
1 The fresh look serves the story and is never allowed to overwhelm it, leaving a beautifully judged yarn to unwind and enchant a new intake of young cinemagoers.
1 There's no denying 3D adds extra texture to Pixar's seminal 1995 buddy movie, emphasising Buzz and Woody's toy's-eye- view of the world.
1 If anything, it feels even fresher, funnier and more thrilling in today's landscape of over-studied demographically correct moviemaking.
1 If you haven't seen it for a while, you may have forgotten just how fantastic the snappy dialogue, visual gags and genuinely heartfelt story is.
0 The humans are wooden, the computer-animals have that floating, jerky gait of animated fauna.
1 Some thrills, but may be too much for little ones.
1 Like the rest of Johnston's oeuvre, Jumanji puts vivid characters through paces that will quicken any child's pulse.
1 "This smart, scary film, is still a favorite to dust off and take from the ""vhs"" bin"
0 All the effects in the world can't disguise the thin plot.
the first columns with 0s and 1s is my label.
I want to first turn the texts in movie_reviews.csv into vectors, then split my dataset based on the labels (all 1s to train and 0s to test). Then feed the vectors into a classifier like random forest.
For such a task you'll need to parse your data first with different tools. First lower-case all your sentences. Then delete all stopwords (the, and, or, ...). Tokenize (an introduction here: https://medium.com/#makcedward/nlp-pipeline-word-tokenization-part-1-4b2b547e6a3). You can also use stemming in order to keep anly the root of the word, it can be helpful for sentiment classification.
Then you'll assign an index to each word of your vocabulary and replace words in your sentence by these indexes :
Imagine your vocabulary is : ['i', 'love', 'keras', 'pytorch', 'tensorflow']
index['None'] = 0 #in case a new word is not in your vocabulary
index['i'] = 1
index['love'] = 2
...
Thus the sentence : 'I love Keras' will be encoded as [1 2 3]
However you have to define a maximum length max_len for your sentences and when a sentence contain less words than max_len you complete your vector of size max_len by zeros.
In the previous example if your max_len = 5 then [1 2 3] -> [1 2 3 0 0].
This is a basic approach. Feel free to check preprocessing tools provided by libraries such as NLTK, Pandas ...
I am having 3 different datasets, 3 of them were all blood smear image stained with the same chemical substance. Blood smear images are images that capture your blood, include Red, White blood cells inside.
The first dataset contain 2 classes : normal vs blood cancer
The second dataset contain 2 classes: normal vs blood infection
The third dataset contain 2 classes: normal vs sickle cell disease
So, what i want to do is : when i input a blood smear image, the AI system will tell me whether it was : normal , or blood cancer or blood infection or sickle cell disease (4 classes classification task)
What should i do?
Should i mix these 3 datasets and train only 1 model to detect 4 classes ?
Or should i train 3 different models and them combine them? If yes, what method should i use to combine?
Update : i searched for a while. Can this task called "Learning without forgetting?"
I think it depends on the data.
You may use three different models and make three binary predictions on each image. So you get a vote (probability) for each x vs. normal. If binary classifications are accurate, this should deliver okay results. But you kind of get a cummulated missclassification or error in this case.
If you can afford, you can train a four class model and compare the test error to the series of binary classifications. I understand that you already have three models. So training another one may be not too expensive.
If ONLY one of the classes can occur, a four class model might be the way to go. If in fact two (or more) classes can occur jointly, a series of binary classifications would make sense.
As #Peter said it is totally data dependent. If the images of the 4 classes, namely normal ,blood cancer ,blood infection ,sickle cell disease are easily distinguishable with your naked eyes and there is no scope of confusion among all the classes then you should simply go for 1 model which gives out probabilities of all the 4 classes(as mentioned by #maxi marufo). If there is confusion between classes and the images are NOT distinguishable with naked eyes or there is a lot of scope of confusion between the classes then you should use 3 different models but then you'll need. You simply get the predicted probabilities from all the 3 models suppose p1(normal) and p1(c1), p2(normal) and p2(c2), p3(normal) and p3(c3). Now you can average(p1(normal),p2(normal),p3(normal)) and the use a softmax for p(normal), p1(c1), p2(c2), p3(c3) . Out of multiple ways you could try, the above could be one.
This is a multiclass classification problem. You can train just one model, with the final layer being a full connected (dense) layer of 4 units (i.e. output dimension) and softmax activation function.
I'm not a Natural Language Programming student, yet I know it's not trivial strcmp(n1,n2).
Here's what i've learned so far:
comparing Personal Names can't be solved 100%
there are ways to achieve certain degree of accuracy.
the answer will be locale-specific, that's OK.
I'm not looking for spelling alternatives! The assumption is that the input's spelling is correct.
For example, all the names below can refer to the same person:
Berry Tsakala
Bernard Tsakala
Berry J. Tsakala
Tsakala, Berry
I'm trying to:
build (or copy) an algorithm which grades the relationship 2 input names
find an indexing method (for names in my database, for hash tables, etc.)
note:
My task isn't about finding names in text, but to compare 2 names. e.g.
name_compare( "James Brown", "Brown, James", "en-US" ) ---> 99.0%
I used Tanimoto Coefficient for a quick (but not super) solution, in Python:
"""
Formula:
Na = number of set A elements
Nb = number of set B elements
Nc = number of common items
T = Nc / (Na + Nb - Nc)
"""
def tanimoto(a, b):
c = [v for v in a if v in b]
return float(len(c)) / (len(a)+len(b)-len(c))
def name_compare(name1, name2):
return tanimoto(name1, name2)
>>> name_compare("James Brown", "Brown, James")
0.91666666666666663
>>> name_compare("Berry Tsakala", "Bernard Tsakala")
0.75
>>>
Edit: A link to a good and useful book.
Soundex is sometimes used to compare similar names. It doesn't deal with first name/last name ordering, but you could probably just have your code look for the comma to solve that problem.
We've just been doing this sort of work non-stop lately and the approach we've taken is to have a look-up table or alias list. If you can discount misspellings/misheard/non-english names then the difficult part is taken away. In your examples we would assume that the first word and the last word are the forename and the surname. Anything in between would be discarded (middle names, initials). Berry and Bernard would be in the alias list - and when Tsakala did not match to Berry we would flip the word order around and then get the match.
One thing you need to understand is the database/people lists you are dealing with. In the English speaking world middle names are inconsistently recorded. So you can't make or deny a match based on the middle name or middle initial. Soundex will not help you with common name aliases such as "Dick" and "Richard", "Berry" and "Bernard" and possibly "Steve" and "Stephen". In some communities it is quite common for people to live at the same address and have 2 or 3 generations living at that address with the same name. The only way you can separate them is by date of birth. Date of birth may or may not be recorded. If you have the clout then you should probably make the recording of date of birth mandatory. A lot of "people databases" either don't record date of birth or won't give them away due to privacy reasons.
Effectively people name matching is not that complicated. Its entirely based on the quality of the data supplied. What happens in practice is that a lot of records remain unmatched - and even a human looking at them can't resolve the mismatch. A human may notice name aliases not recorded in the aliases list or may be able to look up details of the person on the internet - but you can't really expect your programme to do that.
Banks, credit rating organisations and the government have a lot of detailed information about us. Previous addresses, date of birth etc. And that helps them join up names. But for us normal programmers there is no magic bullet.
Analyzing name order and the existence of middle names/initials is trivial, of course, so it looks like the real challenge is knowing common name alternatives. I doubt this can be done without using some sort of nickname lookup table. This list is a good starting point. It doesn't map Bernard to Berry, but it would probably catch the most common cases. Perhaps an even more exhaustive list can be found elsewhere, but I definitely think that a locale-specific lookup table is the way to go.
I had real problems with the Tanimoto using utf-8.
What works for languages that use diacritical signs is difflib.SequenceMatcher()