Hi I am trying to do one hot encoding in Orange in order to conduct market basket analysis.
Currently I have transaction data as follows in my CSV:
C#
Items
C1
Apple
Orange
C2
Baby Milk
Apple
Orange
I would like to find out what are the steps that I can do to process the data in orange or other software such that I am able to get this state for my data
C#
Apple
Orange
Baby Milk
C1
1
1
0
C2
1
1
1
Currently when I try to preprocess the data in orange using "continous discrete variables - one feature per line" I get individual feature value columns.
It is not entirely straightforward, but you could concatenate your products with comma or semicolon, pass it to Corpus, apply tokenization based on your concatenation character (comma, semicolon) with a Regex, then use Bag of Words from the Text add-on. I have tried it with Associate add-on, and it seems to work.
Related
I have a table like below and I want to return the name of the item with the greatest effect of a particular type. For example, I want the name of the ring with the best 'Shield' enchantment, in this case 'Brusef Amelion's Ring'.
Description
Apparel slot
Effect Type
Effect Value
Apron of Adroitness
Chest
Fortify Agility
5 pts
Brusef Amelion's Ring
Ring
Shield
18%
Cuirass of the Herald
Chest
Fortify Health
15 pts
Fortify Magicka Pants
Legs
Fortify Magicka
20 pts
Grand ring of Aegis
Ring
Shield
6%
I've tried using a MAXIFS statement:
=maxifs(Effect Value, Apparel Slot, "=Ring", Effect Type, "=Shield")
and that returns 0.18, as I'd expect. But I want it to return the name of this item, 'Brusef Amelion's Ring'. So I then tried using a vlookup on this value, but there doesn't seem to be the option to only lookup a value if ('Apparel Slot'='Ring' && 'Effect Type'='Shield'), for example.
I feel like there must be a way of nesting some specific functions here, but I can't quite figure it out.
Is there any way to do this while avoiding manually sorting and filtering my data before running each query?
Is this what you are looking for?
=DGET(A1:D6,"Description",{"Apparel slot","Effect Value";"Ring",MAXIFS(D2:D6,B2:B6,"Ring",C2:C6,"Shield")})
DGET function needs column headers to work with as you can see. So we want the value from the column Description. "Originally" this is a DGET function:
DGET(A2:F20,"price",{"Ticker";"Google"})
This is saying: Find price where ticker = google.
We can enhance this a bit by entering two criteria's this is done , or \ separated (depends on your country settings) like this:
DGET(A2:F20,"price",{"Ticker","Year;"Google",2020})
DGET(A2:F20;"price";{"Ticker"\"Year;"Google"\2020})
Find price where ticker = google and year = 2020.
Ofcource you can replace "Google" with a cell reference. Hope this helps.
I have two csv files containing countries and values that correspond to each country.
The data from CSV 1 denotes the number of times a country has been attacked on their own soil.
The data from CSV 2 denotes the number of times a country has attacked another country abroad.
There is overlap between the two sets of data and I intend to demonstrate values from both data sets in one grey scale range to be shown on a choropleth map.
I have some (obviously) phony data below to demonstrate what I'm working with.
TARGET.csv
country, code, value
Iran, IRN, 5
Russia, RUS, 4
United States, USA, 0
Egypt, EGY, 2
Spain, ESP, 1
ATTACKER.csv
country, code, value
Iran, IRN, 3
Russia, RUS, 9
United States, USA, 4
Egypt, EGY, 0
Spain, ESP, 0
There are more targets than attackers.
I want to ensure that I represent the data accurately, but do not know how I would create a normalized range of values between -1 and 1.
It is my understanding that displaying the data in this way would accurately represent the reality best, but I feel like I may be wrong.
In summation:
1) Am I thinking about this problem properly? Is this even the right way to think about displaying the data?
2) What is the proper language used to describe my question?
I am usually able to figure these things out but I'm stumped with dead-end search queries.
3) How do I make sure that my range is normalized. Notice that USA above appears as the only attacker who has never been a target, Would that make the USA the nearest value to +1, despite Russia's larger number of attacks?
I would appreciate whatever input you all can offer.
I'm using OCR on historical newspapers that contain 6 columns per page. At present I use FineReader and define text blocks for each column. I'd like to use Tesseract. Tesseract gets the columns mostly right, but every few lines it reads into adjacent columns. I wonder if there's a way to set its parameters so that it will look quite rigidly for six columns.
Following suggestions on other questions, I've tried playing with --psm and hocr without great success.
Working with a jpg I've posted on github, and converting it into a text-embedded pdf using this code tesseract 1906-07-02-p4.jpg out -l eng+fra --psm 1 pdf I get this result:
Clearly the engine is making a bloc containing the indented lines, and another containing the flush lines.
Confirming this is the text output of the flush lines:
Grocery, Bar and Coffea shop of the trpops
stationed at the Citadel, Cairo.
to received tender for this service by 10 a.m.,
on Saturday, the 14th Jaly, 1906.
application in person to the Commandant,
Citadel, between the hours of 10 a.m. and
12 noon, daily.
—_—_——
Is there a way to constrain tesseract to certain column boundaries? (Obviously I could do this by cutting up the images but I'd like to avoid that work.)
you can user
psm 4 oem 1
or psm 4 oem 3
to get better text and accuracy
I have a 3d cylinder chart that I am having some problems with. I want to effectively sort the cylinders with the highest value at the back and the lowest value at the front. Otherwise the tallest valuest cover the smallest values.
I have tried sorting both a-z and z-a but I really need it to be dynamic based on the values. I have also tried sorting the values by the actual value field. both a-z and z-a but this seems to return completely random results.
the data in the database (example) looks like. I use a parameter to separate by supplier.
Date catgeory_Type cost supplier
01/01/2013 apple $5 abc
01/01/2013 pear $10 def
01/01/2013 bannana $15 cgi
01/02/2013 apple $7 etc
01/02/2013 pear $12 etc
01/02/2013 banana $18 etc
I believe I need some form of expression that sorts the values based on cost. as both a-z and z-a in the instance would provide cylinders that blocked other cylinders.
I have tried sorting the series group by :=Sum(Fields!cost.Value, "DataSet1") and =Fields!cost.Value but this seems to return random results.
I would be happy even if I could achieve a custom sort such as sort by "bannana, pear, apple" although for some "suppliers" this would still cause me an issue.
edit 1: strangely enough this works with a line chart but not a 3d cylinder
edit 2: example
attached is an example. I want the tallest cylinders at the back. but methods mentioned above do not work
In chart area properties -> 3D-options , Enable,
series clustering
Choose this option to cluster series groups. When multiple series for
bar or column charts are clustered, they are displayed along two
distinct rows in the chart area. If series are not clustered, their
corresponding data points are displayed adjacent to each other in one
row. This option is applicable only to bar and column charts.
Also try changing the Rotation & Inclination degrees, to get a better look.
Decrease wall thickness also.
I'm re-asking a question that was previously deleted here in SO for not being a "programming question". Hopefully, this is a bit more "programming" than the last post.
First, a few definitions:
model - 2011 Nissan Sentra
trim - 2011 Nissan Sentra LX
Generally, a particular vehicle would have a list of, say, available colors or equipment options. So a 2011 Nissan Sentra may be available in the following colors:
Black
White
Red
Then, the manufacturer may have made a special color only available to the 2011 Nissan Sentra LX trim:
Pink with Yellow Polka Dots
If I were building a car website wherein I wanted to capture this information, which of the following should I do:
Associate the colors to the model?
Associate the colors to the trim?
Associate the colors to the model and trim?
My gut feeling is that associating it to the model would be sufficient. Associating to trim would mean duplicates (e.g. 2011 Nissan Sentra LX and 2011 Nissan Sentre SE would both have "Black" as a color). Trying to associate colors to model and trim might be overkill.
Suggestions?
If there are special cases, as you say, where a manufacturer has made a special color only available to a specific trim, like "Pink with Yellow Polka Dots" for the "2011 Nissan Sentra LX trim"
and you want to have those special case stored, you should choose the 2nd option.
So, your relationships would be:
1 manufacturer makes many models
1 model has many trims
1 trim can have many colors and for 1 colour many trims have it
(so you'll need an association table for this relationship)
Manufacturer
1\
\
\N
Model
1\
\
\N
Trim Colour
1\ 1/
\ /
\N /M
TrimColour
With additional information about colours:
One GeneralColour can be named as many Colours by different Manufacturers and
one Manufacturer can "baptize" a GeneralColour with various Colour (names)
Manufacturer
1/ 1\
/ \
/N \
Model \ GeneralColour
1\ \ 1/
\ \ /
\N \N /M
Trim Colour
1\ 1/
\ /
\N /M
TrimColour
Thinking more clearly, the extra Manufacturer-Colour relationship is not needed:
Manufacturer
1\
\
\N
Model GeneralColour
1\ 1/
\ /
\N /M
Trim Colour
1\ 1/
\ /
\N /M
TrimColour
If different trims for the same model may have different color options (as you imply) then you should associate the color to the trim, otherwise you will have incorrect/incompatible information. aka If "pink with yellow polka dots" is associated to the "2011 Nissan Sentra" model then you will incorrectly show it as an option for trims other than LX.
You're missing the association of the trim to the model; without that, I don't know that you can really properly complete your associations.
As requested in response to my comment...
I would just make 'color' a free-form text field, possibly with a pre-populated drop-down showing current popular colors in the database. The main advantage is that it makes your DB schema much simpler, and keeps your car model/color researchers from going insane. But it also allows for custom paint jobs that aren't available from the manufacturer at all.
manufacturers
-------------
id
models
------
id
manufacturer (FK to manufacturers.id)
model_name (VARCHAR)
trims
-----
id
model (FK to models.id)
cars
-------
id
trim (FK to trims.id)
year INT
color VARCHAR
If I were building a car website wherein I wanted to capture this
information
then you'd have to build a logical model that captured that information. (How hard was that?) And that means you have to model these facts.
Some colors apply to the model.
Some colors apply to the trim package.
(And I'll bet I can find a manufacturer where some colors apply to
the make.)
(And I'll bet that all these colors also have something to do with the year.)
Capturing all the known requirements is one thing. Implementing them is another. Once you understand how the colors actually work,
you're free to ignore whatever real-world behavior you want to.
But, as Dr. Phil often says,
"When you choose the behavior, you choose the consequences."
Simplifying the known requirements--ignoring the fact that some colors apply only to one or two trim packages--means you design your database to deliberately allow invalid data. Your database might end up with information about a "Pink with Yellow Polka Dots" Nissan Altima, or a "Copper" 2002 Nissan Sentra. (I think Nissan introduced copper in 2004.)
So here's the real question.
How much bad data can you tolerate?
That's always going to be application-dependent. A social media site that collected information about your car color would be a lot more tolerant of impossible color choices than a company that sells touch-up paint.