How to interpret data extracted with Latent Dirichlet allocation - lda

I'm analyzing some files extracted with LDA, I have learned some basic knowledge about LDA from here http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
I have three files:
topic ids (5x600)
59673453648 64345309472 1.23984E+11 52934539940 1.06263E+14
1.05643+14 1.44524E+11 1.09535E+14 1.06368E+14 62248718804
1.12535E+14 1.13771E+14 1.70701E+14 1.86305E+14 1.9114E+14
Topic names (5x600)
TBBT The_Mummy Spider-Man Inception Shrek
Outfitters Cheerleading Chanel Victoria's Secret LV
Pia Mia Ciara Usher Jay-z Akon
data.df (600X1100000)
id
1 0.000111111 0.000111111 0.000111111 0.000111111 0.000111111
2 9.883309999 9.883309999 9.883309999 9.883309999 9.883309999
3 6.772454300 6.772454300 6.772454300 6.772454300 6.772454300
I assume topic ids are matched with topic names, but how to interpret the score in data.df (600 cols)?

Related

1NF, 2NF, AND 3NF Normalization?

I have a teacher who doesn't like to explain to the class but loves putting up review questions for upcoming tests. Can anyone explain the image above? My main concern in the red underline which shows that supplier and supplierPhone are repeated values. I thought that repeated values occurred when there were many occurrences of the same item in a column.
Another question I have is that if the Supplier is a repeating value, why isnt Part_Name a repeating value because they both have 2 items with same names in their columns.
Example:
It's repeated because the result of the tuple is always the same. E.g. ABC Plastics will always have the same phone number, therefore having 2 rows with ABC Plastics means that we have redundant information in the phone number.
Part1 Company1 12341234
Part2 Company1 12341234
We could represent the same information with:
Part1 Company1
Part2 Company1
And
Company1 12341234.
Therefore having two rows with the same phone number is redundant.
This should answer your second question as well.
Essentially you're looking for tuples such that given the tuple (X, Y) exists, if there exists another tuple (X, Y') then Y' = Y
Looks like five tables to me.
model (entity)
modelid description price
1 Laserjet 423.56
1 256 Colour 89.99
part (entity)
partid name
PF123 Paper Spool
LC423 Laserjet cartridge
MT123 Power supply
etc
bill_of_materials (many to many relationship model >--< part )
modelid partid qty
1 PF123 2
1 LC423 4
1 MT123 1
2 MT123 2
supplier (entity)
supplier_id phone name
1 416-234-2342 ABC Plastics
2 905.. Jetson Carbons
3 767... ACME Power Supply
etc.
part_supplier (many to many relationship part >--< supplier )
part_id supplier_id
PF123 1
LC423 2
MT123 3
etc.
You have one row in model, part, supplier for each distinct entity
You have rows in bill_of_materials for each part that goes into each model.
You have a row in part_supplier for each supplier that can furnish each part. Notice that more than one part can come from one supplier, and more than one supplier can furnish each part. That's a many-to-many relationship.
The trick: Figure out what physical things you have in your application domain. Then make a table for each one. Then figure out how they relate to each other (that's what makes it relational.)

Is there some kind of way to import data that consists of multiple rows?

In RapidMiner the data table I usually see is like this:
Row Age Class
1 19 Adult
2 10 Minor
3 15 Teenager
In the data table above this sentence, one row refers to one complete information.
But how do I input a data table to RapidMiner where more than one row refers to one complete information?
For example:
Row Word Rho Theta Phi
1 Hello 0.9384 0.4943 1.2750
2 Hello 1.2819 0.8238 1.3465
3 Hello 1.3963 0.1758 1.4320
4 Eat 1.3918 0.3883 1.1756
5 Eat 1.4742 0.0526 1.2312
6 Eat 0.6698 0.2548 1.4769
7 Eat 0.3074 1.2214 0.2059
In the data table above this sentence, rows 1-3 refers to one complete information where the combinations of rho, theta, and phi from rows 1-3 means the word hello. Same goes for rows 4-7 which is one complete information also that means the word eat. For further explanation of the information I'm talking about, take a look at the table below this sentence.
Row Rho Theta Phi Word
----------------------------
1 |0.9384 0.4943 1.2750|
2 |1.2819 0.8238 1.3465| HELLO
3 |1.3963 0.1758 1.4320|
----------------------------
4 |1.3918 0.3883 1.1756|
5 |1.4742 0.0526 1.2312|
6 |0.6698 0.2548 1.4769| EAT
7 |0.3074 1.2214 0.2059|
----------------------------
Again my problem is, how do I insert this kind of data table to RapidMiner where it understands that multiple rows refer to one complete information? Is there some kind of table like what I displayed below this sentence?
Row Word Rho Theta Phi
1 Hello 0.9384 0.4943 1.2750
. Hello 1.2819 0.8238 1.3465
1 Hello 1.3963 0.1758 1.4320
2 Eat 1.4742 0.0526 1.2312
. Eat 0.6698 0.2548 1.4769
. Eat 0.3074 1.2214 0.2059
2 Eat 0.3074 1.2214 0.2059
you can try to use the Pivot operator to group your result by word.
To do so, I would set the group attribute parameter to "Word" and the index parameter to "Row". It's not exactly the same representation, but close enough, depending on your use case, as multiple format tables are not part of RapidMiner's design.

Panel data or time-series data and xt regression

Need help observing simple regression as well as xt-regression for panel data.
The dataset consists of 16 participants in which daily observations were made.
I would like to observe the difference between pre-test (from the first date on which observations were taken) and post-test (the last date on which observations were made) across different variables.
also I was advised to do xtregress, re
what is this re? and its significance?
If the goal is to fit some xt model at the end, you will need the data in long form. I would use:
bysort id (year): keep if inlist(_n,1,_N)
For each id, this puts the data in ascending chronological order, and keeps the first and last observation for each id.
The RE part of your question is off-topic here. Try Statalist or CV SE site, but do augment your questions with details of the data and what you hope to accomplish. These may also reveal that getting rid of the intermediate data is a bad idea.
Edit:
Add this after the part above:
bysort id (year): gen t= _n
reshape wide x year, i(id) j(t)
order id x1 x2 year1 year2
Perhaps this sample code will set you in the direction you seek.
clear
input id year x
1 2001 11
1 2002 12
1 2003 13
1 2004 14
2 2001 21
2 2002 22
2 2003 23
3 1005 35
end
xtset id year
bysort id (year): generate firstx = x[1]
bysort id (year): generate lastx = x[_N]
list, sepby(id)
With regard to xterg, re, that fits a random effects model. See help xtreg for more details, as well as the documentation for xtreg in the Stata Longitudinal-Data/Panel-Data Reference Manual included in your Stata documentation.

Designing database for Multilanguage dictionary

I want to make a database for English to Myanmar multi-ethnics dictionary.
Currently,I have 4 English to Myanmar Ethnic dictionary data for English to Myanmar/Burma Languages.
Example:
English#WordTyp#Transalation#Eg_Sentences#Synonyms
Love #(n)# translation # Language 1 #Eg_Sentences#Synonyms
Love #(v)# translation # Language 2 #Eg_Sentences#Synonyms
Love #(v)# translation # Language 3 #Eg_Sentences#Synonyms
Love #(n)# translation # Language 4 #Eg_Sentences#Synonyms
"WordType" can be "verb", "noun" , "adverb", etc...
Here is example data entries
http://i.imgur.com/hbE60Vm.png
I am not good in Database design, I am thinking about to create "6" tables:
English_Word
PartOfSpeech
Translation
Language
Synonyms
Example_Sentences
in near future, I want to add more Myanmar/Burma ethnic languages.
One English word can have many word type(noun,verb,adverb,..) and can be translated(just a strings) into many languages.
But how is the relations between them? Here is my first database schema.
Could you give me feedback on my draft database table design?
my draft database table design
Many thanks in advance.
regards,
id Should be a auto increment field and we can treat all words (in all languages as words )
WORD (ID,NAME,LANGUAGE_ID,TYPE_ID) //TYPE_ID AND LANG_ID ARE FOREIGN KEYS
TYPE (ID,NAME) // PART OF SPEACH
LANGUAGE (ID,NAME)
SENTENSE (ID, EXAMPLE)
SYNONUMS (ID,WORD_ID,SYNONYM_ID) //TYPE_ID AND SYNONYM_ID ARE FOREIGN KEYS TO WORD TABLE (SYNONYM IS ALSO A WORD)
DICTIONARY (WORD_ID,TRANSLATION_ID, SENTENSE_ID) // TRANSLATION IS ALSO A WORD
LANGUAGE
id name
1 english
2 tamil
3 gggggg
TYPE
ID NAME
1 NOUN
2 MMMMM
WORD
ID NAME LANGUAGE_ID TYPE_ID
1 LOVE 1 1
2 $#%#$% 2 1
3 PASSION 1 1
4 HHJHJHJ 1 2
SENTENCE
ID EXAMPLE
1 HAFKAKJDHAKJFHAHLFALFLASLDAL
SYNONUMS
ID WORD_ID SYNONYM_ID
1 1 3 //LOVE =>SYNONUMS PASSION
2 1 4
DICTIONARY
WORD_ID TRANSLATION_ID SENTENSE_ID
1 2 1 //LOVE $#%#$% HAFKAKJDHAKJFHAHLFALFLASLDAL

Isolating unique observations and calculating the average in Stata

Currently I have a dataset that appears as follows:
mnbr firm contribution
1591 2 1
9246 6 1
812 6 1
674 6 1
And so on. The idea is that mnbr is the member number of employees who work at firm # whatever. If contribution is 1 (and I have dropped all the 0s for this purpose) said employee has contributed to a certain fund.
I additionally used codebook to determine the number of unique firms that exist. The goal is to determine the average number of contributions per firm i.e. there was 1 contribution for firm 2, 3 contributions for firm 6 and so on. The problem I arrive at is accessing that the unique values number from codebook.
I read some documentation online for
inspect *varlist*
display r(N_unique)
which suggests to me that using r(N_unique) would store that value, yet unfortunately this method did not work for me. So that is part 1.
Part 2 is I'd also like to create a variable that shows the contributions in each firm i.e.
mnbr firm contribution average
1591 2 1 1
9246 6 . 2/3
812 6 1 2/3
674 6 1 2/3
to show that for firm 6, 2 out of the 3 employees contributed to this fund.
Thanks in advance for the help.
To answer your comment, this works for me:
clear
set more off
input ///
mnbr firm cont
1591 2 1
9246 6 .
812 6 1
674 6 1
end
list
// problem 1
inspect firm
display r(N_unique)
// problem 2
bysort firm: egen totc = total(cont)
by firm: gen share = totc / _N
list
You have to use r(N_unique) before running another Stata command, or it can get lost. You can also save that result to a local or scalar.
Problem 2 is also addressed.