Analyze MySQL Text Data - mysql

This is a strange one but I have found the Stackoverflow community to be very helpful. I have mySQL Table with a column full of parsed text data. I want to analyze the data and see in how many rows words appear.
ID columnName
1 Car
2 Dog
3 CAR CAR car CAR
From the above example what I want returned is that the word CAR appears in two rows and the word Dog Appears in 1 row. I don't really care how much the word count is as much as in how many rows does the word appear in. The problem is that I don't know which words to search for. Is there a tool, or something I can build in python, that would show me the most popular words used and in how many rows do the words appear in.
I have not idea where to start and it would be great if someone could assist me with this.

I'd use python:
1) setup python to work with mysql (loads of tutorials online)
2) define:
from collections import defaultdict
tokenDict = defaultdict(lambda: 0)
the former is a simple dictionary which returns 0 if there is no value with the given key (i.e. tokenDict['i_have_never_used_this_key_before'] will return 0)
3) read each row from the table, tokenize it and increment the token counts
tokens = row.split(' ') //tokenize
tokens = [lower(t) for t in tokens] //lowercase
tokens = set(tokens) //remove duplicates
for token in tokens:
tokenDict[token] = tokenDict[token] + 1

Related

How to find missing numbers within a column of strings

I'm trying to find unaccounted for numbers within a substantially large SQL dataset and facing some difficulty sorting.
By default the data for column reads
'Brochure1: Brochure2: Brochure3:...Brochure(k-1): Brochure(k):'
where k stands in for the number of brochures a unique id is eligible for.
Now the issue arises as the brochures are accounted for a sample updated data would read
'Brochure1: 00001 Brochure2: 00002 Brochure3: 00003....'
How does one query out the missing numbers, if in the range of number of say 00001-88888 some haven't been accounted next to Brochure(X):
The right way:
You should change the structure of your database. If you care about performance, you should follow the good practices of relational databases, so as first comment under your question said: normalize. Instead of placing information about brochures in one column of the table, it's much faster and more clear solution to create another table, that will describe relations between brochures and your-first-table-name
<your-first-table-name>_id | brochure_id
----------------------------+---------------
1 | 00002
1 | 00038
1 | 00281
2 | 28192
2 | 00293
... | ...
Not mention, if possible - you should treat brochure_id as integer, so using 12 instead of 0012.
The difference here is, that now you can make efficient and simple queries, to find out how many brochures one ID from your first table has, or what ID any brochure belongs to. If for some reason you need to keep the ordinal number of every single brochure you can add a column to the above table, like brochure_number.
What you want to achieve (not recommended): I think the fastest way to achieve your objective without changing the db structure, is to get the value of your brochures column, and then process it with your script. You really don't want to create a SQL statement to parse this kind of data. In PHP that wolud look something like this:
// Let's assume you already have your `brochures` column value in variable $brochures
$bs = str_replace(": ", ":", $brochures);
$bs = explode(" ", $bs);
$brochures = array();
foreach($bs as $b)
$brochures[substr($b, 8, 1)] = substr($b, strpos($b, ":")+1, 5);
// Now you have $brochures array with keys representing the brochure number,
// and values representing the ID of brochure.
if(isset($brochures['3'])){
// that row has a defined Brochure3
}else{
// ...
}

Best way to parse a big and intricated Json file with OpenRefine (or R)

I know how to parse json cells in Open refine, but this one is too tricky for me.
I've used an API to extract the calendar of 4730 AirBNB's rooms, identified by their IDs.
Here is an example of one Json file : https://fr.airbnb.com/api/v2/calendar_months?key=d306zoyjsyarp7ifhu67rjxn52tv0t20&currency=EUR&locale=fr&listing_id=4212133&month=11&year=2016&count=12&_format=with_conditions
For each ID and each day of the year from now until november 2017, i would like to extract the availability of this rooms (true or false) and its price at this day.
I can't figure out how to parse out these informations. I guess that it implies a series of nested forEach, but i can't find the right way to do this with Open Refine.
I've tried, of course,
forEach(value.parseJson().calendar_months, e, e.days)
The result is an array of arrays of dictionnaries that disrupts me.
Any help would be appreciate. If the operation is too difficult in Open Refine, a solution with R (or Python) would also be fine for me.
Rather than just creating your Project as text, and working with GREL to parse out...
The best way is just select the JSON record part that you want to work with using our visual importer wizard for JSON files and XML files (you can even use a URL pointing to a JSON file as in your example). (A video tutorial shows how here: https://www.youtube.com/watch?v=vUxdB-nl0Bw )
Select the JSON part that contains your records that you want to parse and work with (this can be any repeating part, just select one of them and OpenRefine will extract all the rest)
Limit the amount of data rows that you want to load in during creation, or leave default of all rows.
Click Create Project and now your in Rows mode. However if you think that Records mode might be better suited for context, just import the project again as JSON and then select the next outside area of the content, perhaps a larger array that contains a key field, etc. In the example, the key field would probably be the Date, and why I highlight the whole record for a given date. This way OpenRefine will have Keys for each record and Records mode lets you work with them better than Row mode.
Feel free to take this example and make it better and even more helpful for all , add it to our Wiki section on How to Use
I think you are on the right track. The output of:
forEach(value.parseJson().calendar_months, e, e.days)
is hard to read because OpenRefine and JSON both use square brackets to indicate arrays. What you are getting from this expression is an OR array containing twelve items (one for each month of the year). The items in the OR array are JSON - each one an array of days in the month.
To keep the steps manageable I'd suggest tackling it like this:
First use
forEach(value.parseJson().calendar_months,m,m.days).join("|")
You have to use 'join' because OR can't store OR arrays directly in a cell - it has to be a string.
Then use "Edit Cells->Split multi-valued cells" - this will get you 12 rows per ID, each containing a JSON expression. Now for each ID you have 12 rows in OR
Then use:
forEach(value.parseJson(),d,d).join("|")
This splits the JSON down into the individual days
Then use "Edit Cells->Split multi-valued cells" again to split the details for each day into its own cell.
Using the JSON from example URL above - this gives me 441 rows for the single ID - each contains the JSON describing the availability & price for a single day. At this point you can use the 'fill down' function on the ID column to fill in the ID for each of the rows.
You've now got some pretty easy JSON in each cell - so you can extract availability using
value.parseJson().available
etc.

MySQL finding data if any 4 of 5 columns are found in a row

I have an imported table of several thousand customers, the development I am working on runs on the basis of anonymity for purchase checkouts (customers do not need to log in to check out), but if enough of their details match the database record then do a soft match and email the (probably new) email address and eventually associate the anonymous checkout with the account record on file.
This is rolling out this way due to the age of the records, many people have the same postal address or names but not the same email address, likewise some people will have moved house and some people will have changed name (marriage etc).
What I think I am looking for is a MySQL CASE system, however the CASE questions on Stack Overflow I've found don't appear to cover what I'm trying to get from this query.
The query should work something like this:
$input[0] = postcode (zip code)
$input[1] = postal address
$input[2] = phone number
$input[3] = surname
$input[4] = forename
SELECT account_id FROM account WHERE <4 or more of the variables listed match the same row>
The only way I KNOW I can do this is with a massive bunch of OR statements but that's excessive and I'm sure there's a cleaner more concise method.
I also apologise in advance if this is relatively easy but I don't [think I] know the keyword to research constructing this. As I say, CASE is my best guess.
I'm having trouble working out how to manipulate CASE to fit what I'm trying to do. I do not need to return the values only the account_id from the valid row (only) that matches 4 or 5 of the given inputs.
I imagine that I could construct a layout that does this:
SELECT account_id CASE <if postcode_column=postcode_var> X=X+1
CASE <if surname_column=surname_var> X=X+1
...
...
WHERE X > 3
Is CASE the right idea?
If not, What is the process I need to use to achieve the desired results?
What is [another] MySQL keyword / syntax I need to research, if not CASE.
Here is your pseudo query:
SELECT account_id
FROM account
WHERE (postcode = 'pc')+
(postal_address = 'pa')+
(phone_number = '12345678901')+
(surname = 'sn')+
(forename= 'fn') > 3

Dealing with 5,000 attributes

I have a data set which contains 5,000 + attributes
The tables looks like below
id attr1 attr2, attr3
a 0 1 0
a 1 0 0
a 0 0 0
a 0 0 1
I wish to represent each record on a single row for example the table below to make it more amenable to data mining via clustering.
id, attr1, attr2, attr3
a 1 1 1
I have tried a multitude of ways of doing this
I have tried importing it into a MYSQL DB and getting the max value for each attribute (they can only be 1 or zero for each ID) but a table cant hold the 5,000 + attributes.
I have tried using the pivot function in excel and getting the Max Value per attribute but the number of columns a pivot can handle is far less than the 5,000 I'm currently looking at.
I have tried importing it into Tableua but that also suffers from the fact it cant handle so many records
I just want to get Table 2 in either a text/CSV file or a database table
Can anyone suggest anything at all, a piece of software or something i have not yet considered
Here is a Python script which does what you ask for
def merge_rows_by_id(path):
rows = dict()
with open(path) as in_file:
header = in_file.readline().rstrip()
for line in in_file:
fields = line.split()
id, attributes = fields[0], fields[1:]
if id not in rows:
rows[id] = attributes
else:
rows[id] = [max(x) for x in zip(rows[id], attributes)]
print (header)
for id in rows:
print ('{},{}'.format(id, ','.join(rows[id])))
merge_rows_by_id('my-data.txt')
Which was written for clarity more than maximum efficiency, although it's pretty efficient. However, this will still leave you with lines with 5000 attributes, just fewer of them.
I've seen this data "structure" too often used in bioinformatics where the researchers just say "put everything we know about "a" on one row, and then the set of "everything" doubles, and re-doubles, etc. I've had to teach them about data normalization to make an RDBM handle what they've got. Usually, attr_1…n are from one trial and attr_n+1…m is from a second trial, and so on which allows for a sensible normalization of the data.

Stumbleupon type query

Wow, makes your head spin!
I am about to start a project, and although my mySql is OK, I can't get my head around what required for this:
I have a table of web addresses.
id,url
1,http://www.url1.com
2,http://www.url2.com
3,http://www.url3.com
4,http://www.url4.com
I have a table of users.
id,name
1,fred bloggs
2,john bloggs
3,amy bloggs
I have a table of categories.
id,name
1,science
2,tech
3,adult
4,stackoverflow
I have a table of categories the user likes as numerical ref relating to the category unique ref. For example:
user,category
1,4
1,6
1,7
1,10
2,3
2,4
3,5
.
.
.
I have a table of scores relating to each website address. When a user visits one of these sites and says they like it, it's stored like so:
url_ref,category
4,2
4,3
4,6
4,2
4,3
5,2
5,3
.
.
.
So based on the above data, URL 4 would score (in it's own right) as follows: 2=2 3=2 6=1
What I was hoping to do was pick out a random URL from over 2,000,000 records based on the current users interests.
So if the logged in user likes categories 1,2,3 then I would like to ORDER BY a score generated based on their interest.
If the logged in user likes categories 2 3 and 6 then the total score would be 5. However, if the current logged in user only like categories 2 and 6, the URL score would be 3. So the order by would be in context of the logged in users interests.
Think of stumbleupon.
I was thinking of using a set of VIEWS to help with sub queries.
I'm guessing that all 2,000,000 records will need to be looked at and based on the id of the url it will look to see what scores it has based on each selected category of the current user.
So we need to know the user ID and this gets passed into the query as a constant from the start.
Ain't got a clue!
Chris Denman
What I was hoping to do was pick out a random URL from over 2,000,000 records based on the current users interests.
This screams for predictive modeling, something you probably wouldn't be able to pull off in the database. Basically, you'd want to precalculate your score for a given interest (or more likely set of interests) / URL combination, and then query based on the precalculated values. You'd most likely be best off doing this in application code somewhere.
Since you're trying to guess whether a user will like or dislike a link based on what you know about them, Bayes seems like a good starting point (sorry for the wikipedia link, but without knowing your programming language this is probably the best place to start): Naive Bayes Classifier
edit
The basic idea here is that you continually run your precalculation process, and once you have enough data you can try to distill it to a simple formula that you can use in your query. As you collect more data, you continue to run the precalculation process and use the expanded results to refine your formula. This gets really interesting if you have the means to suggest a link, then find out whether the user liked it or not, as you can use this feedback loop really improve the prediction algorithm (have a read on machine learning, particularly genetic algorithms, for more on this)
I did this in the end:
$dbh = new NewSys::mySqlAccess("xxxxxxxxxx","xxxxxxxxxx","xxxxxxxxx","localhost");
$icat{1}='animals pets';
$icat{2}='gadget addict';
$icat{3}='games online play';
$icat{4}='painting art';
$icat{5}='graphic designer design';
$icat{6}='philosophy';
$icat{7}='strange unusual bizarre';
$icat{8}='health fitness';
$icat{9}='photography photographer';
$icat{10}='reading books';
$icat{11}='humour humor comedy comedian funny';
$icat{12}='psychology psychologist';
$icat{13}='cartoons cartoonist';
$icat{14}='internet technology';
$icat{15}='science scientist';
$icat{16}='clothing fashion';
$icat{17}='movies movie latest';
$icat{18}="\"self improvement\"";
$icat{19}='drawing art';
$icat{20}='latest band member';
$icat{21}='shop prices';
$icat{22}='recipe recipes food';
$icat{23}='mythology';
$icat{24}='holiday resorts destinations';
$icat{25}="(rude words)";
$icat{26}="www website";
$dbh->Sql("DELETE FROM precalc WHERE member = '$fdat{cred_id}'");
$dbh->Sql("SELECT * FROM prefs WHERE member = '$fdat{cred_id}'");
#chos=();
while($dbh->FetchRow()){
$cat=$dbh->Data('category');
$cats{$cat}='#';
}
foreach $cat (keys %cats){
push #chos,"\'$cat\'";
push #strings,$icat{$cat};
}
$sqll=join("\,",#chos);
$words=join(" ",#strings);
$dbh->Sql("select users.id,users.url,IFNULL((select sum(scoretot.scr) from scoretot where scoretot.id = users.id and scoretot.category IN \($sqll\)),0) as score from users WHERE MATCH (description,lasttweet) AGAINST ('$words' IN BOOLEAN MODE) AND IFNULL((SELECT ref FROM visited WHERE member = '$fdat{cred_id}' AND user = users.id LIMIT 1),0) = 0 ORDER BY score DESC limit 30");
$cnt=0;
while($dbh->FetchRow()){
$id=$dbh->Data('id');
$url=$dbh->Data('url');
$score=$dbh->Data('score');
$dbh2->Sql("INSERT INTO precalc (member,user,url,score) VALUES ('$fdat{cred_id}','$id','$url','$score')");
$cnt++;
}
I came up with this answer about three months ago, and just cannot read it. So sorry, I can't explain how it finally worked, but it managed to query 2 million websites and choose one based on the history of a users past votes on other sites.
Once I got it working, I moved on to another problem!
http://www.staggerupon.com is where it all happens!
Chris