How to find missing numbers within a column of strings - mysql

I'm trying to find unaccounted for numbers within a substantially large SQL dataset and facing some difficulty sorting.
By default the data for column reads
'Brochure1: Brochure2: Brochure3:...Brochure(k-1): Brochure(k):'
where k stands in for the number of brochures a unique id is eligible for.
Now the issue arises as the brochures are accounted for a sample updated data would read
'Brochure1: 00001 Brochure2: 00002 Brochure3: 00003....'
How does one query out the missing numbers, if in the range of number of say 00001-88888 some haven't been accounted next to Brochure(X):

The right way:
You should change the structure of your database. If you care about performance, you should follow the good practices of relational databases, so as first comment under your question said: normalize. Instead of placing information about brochures in one column of the table, it's much faster and more clear solution to create another table, that will describe relations between brochures and your-first-table-name
<your-first-table-name>_id | brochure_id
----------------------------+---------------
1 | 00002
1 | 00038
1 | 00281
2 | 28192
2 | 00293
... | ...
Not mention, if possible - you should treat brochure_id as integer, so using 12 instead of 0012.
The difference here is, that now you can make efficient and simple queries, to find out how many brochures one ID from your first table has, or what ID any brochure belongs to. If for some reason you need to keep the ordinal number of every single brochure you can add a column to the above table, like brochure_number.
What you want to achieve (not recommended): I think the fastest way to achieve your objective without changing the db structure, is to get the value of your brochures column, and then process it with your script. You really don't want to create a SQL statement to parse this kind of data. In PHP that wolud look something like this:
// Let's assume you already have your `brochures` column value in variable $brochures
$bs = str_replace(": ", ":", $brochures);
$bs = explode(" ", $bs);
$brochures = array();
foreach($bs as $b)
$brochures[substr($b, 8, 1)] = substr($b, strpos($b, ":")+1, 5);
// Now you have $brochures array with keys representing the brochure number,
// and values representing the ID of brochure.
if(isset($brochures['3'])){
// that row has a defined Brochure3
}else{
// ...
}

Related

Rails, MySql, JSON column which stores array of UUIDs - Need to do exact match

I have a model called lists, which has a column called item_ids. item_ids is a JSON column (MySQL) and the column contains array of UUIDs, each referring to one item.
Now when someone creates a new list, I need to search whether there is an existing list with same set of UUIDs, and I want to do this search using query itself for faster response. Also use ActiveRecord querying as much as possible.
How do i achieve this?
item_ids = ["11E85378-CFE8-39F8-89DC-7086913CFD4B", "11E85354-304C-0664-9E81-0A281BE2CA42"]
v = List.new(item_ids: item_ids)
v.save!
Now, how do I check whether a list exists which has item ids exactly matches with that mentioned in query ? Following wont work.
list_count = List.where(item_ids: item_ids).count
Edit 1
List.where("JSON_CONTAINS(item_ids, ?) ", item_ids.to_json).count
This statement works, but it counts even if only one of the item matches. Looking for exact number of items.
Edit 2
List.where("JSON_CONTAINS( item_ids, ?) and JSON_LENGTH(item_ids) = ?", item_ids.to_json, item_ids.size).count
Looks like this is working
You can implement a has many relation between lists and items and then access like this.
List.includes(:item).where('items.id in (?)',item_ids)
To implement has_many relation:
http://guides.rubyonrails.org/association_basics.html#the-has-many-through-association

UPDATE all empty array columns to be NULL

MySQL query is:
UPDATE buddy SET buddy_positions = CONCAT('| ',
(IF(buddy1_request = 'Yes', 'buddy1_main | ','')),
(IF(buddy2_request = 'Yes','buddy2_main | ','')),
(IF(buddy3_request = 'Yes','buddy3_main | ','')),
(IF(buddy4_request = 'Yes' NULL,'buddy4_main | ','')),
(IF(buddy5_request = 'Yes','buddy5_main | ','')))
My problem is that if buddy1_request = 'Yes' I want it to CONCAT the data in a separate cell (buddy1_main) plus some extra bits. I'm struggling to output the result from another column however, I can type it in manually but can't find a way to do this automatically.
EDIT
So my data looks like this:
|buddy1_request | buddy1_main | buddy2_request | buddy2_main |buddy_position|
|---------------|-------------|----------------|-------------|--------------|
|Yes |prop |no |NULL |(CONCAT HERE) |
So what I want to happen is that if buddy1_request says 'Yes' then it includes the contents of buddy1_main in the CONCAT for buddy_position and so on
In this example the output would simpley be "prop", however if buddy2_request said "Yes" it would have a value such as "winger" and the CONCAT would return "prop","winger"
My problem is that my current query returns the text "buddy1_main","buddy2_main"
I don't know how to reference the column and pull the value through in the CONCAT.
PS I don't see how this is bad design, it's a table where player adds a friend and if they click yes to playing with them that week it brings through their position as well into buddy1_main, I then need a way of outputting all the results into a table for clubs to view so they know that player also comes with X amount of people that can platy positions X,Y and Z.
However if player 2 is unavailable but player 4 is available it needs to ignore player 2's position. I hope that makes sense, there's a lot of reasons it's done this way and it's actually a very complex system when you drill down into it all. I've kept it this way so it's modular and not linear as a model so I can change aspects to it as needed without having a knock on effect. I'm not concerned about how memory hungry it is on the server at this stage.

Analyze MySQL Text Data

This is a strange one but I have found the Stackoverflow community to be very helpful. I have mySQL Table with a column full of parsed text data. I want to analyze the data and see in how many rows words appear.
ID columnName
1 Car
2 Dog
3 CAR CAR car CAR
From the above example what I want returned is that the word CAR appears in two rows and the word Dog Appears in 1 row. I don't really care how much the word count is as much as in how many rows does the word appear in. The problem is that I don't know which words to search for. Is there a tool, or something I can build in python, that would show me the most popular words used and in how many rows do the words appear in.
I have not idea where to start and it would be great if someone could assist me with this.
I'd use python:
1) setup python to work with mysql (loads of tutorials online)
2) define:
from collections import defaultdict
tokenDict = defaultdict(lambda: 0)
the former is a simple dictionary which returns 0 if there is no value with the given key (i.e. tokenDict['i_have_never_used_this_key_before'] will return 0)
3) read each row from the table, tokenize it and increment the token counts
tokens = row.split(' ') //tokenize
tokens = [lower(t) for t in tokens] //lowercase
tokens = set(tokens) //remove duplicates
for token in tokens:
tokenDict[token] = tokenDict[token] + 1

How to find duplicate records in ActiveRecord other than original one

Using Rails 4, Ruby 2, MySql
I would like to find all the records in my database which are repeats of another record - but not the original record itself.
This is so I can update_attributes(:duplicate => true) on each of these records and leave the original one not marked as a duplicate.
You could say that I am looking for the opposite of Uniq* I don't want the Uniq values, I want all the values which are not uniq after the fact. I don't want all values which have a duplicate as that would include the original.
I don't mind using pure SQL or Ruby for this but I would prefer to use active record to keep it Railsy.
Let's say the table is called "Leads" and we are looking for those where the field "telephone_number" is the same. I would leave record 1 alone and mark 2,3 and 4 as duplicate = true.
* If I wanted the opposite of Uniq I could do something like Find keep duplicates in Ruby hashes
b = a.group_by { |h| h[:telephone_number] }.values.select { |a| a.size > 1 }.flatten
But that is all the records, I want all the duplicated ones other than the original one I'm comparing it to.
I'm assuming your query returns all 'Leads' that have the same telephone number in an array b. You can then use
b = b.shift
which takes the first element off of the b array. Then you can continue with your original thought update_attributes(:duplicate => true)

Dealing with 5,000 attributes

I have a data set which contains 5,000 + attributes
The tables looks like below
id attr1 attr2, attr3
a 0 1 0
a 1 0 0
a 0 0 0
a 0 0 1
I wish to represent each record on a single row for example the table below to make it more amenable to data mining via clustering.
id, attr1, attr2, attr3
a 1 1 1
I have tried a multitude of ways of doing this
I have tried importing it into a MYSQL DB and getting the max value for each attribute (they can only be 1 or zero for each ID) but a table cant hold the 5,000 + attributes.
I have tried using the pivot function in excel and getting the Max Value per attribute but the number of columns a pivot can handle is far less than the 5,000 I'm currently looking at.
I have tried importing it into Tableua but that also suffers from the fact it cant handle so many records
I just want to get Table 2 in either a text/CSV file or a database table
Can anyone suggest anything at all, a piece of software or something i have not yet considered
Here is a Python script which does what you ask for
def merge_rows_by_id(path):
rows = dict()
with open(path) as in_file:
header = in_file.readline().rstrip()
for line in in_file:
fields = line.split()
id, attributes = fields[0], fields[1:]
if id not in rows:
rows[id] = attributes
else:
rows[id] = [max(x) for x in zip(rows[id], attributes)]
print (header)
for id in rows:
print ('{},{}'.format(id, ','.join(rows[id])))
merge_rows_by_id('my-data.txt')
Which was written for clarity more than maximum efficiency, although it's pretty efficient. However, this will still leave you with lines with 5000 attributes, just fewer of them.
I've seen this data "structure" too often used in bioinformatics where the researchers just say "put everything we know about "a" on one row, and then the set of "everything" doubles, and re-doubles, etc. I've had to teach them about data normalization to make an RDBM handle what they've got. Usually, attr_1…n are from one trial and attr_n+1…m is from a second trial, and so on which allows for a sensible normalization of the data.