I have a data set which contains 5,000 + attributes
The tables looks like below
id attr1 attr2, attr3
a 0 1 0
a 1 0 0
a 0 0 0
a 0 0 1
I wish to represent each record on a single row for example the table below to make it more amenable to data mining via clustering.
id, attr1, attr2, attr3
a 1 1 1
I have tried a multitude of ways of doing this
I have tried importing it into a MYSQL DB and getting the max value for each attribute (they can only be 1 or zero for each ID) but a table cant hold the 5,000 + attributes.
I have tried using the pivot function in excel and getting the Max Value per attribute but the number of columns a pivot can handle is far less than the 5,000 I'm currently looking at.
I have tried importing it into Tableua but that also suffers from the fact it cant handle so many records
I just want to get Table 2 in either a text/CSV file or a database table
Can anyone suggest anything at all, a piece of software or something i have not yet considered
Here is a Python script which does what you ask for
def merge_rows_by_id(path):
rows = dict()
with open(path) as in_file:
header = in_file.readline().rstrip()
for line in in_file:
fields = line.split()
id, attributes = fields[0], fields[1:]
if id not in rows:
rows[id] = attributes
else:
rows[id] = [max(x) for x in zip(rows[id], attributes)]
print (header)
for id in rows:
print ('{},{}'.format(id, ','.join(rows[id])))
merge_rows_by_id('my-data.txt')
Which was written for clarity more than maximum efficiency, although it's pretty efficient. However, this will still leave you with lines with 5000 attributes, just fewer of them.
I've seen this data "structure" too often used in bioinformatics where the researchers just say "put everything we know about "a" on one row, and then the set of "everything" doubles, and re-doubles, etc. I've had to teach them about data normalization to make an RDBM handle what they've got. Usually, attr_1…n are from one trial and attr_n+1…m is from a second trial, and so on which allows for a sensible normalization of the data.
Related
I'm trying to find unaccounted for numbers within a substantially large SQL dataset and facing some difficulty sorting.
By default the data for column reads
'Brochure1: Brochure2: Brochure3:...Brochure(k-1): Brochure(k):'
where k stands in for the number of brochures a unique id is eligible for.
Now the issue arises as the brochures are accounted for a sample updated data would read
'Brochure1: 00001 Brochure2: 00002 Brochure3: 00003....'
How does one query out the missing numbers, if in the range of number of say 00001-88888 some haven't been accounted next to Brochure(X):
The right way:
You should change the structure of your database. If you care about performance, you should follow the good practices of relational databases, so as first comment under your question said: normalize. Instead of placing information about brochures in one column of the table, it's much faster and more clear solution to create another table, that will describe relations between brochures and your-first-table-name
<your-first-table-name>_id | brochure_id
----------------------------+---------------
1 | 00002
1 | 00038
1 | 00281
2 | 28192
2 | 00293
... | ...
Not mention, if possible - you should treat brochure_id as integer, so using 12 instead of 0012.
The difference here is, that now you can make efficient and simple queries, to find out how many brochures one ID from your first table has, or what ID any brochure belongs to. If for some reason you need to keep the ordinal number of every single brochure you can add a column to the above table, like brochure_number.
What you want to achieve (not recommended): I think the fastest way to achieve your objective without changing the db structure, is to get the value of your brochures column, and then process it with your script. You really don't want to create a SQL statement to parse this kind of data. In PHP that wolud look something like this:
// Let's assume you already have your `brochures` column value in variable $brochures
$bs = str_replace(": ", ":", $brochures);
$bs = explode(" ", $bs);
$brochures = array();
foreach($bs as $b)
$brochures[substr($b, 8, 1)] = substr($b, strpos($b, ":")+1, 5);
// Now you have $brochures array with keys representing the brochure number,
// and values representing the ID of brochure.
if(isset($brochures['3'])){
// that row has a defined Brochure3
}else{
// ...
}
4150
NRrows = RSNonResourceCosts.RecordCount ' Number of Rows in Non Resource Table
NRCols = RSNonResourceCosts.Fields.Count ' Number of Fields in NonResource Table
Dim CL(1 To 10) As Integer ' This is to count "filled rows" when spreadsheet is filled
Dim Header(1 To 10) As String
'-----------
'Find the Headers (Taken from Actual Table and not predefined as original)
For Each Recordsetfieldx In RSNonResourceCosts.Fields
If C > 0 Then
Header(C) = Recordsetfieldx.Name
End If
C = C + 1
Next Recordsetfieldx
4170
R = 0
'Write to worksheet
RSNonResourceCosts.MoveFirst
Do Until RSNonResourceCosts.EOF
For C = 1 To NRCols - 1
FieldName = RSNonResourceCosts.Fields(C).Value
If RSNonResourceCosts.Fields(Header(C)).Value <> "" Then
CL(C) = CL(C) + 1
WKS.Cells(200 + R, C) = RSNonResourceCosts.Fields(Header(C)).Value
End If
Next C
RSNonResourceCosts.MoveNext
R = R + 1
Loop
I attach code. Have solved part of original by defining Recordset. User can add column to Table. First part of code determines the headers. Second part determines values and writes to worksheet. The new Rows are appearing first on the worksheet and in wrong column. I tried attaching worksheet but it looked awful. Any help would be appreciated.
Two things:
1) The order your records is the order they are in the recordset. If you want them in a particular order, try sorting them (perhaps with an ORDER BY in the underlying SQL statement)
2) For the column issue: In the first bit of code, I don't see where C is initialized, but keep in mind the Headers and Fields both start with an index of 0, so if you set Header(1) = the first field's header (index 0), but then copy the data in the fields without shifting the index value, it will shift everything over by one column.
As an added note, you might want to consider what happens when you have more than 10 columns. Using fixed-length arrays means your code will break. You might want to read about using a dynamic array and ReDim.
I don't yet feel like I have completely grasped the entirety of the problem yet, but let me take a stab at it. From what I do understand, data is being written from your record set into excel (good), but it is going into the 'wrong row' (question title) and the 'wrong column' (question text).
From what I see, I don't know the purpose of FieldName = RSNonResourceCosts.Fields(C).Value, but I want to make sure that you understand that RSNonResourceCosts.Fields(C).Value is not necessarily equivalent to RSNonResourceCosts.Fields(Header(C)).Value. More than that, you are likely missing at least one column altogether in your output, or at least skipping over it accidentally. rs.Fields(0).name is the first 'column' in a recordset, but it is completely ignored in your code. Perhaps this is intentional, maybe it is a key field or something useless to you, but it is important that you are making that distinction intentionally. But, since I don't see where your code populates the headers in your worksheet, I wonder if 'wrong column' means every record has been shifted a column and your last column is sitting empty. That, coupled with the dubious omission of C being initialized as 0 (not 1, or anything else) in your above code, makes me concerned that Header(3) could possibly by field(1), or field(4), or I don't know. That would certainly also confuse the columns in your output, or at least make dependence on FieldName frustrating.
Another thing, really a shot in the dark: NRrows. I have had issues before, depending on how I create my recordset, of not getting the correct record count the first time. And, if I base the population of a worksheet, array, etc., on the number of rows and the records relative position in that number, my records get all sorts of wacky. Maybe you did this already, but since it isn't shown, I recommend a RSNonResourceCosts.movelast: RSNonResourceCosts.movefirst line before you define NRrows, just to be sure.
And last, if I am way off base here... then you really are going to have to show us the spreadsheet, even if it isn't your most beautiful work. We all know that if it were, you wouldn't be asking about it here... so set your pride aside, and be more specific as well as show us what the output looks like and how it should look.
This is a strange one but I have found the Stackoverflow community to be very helpful. I have mySQL Table with a column full of parsed text data. I want to analyze the data and see in how many rows words appear.
ID columnName
1 Car
2 Dog
3 CAR CAR car CAR
From the above example what I want returned is that the word CAR appears in two rows and the word Dog Appears in 1 row. I don't really care how much the word count is as much as in how many rows does the word appear in. The problem is that I don't know which words to search for. Is there a tool, or something I can build in python, that would show me the most popular words used and in how many rows do the words appear in.
I have not idea where to start and it would be great if someone could assist me with this.
I'd use python:
1) setup python to work with mysql (loads of tutorials online)
2) define:
from collections import defaultdict
tokenDict = defaultdict(lambda: 0)
the former is a simple dictionary which returns 0 if there is no value with the given key (i.e. tokenDict['i_have_never_used_this_key_before'] will return 0)
3) read each row from the table, tokenize it and increment the token counts
tokens = row.split(' ') //tokenize
tokens = [lower(t) for t in tokens] //lowercase
tokens = set(tokens) //remove duplicates
for token in tokens:
tokenDict[token] = tokenDict[token] + 1
I've been doing some harmless operations, basically combining 2 ids to create a unique id, and as the ids are numbers, I decided to use math operations to combine then (rather then string concatenation). So, as my second id is alwas < 10000, what I did was id1*10000 + id2
Problem is, Tableau doesn't seem to know how to add those numbers. To illustrate better, I created Calculation 1 (that is Id1*10000), Calculation2 (that is Id2) and Calculation3 (that is Calculation1 + Calculation2).
Check the file. http://www.speedyshare.com/Wc5zP/Tableau-can-t-sum.twbx
Original datasource is a csv file, but it's extracted (to a tde).
One thing that might be happening is that Tableau has some limitation on the size of int it can store. Anyone knows how I can change this? (Int64 would do the trick, if that is actually the problem)
Here a snapshot
I have a challenge with the following database structure:
HEADER table called 'DOC' containing document details among which the document ID
DETAIL tabel called 'DOC_SET' containing data related to the document.
The header table is approximately 16000 records. The detail table contains on average 75 records per header table (1.2 million records in total).
I have one source document and its related set (source set). This source set I like to compare to the other documents' sets (which I refer to as destination documents and sets). Through my application I have a list of ID's of the source set available and as such also the length (in the example below shown as a list of 46 elements) which I can use in the query directly.
What I need per destination document is the length of the intersection (number of shared elements) of the source and destination sets and the length of the difference (length of what is in the source set and what is not in the destination set) for display. I also need a filter to retrieve only records for which a 75% intersection between source and destination, compared to the source set is reached.
Currently I have a query which does this by using sub selects containing expressions, but it is utterly slow and the results need to be available at page refresh in a web application. The point is I only need to display about 20 results at a time, but when sorting on calculated fields I need to calculate every destination record before being able to sort and paginate.
The query is something like this:
select
DOC.id,
calc_subquery._calcSetIntersection,
calc_subquery._calcSetDifference
from
DOC
inner join
(
select
DOC.id as document_id,
(
select
count(*)
from
DOC_SET
where
DOC_SET.doc_id = DOC.id and
DOC_SET.element_id in (60,114,130,187,267,394,421,424,426,603,604,814,909,1035,1142,1223,1314,1556,2349,2512,4953,5134,6318,6339,6344,6455,6528,6601,6688,6704,6705,6731,6894,6895,7033,7088,7103,7119,7129,7132,7133,7137,7154,7159,7188,7201)
) as _calcSetIntersection
,46-(
select
count(*)
from
DOC_SET
where
DOC_SET.doc_id = DOC.id and
DOC_SET.element_id in (60,114,130,187,267,394,421,424,426,603,604,814,909,1035,1142,1223,1314,1556,2349,2512,4953,5134,6318,6339,6344,6455,6528,6601,6688,6704,6705,6731,6894,6895,7033,7088,7103,7119,7129,7132,7133,7137,7154,7159,7188,7201)
) as _calcSetDifference
from
DOC
where
DOC.id = 2599
) as calc_subquery
on
DOC.id = calc_subquery.document_id
where
DOC.id = 2599 and
_calcSetIntersection / 46 > 0.75;
I'm wondering if:
this is possible while being performed in < 100msec or so on MySQL
on an average spec server running MySQL fully in memory (24Gb).
I should use a better suiting solution for this, perhaps like a NoSQL solution.
If I should use some sort of temporary table or cache containing
calculated values. This is an issue for me as the source set of id's
might change in between queries and the whole thing needs to be
calculated again.
Anyway, some thoughts or solutions are really appreciated.
Kind regards,
Eric