In a large tab delimited file, get unique values of one field, sum their values in a second field

In a large tab delimited file, get unique values of one field, sum their values in a second field - unique

first post, I normally lurk, but I couldn't find anything that quite fit my situation.
So, I have a large tab-delimited file (~3 billion lines) with two fields on each line. One is a string of fixed length (10 characters, all alpha, all caps), the other is an integer of variable size. Some of the entries in the first field between lines are identical, like so
AAABBBCCCD 6
QQQQQQQQQQ 1
ZZZTOPZZZZ 299
AAABBBCCCD 14
JHFDSJKHFJ 2
ZZZTOPZZZZ 1
What I want to do is compare values in the first field, find the unique ones, and sum every second field value for those unique entries, resulting in an output like this,
AAABBBCCCD 20
QQQQQQQQQQ 1
ZZZTOPZZZZ 300
JHFDSJKHFJ 2
I don't necessarily care if they're sorted by integer value, but it would be cool if they were. Not really a priority.
I've already tried some things in perl that work on test files but perform much too slowly to be useful on the real thing. So, yeah, I'm open to any sort of solutions, but I'm particularly interested at this point if there is any cool/clever bash-fu that would do the job.

Related

Searching for a range within a comma-separated field

I have a field called 'fits' that contains comma separated values like this:
120102199105199205,130101199105199205,120101199107199201
where for each number the first 6 digits are a fitment code and the last 12 digits are a unique date range.
Now, I know that CSV is nasty and violates 1NF, but the alternative is normalizing the data into a couple more tables that would be monstrous because of the potential number of records
So in an effort to keep it simple I'm trying to write a query to select rows based on providing the fitment code and a range of dates. So far I have this:
select data from table
where fits = any(
SELECT fits
FROM table
WHERE fits LIKE '120102')
AND fits BETWEEN "120102199105000000" AND "120102999999199205"
The problem is, the BETWEEN...AND doesn't work with the CSV data
Is there a way to apply a range in the query with some kind of wildcard to this type of CSV data, or is the only sql-side solution to normalize?

If you want to use BETWEEN for values in your fits column you will need to normalise the data.
This should not necessarily be a problem. How big is the potential row size? You'd be surprised how many records is not a problem for a modern database server.
1 Million, 10 Million, 100 million really not a problem for any RDMS mysql, mariadb, mssql, postgresSQL on a modern pc.
Normalise your data and everything will be much easier.

Is the following possible in SQL query

Is the following possible, i been racking my brain to think of a solution.
I have an sql table, very simple table, few text columns and two int columns.
What i want to ideally do is allow user to add a row, but just the text columns and have the sql automatically put the numbers in the integer columns.
Ideally id like these numbers to random but not already exsist (so every row has a unique number) in the column. Also 10 digits long (but think that might be pushing it).
Is there anyway i can achieve this within the query itself?
Thanks

Sure - you pass the string as parameters to the Insert statement and the values as well - after you computed them. you can use SQL fucntion to generate the random number, or use the code you're calling from to generate them.

You can generate unique int numbers for a row with setting it AUTO_INCREMENT. However if you want something like a random hash, you need to do it in your backend. (or in a stored procedure)
Just a thought: if you generate long enough random strings you don't need to worry about having duplication usually. So it's safe to generate a random string, try to insert it and repeat until you get a duplicate entry error. Won't happen most of the time so it might be quicker than checking it first with a select.

You can generate a random number using MySQL. This will generate a random number between 0 and 10.000:
FLOOR(RAND() * 10001)
If you really want the numbers to always be 10 digits long you can generate a number between 1.000.000.000 and 9.999.999.999 like this:
FLOOR(RAND() * 9000000000) + 1000000000
The chance of the number not being unique is ~0.0000000001% and rising as you insert new rows. For a 0% chance of collision I'd suggest doing this the right way and handling this in code and not the database.
The random function explained:
What is happening is RAND() is generating a random decimal number between 0 and 1 (never actually 1). Then we multiply that number by the maximum number that we wish to produce plus 1. We add 1 because the biggest number produced for a set maximum number of 10 will be 9,XXXX and never actually 10 or above (remember I said that RAND() never generates 1), so we add plus one to produce the possibility of 10,XXXX which we later floor using FLOOR() to produce 10. In this case though we don't add 1 because 10.000.000.000 will become possible and it breaches our 10 digit boundary. Then we add the minimum number which we want produced (+ 1.000.000.000 in this case) while subtracting the same from the number we entered before (the maximum number).

Calculate and minimize row size to get around effective MySQL column limit

I've read the more nuanced responses to the question of how many columns is too many and would like to ask a follow up.
I inherited a pretty messy project (a survey framework), but one could argue that the DB design is actually properly normalized, i.e. a person really has as many attributes as there are questions.
I wouldn't defend that notion in a debate, but the more pressing point is that I have very limited time, I'm trying to help users of this framework out and the quickest fix I can think of right now is reducing the row size. I doubt I have the skill to change the DB model in the time I have.
The column number now is 4661, but they can hopefully reduce it to at least 3244, probably less (by reducing the actual number of questions).
The hard column limit is 4096, but right now I don't even succeed in adding 2500 columns, presumably because of the row size limit which is 65,535 bytes.
However, when I calculate my row size I end up with a much lower value, because nearly all columns are TINYINTs (survey responses ranging from 1-12).
It doesn't even work with 2000 TINYINTs (an example query that fails).
Using the formula given in the documentation I get 4996 bytes or less.
column.lengths = tinyints * 1
null.columns = length(all.columns)
variable.lengths.columns = 0
(row.length = 1+
(column.lengths)+
(null.columns + 7)/8+
(variable.lengths.columns)
)
## 4996
What did I misunderstand in that row length calculation?

I overlooked this paragraph
Thus, using long column names can reduce the maximum number of
columns, as can the inclusion of ENUM or SET columns, or use of column
or table comments.
I had long column names, replacing them with sequential numbers allowed me to have more columns (ca. 2693), I'll have to see if the increase is sufficient. I don't know how they're saved, presumably as strings, so maybe I can reduce them even further using letters.

Selecting unique fields but am unsure if DISTINCT or GROUP BY will do as I need

I am using MySQL ver 5.5.8.
Lets say I have the table,entries, structure like so:
entry_id int PK
member_id FK
there can be multiple entries for each member. I want to get 10 of them at random but I need to fetch them in a way that allows for the odds of being selected increase with the number of entries a member has. I know I could just do something like:
SELECT member_id
FROM entries
GROUP BY member_id
ORDER BY RAND()
LIMIT 10
But I'm not sure if that will do what I want. Will MySQL group the records THEN select 10? If that were the case then every member would have the same chance to get picked, which is not what I want. I have done some testing and searching but can't come up with a definitive answer. Does anyone know if this will do what I want or do I have to do things a different way? Any help would be appreciated. Thanks much!

LIMIT 10 will choose 10 records base in (in this case) a random order. This is indeed after the grouping.
Maybe you can ORDER BY RAND() / count(*). That way, the number is likely to be smaller for users with more questions, thus they are more likely to be in the top 10.
[edit]
By the way, it seems that over time (as the data grows) ORDER BY RAND() becomes slower. There are a couple of ways to work around that. Mediawiki (software behind Wikipedia) has an interesting method: It generates a random number for each page, so when you select 'random page', it generates one random number between 0 and 1 and selects the page that is closest to that number:
WHERE number > {randomNumber} ORDER BY number LIMIT 1`
That saves having to generate that temporary table for each query. You will need to periodically re-generate the numbers if your data grows, and you must make sure the numbers are evenly generated. That is easy enough: For new records, you can just generate a random number. Periodically the entire list is updated: All records are queried. Then, each record in that order is assigned a number between 0 and 1, but in an incrementing number, that increments 1 / recordCount. That way, the records are evenly spaced, and the change of finding them is the same for each one of them.
You could use that method too. It will make your query faster in the long run, and you could make the distribution smarter: 1) Instead of using 'memberCount', you can use 'totalEntryCount'. 2) Instead of incrementing by 1 / 'memberCount', you could use entryCountForMember / totalEntryCount. That way, the gap before members with more entries will be bigger, therefor, the chance of them matching the random number will be bigger as well. For instance, your members may look like this:
name entries number delta
bob 10 0.1 0.10
john 1 0.11 0.01
jim 5 0.16 0.05
fred 84 1 0.84
The delta isn't saved, of course, but it shows the added number. In the Mediawiki example, this delta would be the same for each page, but in your case, it could depend on the number of entries. Now you see, there's only a small gap between bob and john, so the chance that you pick a random number between 0 and bob is ten times as large as picking a random number between bob and john. So, chances of picking bob are ten times as large as picking john.
You will need a (cron) job to periodically redistribute the numbers, because you don't want to do that on each modification, but for the kind of data you're dealing with, it doesn't have to be real-time, and it makes your queries a lot faster if you got many members and many entries.

Which mysql method is fast?

I am saving some text in database say 1000 characters.
But I display only first 200 characters.
Method 1
I could save first 200 characters in one column
and the remaining in second column of sql table
Method 2
I can save everything in one column and while displaying I can
query for 200 characters

It would be "cleaner" to store everything in 1 column. and you can select only the first 200 characters like this
select substring(your_column, 1, 200) as your_column from your_table

It really is irrelevant, but if you try to optimize, then method 1 is better, as long as you limit your query to that column (or you only query these columns you really need), because doing any substring on server side takes time and resources (times number of requests...). Method 2 is cleaner, but you are optimize for time so method 1.

This will come down to one of two things:
If you are pulling the entire row back into PHP and then only showing the first 200 chars, then your network speed will potentially be a bottleneck on the pulling data back:
If on the other hand you have two columns, you will potentially have a bottleneck at your drive access which fetches the data back to your PHP - longer rows can cause a slower access to multiple rows.
This will come down to a server-specific weigh-up. It will really depend on how your server performs. I would suggest running some scenarios where your code tries to pull back a few hundred thousand of each to see how long it takes.

Method 2.
First, duplicate storage of data is usually bad (demoralization). This is certainly true in this case.
Second, it would take longer to write to two tables than one.
Third, you have now made updates and deleted vulnerable to annoying inconsistencies (see #1).
Fourth, unless you are searching the first 200 characters for text, getting data out will be the same for both methods (just select a sub string of the first 200 characters).
Fifth, even if you are searching the first 200 characters, you can index on those, and retrieval speed should be identical.
Sixth, you don't want a database design that limits your UX--what if you need to change to 500 characters? You'll have a lot of work to do.
This is a very obvious case of what not to do in database design.
reference: as answered by Joe Emison http://www.quora.com/MySQL/Which-mysql-method-is-fast

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008