Distributed database use cases - mysql

At the moment i do have a mysql database, and the data iam collecting is 5 Terrabyte a year. I will save my data all the time, i dont think i want to delete something very early.
I ask myself if i should use a distributed database because my data will grow every year. And after 5 years i will have 25 Terrabyte without index. (just calculated the raw data i save every day)
i have 5 tables and the most queries are joins over multiple tables.
And i need to access mostly 1-2 columns over many rows at a specific timestamp.
Would a distributed database be a prefered database than only a single mysql database?
Paritioning will be difficult, because all my tables are really high connected.
I know it depends on the queries and on the database table design and i can also have a distributed mysql database.
i just want to know when i should think about a distributed database.
Would this be a use case? or could mysql handle this large dataset?
EDIT:
in average i will have 1500 clients writing data per second, they affect all tables.
i just need the old dataset for analytics. Like machine learning and
pattern matching.
also a client should be able to see the historical data

Your question is about "distributed", but I see more serious questions that need answering first.
"Highly indexed 5TB" will slow to a crawl. An index is a BTree. To add a new row to an index means locating the block in that tree where the item belongs, then read-modify-write that block. But...
If the index is AUTO_INCREMENT or TIMESTAMP (or similar things), then the blocks being modified are 'always' at the 'end' of the BTree. So virtually all of the reads and writes are cacheable. That is, updating such an index is very low overhead.
If the index is 'random', such as UUID, GUID, md5, etc, then the block to update is rarely found in cache. That is, updating this one index for this one row is likely to cost a pair of IOPs. Even with SSDs, you are likely to not keep up. (Assuming you don't have several TB of RAM.)
If the index is somewhere between sequential and random (say, some kind of "name"), then there might be thousands of "hot spots" in the BTree, and these might be cacheable.
Bottom line: If you cannot avoid random indexes, your project is doomed.
Next issue... The queries. If you need to scan 5TB for a SELECT, that will take time. If this is a Data Warehouse type of application and you need to, say, summarize last month's data, then building and maintaining Summary Tables will be very important. Furthermore, this can obviate the need for some of the indexes on the 'Fact' table, thereby possibly eliminating my concern about indexes.
"See the historical data" -- See individual rows? Or just see summary info? (Again, if it is like DW, one rarely needs to see old datapoints.) If summarization will suffice, then most of the 25TB can be avoided.
Do you have a machine with 25TB online? If not, that may force you to have multiple machines. But then you will have the complexity of running queries across them.
5TB is estimated from INT = 4 bytes, etc? If using InnoDB, you need to multiple by 2 to 3 to get the actual footprint. Furthermore, if you need to modify a table in the future, such action probably needs to copy the table over, so that doubles the disk space needed. Your 25TB becomes more like 100TB of storage.
PARTITIONing has very few valid use cases, so I don't want to discuss that until knowing more.
"Sharding" (splitting across machines) is possibly what you mean by "distributed". With multiple tables, you need to think hard about how to split up the data so that JOINs will continue to work.
The 5TB is huge -- Do everything you can to shrink it -- Use smaller datatypes, normalize, etc. But don't "over-normalize", you could end up with terrible performance. (We need to see the queries!)
There are many directions to take a multi-TB db. We really need more info about your tables and queries before we can be more specific.

It's really impossible to provide a specific answer to such a wide question.
In general, I recommend only worrying about performance once you can prove that you have a problem; if you're worried, it's much better to set up a test rig, populate it with representative data, and see what happens.
"Can MySQL handle 5 - 25 TB of data?" Yes. No. Depends. If - as you say - you have no indexes, your queries may slow down a long time before you get to 5TB. If it's 5TB / year of highly indexable data it might be fine.
The most common solution to this question is to keep a "transactional" database for all the "regular" work, and a datawarehouse for reporting, using a regular Extract/Transform/Load job to move the data across, and archive it. The data warehouse typically has a schema optimized for querying, usually entirely unlike the original schema.
If you want to keep everything logically consistent, you might use sharding and clustering - a sort-a-kind-a out of the box feature of MySQL.
I would not, however, roll my own "distributed database" solution. It's much harder than you might think.

Related

Store large amounts of sensor data in SQL, optimize for query performance

I need to store sensor data from various locations (different factories with different rooms with each different sensors). Data is being downloaded in regular intervals from a device on site in the factories that collects the data transmitted from all sensors.
The sensor data looks like this:
collecting_device_id, sensor_id, type, value, unit, timestamp
Type could be temperature, unit could be degrees_celsius. collecting_device_id will identify the factory.
There are quite a lot of different things (==types) being measured.
I will collect around 500 million to 750 million rows and then perform analyses on them.
Here's the question for storing the data in a SQL database (let's say MySQL InnoDB on AWS RDS, large machine if necessary):
When considering query performance for future queries, is it better to store this data in one huge table just like it comes from the sensors? Or to distribute it across tables (tables for factories, temperatures, humidities, …, everything normalized)? Or to have a wide table with different fields for the data points?
Yes, I know, it's hard to say "better" without knowing the queries. Here's more info and a few things I have thought about:
There's no constant data stream as data is uploaded in chunks every 2 days (a lot of writes when uploading, the rest of the time no writes at all), so I would guess that index maintenance won't be a huge issue.
I will try to reduce the amount of data being inserted upfront (data that can easily be replicated later on, data that does not add additional information, …)
Queries that should be performed are not defined yet (I know, designing the query makes a big difference in terms of performance). It's exploratory work (so we don't know ahead what will be asked and cannot easily pre-compute values), so one time you want to compare data points of one type in a time range to data points of another type, the other time you might want to compare rooms in factories, calculate correlations, find duplicates, etc.
If I would have multiple tables and normalize everything the queries would need a lot of joins (which probably makes everything quite slow)
Queries mostly need to be performed on the whole ~ 500 million rows database, rarely on separately downloaded subsets
There will be very few users (<10), most of them will execute these "complex" queries.
Is a SQL database a good choice at all? Would there be a big difference in terms of performance for this use case to use a NoSQL system?
In this setup with this amount of data, will I have queries that never "come back"? (considering the query is not too stupid :-))
Don't pre-optimize. If you don't know the queries then you don't know the queries. It is to easy to make choices now that will slow down some sub-set of queries. When you know how the data will be queried you can optimize then -- it is easy to normalize after the fact (pull out temperature data into a related table for example.) For now I suggest you put it all in one table.
You might consider partitioning the data by date or if you have another way that might be useful (recording device maybe?). Often data of this size is partitioned if you have the resources.
After you think about the queries, you will possibly realize that you don't really need all the datapoints. Instead, max/min/avg/etc for, say, 10-minute intervals may be sufficient. And you may want to "alarm" on "over-temp" values. This should not involve the database, but should involve the program receiving the sensor data.
So, I recommend not storing all the data; instead only store summarized data. This will greatly shrink the disk requirements. (You could store the 'raw' data to a plain file in case you are worried about losing it. It will be adequately easy to reprocess the raw file if you need to.)
If you do decide to store all the data in table(s), then I recommend these tips:
High speed ingestion (includes tips on Normalization)
Summary Tables
Data Warehousing
Time series partitioning (if you plan to delete 'old' data) (partitioning is painful to add later)
750M rows -- per day? per decade? Per month - not too much challenge.
By receiving a batch every other day, it becomes quite easy to load the batch into a temp table, do normalization, summarization, etc; then store the results in the Summary table(s) and finally copy to the 'Fact' table (if you choose to keep the raw data in a table).
In reading my tips, you will notice that avg is not summarized; instead sum and count are. If you need standard deviation, also, keep sum-of-squares.
If you fail to include all the Summary Tables you ultimately need, it is not too difficult to re-process the Fact table (or Fact files) to populate the new Summary Table. This is a one-time task. After that, the summarization of each chunk should keep the table up to date.
The Fact table should be Normalized (for space); the Summary tables should be somewhat denormalized (for performance). Exactly how much denormalization depends on size, speed, etc., and cannot be predicted at this level of discussion.
"Queries on 500M rows" -- Design the Summary tables so that all queries can be done against them, instead. A starting rule-of-thumb: Any Summary table should have one-tenth the number of rows as the Fact table.
Indexes... The Fact table should have only a primary key. (The first 100M rows will work nicely; the last 100M will run so slowly. This is a lesson you don't want to have to learn 11 months into the project; so do pre-optimize.) The Summary tables should have whatever indexes make sense. This also makes querying a Summary table faster than the Fact table. (Note: Having a secondary index on a 500M-rows table is, itself, a non-trivial performance issue.)
NoSQL either forces you to re-invent SQL, or depends on brute-force full-table-scans. Summary tables are the real solution. In one (albeit extreme) case, I sped up a 1-hour query to 2-seconds by by using a Summary table. So, I vote for SQL, not NoSQL.
As for whether to "pre-optimize" -- I say it is a lot easier than rebuilding a 500M-row table. That brings up another issue: Start with the minimal datasize for each field: Look at MEDIUMINT (3 bytes), UNSIGNED (an extra bit), CHARACTER SET ascii (utf8 or utf8mb4) only for columns that need it), NOT NULL (NULL costs a bit), etc.
Sure, it is possible to have 'queries that never come back'. This one 'never comes back, even with only 100 rows in a: SELECT * FROM a JOIN a JOIN a JOIN a JOIN a. The resultset has 10 billion rows.

Ways to optimize MySQL table [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a big big table the size of table is in GB's around 130 GB. Every day data is dumped in the table.
I'd like to optimize the table... Can anyone suggest me how I should go about it?
Any input will be a great help.
It depends how you are trying to optimize it.
For querying speed, appropriate indexes including multi-column indexes would be a very good place to start. Do explains on all your queries to see what is taking up so much time. Optimize the code that's reading the data to store it instead of requerying.
If old data is less important or you're getting too much data to handle, you can rotate tables by year, month, week, or day. That way the data writing is always to a pretty minimal table. The older tables are all dated (ie tablefoo_2011_04) so that you have a backlog.
If you are trying to optimize size in the same table, make sure you are using appropriate types. If you get variable length strings, use a varchar instead of statically sized data. Don't use strings for status indicators, use an enum or int with a secondary lookup table.
The server should have a lot of ram so that it's not going to disk all the time.
You can also look at using a caching layer such as memcached.
More information about what the actual problem is, your situation, and what you are trying to optimize for would be helpful.
If your table is a sort of logging table, there can be several strategy for optimizing.
(1) Store essential data only.
If there are not necessary - nullable - columns in it and they does not be used for aggregation or analytics, store them into other table. Keep the main table smaller.
Ex) Don't store raw HTTP_USER_AGENT string. Preprocessing the agent string and store smaller data what you exactly want to review.
(2) Make the table as fixed format.
Use CHAR then VARCHAR for almost-fixed-length strings. This will be helpful for sped up SELECT queries.
Ex) ip VARCHAR(15) => ip CHAR(15)
(3) Summarize old data and dump them into other table periodically.
If you don't have to review the whole data everyday, divide it into periodically table (year/month/day) and store summarize data for old ones.
Ex) Table_2011_11 / Table_2011_11_28
(4) Don't use too many indexes for big table.
Too many indexes cause heavy load for inserting queries.
(5) Use ARCHIVE engine.
MySQL supports ARCHIVE ENGINE. This engine supports zlib for data compression.
http://dev.mysql.com/doc/refman/5.0/en/archive-storage-engine.html
It fits for logging generally(AFAIK), lack of ORDER BY, REPLACE, DELETE and UPDATE are not a big problem for logging.
You should show us what your SHOW CREATE TABLE tablename outputs so we can see the columns, indexes and so on.
From the glimpse of everything, it seems MySQL's partitioning is what you need to implement in order to increase performance further.
A few possible strategies.
If the dataset is so large, it may be of use to store certain information redundantly: keeping cache tables if certain records are accessed much more frequently than others, denormalize information (either to limit the number of joins or creating tables with less columns so you have a lean table to keep in memory at all times), or keeping summaries for the fast lookup of totals.
The summaries-table(s) can be kept in synch by either periodically generating them or by the use of triggers, or even combining both by having a cache table for the latest day on which you can calculate actual totals, and summaries for the historical data... will give you full precision while not requiring to read the full index. Test to see what delivers best performance in your situation.
Splitting your table by periods is certainly an option. It's like partitioning, but Mayflower Blog advises to do it yourself as the MySQL implementation seems to have certain limitations.
Additionally to this: if the data in those historical tables is never changed and you want to reduce space, you could use myisampack. Indexes are supported (you have to rebuild) and performance gain is reported, but I suspect you would gain speed on reading individual rows but face decreasing performance on large reads (as lots of rows need unpacking).
And last: you could think about what you need from the historical data. Does it need the exact same information you have for more recent entries, or are there things that just aren't important anymore? I could imagine if you have an access log, for example, that it stores all sorts of information like ip, referal url, requested url, user agent... Perhaps in 5 years time the user agent isn't interesting at all to know, it's fine to combine all requests from one ip for one page + css + javascript + images into one entry (perhaps have a different many-to-one table for the precise files), and the referal urls only need number of occurances and can be decoupled from exact time or ip.
Don't forget to consider the speed of the medium on which the data is stored. I think you can use raid disks to speed up access or maybe store the table in RAM but at 130GB that might be a challenge! Then consider the processor too. I realise this isn't a direct answer to your question but it may help achieve your aims.
You can still try to do partitioning using tablespaces or "table-per-period" structure as #Evan advised.
If your fulltext searching is failing may be should go to Sphinx/Lucene/Solr. External search engines can definitely help you to get faster.
If we are talking about table structure than you should use the smallest datatype if it possible.
If optimize table is too slow and it's true for the really big tables you can backup this table and restore it. Off course in this case you will need to get some downtime.
As bottom line:
if your issue concerning fulltext searching than before applying any table changes try to use external search engines.

handling large dataset using MySQL

I am trying to apply for a job, which asks for the experiences on handling large scale data sets using relational database, like mySQL.
I would like to know which specific skill sets are required for handling large scale data using MySQL.
Handling large scale data with MySQL isn't just a specific set of skills, as there are a bazillion ways to deal with a large data set. Some basic things to understand are:
Column Indexes, how, why, and when they're used, and the pros and cons of using them.
Good database structure to balance between fast writes and easy reads.
Caching, leveraging several layers of caching and different caching technologies (memcached, redis, etc)
Examining MySQL queries to identify bottlenecks and understanding the MySQL internals to see how queries get planned an executed by the database server in order to increase query performance
Configuring the MySQL server to be able to handle a lot of concurrent connections, and access it's data fast. Hardware bottlenecks, and the advantages to using different technologies to speed up your hardware (for example, storing your MySQL data on a RAID5 Array to increase IO performance))
Leveraging built-in MySQL technology (like Replication) to off-load read traffic
These are just a few things that get thought about in regards to big data in MySQL. There's a TON more, which is why the company is looking for experience in the area. Knowing what to do, or having experience with things that have worked or failed for you is an absolutely invaluable asset to bring to a company that deals with high traffic, high availability, and high volume services.
edit
I would be remis if I didn't mention a source for more information. Check out High Performance MySQL. This is an incredible book, and has a plethora of information on how to make MySQL perform in all scenarios. Definitely worth the money, and the time spent reading it.
edit -- good structure for balanced writes and reads
With this point, I was referring to the topic of normalization / de-normalization. If you're familiar with DB design, you know that normalization is the separation of data as to reduce (eliminate) the amount of duplicate data you have about any single record. This is generally a fantastic idea, as it makes tables smaller, faster to query, easier to index (individually) and reduces the number of writes you have to do in order to create/update a new record.
There are different levels of normalization (as #Adam Robinson pointed out in the comments below) which are referred to as normal forms. Almost every web application I've worked with hasn't had much benefit beyond the 3NF (3rd Normal Form). Which the definition of, if you were to read that wikipedia link above, will probably make your head hurt. So in lamens (at the risk of dumbing it down too far...) a 3NF structure satisfies the following rules:
No duplicate columns within the same table.
Create different tables for each set related data. (Example: a Companies table which has a list of companies, and an Employees table which has a list of each companies' employees)
No sub-sets of columns which apply to multiple rows in a table. (Example: zip_code, state, and city is a sub-set of data which can be identified uniquely by zip_code. These 3 columns could be put in their own table, and referenced by the Employees table (in the previous example) by the zip_code). This eliminates large sets of duplication within your tables, so any change that is required to the city/state for any zip code is a single write operation instead of 1 write for every employee who lives in that zip code.
Each sub-set of data is moved to it's own table and is identified by it's own primary key (this is touched/explained in the example for #3).
Remove columns which are not fully dependent on the primary key. (An example here might be if your Employees table has start_date, end_date, and years_employed columns. The start_date and end_date are both unique and dependent on any single employee row, but the years_employed can be derived by subtracting start_date from end_date. This is important because as end-date increases, so does years_employed so if you were to update end_date you'd also have to update years_employed (2 writes instead of 1)
A fully normalized (3NF) database table structure is great, if you've got a very heavy write-load. If your server is doing a lot of writes, it's very easy to write small bits of data, especially when you're running fewer of them. The drawback is, all your reads become much more expensive, because you have to (typically) run a lot of JOIN queries when you're pulling data out. JOINs are typically expensive and harder to create proper indexes for when you're utilizing WHERE clauses that span the relationship and when sorting the result-sets If you have to perform a lot of reads (SELECTs) on your data-set, using a 3NF structure can cause you some performance problems. This is because as your tables grow you're asking MySQL to cram more and more table data (and indexes) into memory. Ideally this is what you want, but with big data-sets you're just not going to have enough memory to fit all of this at once. This is when MySQL starts to create temporary tables, and has to use the disk to load data and manipulate it. Once MySQL becomes reliant on the hard disk to serve up query results you're going to see a significant performance drop. This is less-so the case with solid state disks, but they are super expensive, and (imo) are not mature enough to use on mission critical data sets yet (i mean, unless you're prepared for them to fail and have a very fast backup recovery system in place...then use them and gonuts!).
This is the balancing part. You have to decide what kind of traffic the data you're reading/writing is going to be serving more of, and design that to be fast. In some instances, people don't mind writes being slow because they happen less frequently. In other cases, writes have to be very fast, and the reads don't have to be fast because the data isn't accessed that often (or at all, or even in real time).
Workloads that require a lot of reads benefit the most from a middle-tier caching layer. The idea is that your writes are still fast (because you're 'normal') and your reads can be slow because you're going to cache it (in memcached or something competitive to it), so you don't hit the database very frequently. The drawback here is, if your cache gets invalidated quickly, then the cache is not reducing the read load by a meaningful amount and that results in no added performance (and possibly even more overhead to check/invalidate the caches).
With workloads that have the requirement for high throughput in writes, with data that is read frequently, and can't be cached (constantly changes), you have to come up with another strategy. This could mean that you start to de-normalize your tables, by removing some of the normalization requirements you choose to satisfy, or something else. Instead of making smaller tables with less repetitive data, you make larger tables with more repetitive / redundant data. The advantage here is that your data is all in the same table, so you don't have to perform as many (or, any) JOINs to pull the data out. The drawback...writes are more expensive because you have to write in multiple places.
So with any given situation the developer(s) have to identify what kind of use the data structure is going to have to serve, and balance between any number of technologies and paradigms to achieve an acceptable solution that meets their needs. No two systems or solutions are the same which is why the employer is looking for someone with experience on how to deal with these large datasets. Finding these solutions is not something that can really be learned out of a book, it typically takes some experience in the field and experience with how different solutions performed.
I hope that helps. I know I rambled a bit, but it's really a lot of information. This is why DBAs make the big dollars (:
You need to know how to process the data in "chunks". That means instead of simply trying to manipulate the entire data set, you need to break it into smaller more manageable pieces. For example, if you had a table with 1 Billion records, a single update statement against the entire table would likely take a long time to complete, and may possibly bring the server to it's knees.
You could, however, issue a series of update statements within a loop that would update 20,000 records at a time. Each iteration of the loop you would increment your range/counters/whatever to identify the next set of records.
Also, you commit your changes at the end of each loop, thereby allowing you to stop the process and continue where you left off.
This is just one aspect of managing large data sets. You still need to know:
how to perform backups
proper indexing
database maintenance
You can raed/learn how to handle large dataset with MySQL But it is not equivalent to having actual experiences.
Straight and simple answer: Study about partitioned database and find appropriate MySQL data structure types for large scale datasets similar with the partitioned database architecture.

Which granulary to choose for database table partitioning?

I have a 20-million record table in MySQL database. SELECT's work really fast because I have set up good indexes, but INSERT and UPDATE operation is getting to be really slow. The database is back-end of a web application under heavy load. INSERTs and UPDATEs are really slow because there are some 5 indexes on this table and index size is about 1GB now - I guess it takes to much time to compute.
To solve this problem, I decided to partition a table. I run MySQL 4, and cannot upgrade (no direct control over server), so I'll do manual partitioning - create a separate table for each section.
The data-set is composed from about 18000 different logical slices, which could be queried completely separately. Therefore, I could create 18000 tables named (maindata1, maindata2, etc.). However, I'm not sure that this is optimal way do to it? Beside the obvious fact that I'll have to browse through 18000 items in administration tool whenever I want to do something manually, I'm concerned about file-system performance. File-system is ext3. I'm not sure how fast it is in locating files in a directory with 36000 files (there's data file and index file).
If this is a problem, I could join some slices of data together into a same table. For example: maindata10, maindata20, etc. where maindata10 would contain slices 1, 2, 3...10. If I would go for "groups" of 10, I would only have 1800 tables. If I would group 20, I would get 900 tables.
I wonder what would be the optimal size of this grouping, i.e. number of files in a directory vs table size?
Edit: I also wonder if it would be a good idea to use multiple separate databases to group files together. So, even if I would have 18000 tables, I could group them in, say, 30 databases of 600 tables each. It seems like this would be much easier to manage. I don't know if having multiple databases would increase or decrease performance or memory footprint (it would complicate backup and restore though)
There are a few tactics you could follow to boost performance. By "partitions" I assume you mean "versions of tables with the same column layout but different data contents."
Get a server that will run mySQL 5 if you possibly can. It's faster and better at this stuff, enough so that you may not have a problem after you upgrade.
Are you using InnoDB? If so, can you switch to myISAM? (If you need rigid transactional integrity you might not be able to switch).
For partitioning, you might try to figure out what kind of data-slice combination will give you roughly equal-size partitions (by row count). If I were you I'd go for no more than about 20 partitions unless you can prove to yourself that you need to.
If only a few of your data slices are being actively updated (for example, if they are "this month's data" and "last month's data), I might consider splitting those into smaller slices. For example, you might have "this week's data", "last week's," and "the week before" in their own partitions. Then, when your partitions cool off, copy their data and combine them into bigger groups like "the quarter before last." This has the disadvantage that it will require routine Sunday-evening style maintenance jobs to run. But it has the advantage that most or all updates only happen on a small fraction of your table.
you should look into the merge engine if you are using myISAM, this way you can get pretty much the same functionality as a partitioning of mysql5, you will be able to run the same select as you are running now.

Building a Large Table in MySQL

This is my first time building a database with a table containing 10 million records. The table is a members table that will contain all the details of a member.
What do I need to pay attention when I build the database?
Do I need a special version of MySQL? Should I use MyISAM or InnoDB?
For a start, you may need to step back and re-examine your schema. How did you end up with 10 million rows in the member table? Do you actually have 10 million members (it seems like a lot)?
I suspect (although I'm not sure) that you have less than 10 million members in which case your table will not be correctly structured. Please post the schema, that's the first step to us helping you out.
If you do have 10 million members, my advice is to make your application vendor-agnostic to begin with (i.e., standard SQL). Then, if you start running into problems, just toss out your current DBMS and replace it with a more powerful one.
Once you've established you have one that's suitable, then, and only then would I advise using vendor-specific stuff. Otherwise it will be a painful process to change.
BTW, 10 million rows is not really considered a big database table, at least not where I come from.
Beyond that, the following is important (not necessarily an exhaustive list but a good start).
Design your tables for 3NF always. Once you identify performance problems, you can violate that rule provided you understand the consequences.
Don't bother performance tuning during development, your queries are in a state of flux. Just accept the fact they may not run fast.
Once the majority of queries are locked down, then start tuning your tables. Add whatever indexes speed up the selects, de-normalize and so forth.
Tuning is not a set-and-forget operation (which is why we pay our DBAs so much). Continuously monitor performance and tune to suit.
I prefer to keep my SQL standard to retain the ability to switch vendors at any time. But I'm pragmatic. Use vendor-specific stuff if it really gives you a boost. Just be aware of what you're losing and try to isolate the vendor-specific stuff as much as possible.
People that use "select * from ..." when they don't need every column should be beaten into submission.
Likewise those that select every row to filter out on the client side. The people that write our DBMS' aren't sitting around all day playing Solitaire, they know how to make queries run fast. Let the database do what it's best at. Filtering and aggregation is best done on the server side - only send what is needed across the wire.
Generate your queries to be useful. Other than the DoD who require reports detailing every component of their aircraft carriers down to the nuts-and-bolts level, no-one's interested in reading your 1200-page report no matter how useful you think it may be. In fact, I don't think the DoD reads theirs either, but I wouldn't want some general chewing me out because I didn't deliver - those guys can be loud and they have a fair bit of sophisticated weaponry under their control.
At least use InnoDB. You will feel the pain when you realize MyISAM has just lost your data...
Apart from this, you should give more information about what you want to do.
You don't need to use InnoDB if you don't have data integrity and atomic action requirements. You want to use InnoDB if you have foreign keys between tables and you are required to keep the constraints, or if you need to update multiple tables in atomic operation. Otherwise, if you just need to use the table to do analysis, MyISAM is fine.
For queries, make sure you build smart indexes to suite your query. For example, if you want to sort by columns c and selecting based on columns a, and b, make sure you have an index that covers columns a, b, and c, in that order, and that index includes full length of each column, rather than a prefix. If you don't do your index right, sorting over a large amount of data will kill you. See http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
Just a note about InnoDB and setting up and testing a large table with it. If you start injecting your data, it will take hours. Make sure you issue commits periodically, otherwise if you want to stop and redo for whatever reason, you end up have to 1) wait hours for transaction recovery, or 2) kill mysqld, set InnoDB recover flag to no recover and restart. Also if you want to re-inject data from scratch, DROP the table and recreate it is almost instantaneous, but it will take hours to actually "DELETE FROM table".