I have 4 way parallel i/p file, 6-way parallel i/p files.I want to write the data into 8-way multifile.How can you do this? - ab-initio

I have 4 way parallel i/p file, 6-way parallel i/p files.I want to write the data into 8-way multifile.How can you do this?

You must repartition both input flows to 8-way using a Partition component.
If the order doesn't matter and won't matter downstream, use
Partition by Round Robin, which does what it sounds like: the records
are redistributed evenly to the output flow's partitions.
If the records are sorted or will be sorted downstream, use Partition
by Key on both the 4-way and 6-way input flows, using the sort key. Partition by Key ensures all records with the same key will be on the same parallel partition, so they can be correctly sorted. If records are already sorted on the input flows, connect the two Partition by Key components to a Merge component using the same key, to combine them while maintaining order. If you will be sorting after combining the input flows, you can use a Gather component, since the order of individual records within each partition doesn't matter yet -- you are about to sort them in the 8-way flow.
If the records will be processed downstream in groups, e.g. Rollup or Join, then you must partition by at least the first field of a multi-part key. This is more subtle, because a key may consist of multiple fields, e.g. {state;city}. You can partition by fewer fields before a Rollup, because that's sufficient to ensure all records with the same key are on the same partition. Partitioning by {state} ensures that all records with any unique {city;state} value will be on the same partition.

Related

Is there a benefit to index mysql column if I always have different value in every row?

Question is for rows like timestamp, where always different value stored in every row.
I'm already search through stackoverflow and read about indexes, but I don't understand profit if no one value equals to another. So, index cardinality will be equal to number of rows. What the profit?
This kind of column would actually be an excellent candidate for an index, preferably a unique one.
Tables are unsorted sets of data, so without any knowledge about the table, the database will have to go over the entire table sequentially to find the rows you're looking for (O(n) complexity, where n is the number of rows).
An index is, essentially a tree that stores values in a sorted way, which allows the database to intelligently find the rows you're looking for (O(log n)). In addition, making the index unique tell the database there can be only one row per timestamp value, so once a single row is retrieved the database can stop searching for more.
The performance benefit for such an index, assuming you search for rows according to timestamps, should be significant.
An index is a map between key values and retrieval pointers. The DBMS uses an index during a query if a strategy that uses the index appears to be optimal.
If the index never gets used, then it is useless.
Indexes can speed up lookups based on a single keyed value, or based on a range of key values (depending on the index type), or by allowing index only retrieval in cases where only the key is needed for the query. Speed ups can be as low as two for one or as high as a hundred for one, depending on the size of the table and various other factors.
If your timestamp field is never used in the WHERE clause or the ON clause of a query, the chances are you are better off with no index. The art of choosing indexes well goes a lot deeper than this, but this is a start.

Database Denormalization for Performance Gains for SELECT reports

Note: This question can be answered keeping MySQL or MSSQL RDMBS in mind
Background:
Lets say you have a table named records. This table has 20 fields some of which are VARCHAR(255).
You have to run reports on two fields named amount (FLOAT) and status (INT).
Since one record can have only one status, it is kept in the same table.
Table is indexed on status and amount.
Situation:
Indexing is working ok, even with more than 10 million records the response times on grouping based on those two fields are acceptable.
However as the data grows the efficiency of index is being reduced because RDBMS has to still parse through all those data subsets and not just parse those two fields. This results in slower and slower reports even with proper indexing.
Question:
Although amount has one to one relation with the record and it does not really make sense to put the amount and status in a separate table alongwith record id foreign key but, do you think that would make it more efficient even if it becomes less normalized?
Why do I ask this question?
Because it sounds like simple logic to me that if i have a separate table which contains a record id and corresponding amount then when i run some reports on amount and status then they will be much faster than the current setup Because the database has to now look at less data and less amount of data has to pass through the data buss etc and all those fields which were not needed for computing a report are not being parsed for data at the OS level. I know that when i run a report on amount and status then a databse won't care about other fields and value stored in them but nonetheless it still has to read all those data subsets in order to parse the records and at Disk level that still results in reads.
Denormalizing the database gives you a very good performance(response time) gain but you have to compromise space usage.
in your case i think Partitioning the database horizontally can do some increase in performance.
Range – this partitioning mode allows a DBA to specify various
ranges for which data is assigned. For example, a DBA may create a
partitioned table that is segmented by three partitions that contain
data for the 1980's, 1990's, and everything beyond and including the
year 2000.
Hash – this partitioning mode allows a DBA to separate data based on
a computed hash key that is defined on one or more table columns,
with the end goal being an equal distribution of values among
partitions. For example, a DBA may create a partitioned table that
has ten partitions that are based on the table's primary key.
Key – a special form of Hash where MySQL guarantees even
distribution of data through a system-generated hash key.
List – this partitioning mode allows a DBA to segment data based on
a pre-defined list of values that the DBA specifies. For example, a
DBA may create a partitioned table that contains three partitions
based on the years 2004, 2005, and 2006.
Composite – this final partitioning mode allows a DBA to perform
sub-partitioning where a table is initially partitioned by, for
example range partitioning, but then each partition is segmented
even further by another method (for example, hash).
taken from mysql dev

mySQL Clustered index: how do gaps in id values impact query performance

My data will be distributed among 50 databases with identical schema, let's say only one table ORDER (one DB for each of 50 clients) but each record must be globally identifiable. I plan to use numeric UID as PK.
My understanding is that mySQL will create a clustered index for this PK.
The data will always be inserted in monotonically increasing UID order.
Question about query performance: I have two choices when deciding how to generate uids. Which one will be better for query performance (within a given DB), or it doesn't matter?
1) For each client/database I assign a fixed hardcoded 'range' which will definitely be sufficient for all the future records there: I pick a really huge numeric range on the scale of 10^15 and within a range I start incrementing by one so that all UID values for this particular DB will be large but there will be no 'holes' between them
2) I use a globally shared HiLo generator for records in all databases, which means for a given DB the records there will have a smaller value (compared to 10^15 scale in #1) but there will be more 'holes' between sequential UID records (or rather, between batches of UID: i.e. if the batch size is 100 there will be UIDs: 100,101,102,...199, and then 1400,1401,1402..1499, and then possibly 16000,16001,..16099)
The simplest solution would be adding an instance_id column to all tables, predetermined for each database, and use the standard auto_increment mechanism. The actual unique id for a record would be the tuple (instance_id, autinc_val).

How does mysql order by implemented internally?

How does Mysql order by implemented internally? would ordering by multiple columns involve scanning the data set multiple times once for each column specified in the order by constraint?
Here's the description:
http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
Unless you have out-of-row columns (BLOB or TEXT) or your SELECT list is too large, this algorithm is used:
Read the rows that match the WHERE clause.
For each row, record a tuple of values consisting of the sort key value and row position, and also the columns required for the query.
Sort the tuples by sort key value
Retrieve the rows in sorted order, but read the required columns directly from the sorted tuples rather than by accessing the table a second time.
Ordering by multiple columns does not require scanning the dataset twice, since all data required for sorting can be fetched in a single read.
Note that MySQL can avoid the order completely and just read the values in order, if you have an index with the leftmost part matching your ORDER BY condition.
MySQL is canny. Its sorting algorithm depends on a couple of factors -
Available Indexes
Expected size of result
MySQL version
MySQL has two methods to produce sorted/ordered streams of data.
1. Smart use of Indexes
Firstly MySQL optimiser analyses the query and figures out if it can just take advantage of sorted indexes available. If yes, it naturally returns records in index order. (The exception is NDB engine, which needs to perform a merge sort once it gets data from all storage nodes)
Hands down to the MySQL optimiser, who smartly figures out if the index access method is cheaper than other access methods.
Really interesting thing to see here
The index may also be used even if ORDER BY doesn’t match the index exactly, as long as other columns in ORDER BY are constants
Sometimes, the optimizer probably may not use Index if it finds indexes expensive as compared to scanning through the table.
2. Filesort Algorithm
If Indexes can not be used to satisfy an ORDER BY clause, MySQL utilises filesort algorithm. This is a really interesting algorithm. In a nutshell, It works like
It scans through the table and finds the rows which matches the WHERE condition
It maintains a buffer and stores a couple of values (sort key value, row pointer and columns required in the query) from each row in it. The size of this chunk is system variable sort_buffer_size.
When, buffer is full, it runs a quick sort on it based on the sort key and stores it safely to the temp file on disk and remembers a pointer to it
It will repeat the same step on chunks of data until there are no more rows left
Now, it has a couple of chunks which are sorted
Finally, it applies merge sort on all sorted chunks and puts it in one result file
In the end, it will fetch the rows from the sorted result file
If the expected result fits in one chunk, the data never hits disk, but remains in RAM.
For a detailed Info - https://www.pankajtanwar.in/blog/what-is-the-sorting-algorithm-behind-order-by-query-in-mysql

Which is faster: Many rows or many columns?

In MySQL, is it generally faster/more efficient/scalable to return 100 rows with 3 columns, or 1 row with 100 columns?
In other words, when storing many key => value pairs related to a record, is it better to store each key => value pair in a separate row with with the record_id as a key, or to have one row per record_id with a column for each key?
Also, assume also that keys will need to be added/removed fairly regularly, which I assume would affect the long term maintainability of the many column approach once the table gets sufficiently large.
Edit: to clarify, by "a regular basis" I mean the addition or removal of a key once a month or so.
You should never add or remove columns on a regular basis.
http://en.wikipedia.org/wiki/Entity-Attribute-Value_model
There are a lot of bad things about this model and I would not use it if there was any other alternative. If you don't know the majority (except a few user customizable fields) of data columns you need for your application, then you need to spend more time in design and figure it out.
If your keys are preset (known at design time), then yes, you should put each key into a separate column.
If they are not known in design time, then you have to return your data as a list of key-value pairs which you should later parse outside the RDBMS.
If you are storing key/value pairs, you should have a table with two columns, one for the key (make this the PK for the table) and one for the value (probably don't need this indexed at all). Remember, "The key, the whole key, and nothing but the key."
In the multi-column approach, you will find that you table grows without bound because removing the column will nuke all the values and you won't want to do it. I speak from experience here having worked on a legacy system that had one table with almost 1000 columns, most of which were bit fields. Eventually, you stop being able to make the case to delete any of the columns because someone might be using it and the last time you did it, you had work till 2 am rolling back to backups.
First: determine how frequently your data needs to be accessed. If the data always needs to be retrieved in one shot and most of it used then consider storing all the key pairs as a serialized value or as an xml value. If you need to do any sort of complex analysis on that data and you need the value pairs then columns are ok but limit them to values that you know you will need to perform your queries on. It’s generally easier to design queries that use one column for one parameter than row. You will also find it easier to work with
the returned values if they are all in one row than many.
Second: separate your most frequently accessed data and put it in its own table and the other data in another. 100 columns is a lot by the way so I recommend that you split your data into smaller chunks that will be more manageable.
Lastly: If you have data that may frequently change then you should use create the column (key) in one table and then use its numerical key value against which you would store the key value. This assumes that you will be using the same key more than once and should speed up your search when you go to do your lookup.