How does Dremel or its implementation (say Drill) handle large columnar data layout in memory? - apache-drill

I am going through the white paper of Google Dremel. I came to know it converts complex data into columnar data layout.
At what location is this data stored?
As Drill has no central metadata repository, I assume it must be in-memory.
Therefore how does Drill handle this data when I have billions of rows?

To get complete, consistent query results from billions of rows, you'll use a distributed file system connected to multiple Drillbits, simulate a distributed file system by copying files to each node, or use an NFS volume, such as Amazon Elastic File System. Drill performs performant querying of big data using a number of techniques, including these:
Relies on the cluster nodes to handle failures (doesn't spend time on failure-related tasks).
Uses an in-memory data model that's hierarchical and columnar (doesn't access the disk for columns that are not involved in an analytic query, processing the columnar data without row materialization).
Uses columnar storage optimizations and execution (keeps memory footprint low).
Uses vectorization to work on arrays of values from different records rather than single values from one record at a time.
For more information, see http://drill.apache.org/docs/performance/.

Related

How efficiently use MySQL for Stock/TimeSeries related data?

I use Python and MySQL to ingest data via API and generate signals and order execution. Currently, things are functional yet coupled, that is, the single script is fetching data, storing it in MySQL, generating signals, and then executing orders. By tightly coupled does not mean all logic is in the same file, there are separate functions for different tasks. If somehow the script breaks everything will be halted. The way DB tables are generated is based on the instrument available on the fly after running a filter mechanism. The python code creates a different table of the same schema but with different table names based on the instrument name.
Now I am willing to separate the parts:
Data Ingestion (A Must)
Signal Generation
Order Execution
Reporting
First three I am mainly focusing. My concern is that if separate processes are running, acting on the same tables, will it generate any lock or something? How do I take care of it smoothly? or, is MySQL good enough for this or I move on to some other DB Like Postgres or others?
We are already using Digital Ocean Instance, MySQL is currently installed on the same instance.
If you intend to ingest/query time-series at scale, a conventional RDBMS will fall short at one point or another. They are designed for a use case in which reads are more frequent than writes, and optimise for that.
There is a whole family of databases designed specifically for working with Time-Series data. These time-series databases can ingest data at high throughput while running queries on top, and they usually give you lifecycle capabilities so you can decide what to do when data keeps growing.
There are many options available, both open source and proprietary. Out of those databases I would recommend you to try QuestDB because of a few reasons:
It is open source and Apache 2.0 licensed, so you can use it anywhere for anything
It is a single binary (or docker container) to operate
You query data using SQL, (with extensions for time series)
You can insert data using SQL, but you will experience locks if using concurrent clients. However you can also ingest data using the ILP protocol which is designed for ingestion speed. There are official clients in 7 languages so you don't have to deal with the low-level details
It is blazingly fast. I have seen over 2 million inserts per second on a single instance and some users report sustained workloads of over 100,000 events per second
It is well supported on Digital Ocean
There are a lot of public references (and many users who are not a reference) in the finance/trading/crypto industries

Handling 2000+ more requests on mysql?

Is there any tools or proper way to handle more than 2000 requests (Mostly write request) per second to mysql database? Without reaching queuelimit.
There are a few different ways to handle massive amounts of requests to a MySQL (or any other relational/RDB) database. Starting out with growing traffic you can employ replication which allows for additional machines to send read-only (no INSERTs, UPDATEs, DELETEs, etc.) from one machine and to only write to a single "master" machine (the read replicas copy the written data from the master or write-allowed instance but may be slightly behind the latest data written for a short period of time). Oracle (owner of the MySQL project) has a good article about it (and scaling PHP) here: http://www.oracle.com/technetwork/articles/dsl/white-php-part1-355135.html
Once your app begins taking on requests on a truly massive scale (like Facebook, Google, etc. level) you will want to consider other strategies such as clustering, utilizing NoSQL (for certain functions such as search, analytics, logging, monitoring, etc.), splitting tables and databases based on geographic regions (if it makes sense). There is a starter white paper here: https://www.mysql.com/why-mysql/white-papers/guide-to-scaling-web-databases-with-mysql-cluster/
You can also conduct generic searches for "scaling MySQL" which deliver even more results.
MariaDB 10+ comes with Galera Cluster that allows you to have multiple MASTER servers and you can load balance either by IP or through a device.
Also, the number or requests/second are dependent on how fast a write is completed. If you have a simple atomic raw write, you can turn off INDEXES on the receiving table, so it's as fast as your server can handle. That raw table can by MyISAM and not InnoDB. That's usually up to 10x faster in writes. Have another process read the raw data in bulk into another table with proper indexes. We've had success with up to 10K transactions/second this way

HIVE, HBASE which one I have to use for My Data Analytics

I have 150 GB of MySQL data, Plan to replace MySQL to Casandra as backend.
Analytics, plan to go with Hadoop, HIVE or HBASE.
Currently I have 4 physical machines for POC. Please some one help me to come up with best efficient architecture.
Per day I will get 5 GB of Data.
Daily Status report I have to send to each customer.
Have to give Analysis report based on request : for example : 1 week report or last month first 2 week report. Is it possible to produce report instantly using HIVe or HBASE ?
I want to give best performance using Cassandra, Hadoop .
Hadoop can process your data using map reduce paradigm or other, using emerging technologies such as Spark. The advantage is a reliable distributed filesystem and the usage of data locality to send the computation to the nodes that have the data.
Hive is a good SQL-like way of processing files and generate your reports once a day. It's batch processing and 5 more GB a day shouldn't produce a big impact. It has a high overhead latency though, but shouldn't be a problem if you do it once a day.
HBase and Cassandra are NoSQL databases whose purpose is to serve data with low latency. If that's a requirement, you should go with any of those. HBase uses the DFS to store the data and Cassandra has good connectors to Hadoop, so it's simple to run jobs consuming from these two sources.
For reports based on request, specifying a date range, you should store the data in an efficient way so you don't have to ingest data that's not needed for your report. Hive supports partitioning and that can be done using date (i.e. /<year>/<month>/<day>/). Using partitioning can significantly optimize your job execution times.
If you go to the NoSQL approach, be sure the rowkeys have some date format as prefix (e.g. 20140521...) so that you can select those that start by the dates you want.
Some questions you should also consider are:
how many data do you want to store in your cluster – e.g. last 180
days, etc. This will affect the number of nodes / disks. Beware data is usually replicated 3 times.
how many files do you have in HDFS – when the number of files is high,
Namenode will be hit hard on retrieval of file metadata. Some
solutions exist such as replicated namenode or using MapR Hadoop
distribution which doesn't rely on a Namenode per se.

Mechanism for extracting data out of Cassandra for load into relational databases

We use Cassandra as our primary data store for our application that collects a very large amount of data and requires large amount of storage and very fast write throughput.
We plan to extract this data on a periodic basis and load into a relational database (like mySQL). What extraction mechanisms exist that can scale to the tune of hundreds of millions of records daily? Expensive third party ETL tools like Informatica are not an option for us.
So far my web searches have revealed only Hadoop with Pig or Hive as an option. However being very new to this field, I am not sure how well they would scale and also how much load they would put on the Cassandra cluster itself when running? Are there other options as well?
You should take a look at sqoop, it has an integration with Cassandra as shown here.
This will also scale easily, you need a Hadoop cluster to get sqoop working, the way it works is basically:
Slice your dataset into different partitions.
Run a Map/Reduce job where each mapper will be responsible for transferring 1 slice.
So the bigger the dataset you wish to export, the higher the number of mappers, which means that if you keep increasing your cluster the throughput will keep increasing. It's all a matter of what resources you have.
As far as the load on the Cassandra cluster, I am not certain since I have not used the Cassandra connector with sqoop personally, but if you wish to extract data you will need to put some load on your cluster anyway. You could for example do it once a day at a certain time where the traffic is lowest, so that in case your Cassandra availability drops the impact is minimal.
I'm also thinking that if this is related to your other question, you might want to consider exporting to Hive instead of MySQL, in which case sqoop works too because it can export to Hive directly. And once it's in Hive you can use the same cluster as used by sqoop to run your analytics jobs.
There is no way to extract data out of cassandra other than paying for etl tool. I tried different way like copy command or cql query -- all the methods gives times out irrespective of changing timeout parameter in Cassandra.Yaml. Cassandra experts say you can not query the data without 'where' clause. This is big restriction to me. This may be one of the main reason not to use cassandra at least for me.

Database for Large number of 1kB data chunks (MySQL?)

I have a very large dataset, each item in the dataset being roughly 1kB in size. The data needs to be queried rapidly by many applications distributed over a network. The dataset has more than a million items (so 500 million+ 1kB data chunks).
What would be the best method to storing this dataset (need to allow adding more items, and reading them rapidly, but never modifying already added data)? Would using a MySQL DB using the binary blob format be appropriate?
Or should each of these be stored as files on a file system?
edit: the number is 1 million items now, but needs to be able to scale to well over 500 million items easily.
Since there is no need to index anything inside the object. I would have to say a filesystem is probably your best bet not a relational database. Since there's only an unique ID and a blob, there really isn't any structure here, so there's no value to putting it in a database.
You could use a web server to provide access to the repository. And then a caching solution like nginx w/memcache to keep it all in memory and scale out using load balancing.
And if you run into further performance issues, you can remove the filesystem and roll your own like Facebook did with their photos system. This can reduce the unnecessary IO operations for pulling unneeded meta-data from the file system like security information.
If you need to retrive saved data then storing in files is certainly not a good idea.
MySQL is a good choice. But make sure you have right indexes set.
Regarding binary-blob. It depends on what you plan to store. Give us more details.
That's one GB of data. What are you going to use the database for?
That's definitely just a file, read it into ram when starting up.
Scaling to 500Million is easy. That just takes some more machines.
Depending on the precise application characteristics, you might be able to normalize or compress the data in ram.
You might be able to keep things on disk, and use a database, but that seriously limits your scalability in terms of simultaneous access. You get 50 disk accesses/sec from a disk, so just count how many disk you need.