Kafka partitions meaning - configuration

When we decide about partitions, should we do that on per-topic base, or it is topic-wide decision?
If T1 partitioned on 3 partitions,
and T2 partitioned on 2 partitions,
can they both be consumed by 1 consumer?
Or it is better make equal number of partitions if topics must be consumed by 1 consumer?
I ask that because high level consumer can be created by passing topics and partition number.
So I wonder should I pass to that constructor only topics with equal partition number?

When we create high level consumer, we pass not partition number, but intended number of consuming threads(streams).
The answer is yes, they can be consumed by 1 consumer.
(If that consumer subscribed to both topics)
Consumer just opens N streams/intended number of consuming threads (you pass that as a parameter!).
If N < P(number of all partitions of all topics), then some streams will collect data of several partitions. If N > P, some streams will be in non-busy wait.
It is desirable to have P=N, but it is even better to have N > P , because tomorrow if new partitions appear - you will be ready for grater load.
I've done a research on that and created a blog entry

Related

[mysql]: Querying more db data vs loop on large array

I need to compare few scenarios, which can be fulfilled either by calling:
an additional timestamp column from mysql database or
Looping over the resultant array.
Elaborating further:
CASE 1: 144 byte columns + 4 byte timestamp column for 10K rows, then looping on array of size 50.(download size- 1480000 Bytes)
CASE 2: 144 byte columns for 10K rows, looping on array of size 10000. (download size- 1440000 bytes)
Data download roughly 40KB more for Case1 while 10000 more loop iteration for case2.
Which of the 2 scenarios could be faster, downloading 40KB more or 10000 loop iterations?
Your first scenario is by far the best. Here is why.
SQL is designed to extract subsets of rows from tables. It's designed to allow your tables to be many orders of magnitude bigger than your RAM. If you use a query like
SELECT *
FROM mytable
WHERE mytimestamp >= TODAY() - INTERVAL 1 DAY
AND mytimestamp < TODAY()
you will get all the rows with timestamps anytime yesterday, for example. If you put an index on the mytimestamp column, SQL will satisfy this query very quickly and only give the rows you need. And it will satisfy any queries looking for a range of timestamps similarly quickly.
There are no answers that are true in 100% of situations. For example, I would prefer to do the first query when I use a fast enough network (anything 1Gbps or higher) and data transfer is free. The difference in the example you show is 40,000 bytes, but it's only 2.7% of the total.
On the other hand, if you need to optimize to minimize bandwidth usage, that's a different scenario. Like if you are transmitting data to a satellite over a very slow link. Or you have a lot of competing network traffic (enough to use up all the bandwidth), and saving 2.7% is significant. Or if you are using a cloud vendor that charges for total bytes transferred on the network.
You aren't taking into account the overhead of executing 1000 separate queries. That means 1000x the bytes sent to the database server, as you send queries. That takes some network bandwidth too. Also the database server needs to parse and optimize each query (MySQL does not cache optimization plans as some other RDBMS products do). And then begin executing the query, starting with an index lookup without the context of the previous query result.
"Performance" is a slippery term. You seem to be concerned only with bandwidth use, but there are other ways of measuring performance.
Throughput (bytes per second)
Latency (seconds per response)
Wait time (seconds until the request begins)
All of these can be affected by the total load, i.e. number of concurrent requests. A busy system may have traffic congestion or request queueing.
You can see that this isn't a simple problem.

Parallelism at Kafka Topics or Partitions Level

In order to seperate my data, based on a key: Should I use multiple topics or multiple partitions within same topic? I'm asking on basis of overheads, computation, data storage and load caused on server.
I would recommend to separate (partition) your data into multiple partitions within the same topic.
I assume the data logically belongs together (for example a stream of click events).
The advantage of partitioning your data using multiple partitions within the same topic is mainly that all Kafka APIs are implemented to be used like this.
Splitting your data into the topics would probably lead to much more code in the producer and consumer implementations.
As suggested by #rmetzger splitting records into multiple topic would increase the complexity at the producer level however there might be some other factors worth considering.
In Kafka the main level of parallelism is number of partitions in a topic, because having so you can spawn that many number of consumer instance to keep reading data from the same topic in parallel.
E.g if you have a separate topic based on the event having N number of partition then while consuming you will be able to create N number of consumer instances each dedicated to consume from a specific partitions concurrently. But in that case the ordering of the messages in not guaranteed.i.e. ordering of the messages is lost in the presence of parallel consumption
On the other hand keeping the records within same topic in a separate partition will make this a lot easier to implement and consumer messages in order (Kafka only provides a total order over messages within a partition, not between different partitions in a topic.). But you will be limited to run only one consumer process in that case.

Database design suggestions for a data scraping/warehouse application?

I'm looking into the database design for a data ware house kind of project which involves large number of inserts daily.The data archives will be further used to generate reports.I will have a list of user's (for example a user set of 2 million), for which I need to monitor daily social networking activities associated with them.
For example, let there be a set of 100 users say U1,U2,...,U100
I need to insert their daily status count into my database.
Consider the total status count obtained for a user U1 for period June 30 - July 6, is as follows
June 30 - 99
July 1 - 100
July 2 - 102
July 3 - 102
July 4 - 105
July 5 - 105
July 6 - 107
The database should keep daily status count of each users ,like
For user U1,
July 1- 1 (100-99)
July 2- 2 (102-100)
July 3- 0 (102-102)
July 4- 3 (105-102)
July 5- 0 (105-105)
July 6- 2 (107-105)
Similarly the database should hold archived details of the full set of user's.
And on a later phase , I envision to take aggregate reports out of these data like total points scored on each day,week,month,etc; and to compare it with older data.
I need to start things from the scratch.I am experienced with PHP as a server side script and MySQL. I am confused on the database side? Since I need to process about a million insertion daily,what all things should be taken care of?
I am confused on how to to design a MySQL database in this regard ? On which storage engine to be used and design patterns to be followed keeping in my mind the data could later used effectively with aggregate functions.
Currently I envision the DB design with one table storing all the user id's with a foreign key and separate status count table for each day.Does lot of table's could create some overhead?
Does MySQL fits my requirement? 2 million or more DB operations are done every day. How the server and other things are to be considered in this case.
1) The database should handle concurrent inserts, which should enable 1-2 million inserts per day.
Before inserting I suggest to calculate daily status count,i.e the difference today's count with yesterday's.
2) on a later phase, the archives data (collected over past days) is used as a data warehouse and aggregation tasks are to be performed on it.
Comments:
I have read MyISAM is the best choice for data warehousing projects and at the same time heard INNODB excels in many ways. Many have suggested on proper tuning to get it done, I would like to get thoughts on that as well.
When creating a data warehouse, you don't have to worry about normalization. You're inserting rows and reading rows.
I'd just have one table like this.
Status Count
------------
User id
Date
Count
The primary (clustering) key would be (User id, Date). Another unique index would be (Date, User id).
As far as whether or not MySQL can handle this data warehouse, that depends on the hardware that MySQL is running on.
Since you don't need referential integrity, I'd use MyISAM as the engine.
As for table design, a dimensional model with a star schema is usually a good choice for a datamart where there are mostly inserts and reads. I see two different granularities for the status data, one for status per day and one for status per user, so I would recommend tables similar to:
user_status_fact(user_dimension_id int, lifetime_status int)
daily_status_fact (user_dimension_id int, calendar_dimension_id int, daily_status int)
user_dimension(user_dimension_id, user_id, name, ...)
calendar_dimension(calendar_dimension_id, calendar_date, day_of_week, etc..)
You might also consider having the most detailed data available even though you don't have a current requirement for it as it may make it easier to build aggregates in the future:
status_fact (user_dimension_id int, calendar_dimension_id int, hour_dimension_id, status_dimension_id, status_count int DEFAULT 1)
hour_dimension(hour_dimension_id, hour_of_day_24, hour_of_day_12, ...)
status_dimension(status_dimension_id, status_description string, ...)
If you aren't familiar with the dimensional model, I would recommend the book data warehouse toolkit by Kimball.
I would also recommend MyISAM since you don't need the transactional integrity provided by InnoDB when dealing with a read-mostly warehouse.
I would question whether you want to do concurrent inserts into a production database though. Often in a warehouse environment this data would get batched over time and inserted in bulk and perhaps go through a promotion process.
As for scalability, mysql can certainly handle 2M write operations per day on modest hardware. I'm inserting 500K+ rows/day (batched hourly) on a cloud based server with 8GB of ram running apache + php + mysql and the inserts aren't really noticeable to the php users hitting the same db.
I'm assuming you will get one new row per user per day inserted (not 2M rows a day as some users will have more than one status). You should look at how many new rows per day you expect that to created. When you get to a large number of rows you might have to consider partitioning, sharding and other performance tricks. There are many books out there that can help you with that. Or you could also consider moving to an analytics db such as Amazon Red Shift.
I would create a fact table for each user status for each day. This fact table would connect to a date dimension via a date_key and to a user dimension via a user_key. The primary key for the fact table should be a surrogate key = status_key.
So, your fact table now has four fields: status_key, date_key, user_key, status.
Once the dimension and fact tables have been loaded, then do the processing and aggregating.
Edit: I assumed you knew something about datamarts and star schemas. Here is a simple star schema to base your design on.
This design will store any user's status for a given day. (If the user status can change during the day, just add a time dimension).
This design will work on MySQL or SQL Server. You will have to manage a million inserts per day, don't bog it down with comparisons to previous data points. You can do that with the datamart (star schema) after it's loaded - that's what it's for - analysis and aggregation.
If there are large number of DML operation and selecting records from database MYISAM engine would be prefer. INNODB is mainly use for TCL and referential integrity.You can also specify engine at table level.
If you need to generate the report then also MYISAM engine work faster than INNODB.See which table or data you need for your report.
Remember that if you generate reports from MYSQL database processing on millions of data using PHP programming could create a problem.You may encounter 500 or 501 error frequently.
So report generation view point MYISAM engine for required table will be useful.
You can also store data in multiple table to prevent overhead otherwise there is a chance for DB table crash.
It looks like you need a schema that will keep a single count per user per day. Very simple. You should create a single table which is DAY, USER_ID, and STATUS_COUNT.
Create an index on DAY and USER_ID together, and if possible keep the data in the table sorted by DAY and USER_ID also. This will give you very fast access to the data, as long as you are querying it by day ranges for any (or all) users.
For example:
select * from table where DAY = X and USER_ID in (Y, Z);
would be very fast because the data is ordered on disk sequentially by day, then by user_id, so there are very few seeks to satisfy the query.
On the other hand, if you are more interested in finding a particular user's activity for a range of days:
select * from table where USER_ID = X and DAY between Y and Z;
then the previous method is less optimal because finding the data will require many seeks instead of a sequential scan. Index first by USER_ID, then DAY, and keep the data sorted in that order; this will require more maintenance though, as the table would need to be re-sorted often. Again, it depends on your use case, and how fast you want your queries against the table to respond.
I don't use MySQL extensively, but I believe MyISAM is faster for inserts at the expense of transaction isolation. This should not be a problem for the system you're describing.
Also, 2MM records per day should be child's play (only 23 inserts / second) if you're using decent hardware. Especially if you can batch load the records using mysqlimport. If that's not possible, 23 inserts/second should still be very doable.
I would not do the computation of the delta from the previous day in the insertion of the current day however. There is an analytic function called LAG() that will do that for you very handily (http://explainextended.com/2009/03/10/analytic-functions-first_value-last_value-lead-lag/), not to mention it doesn't seem to serve any practical purpose at the detail level.
With this detail data, you can aggregate it any way you'd like, truncating the DAY column down to WEEK or MONTH, but be careful how you build aggregates. You're talking about over 7 billion records per year, and re-building aggregates over so many rows can be very costly, especially on a single database. You might consider doing aggregation processing using Hadoop (I'd recommend Spark over plain old Map/Reduce also, its far more powerful). This will alleviate any computation burden from your database server (which can't easily scale to multiple servers) and allow it to do its job of recording and storing new data.
You should consider partitioning your table as well. Some purposes of partitioning tables are to distribute query load, ease archival of data, and possibly increase insert performance. I would consider partitioning along the month boundary for an application such as you've described.

Best database design for a "sensor system"

I'm doing a schoolwork and..
I have to do a vehicle tracker system. I thought of these three designs. What do you think?
My database schemas
Opinions?
If you always measure and store all parameters within one measuring session, then go for the design 1.
Moving the attributes into separate tables only makes sense if the attributes are rarely stored and/or rarely needed.
If you have separate sensors for position and temperature, then go for design 3.
This is most probable, since the position is measured by a GPS tracker and temperature and oil level are measured by the vehicle sensors, which are separate devices and the measurements are performed at separate times.
You may even need to add a separate table for each sensor (i. e. if different sensors measure gas and temperature at different times, then make two tables for them).
Moving liquid into a separate table (as in design 2) makes sense if the list of the liquids you use is not known in design time (i. e. some third liquid, like hydrogen or helium-3 or whatever they will invent will be used by the vehicles and you will need to track it without redesigning the database).
This is not a likely scenario, of course.
if you're reading from the sensors at the same time the second design looks like an overkill to me. It would make sense to keep information separate just if you read that information at different times.
I would suggest the first design.
Your application needs to deal with two types of things
Sensors = which type, where in the engine, and even parameters such as polling frequency and such..
Reads = individual time-stamped recordings from one (or several ?) sensors.
There are a few things to think about:
- How can we find ways of abstracting the sensor concept? The idea is that we could then identify and deal with sensor instances through their properties, rather than having to know where to they are found in database.
-Is is best to keep all measurements for a given timestamp in a single "Read" record or to have one record per sensor, per read, even if several measurements come in sets.
A quick answer to the last question is that the single read event per record seems more flexible; we'll be able to handle, in the very same fashion, both groups of measurements that are systematically polled at the same time, and other measurements that are asynchonous to the former. Even if right-now, all measurements come at once, the potential for easy addition of sensors without changing the database schema and for handling them in like-fashion, is appealing.
Maybe the following design would be closer to what you need:
tblSensors
SensorId PK
Name clear text description of the sensor ("Oil Temp.")
LongName longer description ("Oil Temperarure, Sensor TH-B14 in crankshaft")
SensorType enumeration ("TEMP", "PRESSURE", "VELOCITY"...)
SensorSubType enumeration ("OIL", "AIR"...)
Location enumeration ("ENGINE", "GENERAL", "EXHAUST"...)
OtherCrit other crietrias which may be used to identify/seach for the sensor.
tblReads
Readid PK
DateTime
SensorId FK to tblSensors
Measurment INTeger value
Measurement2 optional extra meassurement (maybe to handle say, all
of a GPS sensor read as one "value"
Measurement3 ... also may have multiple columns for different types of
variables (real-valued ?)
In addition to the above you'd have a few tables where the "enumerations" for the various types of sensors are defined, and the tie-in to the application logic would be by way of the mnemonic-like "keys" of these enumerations. eg.
SELECT S.Name, R.DateTime, R.Measurement
FROM tblReads R
JOIN tblSensors S ON S.SensorId = R.SensorID
WHERE S.SensorType IN ('Temp', 'Pres')
AND S.Location = "ENG"
AND R.DateTime > '04/07/2009'
ORDER BY R.DateTime
This would not prevent you to also call the sensors by their id, and to group reads on the same results line. eg.
SELECT R1.DateTime, R1.Measurement AS OilTemp, R2.Measurement AS OilPress,
R3.Measurement AS MotorRpms
FROM tblReads R1
LEFT OUTER JOIN tblReads R2 ON R1.DateTime = R2.DateTime
LEFT OUTER JOIN tblReads R3 ON R1.DateTime = R3.DateTime
WHERE R1.SensorId = 17
AND R2.SensorId = 11
AND R3.SensorId = 44
AND R1.DateTime > '04/07/2009' AND R1.DateTime < '04/08/2009'
ORDER BY R3.Measurement DESC -- Sorte by Speed, fastest first

How do I assess the hash collision probability?

I'm developing a back-end application for a search system. The search system copies files to a temporary directory and gives them random names. Then it passes the temporary files' names to my application. My application must process each file within a limited period of time, otherwise it is shut down - that's a watchdog-like security measure. Processing files is likely to take long so I need to design the application capable of handling this scenario. If my application gets shut down next time the search system wants to index the same file it will likely give it a different temporary name.
The obvious solution is to provide an intermediate layer between the search system and the backend. It will queue the request to the backend and wait for the result to arrive. If the request times out in the intermediate layer - no problem, the backend will continue working, only the intermediate layer is restarted and it can retrieve the result from the backend when the request is later repeated by the search system.
The problem is how to identify the files. Their names change randomly. I intend to use a hash function like MD5 to hash the file contents. I'm well aware of the birthday paradox and used an estimation from the linked article to compute the probability. If I assume I have no more than 100 000 files the probability of two files having the same MD5 (128 bit) is about 1,47x10-29.
Should I care of such collision probability or just assume that equal hash values mean equal file contents?
Equal hash means equal file, unless someone malicious is messing around with your files and injecting collisions. (this could be the case if they are downloading stuff from the internet) If that is the case go for a SHA2 based function.
There are no accidental MD5 collisions, 1,47x10-29 is a really really really small number.
To overcome the issue of rehashing big files I would have a 3 phased identity scheme.
Filesize alone
Filesize + a hash of 64K * 4 in different positions in the file
A full hash
So if you see a file with a new size you know for certain you do not have a duplicate. And so on.
Just because the probability is 1/X it does not mean that it won't happen to you until you have X records. It's like the lottery, you're not likely to win, but somebody out there will win.
With the speed and capacity of computers these days (not even talking about security, just reliability) there is really no reason not to just use a bigger/better hash function than MD5 for anything critical. Stepping up to SHA-1 should help you sleep better at night, but if you want to be extra cautious then go to SHA-265 and never think about it again.
If performance is truly an issue then use BLAKE2 which is actually faster than MD5 but supports 256+ bits to make collisions less likely while having same or better performance. However, while BLAKE2 has been well-adopted, it probably would require adding a new dependency to your project.
I think you shouldn't.
However, you should if you have the notion of two equal files having different (real names, not md5-based). Like, in search system two document might have exactly same content, but being distinct because they're located in different places.
I came up with a Monte Carlo approach to be able to sleep safely while using UUID for distributed systems that have to serialize without collisions.
from random import randint
from math import log
from collections import Counter
def colltest(exp):
uniques = []
while True:
r = randint(0,2**exp)
if r in uniques:
return log(len(uniques) + 1, 2)
uniques.append(r)
for k,v in Counter([colltest(20) for i in xrange(1000)]):
print k, "hash orders of magnitude events before collission:",v
would print something like:
5 hash orders of magnitude events before collission: 1
6 hash orders of magnitude events before collission: 5
7 hash orders of magnitude events before collission: 21
8 hash orders of magnitude events before collission: 91
9 hash orders of magnitude events before collission: 274
10 hash orders of magnitude events before collission: 469
11 hash orders of magnitude events before collission: 138
12 hash orders of magnitude events before collission: 1
I had heard the formula before: If you need to store log(x/2) keys, use a hashing function that has at least keyspace e**(x).
Repeated experiments show that for a population of 1000 log-20 spaces, you sometimes get a collision as early as log(x/4).
For uuid4 which is 122 bits that means I sleep safely while several computers pick random uuid's till I have about 2**31 items. Peak transactions in the system I am thinking about is roughly 10-20 events per second, I'm assuming an average of 7. That gives me an operating window of roughly 10 years, given that extreme paranoia.
Here's an interactive calculator that lets you estimate probability of collision for any hash size and number of objects - http://everydayinternetstuff.com/2015/04/hash-collision-probability-calculator/