Can suggestion a solution for big, relational data analyzer please?

Can suggestion a solution for big, relational data analyzer please? - mysql

I`m looking for some suggestions on my requirements. Below are the description of my requirements. Feel free to contact me for any details please. Even some suggestions on how I can describe my questions more clearly is also very appreciate:)
Requirements description
I have some data, the format is like below:
router, interface,timestamp, src_ip, dst_ip, src_port, dst_port, protocol, bits
r1, 1, 1453016443, 10.0.0.1, 10.0.0.2, 100, 200, tcp, 108
r2, 1, 1453016448, 10.0.0.3, 10.0.0.8, 200, 200, udp, 100
As you can see, it is some network raw data. I omit some columns just to make it looks more clear. The volume of data is very big. And it is generating very fast, like 1 billion rows every 5min...
What I want is to do some real time analysis on these data.
For example:
draw a line using the timestamp
select sum(bits) , timestamp from raw_data group by router,interface where interface = 1, router=r1.
find out which 3 src_ip sending the most data for one interface
select sum(bits) from raw_data where router=r1 and interface=2 group by src_ip order by sum(bits) desc limit 3
I have already tried some solutions and each of them is not very suitable for it. For example :
rdbms
MySQL seems fine except a few problems:
the data is too big
I`m having a lot more columns than I described here. To improve my query speed, I have to some index on most of the columns. But i think create index on big table and the index containing too many columns is not very good, right?
openTSDB
OpenTSDB is a good timeseries database. But also not suitable for my requirements.
openTSDB is having problem to solve the TOP N problem. In my requirements "to get top 3 src_ip which sending most data", openTSDB can not resolve this.
Spark
I know that apache spark can be used like RDBMS. It having the feature called spark SQL. I did not try but I guess the performance should not satisfy the real time analysis/query requirement, right? After all, spark is more suitable for offline calculation, right?
Elastic Search
I really give a lot hope on ES when I know this project. But it is not suitable either. Because When you aggregating more than one column, you have to use the so called nested bucket aggregation in elasticsearch. And the result of this aggregation can not be sorted. You have to retrieve all the result and sort by your self. In my case, the result is too much. To sort the result will be very difficult
So.... I`m stuck here. Can anyone give some suggestions please?

I don't see why ES would not be able to achieve your requirements. I think you misunderstood this part
But it is not suitable either. Because When you aggregating more than one column, you have to use the so called nested bucket aggregation in elasticsearch. And the result of this aggregation can not be sorted.
Your first requirement draw a line using the timestamp could be easily achieved with a query/aggregation like this:
{
"query": {
"bool": {
"must": [
{
"term": {
"interface": 1
}
},
{
"term": {
"router": "r1"
}
}
]
}
},
"aggs": {
"by_minute": {
"date_histogram": {
"field": "timestamp",
"interval": "1m"
},
"aggs": {
"sum_bits": {
"sum": {
"field": "bits"
}
}
}
}
}
}
As for your second requirement find out which 3 src_ip sending the most data for one interface, it can also easily be achieved with a query/aggregation like this one:
{
"query": {
"bool": {
"must": [
{
"term": {
"interface": 2
}
},
{
"term": {
"router": "r1"
}
}
]
}
},
"aggs": {
"by_src_ip": {
"terms": {
"field": "src_ip",
"size": 3,
"order": {
"sum_bits": "desc"
}
},
"aggs": {
"sum_bits": {
"sum": {
"field": "bits"
}
}
}
}
}
}
UPDATE
According to your comment, your second requirement above could change to find the top 3 combination of src_ip/dst_ip. This would be doable with a terms aggregation using a script instead of a term which would build the src/dest combination and provide the sum of bits for each couple, like this:
{
"query": {
"bool": {
"must": [
{
"term": {
"interface": 2
}
},
{
"term": {
"router": "r1"
}
}
]
}
},
"aggs": {
"by_src_ip": {
"terms": {
"script": "[doc.src_ip.value, doc.dst_ip.value].join('-')",
"size": 3,
"order": {
"sum_bits": "desc"
}
},
"aggs": {
"sum_bits": {
"sum": {
"field": "bits"
}
}
}
}
}
}
Note that in order to run this last query, you'll need to enable dynamic scripting. Also since you'll have billions of documents, scripting might not be the best solution, but it's worth giving it a try before diving further. One other possible solution would be to add a combination field (src_ip-dst_ip) at indexing time so that you can use it as a field in your terms aggregation without having to resort to scripting.

You can try Axibase Time Series Database which is non-relational but supports SQL queries in addition to rest-like API. Here's a Top-N query example:
SELECT entity, avg(value) FROM cpu_busy
WHERE time between now - 1 * hour and now
GROUP BY entity
ORDER BY avg(value) DESC
LIMIT 3
https://axibase.com/docs/atsd/sql/#grouping
ATSD Community Edition is free.
Disclosure: I work for Axibase

Related

Elastic Search - Nested aggregation

I would like to form a nested aggregation type query in elastic search. Basically , the nested aggregation is at four levels.
groupId.keyword
---direction
--billingCallType
--durationCallAnswered
example:
"aggregations": {
"avgCallDuration": {
"terms": {
"field": "groupId.keyword",
"size": 10000,
"min_doc_count": 1,
"shard_min_doc_count": 0,
"show_term_doc_count_error": false,
"order": [
{
"_count": "desc"
},
{
"_key": "asc"
}
]
},
"aggregations": {
"call_direction": {
"terms" : {
"field": "direction"
},
"aggregations": {
"call_type" : {
"terms": {
"field": "billingCallType"
},
"aggregations": {
"avg_value": {
"terms": {
"field": "durationCallAnswered"
}
}
}
}
}
}
}
}
}
This is part of a query . While running this , I am getting the error as
"type": "illegal_argument_exception",
"reason": "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [direction] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
Can anyone throw light on this?

Tldr;
As the error state, you are performing an aggregation on a text field, the field direction.
Aggregation are not supported by default on text field, as it is very expensive (cpu and memory wise).
They are 3 solutions to your issue,
Change the mapping from text to keyword (will require re indexing, most efficient way to query the data)
Change the mapping to add to this field fielddata: true (flexible, but not optimised)
Don't do the aggregation on this field :)

Need documentation for *.analysis.windows.net/public/reports/querydata

I am reverse engineering an app that sends queries to
SOMESERVERNAME.analysis.windows.net/public/reports/querydata via an HTTP POST of an JSON-structured query.
Some initial lines of a sample query are at the end of this message.
I can't find any documentation on this anywhere. I don't know if this is some secret API or what. I ultimately would like to just ignore the aggregations altogether and just dump the raw data, which seems to sit in some flat-file type container on the back-end, but without some API documentation I'm stuck with just re-running the super basic handful of queries I've been able to intercept.
Note: this app is an embedded analytics page created with PowerBI, but the only REST API I can find for PowerBI has nothing to do with querying, but just basic object management.
Thanks!
{
"version": "1.0.0",
"queries": [
{
"Query": {
"Commands": [
{
"SemanticQueryDataShapeCommand": {
"Query": {
"Version": 2,
"From": [
{
"Name": "s",
"Entity": "Sheet1"
}
],
"Select": [
{
"Aggregation": {
"Expression": {
"Column": {
"Expression": {
"SourceRef": {
"Source": "s"
}
},
"Property": "Total"
}
},
"Function": 0
},
"Name": "Sum(Sheet1.Total)"
}
],
"Where": [
{
"Condition": {
"In": {
"Expressions": [
{
"Column": {
"Expression": {
"SourceRef": {
"Source": "s"
}
},
"Property": "Year"
}
}
],
"Values": [
[
{
"Literal": {
"Value": "'2018'"
}
}
]
]
}
}
},
............

I have built a client that scrapes data off a specific Power BI report using the same API, but probably you'll be able to adapt it to your use case. Maybe we can even abstract the code into a more generalized Power BI client!
Having tinkered with the API for two days, I realised that there are many ways the data can be formatted:
"nested"/multidimensional data can be unflattened, flattened by 1 degree, etc.
a primary "table" of a result dataset (in data.PH) can reference others (in data.SH)
The basics are as follows:
A dataset is structured like a multidimensional table, with cells containing values.
In a set of cells, the first always has a field S that contains the schema of its and all subsequent cells.
The schema maps a field of each cell's object with a selection from your query, e.g. the G0 field with the queried column age.
My client seems to work only with a specific type of query (SemanticQueryDataShapeCommand), a specific nr of dimensions and a specific column marked as primary (via Binding.Primary). But maybe that helps! https://github.com/derhuerst/fetch-bvg-occupancy/blob/1ebb864b1ff7130f9d2f0ab031c6d78bcabdd633/lib/parse-dataset.js

The only documented way to use this API is through the ADOMD.NET or OleDb provider.
If you want to send a DAX/MDX query and retrieve data programmatically, there's a sample of how to front-end the service with a simple REST API here.

ElasticSearch - Aggregations: filtering and sorting buckets

Using ElasticSearch 5.2
As an example I'll be using a database representing all people in the world.
Example request: Get the average salary of a person, grouped by country, for everyone older than 30 years old. (Response should contain the top 10 countries with their average salary)
Steps i took to build my aggregation query:
Filter the raw dataset (filter: age > 30)
Aggregate on 'country'
Use metric Avg on field 'salary'
The problem that I'm facing is that I can either apply a filter before aggregating, or sort the buckets, but it seems like ES does not allow me to do both?
In other words: I cannot apply my "order" query on "myInnerAggregation" because it's preceded by "myFilterAggregation"
GET myIndex/_search
{
"size":0,
"aggs": {
"myAggregation": {
"terms": {
"size":10,
"field": "country",
"order": [
{
"myInnerAggregation": "desc"
// I cannot specify "myInnerAggregation" here, only
// "myFilterAggregation" (because it's out of scope I guess)
}
]
},
"aggs": {
"myFilterAggregation": {
"filter": { /* age > 30 (filter syntax is not the problem)*/},
"aggs": {
"myInnerAggergation": {
"avg": {
"field": "salary"
}
}
}
}
}
}
}
}
I've seen that ES 6.1+ has support for "bucket_sort", which would probably solve this problem, but I cannot believe a simple aggregation like this cannot be handled by ES 5.2?

This should do the trick:
"order": [
{
"MyFilterAggregation>MyInnerAggregation": "desc"
}
]

Try moving the filter to the query part of the request. You don't have to declare your filter as an aggregation, aggregations respect filters defined in the query part of your request. Hopefully this way you can get around your issue.

Possible to aggregate and filter on same key?

I'm not very experienced with Elastic but is it possible to aggregate and filter on the same key?
Say I want to be able to filter on the city and aggregate on the department ID for counts of a certain department in that city, but also be able to filter on that as well. Think of it as a checkbox for the city, then children checkboxes under city which can also be filtered on, or 'checked'.
This may be a dumb question but is there anyway to do this? I know it is invalid JSON due to the same key (department.id). Would the pipeline aggregation be something viable to use?
The top query would be a "match_all" query.
Aggregations such as:
"aggregations": {
"department.id": {
"terms": {
"field": "department.id",
"size": 10
}
},
"department.id": {
"filter": {
"bool": {
"filter": {
"terms": {
"city": ["chicago"]
}
}
}
}
}
}

Identifying Duplicates in CouchDB

I'm new to CouchDB and document-oriented databases in general.
I've been playing around with CouchDB, and was able to get familiar with creating documents (with perl) and using the Map/Reduce functions in Futon to query the data and create views.
One of the things I'm still trying to figure out is how to identify duplicate values across documents using Futon's Map/Reduce.
For example, if I have the following documents:
{
"_id": "123",
"name": "carl",
"timestamp": "2012-01-27T17:06:03Z"
}
{
"_id": "124",
"name": "carl",
"timestamp": "2012-01-27T17:07:03Z"
}
And I wanted to get a list of document id's that had duplicate "name" values, is this something I could do with the Futon Map/Reduce?
The result was hoping to achieve is as follows:
{
"name": "carl",
"dupes": [ "123", "124" ]
}
..or..
{
"carl": [ "123", "124" ]
}
.. which would be the value, and associated document ids which contain those duplicate values.
I've tried a few different things with Map/Reduce, but so far as I understand, the Map function works with data on a per-document basis, and the Reduce functions only allow you to work with the keys/values from a given document.
I know i could just pull the data I need with perl, work magic there, and get the result I want, but I'm trying to work only with CouchDB for now in order to better understand it's benefits / limitations.
Another way I'm thinking about doing this is to use a single document like an RDBMS table:
{
"_id": "names",
"rec1": {
"_id": "123",
"name": "carl",
"timestamp": "2012-01-27T17:06:03Z"
},
"rec2": {
"_id": "124",
"name": "carl",
"timestamp": "2012-01-27T17:07:03Z"
}
}
.. which should allow me to use the Map/Reduce functions in the way I originally thought. However I'm not sure if this is ideal.
I understand that my mind is still stuck in RDBMS land, so much of what I'm trying to do above may not be necessary. Any insight on this would be much appreciated.
Thanks!
Edit: Fixed JSON syntax in some of the examples.

If you merely want a list of unique values, that's pretty easy. If you wish to identify the duplicates, then it gets less easy.
In both cases, a map function like this should suffice:
function (doc) {
emit(doc.name);
}
For your reduce function, just enter _count.
Your view output will look like: (based on your 2 documents)
{
"rows": [
{ "key": "carl", "value": 2 }
]
}
From there, you will have a list of names as well as their frequency. You can take that list and filter it yourself, or you can take the "all couch" route and use a _list function to perform that final filtering.
function (head, req) {
var row, duplicates = [];
while (row = getRow()) {
if (row.value > 1) {
duplicates.push(row);
}
}
send(JSON.stringify(duplicates));
}
Read up about _list functions, they're pretty handy and versatile.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008