I have Cassandra DB with data that has TTL of X hour's for every column value and this needs to be pushed to ElasticSearch Cluster real time.
I have seen past posts on StackOverflow that advise using tools such as LogStash or pushing data directly from application layer.
However, How can one preserve the TTL of the data imported once the data is copied in ES Version >=5.0?
There was once a field called _ttlwhich has been deprecated in ES 2.0 and removed in ES 5.0.
As of ES 5, there are now two official ways of preserving the TTL of your data. First make sure to create a TTL field in your ES documents that would be set to the creation date of your row in Cassandra + the TTL seconds. So if in Cassandra you have a record like this:
INSERT INTO keyspace.table (userid, creation_date, name)
VALUES (3715e600-2eb0-11e2-81c1-0800200c9a66, '2017-05-24', 'Mary')
USING TTL 86400;
Then in ES, you should export the following document to ES:
{
"userid": "3715e600-2eb0-11e2-81c1-0800200c9a66",
"name": "mary",
"creation_date": "2017-05-24T00:00:00.000Z",
"ttl_date": "2017-05-25T00:00:00.000Z"
}
Then you can either:
A. Use a cron that will regularly perform a delete by query based on one of your ttl_date field, i.e. call the following command from your cron:
curl -XPOST localhost:9200/your_index/_delete_by_query -d '{
"query": {
"range": {
"ttl_date": {
"lt": "now"
}
}
}
}'
B. Or use time-based indices and insert each document in an index matching it's ttl_date field. For instance, the above document would be inserted in the index named your_index-2017-05-25. Then with the curator tool you can easily delete indices that have expired.
Related
[
{"link":"https://twitter.com/GreenAddress/status/550793651186855937",
"pDate":"2015 01 1",
"title":"GreenAddress",
"description": "btcarchitect coinkite blockchain circlebits coinbase bitgo some maybe some are oracle cosigners which require lesszero trust"},
{"link":"https://twitter.com/Bit_Swift/status/550765718581411840",
"pDate":"2015 01 1",
"title":"Bitswift™",
"description": "swiftstealth offers you privacy in bitswift v2 swiftstealth enables stealth address use on the bitswift blockchain swift"},
{"link":"https://twitter.com/allenday/status/550741133500772352",
"pDate":"2015 01 1",
"title":"Allen Day, PhD",
"description": "all in one article bitcoin blockchain 3dprinting drones and deeplearninghttp simondlr compost101071618938adecentralizedaivia simondlr"}
]
my test.json file like this
and my mysql db table is here
i can input text file with csv type, but i have no idea how input json text file on mysql
i try [create table test ( data json);] and
[insert into test values ( '{json type}'); but when i try input data with csv type LOAD DATA INFILE 'test.txt' made it possible
so I wonder json has the same functionality
thanks for any advice
MySQL does have JSON data field. However, it will not work with your file and current table structure as it request a field to be JSON. To solve your data, will require a little bit of programming work. Depending on your current ability, you will need to write codes that does the following:
Open a database connection
Read the JSON and loop through each value
Store each value using the following INSERT query:
INSERT INTO news(link, date, title, description) VALUES($link, $pDate, $title, $description);
Depending on your language and database connection feature, close the database connection.
I am querying data from MSsql server and saving it into CSVs. With those CSVs,
I am modelling data into Neo4j. But Mssql database updates regularly. So,also wants to update Neo4j data on the regular basis. Neo4j has two types of nodes: 1.X and 2.Y . Below query and indexing, used to model the data:
CREATE INDEX ON :X(X_Number, X_Description,X_Type)
CREATE INDEX ON :Y(Y_Number, Y_Name)
using periodic commit
LOAD CSV WITH HEADERS FROM "file:///CR_n_Feature_new.csv" AS line
Merge(x:X{
X_Number : line.X_num,
X_Description: line.X_txt,
X_Type : line.X_Type,
})
Merge(y:Y{
Y_Number : line.Y_number,
Y_Name: line.Y_name,
})
Merge (y)-[:delivered_by]->(x)
Now there are two kinds of updates:
There may be new X and Y nodes which can be taken care by "Merge" command.
But there can be modifications in the properties of already created nodes X and Y....for e.g. if node X{X_Number : 1, X_Description : "abc", X_Type : "z"} but now in updated data X node properties got changed to X{X_Number : 1, X_Description : "def", X_Type : "y"}
So I don't want to create a new node for the X_Number:1 node but just want to update the existing node properties like X_Description and X_Type.
You could just re-write your query to support new nodes and changes to existing nodes by merging only on the X_Number or Y_Number attributes.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///CR_n_Feature_new.csv" AS line
MERGE (x:X {X_Number: line.X_num})
SET X_Description = line.X_txt, X_Type = line.X_Type
MERGE (y:Y {Y_Number: line.Y_number})
SET Y_Name=line.Y_name
MERGE (y)-[:delivered_by]->(x)
This way the MERGE statements will always match the existing X and Y nodes based on the X_Number and Y_Number attributes which presumably are immutable. Then the existing Description, X_Type and Y_Name attributes will be updated with the newer values.
I would like to know how to correctly format JSON cache rules to MaxScale. I need to store multiple tables for multiple databases and multiple users, how to correctly format this?
Here, i can store one table on one database and use it for one user.
{
"store": [
{
"attribute": "table",
"op": "=",
"value": "databse_alpha.xhtml_content"
}
],
"use": [
{
"attribute": "user",
"op": "=",
"value": "'user_databse_1'#'%'"
}
]
}
I need to create rule to store multiple databases for multiple users like table1 and table2 being accessed by user1, table3 and table4 being accessed by user2...and son on.
Thanks.
In Maxscale 2.1 it is only possible to give a single pair of store/use values for the cache filter rules.
I took the liberty of opening a feature request for MaxScale on the MariaDB Jira as it appears this functionality was not yet requested.
I think that as workaround you should be able to create two cache filters, with a different set of rules, and then in your service use
filters = cache1 | cache2
Note that picking out the table using exact match, as in your sample above, implies that the statement has to be parsed, which carries a significant performance cost. You'll get much better performance if no matching is needed or if it is performed using regular expressions.
Using AppEngine/BigQuery. Timestamp has stopped parsing.
Here is my Schema:
[
{"name":"RowID","type":"string"},
{"name":"Timestamp","type":"timestamp"},
{"name":"Keyword","type":"string"},
{"name":"Engine","type":"string"},
{"name":"Locale","type":"string"},
{"name":"Geo","type":"string"},
{"name":"Device","type":"string"},
{"name":"Metrics","type":"record", "fields":[
{"name":"GlobalSearchVolume","type":"integer"},
{"name":"CPC","type":"float"},
{"name":"Competition","type":"float"}
]}
]
and here is a JSON row that is being shipped to BQ for this schema:
{
"RowID":"6263121748743343555",
"Timestamp":"2015-01-13T07:04:05.999999999Z",
"Keyword":"buy laptop",
"Engine":"google",
"Locale":"en_us",
"Geo":"",
"Device":"d",
"Metrics":{
"GlobalSearchVolume":3600,
"CPC":7.079999923706055,
"Competition":1
}
}
This data is accepted by BigQuery, but the timestamp is nil (1970-01-01 00:00:00 UTC) as seen here:
I have also tried sending through the UNIX timestamp, to no avail. Can you see any errors with my schema or input data that would cause the timestamp to not parse?
I had a similar issue, but I was just checking the details in the preview window. When I actually ran any queries, the timestamps worked correctly. It often took 24 hours for the details to update the timestamps to the actual values.
At http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html, it says
Automatic ID Generation
The index operation can be executed without specifying the id.
In such a case, an id will be generated automatically.
In addition, the op_type will automatically be set to create.
Here is an example (note the POST used instead of PUT):
$ curl -XPOST 'http://localhost:9200/twitter/tweet/' -d '{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}'
So based on my reading, if I run that query twice, it should only index one document, and the second one should return 409. But when I actually run it (on elasticsearch 1.3.2), it creates a document every time! What's going on, and how can I get it to index only if the document doesn't already exist, without specifying the document id?
You can't, if you don't specify an id a new guid will be generated. The create op_type means it knows it doesn't need to do an update since it has a new unique id.
You could checksum your data and set that to id, but that is a bad idea if the data every changes.