Non-technical terms on Elasticsearch, Logstash and Kibana - json

I have a doubt. I do know that Logstash allows us to input csv/log files and filter it using separators and columns. And it will output into elasticsearch for it to be used by Kibana. However, after writing the conf file, do I need to specify index pattern by using the command:
CURL -XPUT 'http://localhost:5601/test' d
Because I do know that when you have a JSON file, you will have to define the mapping etc. Do I need to do this step for csv files and other non json files? Sorry for asking, I need to clear my doubt.

When you insert documents into a new elasticsearch index, a mapping is created for you. This may not be a good thing, as it's based on the initial value of each field. Imagine a field that normally contains a string, but the initial document contains an integer - now your mapping is wrong. This is a good case for creating a mapping.
If you insert documents through logstash into an index named logstash-YYYY-MM-DD (the default), logstash will apply its own mapping. It will use any pattern hints you gave it in grok{}, e.g.:
%{NUMBER:bytes:int}
and it will also make a "raw" (not analyzed) version of each string, which you can access as myField.raw. This may also not be what you want, but you can make your own mapping and provide it as an argument in the elasticsearch{} output stanza.
You can also make templates, which elasticsearch will apply when an index pattern matches the template definition.
So, you only need to create a mapping if you don't like the default behaviors of elasticsearch or logstash.
Hope that helps.

Related

Spark partition projection/pushdown and schema inference with partitioned JSON

I would like to read a subset of partitioned data, in JSON format, with spark (3.0.1) inferring the schema from the JSON.
My data is partitioned as s3a://bucket/path/type=[something]/dt=2020-01-01/
When I try to read this with read(json_root_path).where($"type" == x && $"dt" >= y && $"dt" <= z), spark attempts to read the entire dataset in order to infer the schema.
When I try to figure out my partition paths in advance and pass them with read(paths :_*), spark throws an error that it cannot infer the schema and I need to specify the schema manually. (Note that in this case, unless I specify basePath, spark also loses the columns for type and dt, but that's fine, I can live with that.)
What I'm looking for, I think, is some option that tells spark to either infer the schema from only the relevant partitions, so the partitioning is pushed-down, or tells it that it can infer the schema from just the JSONs in the paths I've given it. Note that I do not have the option of calling mcsk or glue to maintain a hive metastore. In addition, the schema changes over time, so it can't be specified in advance - taking advantage of spark JSON schema inference is an explicit goal.
Can anyone help?
Could you read each day you are interested in using schema inference and then union the dataframes using schema merge code like this:
Spark - Merge / Union DataFrame with Different Schema (column names and sequence) to a DataFrame with Master common schema
One way that comes to my mind is to extract the schema you need from a single file, and then force it when you want to read the others.
Since you know the first partition and the path, try to read first a single JSON like s3a://bucket/path/type=[something]/dt=2020-01-01/file_0001.json then extract the schema.
Run the full reading part and pass the schema that you extracted as parameter read(json_root_path).schema(json_schema).where(...
The schema should be converted into a StructType to be accepted.
I've found a question that may partially help you Create dataframe with schema provided as JSON file

JSON in Wordpress DB: two keys/values for a value/key

I'm trying to understand what kind of json this structure can be:
{s:11:"current_tab";s:7:"content";}
At least I believe that it is json, can anybody can help me understanding how I can query and work with this?
I found it in a mysql DB with wordpress, in the wp_postmeta table.
These arrays are called serialized data representation. That type is used for storing or passing PHP values without losing their type or structure, you can normally find this type of data stored in plugins or themes configuration, but it's widely used when working with WordPress databases.
Let’s say a theme is creating an array for storing color and a path.
In pure PHP, it looks like:
$settings = array(
'color' => 'green',
'path' => 'https://example.com'
);
When that array is stored in the database, it is converted into the serialized representation and looks like:
a:2:{s:5:"color";s:5:"green";s:4:"path";s:18:"https://example.com";}
The advantage is that the serialized data representation can be stored in the database much more effectively than the PHP array. The drawback is that the serialized data can not be changed by a simple search & replace as you would do with a text editor.
You can find more info about this type of data (and also for the PHP methods used to create and retrieve them) by searching for serialized data in WordPress. Also you can find a detailed example here

Apache NiFi: Changing Date and Time format in csv

I have a csv which contains a column with a date and time. I want to change the format of the date-time column. The first 3 rows of my csv looks like the following.
Dater,test1,test2,test3,test4,test5,test6,test7,test8,test9,test10,test11
20011018182036,,,,,166366183,,,,,,
20191018182037,,27,94783564564,,162635463,817038655446,,,0,,
I want to change the csv to look like this.
Dater,test1,test2,test3,test4,test5,test6,test7,test8,test9,test10,test11
2001-10-18-18-20-36,,,,,166366183,,,,,,
2019-10-18-18-20-37,,27,94783564564,,162635463,817038655446,,,0,,
How is this possible?
I tried using the UpdateRecord Processor.
My properties look like this:
But this approach doesn't work since the data gets routed as a failure from the UpdateRecord Processor. Suggest me a method to complete the task.
I was able to accomplish this using the UpdateRecord Processor. The expression language I used is ${field.value:toDate('yyyyMMddHHmmss'):format('yyyy-MM-dd HH:mm:ss')}.
Just this didn't work since every time, the data was routed towards the failure path from the UpdateRecord Processor.
To fix this error I changed the configuration of the CSVRecordSetWriter. The Schema Access Strategy must be changed to Use String Fields from Header. This is by default Use Schema Name Property
Strategy: use UpdateRecord to manipulate the timestamp value using expression language:
${field.value:toDate():format('ddMMyyyy')}
Flow:
GenerateFlowFile:
UpdateRecord:
Setup reader and writer to inherit schema. Include header line. Leave other properties untouched.
Result:
However this solution might not satisfy you because of a strange problem. When you format the date like that:
${field.value:toDate():format('dd-MM-yyyy')}
ConvertRecord routes to the failure relationship:
Type coercion does not work properly. Maybe it is a bug. I could not find a solution for this problem.

Delete/ignore unwanted elements in json

I want to delete/ignore the elements in the following json record:
{"_scroll_id":"==","timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":6908915,"max_score":null,"hits":[{"_index":"abc_v1","_type":"composite_request_response_v1","_id":"123","_score":1.0,"_source":{"response":{"testResults":{"docsisResults":{"devices":[{"upstreamSection":{"upstreams":[]},"fluxSection":{"fluxInfo":[{}]}}],"events":[]},"mocaResults":{"statuses":[]}}}},"sort":[null,1.0]}]}},
I have the records in the above format. I wish to delete the highlighted part of the record. Can someone guide me of ways I can accomplish that. Are there anyways I can achieve that using hive/pig/linux/python?
There is the JSON SerDe in Hive, see this: https://cwiki.apache.org/confluence/display/Hive/Json+SerDe
So you can define only columns that you need in table definition, put your file in the table location and then select only defined columns. Alternatively you can pre-process/transform your files before loading them using Java+ Jackson (library to serialize or map Java objects to JSON and vice versa), this will give you maximum flexibility thought this is not so simple as using JSON SerDe.

How to combine multiple MySQL databases using D2RQ?

I have four different MySQL databases that I need to convert into Linked Data and then run queries on the aggregated data. I have generated the D2RQ maps separately and then manually copied them together into a single file. I have read up some material on customizing the maps but am finding it hard to do so in my case because:
The ontology classes do not correspond to table names. In fact, most classes are column headers.
When I open the combined mapping in Protege, it generates only 3 classes (ClassMap, Database, and PropertyBridge) and lists all the column headers as instances of these.
If I import this file into my ontology, everything becomes annotation.
Please suggest an efficient way to generate a single graph that is formed by mapping these databases to my ontology.
Here is an example. I am using the EEM ontology to refine the mapping file generated by D2RQ. This is a section from the mapping file:
map:scan_event_scanDate a d2rq:PropertyBridge;
d2rq:belongsToClassMap map:scan_event;
d2rq:property vocab:scan_event_scanDate;
d2rq:propertyDefinitionLabel "scan_event scanDate";
d2rq:column "scan_event.scanDate";
# Manually added
d2rq:datatype xsd:int;
.
map:scan_event_scanTime a d2rq:PropertyBridge;
d2rq:belongsToClassMap map:scan_event;
d2rq:property vocab:scan_event_scanTime;
d2rq:propertyDefinitionLabel "scan_event scanTime";
d2rq:column "scan_event.scanTime";
# Manually added
d2rq:datatype xsd:time;
The ontology I am interested in has the following:
Data property: eventOccurredAt
Domain: EPCISevent
Range: datetime
Now, how should I modify the mapping file so that the date and time are two different relationships?
I think the best way to generate a single graph of your 4 databases is to convert them one by one to a Jena Model using D2RQ, and then use the Union method to create a global model.
For your D2RQ mapping file, you should read carefully The mapping language, it's not normal to have classes corresponding to columns.
If you give an example of your table structure, I can give you an illustration of a mapping file.
Good luck