Indexing CSV blobs does not work in Azure Search - csv

I have a number of TSV files as Azure blobs that have following as the first four tab-separated columns:
metadata_path, document_url, access_date, content_type
I want to index them as described here: https://learn.microsoft.com/en-us/azure/search/search-howto-index-csv-blobs
My request for creating an indexer has the following body:
{
"name" : "webdata",
"dataSourceName" : "webdata",
"targetIndexName" : "webdata",
"schedule" : { "interval" : "PT1H", "startTime" : "2017-01-09T11:00:00Z" },
"parameters" : { "configuration" : { "parsingMode" : "delimitedText", "delimitedTextHeaders" : "metadata_path,document_url,access_date,content_type" , "firstLineContainsHeaders" : true, "delimitedTextDelimiter" : "\t" } },
"fieldMappings" : [ { "sourceFieldName" : "document_url", "targetFieldName" : "id", "mappingFunction" : { "name" : "base64Encode", "parameters" : "useHttpServerUtilityUrlTokenEncode" : false } } }, { "sourceFieldName" : "document_url", "targetFieldName" : "url" }, { "sourceFieldName" : "content_type", "targetFieldName" : "content_type" } ]
}
I am receiving an error:
{
"error": {
"code": "",
"message": "Data source does not contain column 'document_url', which is required because it maps to the document key field 'id' in the index 'webdata'. Ensure that the 'document_url' column is present in the data source, or add a field mapping that maps one of the existing column names to 'id'."
}
}
What do I do wrong?

What do I do wrong?
In your case, you supply the json format is invalid. The following is the request for creating an indexer. Detail info we could refer to this document
{
"name" : "Required for POST, optional for PUT. The name of the indexer",
"description" : "Optional. Anything you want, or null",
"dataSourceName" : "Required. The name of an existing data source",
"targetIndexName" : "Required. The name of an existing index",
"schedule" : { Optional. See Indexing Schedule below. },
"parameters" : { Optional. See Indexing Parameters below. },
"fieldMappings" : { Optional. See Field Mappings below. },
"disabled" : Optional boolean value indicating whether the indexer is disabled. False by default.
}
If we want to create an indexer with Rest API. We need 3 steps to do that. I also do a demo for it.
If Azure search SDK is acceptable, you also could refer to another SO thread.
1.Create datasource.
POST https://[service name].search.windows.net/datasources?api-version=2015-02-28-Preview
Content-Type: application/json
api-key: [admin key]
{
"name" : "my-blob-datasource",
"type" : "azureblob",
"credentials" : { "connectionString" : "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=<account key>;" },
"container" : { "name" : "my-container", "query" : "<optional, my-folder>" }
}
2.Create an index
{
"name" : "my-target-index",
"fields": [
{ "name": "metadata_path","type": "Edm.String", "key": true, "searchable": true },
{ "name": "document_url", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false },
{ "name": "access_date", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false },
{ "name": "content_type", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false }
]
}
3. Create an indexer.

Below is the request body that works:
{
"name" : "webdata",
"dataSourceName" : "webdata",
"targetIndexName" : "webdata",
"schedule" :
{
"interval" : "PT1H",
"startTime" : "2017-01-09T11:00:00Z"
},
"parameters" :
{
"configuration" :
{
"parsingMode" : "delimitedText",
"delimitedTextHeaders" : "document_url,content_type,link_text" ,
"firstLineContainsHeaders" : true,
"delimitedTextDelimiter" : "\t",
"indexedFileNameExtensions" : ".tsv"
}
},
"fieldMappings" :
[
{
"sourceFieldName" : "document_url",
"targetFieldName" : "id",
"mappingFunction" : {
"name" : "base64Encode",
"parameters" : {
"useHttpServerUtilityUrlTokenEncode" : false
}
}
},
{
"sourceFieldName" : "document_url",
"targetFieldName" : "document_url"
},
{
"sourceFieldName" : "content_type",
"targetFieldName" : "content_type"
},
{
"sourceFieldName" : "link_text",
"targetFieldName" : "link_text"
}
]
}

Related

Creating Multiple QueueConfigurations in CloudFormation

I'm currently trying to write multiple QueueConfigurations into my CloudFormation template. Each is an SQS queue that is triggered when an object is created to a specified prefix. Here's what I have so far:
{
"Resources": {
"S3Bucket": {
"Type" : "AWS::S3::Bucket",
"Properties" :
"BucketName" : { "Ref" : "paramBucketName" },
"LoggingConfiguration" : {
"DestinationBucketName" : "test-bucket",
"LogFilePrefix" : { "Fn::Join": [ "", [ { "Ref": "paramBucketName" }, "/" ] ] }
},
"NotificationConfiguration" : {
"QueueConfigurations" : [{
"Id" : "1",
"Event" : "s3:ObjectCreated:*",
"Filter" : {
"S3Key" : {
"Rules" : {
"Name" : "prefix",
"Value" : "folder1/"
}
}
},
"Queue" : "arn:aws:sqs:us-east-1:958262988361:interstate-cdc_feeder_prod_hvr_dev"
}],
"QueueConfigurations" : [{
"Id" : "2",
"Event" : "s3:ObjectCreated:*",
"Filter" : {
"S3Key" : {
"Rules" : {
"Name" : "prefix",
"Value" : "folder2/"
}
}
},
"Queue" : "arn:aws:sqs:us-east-1:958262988361:interstate-latency_hvr_dev"
}]
}
}
}
}
}
}
I've encountered the error saying Encountered unsupported property Id. I thought that by defining the ID, I would be able to avoid the Duplicate object key error.
Does anyone know how to create multiple triggers in a single CloudFormation template? Thanks for the help in advance.
It should be structured like the below, There should only be one QueueConfigurations attribute
that contains all queue configurations within it. Also the Id parameter is not a valid property.
{
"Resources": {
"S3Bucket": {
"Type" : "AWS::S3::Bucket",
"Properties" :
"BucketName" : { "Ref" : "paramBucketName" },
"LoggingConfiguration" : {
"DestinationBucketName" : "test-bucket",
"LogFilePrefix" : { "Fn::Join": [ "", [ { "Ref": "paramBucketName" }, "/" ] ] }
},
"NotificationConfiguration" : {
"QueueConfigurations" : [{
"Event" : "s3:ObjectCreated:*",
"Filter" : {
"S3Key" : {
"Rules" : {
"Name" : "prefix",
"Value" : "folder1/"
}
}
},
"Queue" : "arn:aws:sqs:us-east-1:958262988361:interstate-cdc_feeder_prod_hvr_dev"
},
{
"Event" : "s3:ObjectCreated:*",
"Filter" : {
"S3Key" : {
"Rules" : {
"Name" : "prefix",
"Value" : "folder2/"
}
}
},
"Queue" : "arn:aws:sqs:us-east-1:958262988361:interstate-latency_hvr_dev"
}]
}
}
}
}
}
}
There is more information about QueueConfiguration in the documentation.

Manipulating JSON messages from Kafka topic using Logstash filter

I am using Logstash 2.4 to read JSON messages from a Kafka topic and send them to an Elasticsearch Index.
The JSON format is as below --
{
"schema":
{
"type": "struct",
"fields": [
{
"type":"string",
"optional":false,
"field":"reloadID"
},
{
"type":"string",
"optional":false,
"field":"externalAccountID"
},
{
"type":"int64",
"optional":false,
"name":"org.apache.kafka.connect.data.Timestamp",
"version":1,
"field":"reloadDate"
},
{
"type":"int32",
"optional":false,
"field":"reloadAmount"
},
{
"type":"string",
"optional":true,
"field":"reloadChannel"
}
],
"optional":false,
"name":"reload"
},
"payload":
{
"reloadID":"328424295",
"externalAccountID":"9831200013",
"reloadDate":1446242463000,
"reloadAmount":240,
"reloadChannel":"C1"
}
}
Without any filter in my config file, the target documents from the ES index look like below --
{
"_index" : "kafka_reloads",
"_type" : "logs",
"_id" : "AVfcyTU4SyCFNFP2z5-l",
"_score" : 1.0,
"_source" : {
"schema" : {
"type" : "struct",
"fields" : [ {
"type" : "string",
"optional" : false,
"field" : "reloadID"
}, {
"type" : "string",
"optional" : false,
"field" : "externalAccountID"
}, {
"type" : "int64",
"optional" : false,
"name" : "org.apache.kafka.connect.data.Timestamp",
"version" : 1,
"field" : "reloadDate"
}, {
"type" : "int32",
"optional" : false,
"field" : "reloadAmount"
}, {
"type" : "string",
"optional" : true,
"field" : "reloadChannel"
} ],
"optional" : false,
"name" : "reload"
},
"payload" : {
"reloadID" : "155559213",
"externalAccountID" : "9831200014",
"reloadDate" : 1449529746000,
"reloadAmount" : 140,
"reloadChannel" : "C1"
},
"#version" : "1",
"#timestamp" : "2016-10-19T11:56:09.973Z",
}
}
But, I want only the value part of the "payload" field to move to my ES index as the target JSON body. So I tried to use the 'mutate' filter in the config file as below --
input {
kafka {
zk_connect => "zksrv-1:2181,zksrv-2:2181,zksrv-4:2181"
group_id => "logstash"
topic_id => "reload"
consumer_threads => 3
}
}
filter {
mutate {
remove_field => [ "schema","#version","#timestamp" ]
}
}
output {
elasticsearch {
hosts => ["datanode-6:9200","datanode-2:9200"]
index => "kafka_reloads"
}
}
With this filter, the ES documents now look like below --
{
"_index" : "kafka_reloads",
"_type" : "logs",
"_id" : "AVfch0yhSyCFNFP2z59f",
"_score" : 1.0,
"_source" : {
"payload" : {
"reloadID" : "850846698",
"externalAccountID" : "9831200013",
"reloadDate" : 1449356706000,
"reloadAmount" : 30,
"reloadChannel" : "C1"
}
}
}
But actually It should be like below --
{
"_index" : "kafka_reloads",
"_type" : "logs",
"_id" : "AVfch0yhSyCFNFP2z59f",
"_score" : 1.0,
"_source" : {
"reloadID" : "850846698",
"externalAccountID" : "9831200013",
"reloadDate" : 1449356706000,
"reloadAmount" : 30,
"reloadChannel" : "C1"
}
}
Is there a way to do this? Can anyone help me on this?
I also tried the below filter --
filter {
json {
source => "payload"
}
}
But that is giving me errors like --
Error parsing json {:source=>"payload", :raw=>{"reloadID"=>"572584696", "externalAccountID"=>"9831200011", "reloadDate"=>1449093851000, "reloadAmount"=>180, "reloadChannel"=>"C1"}, :exception=>java.lang.ClassCastException: org.jruby.RubyHash cannot be cast to org.jruby.RubyIO, :level=>:warn}
Any help will be much appreciated.
Thanks
Gautam Ghosh
You can achieve what you want using the following ruby filter:
ruby {
code => "
event.to_hash.delete_if {|k, v| k != 'payload'}
event.to_hash.update(event['payload'].to_hash)
event.to_hash.delete_if {|k, v| k == 'payload'}
"
}
What it does is:
remove all fields but the payload one
copy all payload inner fields at the root level
delete the payload field itself
You'll end up with what you need.
It's been a while but here there is a valid workaround, hope it would be useful.
json_encode {
source => "json"
target => "json_string"
}
json {
source => "json_string"
}

mongoDB limit the number of entries scanned

I'm trying mongoDB and I need translate this following SQL query.
SELECT * FROM infos_cli
WHERE MATCH(denomination) AGAINST('cafe')
WHERE code_postal LIKE '34%'
My full text index definition:
db.infos_cli.createIndex(
{ "code_postal": 1,
"denomination": "text"
},
{default_language: "french"},
{name: "indexSerch"}
)
And my query in mongoDb:
db.infos_cli.find({code_postal : /34/, $text: {$search: "cafe"}})
But it's not working.
Can anyone explain how I've to do ?
In this case please create separate index for postal field and for text search
db.articles.createIndex({ author : 1}) //postal... in your case
db.articles.createIndex({
"denomination" : "text"
}, {
default_language : "french"
}, {
name : "indexSerch"
})
my query
db.getCollection('articles').find({
$text : {
$search : "coffee",
$language : "french"
}
}).explain()
Result shows that there is TEXT phase and IXSCAN which is desired!
result from explain:
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "testCode.articles",
"indexFilterSet" : false,
"parsedQuery" : {
"$text" : {
"$search" : "coffee",
"$language" : "french",
"$caseSensitive" : false,
"$diacriticSensitive" : false
}
},
"winningPlan" : {
"stage" : "TEXT",
"indexPrefix" : {},
"indexName" : "denomination_text",
"parsedTextQuery" : {
"terms" : [
"coffe"
],
"negatedTerms" : [],
"phrases" : [],
"negatedPhrases" : []
},
"inputStage" : {
"stage" : "TEXT_MATCH",
"inputStage" : {
"stage" : "TEXT_OR",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"_fts" : "text",
"_ftsx" : 1
},
"indexName" : "denomination_text",
"isMultiKey" : false,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"direction" : "backward",
"indexBounds" : {}
}
}
}
},
"rejectedPlans" : []
},
"serverInfo" : {
"host" : "gbernas3-lt",
"port" : 27017,
"version" : "3.2.0",
"gitVersion" : "45d947729a0315accb6d4f15a6b06be6d9c19fe7"
},
"ok" : 1
}
any comments welcome!

Custom analyzer appearing in type mapping but not working in Elasticsearch

I'm trying to add a custom analyzer to my index while also mapping that analyzer to a property on a type. Here is my JSON object for doing this:
{ "settings" : {
"analysis" : {
"analyzer" : {
"test_analyzer" : {
"type" : "custom",
"tokenizer": "standard",
"filter" : ["lowercase", "asciifolding"],
"char_filter": ["html_strip"]
}
}
}
},
"mappings" : {
"test" : {
"properties" : {
"checkanalyzer" : {
"type" : "string",
"analyzer" : "test_analyzer"
}
}
}
}
}
I know this analyzer works because I've tested it using /wp2/_analyze?analyzer=test_analyzer -d '<p>Testing analyzer.</p>' and also it shows up as the analyzer for the checkanalyzer property when I check /wp2/test/_mapping. However, if I add a document like {"checkanalyzer": "<p>The tags should not show up</p>"}, the HTML tags don't get stripped out when I retrieve the document using the _search endpoint. Am I misunderstanding how the mapping works or is there something wrong with my JSON object? I'm dynamically creating the wp2 index and also the test type when I make this call to Elasticsearch, not sure if that matters.
The html doesn't get removed from the source, it gets removed from the terms generated by that source. You can see this if you use a terms aggregation:
POST /test_index/_search
{
"aggs": {
"checkanalyzer_field_terms": {
"terms": {
"field": "checkanalyzer"
}
}
}
}
{
"took": 77,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "test",
"_id": "1",
"_score": 1,
"_source": {
"checkanalyzer": "<p>The tags should not show up</p>"
}
}
]
},
"aggregations": {
"checkanalyzer_field_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "not",
"doc_count": 1
},
{
"key": "should",
"doc_count": 1
},
{
"key": "show",
"doc_count": 1
},
{
"key": "tags",
"doc_count": 1
},
{
"key": "the",
"doc_count": 1
},
{
"key": "up",
"doc_count": 1
}
]
}
}
}
Here's some code I used to test it:
http://sense.qbox.io/gist/2971767aa0f5949510fa0669dad6729bbcdf8570
Now if you want to completely strip out the html prior to indexing and storing the content as is, you can use the mapper attachment plugin - in which when you define the mapping, you can categorize the content_type to be "html."
The mapper attachment is useful for many things especially if you are handling multiple document types, but most notably - I believe just using this for the purpose of stripping out the html tags is sufficient enough (which you cannot do with the html_strip char filter).
Just a forewarning though - NONE of the html tags will be stored. So if you do need those tags somehow, I would suggest defining another field to store the original content. Another note: You cannot specify multifields for mapper attachment documents, so you would need to store that outside of the mapper attachment document. See my working example below.
You'll need to result in this mapping:
{
"html5-es" : {
"aliases" : { },
"mappings" : {
"document" : {
"properties" : {
"delete" : {
"type" : "boolean"
},
"file" : {
"type" : "attachment",
"fields" : {
"content" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "autocomplete"
},
"author" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets"
},
"title" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "autocomplete"
},
"name" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"keywords" : {
"type" : "string"
},
"content_type" : {
"type" : "string"
},
"content_length" : {
"type" : "integer"
},
"language" : {
"type" : "string"
}
}
},
"hash_id" : {
"type" : "string"
},
"path" : {
"type" : "string"
},
"raw_content" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "raw"
},
"title" : {
"type" : "string"
}
}
}
},
"settings" : { //insert your own settings here },
"warmers" : { }
}
}
Such that in NEST, I will assemble the content as such:
Attachment attachment = new Attachment();
attachment.Content = Convert.ToBase64String(File.ReadAllBytes("path/to/document"));
attachment.ContentType = "html";
Document document = new Document();
document.File = attachment;
document.RawContent = InsertRawContentFromString(originalText);
I have tested this in Sense - results are as follows:
"file": {
"_content": "PGh0bWwgeG1sbnM6TWFkQ2FwPSJodHRwOi8vd3d3Lm1hZGNhcHNvZnR3YXJlLmNvbS9TY2hlbWFzL01hZENhcC54c2QiPg0KICA8aGVhZCAvPg0KICA8Ym9keT4NCiAgICA8aDE+VG9waWMxMDwvaDE+DQogICAgPHA+RGVsZXRlIHRoaXMgdGV4dCBhbmQgcmVwbGFjZSBpdCB3aXRoIHlvdXIgb3duIGNvbnRlbnQuIENoZWNrIHlvdXIgbWFpbGJveC48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+YXNkZjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD4xMDwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5MYXZlbmRlci48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+MTAvNiAxMjowMzwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD41IDA5PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPjExIDQ3PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPkhhbGxvd2VlbiBpcyBpbiBPY3RvYmVyLjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5qb2c8L3A+DQogIDwvYm9keT4NCjwvaHRtbD4=",
"_content_length": 0,
"_content_type": "html",
"_date": "0001-01-01T00:00:00",
"_title": "Topic10"
},
"delete": false,
"raw_content": "<h1>Topic10</h1><p>Delete this text and replace it with your own content. Check your mailbox.</p><p> </p><p>asdf</p><p> </p><p>10</p><p> </p><p>Lavender.</p><p> </p><p>10/6 12:03</p><p> </p><p>5 09</p><p> </p><p>11 47</p><p> </p><p>Halloween is in October.</p><p> </p><p>jog</p>"
},
"highlight": {
"file.content": [
"\n <em>Topic10</em>\n\n Delete this text and replace it with your own content. Check your mailbox.\n\n  \n\n asdf\n\n  \n\n 10\n\n  \n\n Lavender.\n\n  \n\n 10/6 12:03\n\n  \n\n 5 09\n\n  \n\n 11 47\n\n  \n\n Halloween is in October.\n\n  \n\n jog\n\n "
]
}

DataTables Uncaught SyntaxError: Unexpected token :

I try to use the DataTables component with data provided by a REST API. Chrome reports the following error Uncaught SyntaxError: Unexpected token : on line 2 (see JSON below) when I use server-side data but it works if I use a text file. The setup is:
$('#table_id')
.dataTable({
"bProcessing": true,
"bServerSide": true,
"sAjaxSource": "http://mylocalhost:8888/_ah/api/realestate/v1/properties/demo",
//"sAjaxSource": "data.txt",
"sAjaxDataProp": "items",
"aoColumns": [{
"mData": "id"
}],
"fnServerData": function (
sUrl,
aoData,
fnCallback,
oSettings) {
oSettings.jqXHR = $
.ajax({
"url": sUrl,
"data": aoData,
"success": fnCallback,
"dataType": "jsonp",
"cache": false
});
}
}
}
The JSON returned by the server or in the data.txt file:
{
"iTotalRecords" : 10,
"iTotalDisplayRecords" : 10,
"sEcho" : "1",
"items" : [ {
"id" : "0"
}, {
"id" : "1"
}, {
"id" : "2"
}, {
"id" : "3"
}, {
"id" : "4"
}, {
"id" : "5"
}, {
"id" : "6"
}, {
"id" : "7"
}, {
"id" : "8"
}, {
"id" : "9"
} ]
}
Changing the sAjaxSource to data.txt works but not when the data comes from the server whereas the data are the same.