stream2es: command not found - json

I'd like to use a script to convert the Enron email dataset first in mbox format, then in a JSON doc. After that the script should automatically import this JSON into elasticsearch using stream2es utility. Here I faced the problem; when I launch the script everything goes well except the stream2es utility. In fact, stream2es: command not found appears.
I have a folder with the script, the Enron email folder and stream2es in it. I grant the permissions to streams2es, so I think I have everything to make the script work.
I'm going to post the script here:
#!/bin/sh
#
# Loading enron data into elasticsearch
#
# Prerequisites:
# make sure that stream2es utility is present in the path
# install beautifulsoup4 and lxml:
# sudo easy_install beautifulsoup4
# sudo easy_install lxml
#
# The mailboxes__jsonify_mbox.py and mailboxes__convert_enron_inbox_to_mbox.py are modified
# versions of https://github.com/ptwobrussell/Mining-the-Social-Web/tree/master/python_code
#
#if [ ! -d enron_mail_20110402 ]; then
# echo "Downloading enron file"
# curl -O -L http://www.cs.cmu.edu/~enron/enron_mail_20110402.tgz
# tar -xzf enron_mail_20110402.tgz
#fi
if [ ! -f enron.mbox.json ]; then
echo "Converting enron emails to mbox format"
python mailboxes__convert_enron_inbox_to_mbox.py allen-p > enron.mbox # allen-p is one of the folders within Enron dataset
echo "Converting enron emails to json format"
python mailboxes__jsonify_mbox.py enron.mbox > enron.mbox.json
rm enron.mbox
fi
echo "Indexing enron emails"
es_host="http://localhost:9200"
curl -XDELETE "$es_host/enron"
curl -XPUT "$es_host/enron" -d '{
"settings": {
"index.number_of_replicas": 0,
"index.number_of_shards": 5,
"index.refresh_interval": -1
},
"mappings": {
"email": {
"properties": {
"Bcc": {
"type": "string",
"index": "not_analyzed"
},
"Cc": {
"type": "string",
"index": "not_analyzed"
},
"Content-Transfer-Encoding": {
"type": "string",
"index": "not_analyzed"
},
"Content-Type": {
"type": "string",
"index": "not_analyzed"
},
"Date": {
"type" : "date",
"format" : "EEE, dd MMM YYYY HH:mm:ss Z"
},
"From": {
"type": "string",
"index": "not_analyzed"
},
"Message-ID": {
"type": "string",
"index": "not_analyzed"
},
"Mime-Version": {
"type": "string",
"index": "not_analyzed"
},
"Subject": {
"type": "string"
},
"To": {
"type": "string",
"index": "not_analyzed"
},
"X-FileName": {
"type": "string",
"index": "not_analyzed"
},
"X-Folder": {
"type": "string",
"index": "not_analyzed"
},
"X-From": {
"type": "string",
"index": "not_analyzed"
},
"X-Origin": {
"type": "string",
"index": "not_analyzed"
},
"X-To": {
"type": "string",
"index": "not_analyzed"
},
"X-bcc": {
"type": "string",
"index": "not_analyzed"
},
"X-cc": {
"type": "string",
"index": "not_analyzed"
},
"bytes": {
"type": "long"
},
"offset": {
"type": "long"
},
"parts": {
"dynamic": "true",
"properties": {
"content": {
"type": "string"
},
"contentType": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}'
stream2es stdin --target $es_host/enron/email < enron.mbox.json
Can anyone help me to solve the stream2es command not found problem? Thank you guys.

command not found means that the shell cannot find the stream2es command. You have two options:
Your script either needs to call ./stream2es (i.e. call the stream2es script located in the same folder) or
you need to move stream2es in a folder that is located on your $PATH

Related

Azure ARM Template - Running DSC script without triggering extension install?

I am trying to deploy a Active Directory forest with two DCs. I've managed to deploy the DCs and install the ADDS features on both VMs. The "PDC" had a DSC script that runs and configures the forest, again that works great. The issue I have is trying to run a second DSC script on the second DC, this script runs the ADDS configuration to promote the VM to a DC and join it to the forest. I've created a nested JSON template that gets called by the main template. But I am hitting this error:
"Multiple VMExtensions per handler not supported for OS type 'Windows'. VMExtension 'PrepareBDC' with handler 'Microsoft.Powershell.DSC' already added or specified in input."
I've spent the last hour or so whizzing around the internet looking for answers and everyone seems to say the same thing...you can't install the same extension twice. Ok, I can see why that would make sense, my question is can I configure the nested template so it doesn't try and install the extension, just uses what's already installed on the VM?
Main template snippet:
{
"type": "Microsoft.Compute/virtualMachines/extensions",
"name": "[concat(variables('dc2name'), '/PrepareDC2AD')]",
"apiVersion": "2018-06-01",
"location": "[resourceGroup().location]",
"dependsOn": [
"[resourceId('Microsoft.Compute/virtualMachines', variables('dc2name'))]"
],
"properties": {
"publisher": "Microsoft.Powershell",
"type": "DSC",
"typeHandlerVersion": "2.19",
"autoUpgradeMinorVersion": true,
"settings": {
"ModulesUrl": "[concat(parameters('Artifacts Location'), '/dsc/PrepareADBDC.zip', parameters('Artifacts Location SAS Token'))]",
"ConfigurationFunction": "PrepareADBDC.ps1\\PrepareADBDC",
"Properties": {
"DNSServer": "[variables('dc1ipaddress')]"
}
}
}
},
{
"name": "ConfiguringDC2",
"type": "Microsoft.Resources/deployments",
"apiVersion": "2016-09-01",
"dependsOn": [
"[concat('Microsoft.Compute/virtualMachines/',variables('dc1name'),'/extensions/CreateADForest')]",
"[concat('Microsoft.Compute/virtualMachines/',variables('dc2name'),'/extensions/PrepareDC2AD')]"
],
"properties": {
"mode": "Incremental",
"templateLink": {
"uri": "[concat(parameters('Artifacts Location'), '/nestedtemplates/configureADBDC.json', parameters('Artifacts Location SAS Token'))]",
"contentVersion": "1.0.0.0"
},
"parameters": {
"adBDCVMName": {
"value": "[variables('dc2name')]"
},
"location": {
"value": "[resourceGroup().location]"
},
"adminUsername": {
"value": "[parameters('Administrator User')]"
},
"adminPassword": {
"value": "[parameters('Administrator Password')]"
},
"domainName": {
"value": "[parameters('Domain Name')]"
},
"adBDCConfigurationFunction": {
"value": "ConfigureADBDC.ps1\\ConfigureADBDC"
},
"adBDCConfigurationModulesURL": {
"value": "[concat(parameters('Artifacts Location'), '/dsc/ConfigureADBDC.zip', parameters('Artifacts Location SAS Token'))]"
}
}
}
},
The nested template:
{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"adBDCVMName": {
"type": "string"
},
"location": {
"type": "string",
"defaultValue": "[resourceGroup().location]"
},
"adminUsername": {
"type": "string"
},
"adminPassword": {
"type": "securestring"
},
"domainName": {
"type": "string"
},
"adBDCConfigurationFunction": {
"type": "string"
},
"adBDCConfigurationModulesURL": {
"type": "string"
}
},
"resources": [
{
"type": "Microsoft.Compute/virtualMachines/extensions",
"name": "[concat(parameters('adBDCVMName'),'/PrepareBDC')]",
"apiVersion": "2016-03-30",
"location": "[parameters('location')]",
"properties": {
"publisher": "Microsoft.Powershell",
"type": "DSC",
"typeHandlerVersion": "2.21",
"autoUpgradeMinorVersion": true,
"forceUpdateTag": "1.0",
"settings": {
"modulesURL": "[parameters('adBDCConfigurationModulesURL')]",
"wmfVersion": "4.0",
"configurationFunction": "[parameters('adBDCConfigurationFunction')]",
"properties": {
"domainName": "[parameters('domainName')]",
"adminCreds": {
"userName": "[parameters('adminUsername')]",
"password": "privateSettingsRef:adminPassword"
}
}
},
"protectedSettings": {
"items": {
"adminPassword": "[parameters('adminPassword')]"
}
}
}
}
]
}
this error means exactly what it says: you cannot have multiple copies of the same extension, what you need to do is apply the same extension to the vm, all the inputs have to be the same. you can have a look at this example which does exactly that. This particular template installs the extension for the second time to join bdc to the domain.
But, I don't like that approach. I use Powershell DSC to just wait for the domain to get created and join the bdc to the domain in one go. you would use this powershell dsc snippet:
xWaitForADDomain DscForestWait {
DomainName = $DomainName
DomainUserCredential = $DomainCreds
RetryCount = $RetryCount
RetryIntervalSec = $RetryIntervalSec
}
Here's a complete example

js-beautify config for Arrays of Objects

Is there any way to configure js-beautify to keep the following format:
"structure": [
{
"name": "heading",
"text": "",
"default": "",
"type": "string"
},
{
"name": "flickr-album-id",
"text": "",
"type": "string"
},
]
js-beautify pulls in the curly bracket of the first object into the first line. I know of the option keep_array_indentation but i dont want to disable the general auto indentation because errors wont get fixed otherwise.
The result of js-beautify - that i want to prevent - will be
"structure": [{
"name": "heading",
"text": "",
"default": "",
"type": "string"
},
{
"name": "flickr-album-id",
"text": "",
"type": "string"
},
]

Post a json body with swagger

I would like to POST a json body with Swagger, like this :
curl -H "Content-Type: application/json" -X POST -d {"username":"foobar","password":"xxxxxxxxxxxxxxxxx", "email": "foo#bar.com"}' http://localhost/user/register
Currently, I have this definition :
"/auth/register": {
"post": {
"tags": [
"auth"
],
"summary": "Create a new user account",
"parameters": [
{
"name": "username",
"in": "query",
"description": "The username of the user",
"required": true,
"type": "string"
},
{
"name": "password",
"in": "query",
"description": "The password of the user",
"required": true,
"type": "string",
"format": "password"
},
{
"name": "email",
"in": "query",
"description": "The email of the user",
"required": true,
"type": "string",
"format": "email"
}
],
"responses": {
"201": {
"description": "The user account has been created",
"schema": {
"$ref": "#/definitions/User"
}
},
"default": {
"description": "Unexpected error",
"schema": {
"$ref": "#/definitions/Errors"
}
}
}
}
}
But the data are sent in the URL. Here the generated curl provided by Swagger :
curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' 'http://localhost/user/register?username=foobar&password=password&email=foo%40bar.com'
I understand that the query keywork is not good, but I didn't find the way to POST a JSON body. I tried formData but it didn't work.
You need to use the body parameter:
"parameters": [
{
"in": "body",
"name": "body",
"description": "Pet object that needs to be added to the store",
"required": false,
"schema": {
"$ref": "#/definitions/Pet"
}
}
],
and #/definitions/Pet is defined as a model:
"Pet": {
"required": [
"name",
"photoUrls"
],
"properties": {
"id": {
"type": "integer",
"format": "int64"
},
"category": {
"$ref": "#/definitions/Category"
},
"name": {
"type": "string",
"example": "doggie"
},
"photoUrls": {
"type": "array",
"xml": {
"name": "photoUrl",
"wrapped": true
},
"items": {
"type": "string"
}
},
"tags": {
"type": "array",
"xml": {
"name": "tag",
"wrapped": true
},
"items": {
"$ref": "#/definitions/Tag"
}
},
"status": {
"type": "string",
"description": "pet status in the store",
"enum": [
"available",
"pending",
"sold"
]
}
},
"xml": {
"name": "Pet"
}
},
Ref: https://github.com/OpenAPITools/openapi-generator/blob/master/modules/openapi-generator/src/test/resources/2_0/petstore.json#L35-L43
OpenAPI/Swagger v2 spec: https://github.com/OAI/OpenAPI-Specification/blob/master/versions/2.0.md#parameter-object
For OpenAPI v3 spec, body parameter has been deprecated. To define the HTTP payload, one needs to use the requestBody instead, e.g. https://github.com/OpenAPITools/openapi-generator/blob/master/modules/openapi-generator/src/test/resources/3_0/petstore.json#L39-L41
OpenAPI v3 spec: https://github.com/OAI/OpenAPI-Specification/blob/master/versions/3.0.0.md#requestBodyObject

How can I index .JSON in elasticsearch

I am starting with elasticsearch now and i don't know anything about it.
I have folowing .JSON:
[
{
"label": "Admin Law",
"tags": [
"#admin"
],
"owner": "generalTopicTagText"
},
{
"label": "Judicial review",
"tags": [
"#JR"
],
"owner": "generalTopicTagText"
},
{
"label": "Admiralty/Shipping",
"tags": [
"#shipping"
],
"owner": "generalTopicTagText"
}
]
My mapping is this:
{
"topic_tax": {
"properties": {
"label": {
"type": "string",
"index": "not_analyzed"
},
"tags": {
"type": "string",
"index_name": "tag"
},
"owner": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
I need to put the first .Json into Elasticsearch, but it does not work.
All I know is that i am defining only 1 of this:
{
"label": "Judicial review",
"tags": [
"#JR"
],
"owner": "generalTopicTagText"
}
So when i try to put all of them with my elasticsearch.init, it will not work.
But I really don't know how to declare the mapping.Json to put the all .Json, it is like i need something like a for there.
You have to insert them json after json. But what you should do is use the bulk api of elasticsearch to insert multiple documents in one request. Check this api doc to see how it works
You can do something like this
curl -XPUT 'localhost:9000/es/post/1?version=2' -d '{
"text" : "your test message!"
}'
here is the documentation for index json with elasticsearch

configure an elasticsearch index with json not taking

I'm using the following json to configure elasticsearch. The goal is to set up the index and the type in one swoop (this is the requirement, setting up docker images). This is as far as I've gotten that will allow elasticsearch to start successfully. The problem is that the index isn't created yet it doesn't error. Other forms I've tried prevents the service from starting.
{
"cluster": {
"name": "MyClusterName"
},
"node": {
"name": "MyNodeName"
},
"indices": {
"number_of_shards": 4,
"index.number_of_replicas": 4
},
"index": {
"analysis": {
"analyzer": {
"my_ngram_analyzer": {
"tokenizer": "my_ngram_tokenizer",
"filter": "lowercase"
},
"my_lowercase_whitespace_analyzer": {
"tokenizer": "whitespace",
"filter": "lowercase"
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "2",
"max_gram": "20"
}
}
},
"index": {
"settings": {
"_id": "indexindexer"
},
"mappings": {
"inventoryIndex": {
"_id": {
"path": "indexName"
},
"_routing": {
"required": true,
"path": "indexName"
},
"properties": {
"indexName": {
"type": "string",
"index": "not_analyzed"
},
"startedOn": {
"type": "date",
"index": "not_analyzed"
},
"deleted": {
"type": "boolean",
"index": "not_analyzed"
},
"deletedOn": {
"type": "date",
"index": "not_analyzed"
},
"archived": {
"type": "boolean",
"index": "not_analyzed"
},
"archivedOn": {
"type": "date",
"index": "not_analyzed"
},
"failure": {
"type": "boolean",
"index": "not_analyzed"
},
"failureOn": {
"type": "date",
"index": "not_analyzed"
}
}
}
}
}
}
}
I may have a workaround using curl in a post-boot script but I would prefer to have the configuration handled in the config file.
Thanks!
It appears that elasticsearch will not allow all the configuration to be done in a single yml. The workaround I've found is to create an index template and place it in the <es-config>/templates/ dir then after spinning up the service I use curl to create the index. The index matching will catch it and provision it according to the template.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-templates.html