I have a Snakemake pipeline where I get my input/output paths for my file folders from a json file and use the expand function to obtain the paths.
import json
with open('config.json', 'r') as f:
config = json.load(f)
wildcard = ["1234", "5678"]
rule them_all:
input:
expand('config["data_input"]/data_{wc}.tab', wc = wildcard)
output:
expand('config["data_output"]/output_{wc}.rda', wc = wildcard)
shell:
"Rscript ./my_script.R"
My config.json is
{
"data_input": "/very/long/path",
"data_output": "/slightly/different/long/path"
}
While trying to make a dry run, though, I get the following error:
$ snakemake -np
Building DAG of jobs...
MissingInputException in line 12 of /path/to/Snakefile:
Missing input files for rule them_all:
config["data_input"]/data_1234.tab
config["data_input"]/data_5678.tab
The files are there and their path is /very/long/path/data_1234.tab.
This is probably a low-hanging fruit, but what am I doing wrong in the syntax for the expansion? Or is it the way I call the json file?
expand() does not interpret access to dictionaries for its first argument while expanding the path with quotation marks, so this operation with expand() has to be done in a wildcard.
The correct syntax, in this case, would be e.g.
expand('{input_folder}/data_{wc}.tab', wc = wildcard, input_folder = config["data_input"])
I want to add the contents of a shell script into the body of pkg_preinst_${PN} or pkg_postinst_${PN} function (BitBake recipe of a software package).
For example, let's consider this "PREINST" shell script:
$ cat PREINST
#! /bin/sh
chmod +x /usr/bin/mybin
Executing a simple 'cat' command inside pkg_preinst function doesn't work:
pkg_preinst_${PN}() {
cat ${S}/path/to/PREINST
}
In this way, the contents for the .spec file for the generated rpm package are not the expected:
%pre
cat /Full/Path/To/Variable/S/path/to/PREINST
As you can see, %pre section doesn't include real contents of PREINST file, just includes the 'cat' command.
Is it possible to include the contents of PREINST file into the generated .spec file in some way?
Thank you in advance!
Finally I solved this issue by prepending this code to the do_package task:
do_package_prepend() {
PREINST_path = "${S}/${MYMODULE}/PREINST"
POSTINST_path = "${S}/${MYMODULE}/POSTINST"
PREINST = open(PREINST_path, "r")
POSTINST = open(POSTINST_path, "r")
d.setVar("pkg_preinst", PREINST.read())
d.setVar("pkg_postinst", POSTINST.read())
}
It modifies "pkg_preinst" and "pkg_postinst" keys in 'd' global dictionary with the content of each PREINST and POSTINST file as value. Now it works! :)
Please advise on how to get Foreach Loop Contaiiner coordinated with the Execute Process Task so that only 2014-06-20 files are unzipped when my package user variable user::datePart = 2014-06-20.
source folder has 4 zip files with 2 different time stamps (sample):
2014-06-20_24632_1403294308_settings_publisher.txt.zip
2014-06-20_24632_1403294309_settings_campaign.txt.zip
2014-06-21_24632_1403294308_settings_publisher.txt
2014-06-21_24632_1403294309_settings_campaign.txt
What I've tried:
package variable user::datePart set to 2014-06-20
foreach loop container:
collection Foreach File Enumerater expressions: FileSpec =#[User::datePart] +"*.txt.zip"
collection Folder: C:\Users\me\Downloads\MarinmultipleZipped Files: *.txt.zip
collection Files: *.txt.zip
collection Retrieve file name: Fully qualified
variable mappings I set User::zippedFile set to 0
inside foreach loop container an Execute Process Task
Task property DelayValidation = True
Process executable = C:\Program Files (x86)\7-Zip\7z.exe,
expressions property Arguments = "e " +#[User::zippedFile]+ " " +"-C:\Users\me\Downloads\test2"
when I run this, it looks like success, but only the first two files are unzipping, and this is regardless of whether timestamp is 2014-06-20 or 2014-06-21 - which is very weird.
try removing the '.txt' part like this:
collection Foreach File Enumerater expressions: FileSpec =#[User::datePart] +"*.zip"
Is there any way to import a JSON file (contains 100 documents) in elasticsearch server? I want to import a big json file into es-server..
As dadoonet already mentioned, the bulk API is probably the way to go. To transform your file for the bulk protocol, you can use jq.
Assuming the file contains just the documents itself:
$ echo '{"foo":"bar"}{"baz":"qux"}' |
jq -c '
{ index: { _index: "myindex", _type: "mytype" } },
. '
{"index":{"_index":"myindex","_type":"mytype"}}
{"foo":"bar"}
{"index":{"_index":"myindex","_type":"mytype"}}
{"baz":"qux"}
And if the file contains the documents in a top level list they have to be unwrapped first:
$ echo '[{"foo":"bar"},{"baz":"qux"}]' |
jq -c '
.[] |
{ index: { _index: "myindex", _type: "mytype" } },
. '
{"index":{"_index":"myindex","_type":"mytype"}}
{"foo":"bar"}
{"index":{"_index":"myindex","_type":"mytype"}}
{"baz":"qux"}
jq's -c flag makes sure that each document is on a line by itself.
If you want to pipe straight to curl, you'll want to use --data-binary #-, and not just -d, otherwise curl will strip the newlines again.
You should use Bulk API. Note that you will need to add a header line before each json document.
$ cat requests
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
$ curl -s -XPOST localhost:9200/_bulk --data-binary #requests; echo
{"took":7,"items":[{"create":{"_index":"test","_type":"type1","_id":"1","_version":1,"ok":true}}]}
I'm sure someone wants this so I'll make it easy to find.
FYI - This is using Node.js (essentially as a batch script) on the same server as the brand new ES instance. Ran it on 2 files with 4000 items each and it only took about 12 seconds on my shared virtual server. YMMV
var elasticsearch = require('elasticsearch'),
fs = require('fs'),
pubs = JSON.parse(fs.readFileSync(__dirname + '/pubs.json')), // name of my first file to parse
forms = JSON.parse(fs.readFileSync(__dirname + '/forms.json')); // and the second set
var client = new elasticsearch.Client({ // default is fine for me, change as you see fit
host: 'localhost:9200',
log: 'trace'
});
for (var i = 0; i < pubs.length; i++ ) {
client.create({
index: "epubs", // name your index
type: "pub", // describe the data thats getting created
id: i, // increment ID every iteration - I already sorted mine but not a requirement
body: pubs[i] // *** THIS ASSUMES YOUR DATA FILE IS FORMATTED LIKE SO: [{prop: val, prop2: val2}, {prop:...}, {prop:...}] - I converted mine from a CSV so pubs[i] is the current object {prop:..., prop2:...}
}, function(error, response) {
if (error) {
console.error(error);
return;
}
else {
console.log(response); // I don't recommend this but I like having my console flooded with stuff. It looks cool. Like I'm compiling a kernel really fast.
}
});
}
for (var a = 0; a < forms.length; a++ ) { // Same stuff here, just slight changes in type and variables
client.create({
index: "epubs",
type: "form",
id: a,
body: forms[a]
}, function(error, response) {
if (error) {
console.error(error);
return;
}
else {
console.log(response);
}
});
}
Hope I can help more than just myself with this. Not rocket science but may save someone 10 minutes.
Cheers
jq is a lightweight and flexible command-line JSON processor.
Usage:
cat file.json | jq -c '.[] | {"index": {"_index": "bookmarks", "_type": "bookmark", "_id": .id}}, .' | curl -XPOST localhost:9200/_bulk --data-binary #-
We’re taking the file file.json and piping its contents to jq first with the -c flag to construct compact output. Here’s the nugget: We’re taking advantage of the fact that jq can construct not only one but multiple objects per line of input. For each line, we’re creating the control JSON Elasticsearch needs (with the ID from our original object) and creating a second line that is just our original JSON object (.).
At this point we have our JSON formatted the way Elasticsearch’s bulk API expects it, so we just pipe it to curl which POSTs it to Elasticsearch!
Credit goes to Kevin Marsh
Import no, but you can index the documents by using the ES API.
You can use the index api to load each line (using some kind of code to read the file and make the curl calls) or the index bulk api to load them all. Assuming your data file can be formatted to work with it.
Read more here : ES API
A simple shell script would do the trick if you comfortable with shell something like this maybe (not tested):
while read line
do
curl -XPOST 'http://localhost:9200/<indexname>/<typeofdoc>/' -d "$line"
done <myfile.json
Peronally, I would probably use Python either pyes or the elastic-search client.
pyes on github
elastic search python client
Stream2es is also very useful for quickly loading data into es and may have a way to simply stream a file in. (I have not tested a file but have used it to load wikipedia doc for es perf testing)
Stream2es is the easiest way IMO.
e.g. assuming a file "some.json" containing a list of JSON documents, one per line:
curl -O download.elasticsearch.org/stream2es/stream2es; chmod +x stream2es
cat some.json | ./stream2es stdin --target "http://localhost:9200/my_index/my_type
You can use esbulk, a fast and simple bulk indexer:
$ esbulk -index myindex file.ldj
Here's an asciicast showing it loading Project Gutenberg data into Elasticsearch in about 11s.
Disclaimer: I'm the author.
you can use Elasticsearch Gatherer Plugin
The gatherer plugin for Elasticsearch is a framework for scalable data fetching and indexing. Content adapters are implemented in gatherer zip archives which are a special kind of plugins distributable over Elasticsearch nodes. They can receive job requests and execute them in local queues. Job states are maintained in a special index.
This plugin is under development.
Milestone 1 - deploy gatherer zips to nodes
Milestone 2 - job specification and execution
Milestone 3 - porting JDBC river to JDBC gatherer
Milestone 4 - gatherer job distribution by load/queue length/node name, cron jobs
Milestone 5 - more gatherers, more content adapters
reference https://github.com/jprante/elasticsearch-gatherer
One way is to create a bash script that does a bulk insert:
curl -XPOST http://127.0.0.1:9200/myindexname/type/_bulk?pretty=true --data-binary #myjsonfile.json
After you run the insert, run this command to get the count:
curl http://127.0.0.1:9200/myindexname/type/_count
Script works well when run manually, but when I schdule it in cronjob it shows :
malformed JSON string, neither array, object, number, string or atom, at character offset 0 (before "<html>\r\n<head><tit...") at /usr/local/lib/perl5/site_perl/5.14.2/JSON.pm line 171.
script itself:
#rest config vaiables
$ENV{'PERL_LWP_SSL_VERIFY_NONE'} = 0;
print "test\n";
my $client = REST::Client->new();
$client->addHeader('Authorization', 'Basic YWRtaW46cmFyaXRhbg==');
$client->addHeader('content_type', 'application/json');
$client->addHeader('accept', 'application/json');
$client->setHost('http://10.10.10.10');
$client->setTimeout(1000);
$useragent = $client->getUseragent();
print "test\n";
#Getting racks by pod
$req = '/api/v2/racks?name_like=2t';
#print " rekvest {$req}\n";
$client->request('GET', qq($req));
$racks = from_json($client->responseContent());
$datadump = Dumper (from_json($client->responseContent()));
crontab -l
*/2 * * * * /usr/local/bin/perl /folder/api/2t.pl > /dmitry/api/damnitout 2>&1
Appreciate any suggestion
Thank you,
Dmitry
It is difficult to say what is really happening, but in my experience 99% issues of running stuff in crontab stems from differences in environment variables.
Typical way to debug this: in the beginning of your script add block like this:
foreach my $key (keys %ENV) {
print "$key = $ENV{$key}\n";
}
Run it in console, look at the output, save it in log file.
Now, repeat the same in crontab and save it into log file (you have already done that - this is good).
See if there is any difference in environment variables when trying to run it both ways and try to fix it. In Perl, probably easiest is to alter environment by changing %ENV. After all differences are sorted out, there is no reason for this to not work right.
Good luck!