Partitioning data across hosts in Ansible (access "index" of host in task?) - jinja2

I am trying to use Ansible to do some parallel computation. My data is trivially parallelizable, I just need to split the file across my hosts (EC2 instances). Is there a canonical way to do this?
The next best thing would be to have a counter that increments for each host. Assuming I have already split my data into my number of workers, I would like to be able to say within each worker task:
- file: src=data/users-{{host_index}}.csv dest=/mnt/users.csv`.
Then, each worker can process their copy of users.csv with a separate script, that is agnostic to which set of users they have. Is there any way to get this counter index?
I am a beginner to Ansible, so I wonder whether I am overlooking a simple module or idiom, either in Ansible or Jinja. Thanks in advance.

It turns out I have access to a variable called ami_launch_index inside of the ec2_facts module that gives me a zero-indexed unique ID to each EC2 instance. Here is the code for copying over files with numerical suffixes to their corresponding EC2 instances:
tasks:
- name: Gather ec2 facts
action: ec2_facts
register: facts
- name: Share data to nodes
copy: src=data/websites-{{facts.ansible_facts.ansible_ec2_ami_launch_index}}.txt dest=/mnt/websites.txt
The copy line produces the following for the src values:
data/websites-1.txt
data/websites-0.txt
data/websites-2.txt
(There is no guarantee that the hosts will iterate in ami_launch_index order)

Related

Azure Datafactory process and filter files to process

I have a pipeline that processes some files, and in some cases "groups" of files. Meaning the files should be processed together and are correlated with a timestamp.
Ex.
Timestamp#Customer.csv
Timestamp#Customer_Offices.csv
Timestamp_1#Customer.csv
Timestamp_1#Customer_Offices.csv
...
I have a table with all the scopes, and files with respective filemask. I have populated a variable in the beginning of the pipeline based on a parameter
The Get files activity goes to a sFTP location and grab files from a folder. Then I only want to process the "Customer.csv" and ".Customer_Offices.csv" files. This is because the folder location has more file types or scopes to be processed by other pipelines. If I don't filter, the next activities end up by processing metadata of files that are not supposed to. In terms of efficiency and performance s bad, and is even causing some issues with files being left behind.
I've tried something like
#variables('FilesToSearch').contains(#endswith(item().name, 'do I need this 2nd parm in arrays ?'))
but no luck... :(
Any help will be highly appreciated,
Best regards,
Manuel
contains function can direct for a string to find a substring, so you can try something like this expression #contains(item().name,'Customer')
and no need to create a variable.
Or use endsWith function and use this expression:
#or(endswith(item().name,'Customer.csv'),endswith(item().name,'Customer_Offices.csv'))

How to work with configuration files in Airflow

In Airflow, we've created several DAGS. Some of which share common properties, for example the directory to read files from. Currently, these properties are listed as a property in each separate DAG, which will obviously become problematic in the future. Say if the directory name was to change, we'd have to go into each DAG and update this piece of code (possibly even missing one).
I was looking into creating some sort of a configuration file, which can be parsed into Airflow and used by the various DAGS when a certain property is required, but I cannot seem to find any sort of documentation or guide on how to do this. Most I could find was the documentation on setting up Connection ID's, but that does not meet my use case.
The question to my post, is it possible to do the above scenario and how?
Thanks in advance.
There are a few ways you can accomplish this based on your setup:
You can use a DagFactory type approach where you have a function generate DAGs. You can find an example of what that looks like here
You can store a JSON config as an Airflow Variable, and parse through that to generate a DAG. You can store something like this in a Admin -> Variables:
[
{
"table": "users",
"schema": "app_one",
"s3_bucket": "etl_bucket",
"s3_key": "app_one_users",
"redshift_conn_id": "postgres_default"
},
{
"table": "users",
"schema": "app_two",
"s3_bucket": "etl_bucket",
"s3_key": "app_two_users",
"redshift_conn_id": "postgres_default"
}
]
Your DAG could get generated as:
sync_config = json.loads(Variable.get("sync_config"))
with dag:
start = DummyOperator(task_id='begin_dag')
for table in sync_config:
d1 = RedshiftToS3Transfer(
task_id='{0}'.format(table['s3_key']),
table=table['table'],
schema=table['schema'],
s3_bucket=table['s3_bucket'],
s3_key=table['s3_key'],
redshift_conn_id=table['redshift_conn_id']
)
start >> d1
Similarly, you can just store that config as a local file and open it as you would any other file. Keep in mind the best answer to this will depend on your infrastructure and use case.

ZooKeeper Multi-Server Setup by Example

From the ZooKeeper multi-server config docs they show the following configs that can be placed inside of zoo.cfg (ZK's config file) on each server:
tickTime=2000
dataDir=/var/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=zoo1:2888:3888
server.2=zoo2:2888:3888
server.3=zoo3:2888:3888
Furthermore, they state that you need a myid file on each ZK node whose content matches one of the server.id values above. So for example, in a 3-node "ensemble" (ZK cluster), the first node's myid file would simply contain the value 1. The second node's myid file would contain 2, and so forth.
I have a few practical questions about what this looks like in the real world:
1. Can localhost be used? If zoo.cfg has to be repeated on each node in the ensemble, is it OK to define the current server as localhost? For example, in a 3-node ensemble, would it be OK for Server #2's zoo.cfg to look like:
tickTime=2000
dataDir=/var/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=zoo1:2888:3888
server.2=localhost:2888:3888 # Afterall, we're on server #2!
server.3=zoo3:2888:3888
Or is this not advised/not possible?
2. Do they server ids have to be numerical? For instance, could I have a 5-node ensemble where each server's zoo.cfg looks like:
tickTime=2000
dataDir=/var/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.red=zoo1:2888:3888
server.green=zoo2:2888:3888
server.blue=zoo3:2888:3888
server.orange=zoo1:2888:3888
server.purple=zoo2:2888:3888
And, say, Server 1's myid would contain the value red inside of it (etc.)?
1. Can localhost be used?
This is a good question as ZooKeeper docs don't make it cristal clear whether the configuration file only accepts IP addresses. It says only hostname which could mean either an IP address, a DNS, or a name in the hosts file, such as localhost.
server.x=[hostname]:nnnnn[:nnnnn], etc
(No Java system property)
servers making up the ZooKeeper ensemble. When the server starts up, it determines which server it is by looking for the file myid in the data directory. That file contains the server number, in ASCII, and it should match x in server.x in the left hand side of this setting.
However, note that ZooKeeper recommend to use the exactly same configuration file in all hosts:
ZooKeeper's behavior is governed by the ZooKeeper configuration file. This file is designed so that the exact same file can be used by all the servers that make up a ZooKeeper server assuming the disk layouts are the same. If servers use different configuration files, care must be taken to ensure that the list of servers in all of the different configuration files match.
So simply put the machine IP address and everything should work. Also, I have personally tested using 0.0.0.0 (in a situation where the interface IP address was different from the public IP address) and it does work.
2. Do they server ids have to be numerical?
From ZooKeeper multi-server configuration docs, myid need to be a numerical value from 1 to 255:
The myid file consists of a single line containing only the text of that machine's id. So myid of server 1 would contain the text "1" and nothing else. The id must be unique within the ensemble and should have a value between 1 and 255.
Since myid must match the x in server.x parameter, we can infer that x must be a numerical value as well.

Jmeter: How to map specific variable values from CSV file to specific thread-groups in a test plan

I have a test plan with 12 thread-groups, each one is one test scenario.I want to use unique login credentials for each thread-group. So I've created a CSV file, added CSV Data Config element to each thread-group and selected "All Threads" in "Sharing mode". Whenever I execute the test plan(All thread-groups concurrently) the thread-groups are not taking variable rows sequentially. I expected that the 1st thread-group in the test plan would consider 1st row of variables in the CSV file based on the post: JMeter test plan with different parameter for each thread
But it is not happening and I am unable to understand the pattern of variable allocation. Please help me resolve my issue.
My CSV file looks like below:
userName,password,message
userone,sample123,message1
usertwo,sample123,message2
.
.
so on...
Refer below for configuration of CSV Data Config element:
Thanks!
Threads and thread groups are different things. When you choose "All Threads" in "Sharing mode", it just means that all threads in the same thread group will share CSV. Thread groups are always independent.
You have 2 simple options:
Use one thread group and control what users are doing with controllers. For example Throughput Controller can allow you to control how many threads perform this or other script scenario within the same thread group.
Split your CSV so, that each thread group has its own CSV.
And many more complicated options, for example:
Use __CSVRead or __StringFromFile function, which allows to read one line. That way you can assign each thread group a range of lines to read, rather than reading the entire file.
If your usernames and passwords are predictable (e.g. user1, user2, etc), you could use a counter and a range for each thread group.

Is there an easy way for cfengine3 to copy different files based on the OS its running

I have two different versions of linux/unix each running cfengine3. Is it possible to have one promises.cf file I can put on both machines that will copy different files based on what os is on the clients? I have been searching around the internet for a few hours now and have not found anything useful yet.
There are several ways of doing this. At the simplest, you can simply have different files: promises depending on the operating system, for example:
files:
ubuntu_10::
"/etc/hosts"
copy_from => mycopy("$(repository)/etc.hosts.ubuntu_10");
suse_9::
"/etc/hosts"
copy_from => mycopy("$(repository)/etc.hosts.suse_9");
redhat_5::
"/etc/hosts"
copy_from => mycopy("$(repository)/etc.hosts.redhat_5");
windows_7::
"/etc/hosts"
copy_from => mycopy("$(repository)/etc.hosts.windows_7");
This example can be easily simplified by realizing that the built-in CFEngine variable $(sys.flavor) contains the type and version of the operating system, so we could rewrite this example as follows:
"/etc/hosts"
copy_from => mycopy("$(repository)/etc.$(sys.flavor)");
A more flexible way to achieve this task is known in CFEngine terminology as "hierarchical copy." In this pattern, you specify an arbitrary list of variables by which you want files to be differentiated, and the order in which they should be considered, from most specific to most general. When the copy promise is executed, the most-specific file found will be copied.
This pattern is very simple to implement:
# Use single copy for all files
body agent control
{
files_single_copy => { ".*" };
}
bundle agent test
{
vars:
"suffixes" slist => { ".$(sys.fqhost)", ".$(sys.uqhost)", ".$(sys.domain)",
".$(sys.flavor)", ".$(sys.ostype)", "" };
files:
"/etc/hosts"
copy_from => local_dcp("$(repository)/etc/hosts$(suffixes)");
}
As you can see, we are defining a list variable called $(suffixes) that contains the criteria by which we want to differentiate the files. All the variables contained in this list are automatically defined by CFEngine, although you could use any arbitrary CFEngine variables. Then we simply include that variable, as a scalar, in our copy_from parameter. Because CFEngine does automatic list expansion, it will try each variable in turn, executing the copy promise multiple times (one for each value in the list) and copy the first file that exists. For example, for a Linux SuSE 11 machine called superman.justiceleague.com, the #(suffixes) variable will contain the following values:
{ ".superman.justiceleague.com", ".superman", ".justiceleague.com", ".suse_11",
".linux", "" }
When the file-copy promise is executed, implicit looping will cause these strings to be appended in sequence to "$(repository)/etc/hosts", so the following filenames will be attempted in sequence: hosts.superman.justiceleague.com, hosts.justiceleague.com, hosts.suse_11, hosts.linux and hosts. The first one to exist will be copied over /etc/hosts in the client, and the rest will be skipped.
For this technique to work, we have to enable "single copy" on all the files you want to process. This is a configuration parameter that tells CFEngine to copy each file at most once, ignoring successive copy operations for the same destination file. The files_single_copy parameter in the agent control body specifies a list of regular expressions to match filenames to which single-copy should apply. By setting it to ".*" we match all filenames.
For hosts that don't match any of the existing files, the last item on the list (an empty string) will cause the generic hosts file to be copied. Note that the dot for each of the filenames is included in $(suffixes), except for the last element.
I hope this helps.
(p.s. and shameless plug: this is taken from my upcoming book, "Learning CFEngine 3", published by O'Reilly)