Zabbix - Utilization of housekeeper processes over 75% - configuration

I'm monitoring about 1300 hosts on Zabbix. After I defined bulk templates for all hosts, I got the "Utilization of housekeeper processes over 75%" alarm, but it was not resolved for about 20 hours. I did not define a housekeeper in my server settings. How can I resolve this alarm and what is the effect? Using postgresql.
Server config;
StartPollers=50
StartPollersUnreachable=50
StartPingers=50
StartDiscoverers=50
StartHTTPPollers=50
CacheSize=1024M
HistoryCacheSize=1024M
TrendCacheSize=1024M
ValueCacheSize=1024M
LogSlowQueries=3000
MaxHousekeeperDelete=5000
Postgreqsql config;
max_connections = 1000
shared_buffers = 8GB
effective_cache_size = 24GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 1048kB
min_wal_size = 1GB
max_wal_size = 4GB
max_worker_processes = 12
max_parallel_workers_per_gather = 4
max_parallel_workers = 12
max_parallel_maintenance_workers = 4​

Related

Retention Fail with Bareos

I am encountering backup disk saturation issues in production.
Directory to backup = 400Go
Backup disk = 2To
Scheduling:
A Full backup every 2 days
An Incremental backup every hour
Retention duration of 2 days
I have recreated these issues in prototypes.
Directory to backup = 400Mo
Size of the backup directory = 48Go
Scheduling:
A Full backup every hour
An Incremental backup every 15 minutes
Retention duration of 2 hours
The issue is that :
After 5 days, the prototype backup directory is 48Go (which accurately reproduces the fact that the 2To production backup disk is saturating).
My observations are:
The backup volumes remain on the disk.
File size in the backup directory
It appears that the retention duration defined in the Pool configuration is not being taken into account.
Here is an example of a Backup still present in the Volumes
Data that should no longer exist
My tests :
Adding to the storage Pool:
File Retention = 2 hours
Recycle Oldest Volume = yes
Result: without success
Here is the prototype's configuration file:
Job {
Name = "backup-Client2"
Type = Backup
Level = Incremental
Client = Client2
FileSet = "FileSet_Client2"
Schedule = "Schedule_Client2"
Storage = Storage_Client2
Messages = Standard
Pool = Incremental_Client2
Priority = 1
Write Bootstrap = "/var/lib/bareos/%c.bsr"
Full Backup Pool = Full_Client2
Incremental Backup Pool = Incremental_Client2
}
FileSet {
Name = "FileSet_Client2"
Description = "Sauvegarde de tous les répertoires présents dans /home et /etc"
Include {
Options {
signature = MD5
}
File = "/home"
File = "/DATA"
File = "/etc"
}
}
Schedule {
Name = "Schedule_Client2"
Run = Full hourly
Run = Incremental hourly at 0:15
Run = Incremental hourly at 0:30
Run = Incremental hourly at 0:45
}
Client {
Name = "Client2"
Address = "192.168.1.78"
Password = "*****"
}
Storage {
Name = Storage_Client2
Address = 192.168.1.78
Password = "******"
Device = Client2FileStorage
Media Type = File
}
Pool {
Name = Incremental_Client2
Pool Type = Backup
Recycle = yes
AutoPrune = yes
Action On Purge = Truncate
Volume Retention = 1 hours
File Retention = 1 hours
Recycle Oldest Volume = yes
Label Format = "Incremental_Client2-"
}
Pool {
Name = Full_Client2
Pool Type = Backup
Recycle = yes
AutoPrune = yes
Action On Purge = Truncate
Volume Retention = 2 hours
File Retention = 2 hours
Recycle Oldest Volume = yes
Label Format = "Full_Client2-"
}
Did I forget a parameter?
Thank you to whoever can give me some solution leads.

restart of the smbd daemon without interrupting the load on the windows client

Such a problem, there is a server (cluster) on which smb is used, the server is entered into the AD domain, sometimes it is necessary to restart the smbd service (reload won't fit), but at the same time there is some copying of the file on the client (windows), then the load is interrupted, and after the klick "Retry" button, the download starts from the very beginning. Is it possible to do something like that so that the load continues to go from the moment where it was interrupted, maybe you need to configure the client like that. client connects as SMBv3 or SMBv2
server on ubuntu 18.04.
smb created at zfs
smb.conf:
[global]
workgroup = TEST247
realm = test247.ru
security = ads
auth methods = winbind
interfaces = 172.16.11.170/24
bind interfaces only = yes
netbios name = SERVER
encrypt passwords = true
map to guest = Bad User
max log size = 300
dns proxy = no
socket options = TCP_NODELAY
domain master = no
local master = no
preferred master = no
os level = 0
domain logons = no
load printers = no
show add printer wizard = no
log level = 0 vfs:2
max log size = 0
syslog = 0
printcap name = /dev/null
disable spoolss = yes
name resolve order = lmhosts wins host bcast
machine password timeout = 604800
name cache timeout = 660
idmap config TEST247 : backend = rid
idmap config TEST247 : base_rid = 0
idmap config TEST247 : range = 100000 - 200000
idmap config * : range = 200001-300000
idmap config * : backend = tdb
idmap cache time = 604800
idmap negative cache time = 60
winbind rpc only = yes
winbind cache time = 120
winbind enum groups = yes
winbind enum users = yes
winbind max domain connections = 10
winbind use default domain = yes
winbind refresh tickets = yes
winbind reconnect delay = 15
winbind request timeout = 25
winbind separator = ^
private dir = /var/lib/samba/private
lock directory = /run/samba
state directory = /var/lib/samba
cache directory = /var/cache/samba
pid directory = /run/samba
log file = /var/log/samba/smb.%m
include = /etc/samba/smb-res.conf
testparm:
testparm -s /etc/samba/smb.conf
Load smb config files from /etc/samba/smb.conf
WARNING: The "auth methods" option is deprecated
WARNING: The "syslog" option is deprecated
Loaded services file OK.
Server role: ROLE_DOMAIN_MEMBER
smb-res.conf:
[test109_smb]
comment = test109_smb share
path = /config/pool/test109/smb
browseable = yes
writable = yes
inherit acls = yes
inherit owner = no
inherit permissions = yes
map acl inherit = yes
nt acl support = yes
create mask = 0777
force create mode = 0777
force directory mode = 0777
store dos attributes = yes
public = no
admin users =
valid users =
write list =
read list =
invalid users =
vfs objects = acl_xattr
full_audit:prefix = %S|%u|%I
full_audit:facility = local5
full_audit:priority = notice
full_audit:success = none
full_audit:failure = none
shadow: snapdir = .zfs/snapshot
shadow: sort = desc
shadow: localtime = yes
shadow: format = shadow_%d.%m.%Y-%H:%M:%S
worm: grace_period = 30
cryptfile: method = grasshopper
Resuming a copy operation doesn't depend on the smb client or server, but on the application which is doing the copying.
The standard Windows copy doesn't know to resume.
Other (third party) apps (maybe Total Commander?) can be more intelligent about it. You could even write your own app to do a smart copy.

Couchbase benchmark reveals very slow INSERTs and GETs (using KeyValue operations); slower than persisted MySQL data

I did a small benchmark test to compare Couchbase (running in Win) with Redis and MySql (EDIT: added Aerospike to test)
We are inserting 100 000 JSON "documents" into three db/stores:
Redis (just insert, there is nothing else)
Couchbase (in-memory Ephemeral buckets, JSON Index on JobId)
MySql (Simple table; Id (int), Data (MediumText), index on Id)
Aerospike (in-memory storage)
The JSON file is 67 lines, about 1800 bytes.
INSERT:
Couchbase: 60-100 seconds (EDIT: seems to vary quite a bit!)
MySql: 30 seconds
Redis: 8 seconds
Aerospike: 71 seconds
READ:
We are reading 1000 times, and we do this 10 times and look at averages.
Couchbase: 600-700 ms for 1000 GETs (Using KeyValue operations, not Query API. Using Query API, this takes about 1500 ms)
MySql: 90-100 ms for 1000 GETs
Redis: 50-60 ms for 1000 GETs
Aerospike: 750 ms for 1000 GETs
Conclusion:
Couchbase seems slowest (the INSERT times varies a lot it seems), Aerospike is also very slow. Both of these are using in-memory storage (Couchbase => Ephemeral bucket, Aerospike => storage-engine memory).
Question: Why the in-memory write and read on Couchbase so slow, even slower than using normal MySQL (on an SSD)?
CODE
Note: Using Task.WhenAll, or awaiting each call, doesn't make a difference.
INSERT
Couchbase:
IBucket bucket = await cluster.BucketAsync("halo"); // <-- ephemeral
IScope scope = bucket.Scope("myScope");
var collection = scope.Collection("myCollection");
// EDIT: Added this to avoid measuring lazy loading:
JObject t = JObject.FromObject(_baseJsonObject);
t["JobId"] = 0;
t["CustomerName"] = $"{firstnames[rand.Next(0, firstnames.Count - 1)]} {lastnames[rand.Next(0, lastnames.Count - 1)]}";
await collection.InsertAsync("0", t);
await collection.RemoveAsync("0");
List<Task> inserTasks = new List<Task>();
sw.Start();
foreach (JObject temp in jsonObjects) // jsonObjects is pre-created so its not a factor in the test
{
inserTasks.Add(collection.InsertAsync(temp.GetValue("JobId").ToString(), temp));
}
await Task.WhenAll(inserTasks);
sw.Stop();
Console.WriteLine($"Adding {nbr} to Couchbase took {sw.ElapsedMilliseconds} ms");
Redis (using ServiceStack!)
sw.Restart();
using (var client = redisManager.GetClient())
{
foreach (JObject temp in jsonObjects)
{
client.Set($"jobId:{temp.GetValue("JobId")}", temp.ToString());
}
}
sw.Stop();
Console.WriteLine($"Adding {nbr} to Redis took {sw.ElapsedMilliseconds} ms");
sw.Reset();
Mysql:
MySql.Data.MySqlClient.MySqlConnection mySqlConnection = new MySql.Data.MySqlClient.MySqlConnection("Server=localhost;Database=test;port=3306;User Id=root;password=root;");
mySqlConnection.Open();
sw.Restart();
foreach (JObject temp in jsonObjects)
{
MySql.Data.MySqlClient.MySqlCommand cmd = new MySql.Data.MySqlClient.MySqlCommand($"INSERT INTO test (id, data) VALUES ('{temp.GetValue("JobId")}', #data)", mySqlConnection);
cmd.Parameters.AddWithValue("#data", temp.ToString());
cmd.ExecuteNonQuery();
}
sw.Stop();
Console.WriteLine($"Adding {nbr} to MySql took {sw.ElapsedMilliseconds} ms");
sw.Reset();
READ
Couchbase:
IBucket bucket = await cluster.BucketAsync("halo");
IScope scope = bucket.Scope("myScope");
var collection = scope.Collection("myCollection");
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 1000; i++)
{
string key = $"{r.Next(1, 100000)}";
var result = await collection.GetAsync(key);
}
sw.Stop();
Console.WriteLine($"Couchbase Q: {q}\t{sw.ElapsedMilliseconds}");
Redis:
Stopwatch sw = Stopwatch.StartNew();
using (var client = redisManager.GetClient())
{
for (int i = 0; i < nbr; i++)
{
client.Get<string>($"jobId:{r.Next(1, 100000)}");
}
}
sw.Stop();
Console.WriteLine($"Redis Q: {q}\t{sw.ElapsedMilliseconds}");
MySQL:
MySqlConnection mySqlConnection = new MySql.Data.MySqlClient.MySqlConnection("Server=localhost;Database=test;port=3306;User Id=root;password=root;");
mySqlConnection.Open();
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < nbr; i++)
{
MySqlCommand cmd = new MySql.Data.MySqlClient.MySqlCommand($"SELECT data FROM test WHERE Id='{r.Next(1, 100000)}'", mySqlConnection);
using MySqlDataReader rdr = cmd.ExecuteReader();
while (rdr.Read())
{
}
}
sw.Stop();
Console.WriteLine($"MySql Q: {q} \t{sw.ElapsedMilliseconds} ms");
sw.Reset();
Couchbase setup:
and
and Bucket Durability:
I only have 1 Node (no cluster), it's local on my machine, running Ryzen 3900x 12 cores, M.2 SSD, Win10, 32 GB RAM.
If you made it this far, here is a GitHub repo with my benchmark code:
https://github.com/tedekeroth/CouchbaseTests
I took your CouchbaseTests, commented out the non-Couchbase bits. Fixed the query to select from the collection ( myCollection ) instead of jobcache, and removed the Metrics option. And created an index on JobId.
create index mybucket_JobId on default:myBucket.myScope.myCollection (JobId)
It inserts the 100,000 documents in 19 seconds and kv-fetches the documents on average 146 usec and query by JobId on average 965 usec.
Couchbase Q: 0 187
Couchbase Q: 1 176
Couchbase Q: 2 143
Couchbase Q: 3 147
Couchbase Q: 4 140
Couchbase Q: 5 138
Couchbase Q: 6 136
Couchbase Q: 7 139
Couchbase Q: 8 125
Couchbase Q: 9 129
average et: 146 ms per 1000 -> 146 usec / request
Couchbase Q: 0 1155
Couchbase Q: 1 1086
Couchbase Q: 2 1004
Couchbase Q: 3 901
Couchbase Q: 4 920
Couchbase Q: 5 929
Couchbase Q: 6 912
Couchbase Q: 7 911
Couchbase Q: 8 911
Couchbase Q: 9 927
average et: 965 ms per 1000 -> 965 usec / request. (coincidentally exactly the same as with the java api).
This was on 7.0 build 3739 on a Mac Book Pro with the cbserver running locally.
######################################################################
I have a small LoadDriver application for the java sdk that uses the kv api. With 4 threads, it shows an average response time of 54 micro-seconds and throughput of 73238 requests/second. It uses the travel-sample bucket on a cb server on localhost. git#github.com:mikereiche/loaddriver.git
Run: seconds: 10, threads: 4, timeout: 40000us, threshold: 8000us requests/second: 0 (max), forced GC interval: 0ms
count: 729873, requests/second: 72987, max: 2796us avg: 54us, aggregate rq/s: 73238
For the query API I get the following which is 18 times slower.
Run: seconds: 10, threads: 4, timeout: 40000us, threshold: 8000us requests/second: 0 (max), forced GC interval: 0ms
count: 41378, requests/second: 4137, max: 12032us avg: 965us, aggregate rq/s: 4144
I would have to run such a comparison myself to do a full investigation, but two things stand out.
Your parallel execution isn't truly fully parallel. async methods run synchronously up to the first await, so all of the code in InsertAsync/GetAsync before the first await is running sequentially as you add your tasks, not parallel.
CouchbaseNetClient does some lazy connection setup in the background, and you're paying that cost in the timed section. Depending on the environment, including SSL negotiation and such things, this can be a significant initial latency.
You can potentially address the first issue by using Task.Run to kick off the operation, but you may need to pre-size the default Threadpool size.
You can address the second issue by doing at least one operation on the bucket (including bucket.WaitUntilReadyAsync()) before the timed section.
60 seconds for inserts still look abnormal. How many nodes and what Durability setting are you using?

Xmx settings in elasticbean stalk through environment properties

I had been trying to up the memory of my elastic beanstalk console using JAVA_OPTS in environment settings with values -Xms1G -Xmx3G. Attached is the image on how I have changed the settings.
AFter applying the changes and restarting the vm, I do not see the changes refelcted on the server.
This is how I am verifying
sudo jmap -heap
Heap Configuration:
MinHeapFreeRatio = 0
MaxHeapFreeRatio = 100
MaxHeapSize = 1035993088 (988.0MB)
NewSize = 21495808 (20.5MB)
MaxNewSize = 344981504 (329.0MB)
OldSize = 43515904 (41.5MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 21807104 (20.796875MB)
CompressedClassSpaceSize = 1073741824 (1024.0MB)
MaxMetaspaceSize = 17592186044415 MB
G1HeapRegionSize = 0 (0.0MB)
Heap Usage:
PS Young Generation
Eden Space:
capacity = 192413696 (183.5MB)
used = 18710296 (17.843528747558594MB)
free = 173703400 (165.6564712524414MB)
9.723993867879342% used
From Space:
capacity = 26738688 (25.5MB)
used = 22166296 (21.139427185058594MB)
free = 4572392 (4.360572814941406MB)
82.89971445121017% used
To Space:
capacity = 27262976 (26.0MB)
used = 0 (0.0MB)
free = 27262976 (26.0MB)
0.0% used
PS Old Generation
capacity = 691011584 (659.0MB)
used = 571332904 (544.8655166625977MB)
free = 119678680 (114.13448333740234MB)
Heap settings cannot be set through environment properties. You have to give this via Procfile. The procfile has tobe bundled when uplaoding.
I had to created a zip file that had war and Procfile.
Proc file contents
web: java -jar -Xms1G -Xmx3G application.war
How to test this works?
Find the process id of your webapp/java process from top.
Use jmap heap - to get the heap allocation. I tested this on AWS-Ec2 for elastic beanstalk

quartz clustered scheduler deadlock

We randomly get below exception while running quartz clustered scheduler on 6 instances:
Couldn't acquire next trigger: Deadlock found when trying to get lock;
try restarting transaction
Here is our quartzConfig.properties
scheduler.skipUpdateCheck = true
scheduler.instanceName = 'quartzScheduler'
scheduler.instanceId = 'AUTO'
threadPool.threadCount = 13
threadPool.threadPriority = 5
jobStore.misfireThreshold = 300000
jobStore.'class' = 'org.quartz.impl.jdbcjobstore.JobStoreTX'
jobStore.driverDelegateClass = 'org.quartz.impl.jdbcjobstore.StdJDBCDelegate'
jobStore.useProperties = true
jobStore.dataSource = 'myDS'
jobStore.tablePrefix = 'QRTZ_'
jobStore.isClustered = true
jobStore.clusterCheckinInterval = 10000
dataSource.myDS.driver='com.mysql.jdbc.Driver'
dataSource.myDS.maxConnections = 15
We are using quartz grails plugins(with quartz 2.2.1) in our application with mysql db.