I have a directory of roughly 45,000 json files. The total size is around 12.8 GB currently. This is website data from kissmetrics and its structure is detailed here.
The data:
Each file is multiple json documents separated by a newline
It will be updated every 12 hours with new additional files
I want to import this data to mongoDB using mongoimport. I've tried this shell script to make the process easier:
for filename in revisions/*;
do
echo $filename
mongoimport --host <HOSTNAME>:<PORT> --db <DBNAME> --collection <COLLECTIONNAME> \
--ssl --sslCAFile ~/mongodb.pem --username <USERNAME> --password <PASSWORD> \
--authenticationDatabase admin $filename
done
This will have errors
2016-06-18T00:31:10.781+0000 using 1 decoding workers
2016-06-18T00:31:10.781+0000 using 1 insert workers
2016-06-18T00:31:10.781+0000 filesize: 113 bytes
2016-06-18T00:31:10.781+0000 using fields:
2016-06-18T00:31:10.822+0000 connected to: <HOSTNAME>:<PORT>
2016-06-18T00:31:10.822+0000 ns: <DBNAME>.<COLLECTION>
2016-06-18T00:31:10.822+0000 connected to node type: standalone
2016-06-18T00:31:10.822+0000 standalone server: setting write concern w to 1
2016-06-18T00:31:10.822+0000 using write concern: w='1', j=false, fsync=false, wtimeout=0
2016-06-18T00:31:10.822+0000 standalone server: setting write concern w to 1
2016-06-18T00:31:10.822+0000 using write concern: w='1', j=false, fsync=false, wtimeout=0
2016-06-18T00:31:10.824+0000 Failed: error processing document #1: invalid character 'l' looking for beginning of value
2016-06-18T00:31:10.824+0000 imported 0 documents
I will potentially run into this error, and from my inspection is not due to malformed data.
The error may happen hours into the import.
Can I parse the error in mongoimport to retry the same document? I don't know if the error will have this same form, so I'm not sure if I can try to handle it in bash. Can I keep track of progress in bash and restart if terminated early? Any suggestions on importing large data of this size or handling the error in shell?
Typically a given command will return error codes when it fails (and the are hopefully documented on the man page for the command).
So if you want to do something hacky and just retry once,
cmd="mongoimport --foo --bar..."
$cmd
ret=$?
if [ $ret -ne 0 ]; then
echo "retrying..."
$cmd
if [ $? -ne 0 ]; then
"failed again. Sadness."
exit
fi
fi
Or if you really need what mongoimport outputs, capture it like this
results=`mongoimport --foo --bar...`
Now the variable $results will contain what was returned on stdout. Might have to redirect stderr as well.
Related
To my eyes the following JSON looks valid.
{
"DescribeDBLogFiles": [
{
"LogFileName": "error/postgresql.log.2022-09-14-00",
"LastWritten": 1663199972348,
"Size": 3032193
}
]
}
A) But, jq, json_pp, and Python json.tool module deem it invalid:
# jq 1.6
> echo "$logfiles" | jq
parse error: Invalid numeric literal at line 1, column 2
# json_pp 4.02
> echo "$logfiles" | json_pp
malformed JSON string, neither array, object, number, string or atom,
at character offset 0 (before "\x{1b}[?1h\x{1b}=\r{...") at /usr/bin/json_pp line 51
> python3 -m json.tool <<< "$logfiles"
Expecting value: line 1 column 1 (char 0)
B) But on the other hand, if the above JSON is copy & pasted into an online validator, both 1 and 2, deem it valid.
As hinted by json_pp's error above, hexdump <<< "$logfiles" indeed shows additional, surrounding characters. Here's the prefix: 5b1b 313f 1b68 0d3d 1b7b ...., where 7b is {.
The JSON is output to a logfiles variable by this command:
logfiles=$(aws rds describe-db-log-files \
--db-instance-identifier somedb \
--filename-contains 2022-09-14)
# where `aws` is
alias aws='docker run --rm -it -v ~/.aws:/root/.aws amazon/aws-cli:2.7.31'
> bash --version
GNU bash, version 5.0.17(1)-release (x86_64-pc-linux-gnu)
Have perused this GitHub issue, yet can't figure out the cause. I suspect that double quotes get mangled somehow when using echo - some reported that printf "worked" for them.
The use of docker run --rm -it -v command to produce the JSON, added some additional unprintable characters to the start of the JSON data. That makes the resulting file $logfiles invalid.
The -t option allocations a tty and the -i creates an interactive shell. In this case the -t is allowing the shell to read login scripts (e.g. .bashrc). Something in your start up scripts is outputting ansi escape codes. Often this will to clear the screen, set up other things for the interactive shell, or make the output more visually appealing by colorizing portions of the data.
newbie to coding and am making a legislative database for use in academic work. I have downloaded the California legislative information into a directory on a partitioned portion of my HD. Loaded the schema to the MySQL DB with no issues, downloaded the data and am having problems getting it uploaded. Lets call my workspace home directory home, within that directory are my modules (I have node in there but I would love to avoid using it until I make an app), my json package and settings files and a subdirectory called pubinfo. This is all set up.
Within the pubinfo directory are my sql table files, and shell commands for loading the data into mysql where I have a DB with tables ready for data insertion, as well as subdirectories for legislative sessions labeled from 2001-2019 by sessions (2001, 2003, and so on by 2 years). The loadData.sh file is below, and the instructions from the California data website said to download these files, unzip them, then to run them on my pubinfo directory...
if [ $# -gt 0 ]; then
echo Usage: .loadData.sh
exit 1
fi
if [ -z "$MYSQL_PWD" ]; then
read -p "Please enter root password:" MYSQL_PWD
export MYSQL_PWD=${MYSQL_PWD}
fi
do
if [ -e ${lcTable}.dat ]; then
echo Processing table: ${lcTable}
if [ -z "$MYSQL_PWD" ]; then
mysql -uroot -p -Dcapublic -v -v -f < ${lcTable}.sql 2>&1 > ${lcTable}.log
else
mysql -uroot -Dcapublic -v -v -f < ${lcTable}.sql 2>&1 > ${lcTable}.log
fi
fi
done < "tables_lc.lst"
When ran, the out put on my zsh terminal is '/usr/local/bin/loadData.sh: line 29: location_code_tbl.sql: No such file or directory', I also have to add that I symlinked the shell file into my path so that the variable could be called in a global setting. I plan to eliminate it once this is all uploaded. I suppose I could symlink all the sql tables as well, but I know there has to be an easier way to iterate through subdirectories while using the sql tables and files in my main directory. I just am not familiar with zsh or bash, I had to take an Udemy course just to set up the MySQL DB. Anyways, I was hoping someone would be able to help, if you have any questions that I did not address here I can answer. Oh and if there is any question on my machine, it is a newer Mac book pro, running the most current mysql version and my editor is visual studio code in addition to the good old terminal.
Thanks!
I would like to automate host creation on zabbix server without using agent on hosts. Tried to use Discovery rules and sending JSON data with zabbix_sender. But without luck. Server does not accept data.
Environment:
Zabbix server 3.4 installed on Centos 7.Hosts with Windows or Ubuntu.
On server I created host with name zab_trap
In that host I created Discovery rule with key zab_trap.discovery and type Zabbix_trapper. Then in Discovery rule I created Host prototype with name {#RH.NAME}.
Command line with JSON "data":
zabbix_sender.exe -z zab_server -s zab_trap -k zab_trap.discovery -o "{"data":[{"{#RH.NAME}":"HOST1"}]}"
I expected that "HOST1" will be created. But after execution I got:
"info from server: "processed: 0; failed: 1; total: 1; seconds spent: 0.000188"
sent: 1; skipped: 0; total: 1"
And there is no error in zabbix_server.log (with debug level 5)
I see this:
trapper got '{"request":"sender data","data":[{"host":"zab_trap","key":"zab_trap.discovery","value":"'{data:[{{#RH.NAME}:HOST1}]}'"}]}'
I think that maybe there is something wrong with JSON syntax.
Please help.
It seems I have found solution. Problem is hidden in a way to send JSON. As I understood it does not work properly or there is problem with syntax(quotes) if write JSON directly in command line. But it works if zabbix_sender send file with JSON.
Command line:
zabbix_sender -z zab_server -s zab_trap -i test.json
File test.json contain line:
- zab_trap.discovery {"data":[{"{#RH.NAME}":"HOST1"}]}
Host created.
If you want to use the command line, without file json, you need to clean the string with:
zabbix_sender.exe -z zab_server -s zab_trap -k zab_trap.discovery -o "$(echo '{"data":[{"{#RH.NAME}":"HOST1"}]}' | tr -cd '[:print:]')"
The top part of the following script works great, the .dat files are created via the MySQL command, and work perfectly with gnu plot (via the command line). The problem is getting the bottom (gnuplot) to work correctly. I'm pretty sure I have a couple of problems in the code: variables and the array. I need to call each .dat file (plot), have the title in the graph (from title in customers.txt)and name it (.png)
any guidance would be appreciated. Thanks a lot -- RichR
#!/bin/bash
set -x
databases=""
titles=""
while read -r ipAddr dbName title; do
dbName=$(echo "$dbName" | sed -e 's/pacsdb//')
rm -f "$dbName.dat"
touch "$dbName.dat"
databases=("$dbName.dat")
titles="$titles $title"
while read -r period; do
mysql -uroot -pxxxx -h "$ipAddr" "pacsdb$dbName" -se \
"SELECT COUNT(*) FROM tables WHERE some.info BETWEEN $period;" >> "$dbName.dat"
done < periods.txt
done < customers.txt
for database in "${databases[#]}"; do
gnuplot << EOF
set a bunch of options
set output "/var/www/$dbName.png"
plot "$dbName.dat" using 2:xtic(1) title "$titles"
EOF
done
exit 0
customers.txt example line-
192.168.179.222 pacsdbgibsonia "Gibsonia Animal Hospital"
Error output.....
+ for database in '"${databases[#]}"'
+ gnuplot
line 0: warning: Skipping unreadable file ".dat"
line 0: No data in plot
+ exit 0
to initialise databases array:
databases=()
to append $dbName.dat to databases array:
databases+=("$dbName.dat")
to retrieve dbName, remove suffix pattern .dat
dbName=${database%.dat}
I'm going to design a network Analyzer for WiFi (802.11)
Currently I use tshark to capture and parse the WiFi frames and then pipe the output to a perl script to store the parsed information to Mysql database.
I just find out that I miss alot of frames in this process. I checked and the frames seem to be lost during the Pipe (when the output is delivered to perl to get srored in Mysql)
Here is how it goes
(Tshark) -------frames are lost----> (Perl) --------> (MySQL)
this is the how I pipe the output of tshark to script:
sudo tshark -i mon0 -t ad -T fields -e frame.time -e frame.len -e frame.cap_len -e radiotap.length | perl tshark-sql-capture.pl
this is simple template of the perl script I use (tshark-sql-capture.pl)
# preparing the MySQL
my $dns = "DBI:mysql:capture;localhost";
my $dbh = DBI->connect($dns,user,pass);
my $db = "captured";
while (<STDIN>) {
chomp($data = <STDIN>);
($time, $frame_len, $cap_len, $radiotap_len) = split " ", $data;
my $sth = $dbh-> prepare("INSERT INTO $db VALUES (str_to_date('$time','%M %d, %Y %H:%i:%s.%f'), '$frame_len', '$cap_len', '$radiotap_len'\n)" );
$sth->execute;
}
#Terminate MySQL
$dbh->disconnect;
Any Idea which can help to make the performance better is appreciated.Or may be there is an Alternative mechanism which can do better.
Right now my performance is 50% means I can store in mysql around half of the packets I'v captured.
Things written in a pipe don't get lost, what's probably really going on is that tshark tries to write to the pipe but perl+mysql is too slow to process the input so the pipeb is full, write would block so tshark just drops the packets.
Bottleneck could be either MySQL or Perl itself but probably the DB. Check CPU usage, measure insert rate. Then pick a faster DB or write to multiple DBs. You can also try batch inserts and increasing the size of the pipe buffer.
Update
while (<STDIN>)
this reads a line into $_, then you ignore it.
For pipe problems, you can improve packet capture with GULP http://staff.washington.edu/corey/gulp/
From the Man pages:
1) reduce packet loss of a tcpdump packet capture:
(gulp -c works in any pipeline as it does no data interpretation)
tcpdump -i eth1 -w - ... | gulp -c > pcapfile
or if you have more than 2, run tcpdump and gulp on different CPUs
taskset -c 2 tcpdump -i eth1 -w - ... | gulp -c > pcapfile
(gulp uses CPUs #0,1 so use #2 for tcpdump to reduce interference)
you can use a FIFO file, then read the packets and inserts in mysql using insert delay.
sudo tshark -i mon0 -t ad -T fields -e frame.time -e frame.len -e frame.cap_len -e radiotap.length > MYFIFO