So the thing that makes this whole question hard is that I am working in a bash shell environment. I am parsing a large amount of data that is all located in text files in a set of directories. The environment I am working in does not have a gui, and is just the shell, and I am executing the commands from the shell through mysql, I am not logged into mysql.
I am the partner on a project, the main part is a bash script that searches for information and inserts it into text files in several directories. My operations parse out the needed data and inserts it into the database.
I run my main loop through a shell script. It loops through a set of directories and searches for the .txt files in each. I then pass the information to my procedure. In something like the below.
NOTE: I am not an expert in bash and have just started learning.
mysql - user -p'mypassword' --database=dbname <<EFO
call Procedure_Name("`cat ${textfile}`");
EOF
Since I am working in mysql and bash only I can not use another language to make my life easier so I use SUBSTRING_INDEX mostly. So an illustration of the procedure is shown below.
DELIMITER $$
CREATE PROCEDURE Procedure_name(textfile LONGTEXT)
BEGIN
DECLARE data LONGTEXT;
SET data = SUBSTRING_INDEX(SUBSTRING_INDEX(textfile,"(+++)",1),"(++)",-1));
INSERT INTO Table_Name (column) values (data);
END; $$
DELIMITER ;
The text file is a clean structure that allows for me to cut it up, but the problem I am having is that special characters inside of the textfile is causing my procedure to throw an error. I believe they are escape characters and I need a way around this. Just about any character could appear in the data I am parsing so I need a way to ignore these characters in the procedure or to cause them to not affect my process.
I tried looking into mysql_real_escape_string() however the parameters were hard to figure out and it looks like it only works in PHP but I am not sure. So I would like to do something at the beginning of my procedure to maybe insert "\"'s or something into the string to not cause my procedure to fail.
Also, these textfiles range from 16k to 11000k so I need something that can handle that. My process works sometimes but is getting caught up on a lot of stuff and my searching has not helped me at all. So any help would be greatly appreciated!!!
and thanks to all to reading this long description. normally I can find my answer or piece it together from questions but I had no luck this time so I figured it was about time to make an account and ask something.
Your question is really too board, but here is an example of what I mean
a script file:
#!/bin/bash
case $# in
1 ) inFile=$1 ;;
* ) echo "usage: myLoader infile"; exit 1 ;;
esac
awk 'BEGIN {
FS="\t"'; OFS="|"
}
{
sub(/badChars/, "", $0); sub(/otherBads/, "", $0) ; # .... as many as needed
# but be careful, easy to delete stuff that with too broad a brush.
print $1, $2, $5, $4, $9
}' $inFile > $inFile.psv
bcp -in -f ${formatFile:-formatFile} $inFile.psv
Note how awk makes it very easy, by repeating sub(...) commands to remove any "bad chars" you may have in your source data AND to reorganize the order of the columns in your data. Each $n is the value in numbered column on a line, so $1, $2, $5 skips fields $3 and $4, for example.
The OFS is set to the pipe char, making it easy to see in your output where exactly the field boundaries are AND if there are any leading or trailing whitespace characters that may be throwing off your load.
The > $inFile.psv keeps your original file, just in case you make a mistake in the awk script.
If you create really small test data files, you can eliminate saving to a file and just let the output go to the screen, editing until you get it right.
You'll have to find out exactly how mySQL's equivalent of bcp works. I'm pretty sure I've seen postings here. Either that, or post a separate question, "I have this pipe-delimited file with 8 columns, how do I load it to my table?".
The reference in my sample code to ${formatFile} is that hopefully the mySQL bcp command can take a format file that specifies the order and types of fields to be loaded into a file. Good bcp fmt files allow a fair amount of flexibility, but you'll have to read the man page for that utility AND do some research to understand the scope and restraints on that flexibility.
Going forward, you should post individual questions like, "I've tried x using lang Y to filter Z characters. Right now I'm getting output z, What am I doing wrong?"
Divide and conquer. There is no easy way. Reset those customer and boss expectations, you're learning something new, and it will take a little study to get it right. Good luck.
IHTH
Related
I have the latest TCL build from Active State installed on a desktop and laptop both running Windows 10. I'm new to TCL and a novice developer and my reason for learning TCL is to enhance my value on the F5 platform. I figured a good first step would be to stop the occasional work I do in VBScript and port that to TCL. Learning the language itself is coming along alright, but I'm worried my project isn't viable due to performance. My VBScripts absolutely destroy my TCL scripts in performance. I didn't expect that outcome as my understanding was TCL was so "fast" and that's why it was chosen by F5 for iRules etc.
So the question is, am I doing something wrong? Is the port for Windows just not quite there? Perhaps I misunderstood the way in which TCL is fast and it's not fast for file parsing applications?
My test application is a firewall log parser. Take a log with 6 million hits and find the unique src/dst/port/policy entries and count them; split up into accept and deny. Opening the file and reading the lines is fine, TCL processes 18k lines/second while VBScript does 11k. As soon as I do anything with the data, the tide turns. I need to break the four pieces of data noted above from the line read and put in array. I've "split" the line, done a for-next to read and match each part of the line, that's the slowest. I've done a regexp with subvariables that extracts all four elements in a single line, and that's much faster, but it's twice as slow as doing four regexps with a single variable and then cleaning the excess data from the match away with trims. But even this method is four times slower than VBScript with ad-hoc splits/for-next matching and trims. On my desktop, i get 7k lines/second with TCL and 25k with VBscript.
Then there's the array, I assume because my 3-dimensional array isn't a real array that searching through 3x as many lines is slowing it down. I may try to break up the array so it's looking through a third of the data currently. But the truth is, by the time the script gets to the point where there's a couple hundred entries in the array, it's dropped from processing 7k lines/second to less than 2k. My VBscript drops from about 25k lines to 22k lines. And so I don't see much hope.
I guess what I'm looking for in an answer, for those with TCL experience and general programming experience, is TCL natively slower than VB and other scripts for what I'm doing? Is it the port for Windows that's slowing it down? What kind of applications is TCL "fast" at or good at? If I need to try a different kind of project than reading and manipulating data from files I'm open to that.
edited to add code examples as requested:
while { [gets $infile line] >= 0 } {
some other commands I'm cutting out for the sake of space, they don't contribute to slowness
regexp {srcip=(.*)srcport.*dstip=(.*)dstport=(.*)dstint.*policyid=(.*)dstcount} $line -> srcip dstip dstport policyid
the above was unexpectedly slow. the fasted way to extract data I've found so far
regexp {srcip=(.*)srcport} $line srcip
set srcip [string trim $srcip "cdiloprsty="]
regexp {dstip=(.*)dstport} $line dstip
set dstip [string trim $dstip "cdiloprsty="]
regexp {dstport=(.*)dstint} $line dstport
set dstport [string trim $dstport "cdiloprsty="]
regexp {policyid=(.*)dstcount} $line a policyid
set policyid [string trim $policyid "cdiloprsty="]
Here is the array search that really bogs down after a while:
set start [array startsearch uList]
while {[array anymore uList $start]} {
incr f
#"key" returns the NAME of the association and uList(key) the VALUE associated with name
set key [array nextelement uList $start]
if {$uCheck == $uList($key)} {
##puts "$key CONDITOIN MET"
set flag true
adduList $uCheck $key $flag2
set flag2 false
break
}
}
Your question is still a bit broad in scope.
F5 has published some comment why they choose Tcl and how it is fast for their specific usecases. This is actually a bit different to a log parsing usecase, as they do all the heavy lifting in C-code (via custom commands) and use Tcl mostly as a fast dispatcher and for a bit of flow control. And Tcl is really good at that compared to various other languages.
For things like log parsing, Tcl is often beaten in performance by languages like Python and Perl in simple benchmarks. There are a variety of reasons for that, here are some of them:
Tcl uses a different regexp style (DFA), which are more robust for nasty patterns, but slower for simple patterns.
Tcl has a more abstract I/O layer than for example Python, and usually converts the input to unicode, which has some overhead if you do not disable it (via fconfigure)
Tcl has proper multithreading, instead of a global lock which costs around 10-20% performance for single threaded usecases.
So how to get your code fast(er)?
Try a more specific regular expression, those greedy .* patterns are bad for performance.
Try to use string commands instead of regexp, some string first commands followed by string range could be faster than a regexp for these simple patterns.
Use a different structure for that array, you probably want either a dict or some form of nested list.
Put your code inside a proc, do not put it all in a toplevel script and use local variables instead of globals to make the bytecode faster.
If you want, use one thread for reading lines from file and multiple threads for extracting data, like a typical producer-consumer pattern.
My problem is simple: I'm trying to write a tcl script to use $grofile instead writing every time I need this file name.
So, what I did in TkConsole was:
% set grofile "file.gro"
% mol load gro ${grofile}
and, indeed, I succeeded uploading the file.
In the script I have the same lines, but still have this error:
wrong # args: should be "set varName ?newValue?"
can't read "grofile": no such variable
I tried to solve my problem with
% set grofile [./file.gro]
and I have this error,
invalid command name "./file.gro"
can't read "grofile": no such variable
I tried also with
% set grofile [file ./file.gro r]
and I got the first error, again.
I haven't found any simple way to avoid using the explicit name of the file I want to upload. It seems like you only can use the most trivial, but tedious way:
mol load file.gro
mol addfile file.xtc
and so on and so on...
Can you help me with a brief explanation about why in the TkConsole I can upload the file and use it as a variable while I can not in the tcl script?
Also, if you have where is my mistake, I will appreciate it.
I apologize if it is basic, but I could not find any answer. Thanks.
I add the head of my script:
set grofile "sim.part0001_protein_lipid.gro"
set xtcfile "protein_lipid.xtc"
set intime "0-5ms"
set system "lower"
source view_change_render.tcl
source cg_bonds.tcl
mol load gro $grofile xtc ${system}_${intime}_${xtcfile}
It was solved, thanks for your help.
You may think you've typed the same thing, but you haven't. I'm guessing that your real filename has spaces in it, and that you've not put double-quotes around it. That will confuse set as Tcl's general parser will end up giving set more arguments than it expects. (Tcl's general parser does not know that set only takes one or two arguments, by very long standing policy of the language.)
So you should really do:
set grofile "file.gro"
Don't leave the double quotes out if you have a complicated name.
Also, this won't work:
set grofile [./file.gro]
because […] is used to indicate running something as a command and using the result of that. While ./file.gro is actually a legal command name in Tcl, it's… highly unlikely.
And this won't work:
set grofile [file ./file.gro r]
Because the file command requires a subcommand as a first argument. The word you give is not one of the standard file subcommands, and none of them accept those arguments anyway, which look suitable for open (though that returns a channel handle suitable for use with commands like gets and read).
The TkConsole is actually pretty reasonable as quick-and-dirty terminal emulations go (given that it omits a lot of the complicated cases). The real problem is that you're not being consistently accurate about what you're really typing; that matters hugely in most programming languages, not just Tcl. You need to learn to be really exacting; cut-n-paste when creating a question helps a lot.
This should be an incredibly easy question but I am not very familiar with bash and I am taking way longer than I should to figure it out.
declare -a ids=( 1 2 3 )
for i in "${ids[#]}";
do
re= $(mysql -h .... "SELECT col_A FROM DBA WHERE id=$i")
if [ $re -eq 0 ]; then
echo sucess
fi
done
This is an example of what I am trying to do, I have an id array and I want to send a query to my db so I can get a flag in the row with a certain id and then do something based on that. But I keep getting unexpected token errors and I am not entirely sure why
Edit: While copying the code and deleting some private information somehow I deleted the then, it was present in the code I was testing.
Based on what you described and the partial script, I am not certain I can completely create what you are trying to do but the token error messages you are experiencing usually have to do with the way bash handles whitespace as a delimiter. A few comments based on what you posted:
You need to remove the space around the equal sign in declaring an variable, so the space after the equal sign in re= needs to removed.
Because bash will is sensitive to whitespace, you need to quote variables declarations that might contain a space. To be safe, quotes need to be around the sub-shell $( )
You were missing the then in the if statement
It is important that variables in the test brackets, that is single [ ]s, must be quoted. Using an unquoted string with -eq, or even just the unquoted string alone within test brackets normally works, however, this is an unsafe practice and can give unpredictable results.
So, taking into account the items noted, the updated script would look something like:
declare -a ids=( 1 2 3 )
for i in "${ids[#]}";
do
re="$(mysql -h .... "SELECT col_A FROM DBA WHERE id=$i")"
if [ "$re" -eq "0" ]; then
echo "success"
fi
done
Can you try working the edits mentioned into your script and see if you are able to get it working? Remember, it will be helpful for you to use a site like ShellCheck to learn more about potential pitfalls or the uniquenesses of bash syntax. This will help to ensure you are working toward a solution to your specific need rather then getting trapped by some tricky syntax.
After you have worked through those edits, can you report back your experience?
EDIT
Based on your comments there is a good chance you are not running your script with bash despite the including #!/bin/bash at the top of your script. When you run the script as sh scriptname.sh you are forcing the script to be run by sh not bash. Try running your script like this /bin/bash scriptname.sh then report back on your experience.
For more information on the differences between various shells, see Unix/Linux : Difference between sh , csh , ksh and bash Shell
Your problem with your if statement is that you do not have the then keyword. A simple fix is:
declare -a ids=( 1 2 3 )
for i in "${ids[#]}";
do
re= $(mysql -h .... "SELECT col_A FROM DBA WHERE id=$i")
if [ $re -eq 0 ]; then
echo sucess
fi
done
Also here is a great reference on if statements in bash
I have a MySQL dump file over 1 terabyte big. I need to extract the CREATE TABLE statements from it so I can provide the table definitions.
I purchased Hex Editor Neo but I'm kind of disappointed I did. I created a regex CREATE\s+TABLE(.|\s)*?(?=ENGINE=InnoDB) to extract the CREATE TABLE clause, and that seems to be working well testing in NotePad++.
However, the ETA of extracting all instances is over 3 hours, and I cannot even be sure that it is doing it correctly. I don't even know if those lines can be exported when done.
Is there a quick way I can do this on my Ubuntu box using grep or something?
UPDATE
Ran this overnight and output file came blank. I created a smaller subset of data and the procedure is still not working. It works in regex testers however, but grep is not liking it and yielding an empty output. Here is the command I'm running. I'd provide the sample but I don't want to breach confidentiality for my client. It's just a standard MySQL dump.
grep -oP "CREATE\s+TABLE(.|\s)+?(?=ENGINE=InnoDB)" test.txt > plates_schema.txt
UPDATE
It seems to not match on new lines right after the CREATE\s+TABLE part.
You can use Perl for this task... this should be really fast.
Perl's .. (range) operator is stateful - it remembers state between evaluations.
What it means is: if your definition of table starts with CREATE TABLE and ends with something like ENGINE=InnoDB DEFAULT CHARSET=utf8; then below will do what you want.
perl -ne 'print if /CREATE TABLE/../ENGINE=InnoDB/' INPUT_FILE.sql > OUTPUT_FILE.sql
EDIT:
Since you are working with a really large file and would probably like to know the progress, pv can give you this also:
pv INPUT_FILE.sql | perl -ne 'print if /CREATE TABLE/../ENGINE=InnoDB/' > OUTPUT_FILE.sql
This will show you progress bar, speed and ETA.
You can use the following:
grep -ioP "^CREATE\s+TABLE[\s\S]*?(?=ENGINE=InnoDB)" file.txt > output.txt
If you can run mysqldump again, simply add --no-data.
Got it! grep does not support matching across multiple lines. I found this question helpul and I ended up using pcregrep instead.
pcregrep -M "CREATE\s+TABLE(.|\n|\s)+?(?=ENGINE=InnoDB)" test.txt > plates.schema.txt
>> set signal_name [get_fanout abc_signal]
{xyz_blah_blah}
>> echo $signal_name
#142
>> set signal_name [get_fanout abc_signal]
{xyz_blah_blah}
>> echo $signal_name
#144
>>
I tried other stuff like catch etc, and every where, it returns #number. My goal is to be able to print the actual value instead of the number - xyz_blah_blah.
I am new to tcl. Want to understand, if this is an array or a pointer to an array or something like that. When I try the exact same thing with a different command, which returns just a value, then it works. This is a new command which returns value in parenthesis.
Please help. Thanks.
Every Tcl command produces a result value, which you capture and use by putting the call of the command in [square brackets] and putting the whole lot as part of an argument to another command. Thus, in:
set signal_name [get_fanout abc_signal]
the result of the call to get_fanout is used as the second argument to set. I suggest that you might also like to try doing this:
puts "-->[get_fanout abc_signal]<--"
It's just the same, except this time we're concatenating it with some other small string bits and printing the whole lot out. (In case you're wondering, the result of puts itself is always the empty string if there isn't an error, and set returns the contents of the variable.)
If that is still printing the wrong value (as well as the right one beforehand, without arrow marks around it) the real issue may well be that get_fanout is not doing what you expect. While it is possible to capture the standard output of a command, doing so is a considerably more advanced technique; it is probably better to consider whether there is an alternate mechanism to achieve what you want. (The get_fanout command is not a standard part of the Tcl language library or any very common add-on library like Tk or the Tcllib collection, so we can only guess at its behavior.)