This is weird and I'm not sure who the culprit really is.
I'm doing some scripting, on FreeBSD (6.2)? which makes extensive use of the following ***bash***ism:
do_something <(mysql --skip-column-names -B -e 'select ... from ... where ...;')
... where "do_something is a somewhat crufty utility (in Perl) that won't read from a pipeline. If I use a regular file it works fine. My bash script using things like exec 4< <(...) with these sorts of queries (following by loops of the form while read x y z <&4; do ... never seem to have any issues.
However, Perl (5.8.x) seems to periodically block (apparently forever). I tried changing out the chomp(my $data = <MYDATA>); with a routine that used sysread and I wrote some test cases in Python for comparison. These seem to block far less often than the idiomatic Perl code, but they still do it sometimes. (The Python code using f.read() or os.read(f.fileno()...) seems to behave about equally in this issue).
I've tried reproducing the issue using ... <(cat ...) (where I'm cating the regular file) and that never seems to reproduce that stall.
I've glanced at some ktrace/kdump data ... but I'm far more familiar with Linux strace or even Solaris truss ... so I haven't figured out what's going from there yet, either.
I suppose we can mostly rule out Perl, because I've reproduced the same issue using Python ... I don't see how the bash could be doing anything wrong here (it's just creating a named pipe in /var/tmp/sh-np-xxx and wiring the processes up to that).
What could the mysql shell/utility be doing that might cause this? I don't think I've seen it from anything else (such as cat or dd). I haven't tested this scenario under Linux ... but I've used <(...) (process substitution) for years under Linux and don't recall ever seeing this.
Is it a FreeBSD issue?
Sure I can work around the issue using temporary files ... but I'd sure rather understand why it's doing this (and avoid some of the races and clean-up messiness that temporary files entail).
Any suggestions?
The big difference between operating on the output of mysql and directly on a file is timing. When the perl process is stalled, the big question is: "why is it not making forward progress"? You can use the "l" option to ps to see the wait channel for the perl process; that way you can see if it blocked on a read, or if something else is going on. If it is really blocked on pipe input, I expect the MWCHAN entry for perl to be "piperd".
The same information would be interesting for the mysql process.
What does your Python test code look like?
Another way of writing this while avoiding the bashism is this; that would allow you to rule out bash:
mysql --skip-column-names -B -e 'select ... from ... where ...;' | do_something /dev/stdin
Other interesting questions:
Does the --unbuffered option to mysql change anything?
Does piping the mysql output through dd change anything? (eg. "perlscript <(mysql ... | dd)
Summary: Need more information.
Related
I am looking for an elegant way to parse a text file (i.e. a log file containing source and destination IPs and lots of other data) keeping each line intact, and replacing all IPv4 addresses with the same IP followed by a comma and the GeoIP country code of that IP.
I have tried doing this in bash, sed, perl, and python. I tried a hundred perl one-liners and never quite got it because substitution like s/original/replacement/g doesn't want to execute GeoIP lookup in the substitution field. For example:
perl -pe 's/([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})/($1,system(geoiplookup $1))/g' < log.csv
results in:
"srcip=(110.110.110.110,system(geoiplookup 110.110.110.110))"
instead of the executing geoiplookup.
I've tried this with backticks as well as exec, lots of different punctuation, with the same result.
In Python I tried some code that looks like:
rexp_ip = r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"
repl = { rexp_ip: rexp_ip+".test" }
---
while line:
line = i.readline()
print(re.sub(rexp_ip, lambda m: str(repl.get(m.group())), line))
It seems pretty close but I'm not sure whether I'm on the right track here.
I would be open to bash, sed, awk, perl, python, or any other solution.
This seems fairly simple to me and I may be over-thinking it!
I am guessing I'm not the first person who has tried this and maybe I'm 'reinventing the wheel' here.
Any insight would be appreciated.
I may have solved my own problem using perl with /e switch--
$ perl -lpe 's/([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})/(`printf $1;geoiplookup $1`)/eg' < log.csv
Conculusion:
No WORDCHARS alternative in Bash, where C-w ends can't be configured.
mysql depends on editline, which is customizable with ~/.editrc.
redis-cli depends on linenoise, it deletes the whole word without considering :, -
In zsh, WORDCHARS controls the behavior of C-w when deleting a word. Is there any alternative in readline?
I've noticed recently the behavior of C-w in mysql/redis-cli differs that in Bash, although both of which depends on readline?
Take string foo:bar as an example, only bar is deleted by C-w in Bash. While in mysql/redis-cli, the whole word foo:bar is deleted.
How do I control this behavior?
There are two commands to do backward kill word :
backward-kill-word
unix-word-rubout
backward-kill-word deletes bar,
unix-word-rubout deletes foo:bar
Run following command to find out what C-w is bound to
bind -P | grep C-w
Seems bash doesn't have WORDCHARS as in zsh
I have a MySQL dump file over 1 terabyte big. I need to extract the CREATE TABLE statements from it so I can provide the table definitions.
I purchased Hex Editor Neo but I'm kind of disappointed I did. I created a regex CREATE\s+TABLE(.|\s)*?(?=ENGINE=InnoDB) to extract the CREATE TABLE clause, and that seems to be working well testing in NotePad++.
However, the ETA of extracting all instances is over 3 hours, and I cannot even be sure that it is doing it correctly. I don't even know if those lines can be exported when done.
Is there a quick way I can do this on my Ubuntu box using grep or something?
UPDATE
Ran this overnight and output file came blank. I created a smaller subset of data and the procedure is still not working. It works in regex testers however, but grep is not liking it and yielding an empty output. Here is the command I'm running. I'd provide the sample but I don't want to breach confidentiality for my client. It's just a standard MySQL dump.
grep -oP "CREATE\s+TABLE(.|\s)+?(?=ENGINE=InnoDB)" test.txt > plates_schema.txt
UPDATE
It seems to not match on new lines right after the CREATE\s+TABLE part.
You can use Perl for this task... this should be really fast.
Perl's .. (range) operator is stateful - it remembers state between evaluations.
What it means is: if your definition of table starts with CREATE TABLE and ends with something like ENGINE=InnoDB DEFAULT CHARSET=utf8; then below will do what you want.
perl -ne 'print if /CREATE TABLE/../ENGINE=InnoDB/' INPUT_FILE.sql > OUTPUT_FILE.sql
EDIT:
Since you are working with a really large file and would probably like to know the progress, pv can give you this also:
pv INPUT_FILE.sql | perl -ne 'print if /CREATE TABLE/../ENGINE=InnoDB/' > OUTPUT_FILE.sql
This will show you progress bar, speed and ETA.
You can use the following:
grep -ioP "^CREATE\s+TABLE[\s\S]*?(?=ENGINE=InnoDB)" file.txt > output.txt
If you can run mysqldump again, simply add --no-data.
Got it! grep does not support matching across multiple lines. I found this question helpul and I ended up using pcregrep instead.
pcregrep -M "CREATE\s+TABLE(.|\n|\s)+?(?=ENGINE=InnoDB)" test.txt > plates.schema.txt
since met so many startup errors,I decide to analyze mysql startup shell.while some code fragment I cannot understand clearly.
version:
mysql Ver 14.14 Distrib 5.5.43, for osx10.8 (i386) using readline 5.1
368 #
369 # First, try to find BASEDIR and ledir (where mysqld is)
370 #
372 if echo '/usr/local/mysql/share' | grep '^/usr/local/mysql' > /dev/null
373 then
374 relpkgdata=echo '/usr/local/mysql/share' | sed -e 's,^/usr/local/mysql,,' -e 's,^/,,' -e 's,^,./,'
375 else
376 # pkgdatadir is not relative to prefix
377 relpkgdata='/usr/local/mysql/share'
378 fi
what's the purpose of line 372? a little weird
any help will be appreciated.
At first glance, this is very strange indeed... but here's a solution to this mystery.
372: if echo '/usr/local/mysql/share' | grep '^/usr/local/mysql' > /dev/null
373: then
grep returns true if it matches and false if it doesn't, so this is testing whether the string /usr/local/mysql/share begins with (^) /usr/local/mysql. Output goes /dev/null because we don't need to see it, we just want to compare it.
"Well," you interject, "that's obvious enough. The question is why?" Stick with me.
If it matches:
374: relpkgdata=echo '/usr/local/mysql/share' | sed -e 's,^/usr/local/mysql,,' -e 's,^/,,' -e 's,^,./,'
Beginning with /usr/local/mysql/share, strip off the beginning /usr/local/mysql, then strip off the beginning / then prepend ./.
So /usr/local/mysql/share becomes ./share.
Otherwise, use the string /usr/local/mysql/share.
375: else
376: # pkgdatadir is not relative to prefix
377: relpkgdata='/usr/local/mysql/share'
"That's all fine, too," I hear you say, "but why go through all these gyrations to (apparently) compare and massage two fixed literal strings?? We already know the answer, so what's up with all the tests and substitution?"
It's a fair question.
My first suspicion was that there was some sort of magic bash hackery going on that I didn't recognize, but no, this code is really all too simple to be something along those lines.
My second suspicion, since this is notably absent from MySQL 5.0.96 (which I am not running but keep on hand for reference), was that this was an abandoned attempt to introduce some new magical behavior into mysqld_safe which was never finished and replaced with actual variables, the testing and massaging of which would have made a lot more sense than doing the same thing to literal strings.
But, no. When the only tool you have is a hammer, everything looks like a nail. What this is, is an example of doing something simple... the hard way. At least that's what it looks like to me. There actually is a somewhat rational explanation. To find it answer, you have to look into the source code (not binary) distribution.
MySQL has a lot of "hard-coded" defaults. This turns out to be an example of these.
In the source file scripts/mysqld_safe.sh, the snippet above looks very different:
if echo '#pkgdatadir#' | grep '^#prefix#' > /dev/null
then
relpkgdata=`echo '#pkgdatadir#' | sed -e 's,^#prefix#,,' -e 's,^/,,' -e 's,^,./,'`
else
# pkgdatadir is not relative to prefix
relpkgdata='#pkgdatadir#'
fi
Ah, source munging. Pattern substitution.
When you're compiling MySQL from source, the file scripts/Makefile contains instruction that use sed to replace things like #prefix# and #pkgdatadir# with the literal values. The same thing, of course, happens when Oracle or the Linux disto maintainers compile their binary distribution from source. These paths get hard-coded into many, many, many places in the code, including this script... resulting in the otherwise incomprehensible comparison of two literal strings that somebody should already have known the answer to.
Instead of testing at build time, whether one path is an anchored substring of the other, and the "relpkgdata" value should be expressed relative to the current directory and modifying this script accordingly, that logical test is actually deferred until runtime, comparing two literals that were substituted in for their placeholders at build time.
I've gone to this amount of detail, not because it will help you troubleshoot, because I suspect it won't. It was, however, just bizarre enough to warrant some further investigation.
If you are having difficulty getting MySQL Server running... well, you shouldn't be, because it's a well-established system and it should work. If /bin/sh on your system isn't a symlink to /bin/bash, you might want to change mysqld_safe's shebang line from #!/bin/sh to #!/bin/bash, but beyond that, I suspect you are sniffing down the wrong rabbit hole by looking at mysqld_safe to get to the bottom of your issue. As convoluted as mysqld_safe is, it can't be said that it isn't time-tested. As they say, "the problem is somewhere else."
If I may, I'll suggest that you familiarize yourself with some of our other communities where you're likely to find the answer you need, particularly Ask Ubuntu, Super User, Server Fault, and Database Administrators. Familiarize yourself with each site's community, scope, and the level of existing expertise that each community expects on the part of those who ask questions there, and search the sites for the specific problem you're encountering. It's very likely someone has seen it and we've fixed it on one of them, if not here on SO.
I have a lot of text that I need to process for valid URLs.
The input is vaguely HTMLish, in that it's mostly html. However, It's not really valid HTML.
I*ve been trying to do it with regex, and having issues.
Before you say (or possibly scream - I've read the other HTML + regex questions) "use a parser", there is one thing you need to consider:
The files I am working with are about 5 GB in size
I don't know any parsers that can handle that without failing, or taking days. Furthermore, the fact that, while the text content is largely html, but not necessarily valid html means it would require a very tolerant parser. Lastly, not all links are necessarily in <a> tags (some may be just plaintext).
Given that I don't really care about document structure, are there any better alternatives WRT extracting links?
Right now I'm using the regex:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))) (in grep -E)
but even with that, I gave up after letting it run for about 3 hours.
Are there significant differences in Regex engine performance? I'm using MacOS's command-line grep. If there are other compatible implementations with better performance, that might be an option.
I don't care too much about language/platform, though MacOS/command line would be nice.
I wound up string a couple grep commands together:
pv -cN source allContent | grep -oP "(?:\"([^\"' ]*?)\")|(?:'([^\"' ]*?)')|(?:([^\"' ]*?) )" | grep -E "(http)|(www)|(\.com)|(\.net)|(\.to)|(\.cc)|(\.info)|(\.org)" | pv -cN out > extrLinks1
I used pv to give me a progress indicator.
grep -oP "(?:\"([^\"' ]*?)\")|(?:'([^\"' ]*?)')|(?:([^\"' ]*?) )"
Pulls out anything that looks like a word or quoted text, and has no spaces.
grep -E "(http)|(www)|(\.com)|(\.net)|(\.to)|(\.cc)|(\.info)|(\.org)"
Filters the output for anything that looks like it could be a URL.
Finally,
pv -cN out > extrLinks1
Outputs it to a file, and gives a nice activity meter.
I'll probably push the generated file through sort -u to remove duplicate entries, but I didn't want to string that on the end because it would add another layer of complexity, and I'm pretty sure that sort will try to buffer the whole file, which could cause a crash.
Anyways, as it's running right now, it looks like it's going to take about 40 minutes. I didn't know about pv before. It's a really cool utility!
I think you are in the right track, and grep should be able to handle a 5Gb file. Try simplifying your regex avoid the | operator and so many parenthesis. Also, use the head command to grab the first 100Kb before running against the whole file, and chain the greps using pipes to achieve more specificity. For example,
head -c 100000 myFile | grep -E "((src)|(href))\b*=\b*[\"'][\w://\.]+[\"']"
That should be super fast, no?