I have a MySQL dump file over 1 terabyte big. I need to extract the CREATE TABLE statements from it so I can provide the table definitions.
I purchased Hex Editor Neo but I'm kind of disappointed I did. I created a regex CREATE\s+TABLE(.|\s)*?(?=ENGINE=InnoDB) to extract the CREATE TABLE clause, and that seems to be working well testing in NotePad++.
However, the ETA of extracting all instances is over 3 hours, and I cannot even be sure that it is doing it correctly. I don't even know if those lines can be exported when done.
Is there a quick way I can do this on my Ubuntu box using grep or something?
UPDATE
Ran this overnight and output file came blank. I created a smaller subset of data and the procedure is still not working. It works in regex testers however, but grep is not liking it and yielding an empty output. Here is the command I'm running. I'd provide the sample but I don't want to breach confidentiality for my client. It's just a standard MySQL dump.
grep -oP "CREATE\s+TABLE(.|\s)+?(?=ENGINE=InnoDB)" test.txt > plates_schema.txt
UPDATE
It seems to not match on new lines right after the CREATE\s+TABLE part.
You can use Perl for this task... this should be really fast.
Perl's .. (range) operator is stateful - it remembers state between evaluations.
What it means is: if your definition of table starts with CREATE TABLE and ends with something like ENGINE=InnoDB DEFAULT CHARSET=utf8; then below will do what you want.
perl -ne 'print if /CREATE TABLE/../ENGINE=InnoDB/' INPUT_FILE.sql > OUTPUT_FILE.sql
EDIT:
Since you are working with a really large file and would probably like to know the progress, pv can give you this also:
pv INPUT_FILE.sql | perl -ne 'print if /CREATE TABLE/../ENGINE=InnoDB/' > OUTPUT_FILE.sql
This will show you progress bar, speed and ETA.
You can use the following:
grep -ioP "^CREATE\s+TABLE[\s\S]*?(?=ENGINE=InnoDB)" file.txt > output.txt
If you can run mysqldump again, simply add --no-data.
Got it! grep does not support matching across multiple lines. I found this question helpul and I ended up using pcregrep instead.
pcregrep -M "CREATE\s+TABLE(.|\n|\s)+?(?=ENGINE=InnoDB)" test.txt > plates.schema.txt
Related
I am trying to update a simple JSON file (consists of one object with several key/value pairs) and I am using the same command yet getting different results (sometimes even having the whole json wiped with the 2nd command). The command I am trying is:
cat ~/Desktop/config.json | jq '.Option = "klay 10"' | tee ~/Desktop/config.json
This command perfectly replaces the value of the minerOptions key with "klay 10", my intended output.
Then, I try to run the same process on the newly updated file (just value is changed for that one key) and only get interactive terminal with no result. ps unfortunately isn't helpful in showing what's going on. This is what I do after getting that first command to perfectly change the value of the key:
cat ~/Desktop/config.json | jq ‘.othOptions = "-epool etc-eu1.nanopool.org:14324 -ewal 0xc63c1e59c54ca935bd491ac68fe9a7f1139bdbc0 -mode 1"' | tee ~/Desktop/config.json
which I would have expected would replace the othOptions key value with the assigned result, just as the last did. I tried directly sending the stdout to the file, but no result there either. I even tried piping one more time and creating a temp file and then moving it to change to original, all of these, as opposed to the same identical command, just return > and absolutely zero output; when I quit the process, it is the same value as before, not the new one.
What am I missing here that is causing the same command with just different inputs (the key in second comes right after first and has identical structure, it's not creating an object or anything, just key-val pair like first. I thought it could be tee but any other implementation like a passing of stdout to file produces the same constant > waiting for a command, no user.
I genuinely looked everywhere I could online for why this could be happening before resorting to SE, it's giving me such a headache for what I thought should be simple.
As #GordonDavisson pointed out, using tee to overwrite the input file is a (well-known - see e.g. the jq FAQ) recipe for disaster. If you absolutely positively want to overwrite the file unconditionally, then you might want to consider using sponge, as in
jq ... config.json | sponge config.json
or more safely:
cp -p config.json config.json.bak && jq ... config.json | sponge config.json
For further details about this and other options, search for ‘sponge’ in the FAQ.
So the thing that makes this whole question hard is that I am working in a bash shell environment. I am parsing a large amount of data that is all located in text files in a set of directories. The environment I am working in does not have a gui, and is just the shell, and I am executing the commands from the shell through mysql, I am not logged into mysql.
I am the partner on a project, the main part is a bash script that searches for information and inserts it into text files in several directories. My operations parse out the needed data and inserts it into the database.
I run my main loop through a shell script. It loops through a set of directories and searches for the .txt files in each. I then pass the information to my procedure. In something like the below.
NOTE: I am not an expert in bash and have just started learning.
mysql - user -p'mypassword' --database=dbname <<EFO
call Procedure_Name("`cat ${textfile}`");
EOF
Since I am working in mysql and bash only I can not use another language to make my life easier so I use SUBSTRING_INDEX mostly. So an illustration of the procedure is shown below.
DELIMITER $$
CREATE PROCEDURE Procedure_name(textfile LONGTEXT)
BEGIN
DECLARE data LONGTEXT;
SET data = SUBSTRING_INDEX(SUBSTRING_INDEX(textfile,"(+++)",1),"(++)",-1));
INSERT INTO Table_Name (column) values (data);
END; $$
DELIMITER ;
The text file is a clean structure that allows for me to cut it up, but the problem I am having is that special characters inside of the textfile is causing my procedure to throw an error. I believe they are escape characters and I need a way around this. Just about any character could appear in the data I am parsing so I need a way to ignore these characters in the procedure or to cause them to not affect my process.
I tried looking into mysql_real_escape_string() however the parameters were hard to figure out and it looks like it only works in PHP but I am not sure. So I would like to do something at the beginning of my procedure to maybe insert "\"'s or something into the string to not cause my procedure to fail.
Also, these textfiles range from 16k to 11000k so I need something that can handle that. My process works sometimes but is getting caught up on a lot of stuff and my searching has not helped me at all. So any help would be greatly appreciated!!!
and thanks to all to reading this long description. normally I can find my answer or piece it together from questions but I had no luck this time so I figured it was about time to make an account and ask something.
Your question is really too board, but here is an example of what I mean
a script file:
#!/bin/bash
case $# in
1 ) inFile=$1 ;;
* ) echo "usage: myLoader infile"; exit 1 ;;
esac
awk 'BEGIN {
FS="\t"'; OFS="|"
}
{
sub(/badChars/, "", $0); sub(/otherBads/, "", $0) ; # .... as many as needed
# but be careful, easy to delete stuff that with too broad a brush.
print $1, $2, $5, $4, $9
}' $inFile > $inFile.psv
bcp -in -f ${formatFile:-formatFile} $inFile.psv
Note how awk makes it very easy, by repeating sub(...) commands to remove any "bad chars" you may have in your source data AND to reorganize the order of the columns in your data. Each $n is the value in numbered column on a line, so $1, $2, $5 skips fields $3 and $4, for example.
The OFS is set to the pipe char, making it easy to see in your output where exactly the field boundaries are AND if there are any leading or trailing whitespace characters that may be throwing off your load.
The > $inFile.psv keeps your original file, just in case you make a mistake in the awk script.
If you create really small test data files, you can eliminate saving to a file and just let the output go to the screen, editing until you get it right.
You'll have to find out exactly how mySQL's equivalent of bcp works. I'm pretty sure I've seen postings here. Either that, or post a separate question, "I have this pipe-delimited file with 8 columns, how do I load it to my table?".
The reference in my sample code to ${formatFile} is that hopefully the mySQL bcp command can take a format file that specifies the order and types of fields to be loaded into a file. Good bcp fmt files allow a fair amount of flexibility, but you'll have to read the man page for that utility AND do some research to understand the scope and restraints on that flexibility.
Going forward, you should post individual questions like, "I've tried x using lang Y to filter Z characters. Right now I'm getting output z, What am I doing wrong?"
Divide and conquer. There is no easy way. Reset those customer and boss expectations, you're learning something new, and it will take a little study to get it right. Good luck.
IHTH
I have a lot of text that I need to process for valid URLs.
The input is vaguely HTMLish, in that it's mostly html. However, It's not really valid HTML.
I*ve been trying to do it with regex, and having issues.
Before you say (or possibly scream - I've read the other HTML + regex questions) "use a parser", there is one thing you need to consider:
The files I am working with are about 5 GB in size
I don't know any parsers that can handle that without failing, or taking days. Furthermore, the fact that, while the text content is largely html, but not necessarily valid html means it would require a very tolerant parser. Lastly, not all links are necessarily in <a> tags (some may be just plaintext).
Given that I don't really care about document structure, are there any better alternatives WRT extracting links?
Right now I'm using the regex:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))) (in grep -E)
but even with that, I gave up after letting it run for about 3 hours.
Are there significant differences in Regex engine performance? I'm using MacOS's command-line grep. If there are other compatible implementations with better performance, that might be an option.
I don't care too much about language/platform, though MacOS/command line would be nice.
I wound up string a couple grep commands together:
pv -cN source allContent | grep -oP "(?:\"([^\"' ]*?)\")|(?:'([^\"' ]*?)')|(?:([^\"' ]*?) )" | grep -E "(http)|(www)|(\.com)|(\.net)|(\.to)|(\.cc)|(\.info)|(\.org)" | pv -cN out > extrLinks1
I used pv to give me a progress indicator.
grep -oP "(?:\"([^\"' ]*?)\")|(?:'([^\"' ]*?)')|(?:([^\"' ]*?) )"
Pulls out anything that looks like a word or quoted text, and has no spaces.
grep -E "(http)|(www)|(\.com)|(\.net)|(\.to)|(\.cc)|(\.info)|(\.org)"
Filters the output for anything that looks like it could be a URL.
Finally,
pv -cN out > extrLinks1
Outputs it to a file, and gives a nice activity meter.
I'll probably push the generated file through sort -u to remove duplicate entries, but I didn't want to string that on the end because it would add another layer of complexity, and I'm pretty sure that sort will try to buffer the whole file, which could cause a crash.
Anyways, as it's running right now, it looks like it's going to take about 40 minutes. I didn't know about pv before. It's a really cool utility!
I think you are in the right track, and grep should be able to handle a 5Gb file. Try simplifying your regex avoid the | operator and so many parenthesis. Also, use the head command to grab the first 100Kb before running against the whole file, and chain the greps using pipes to achieve more specificity. For example,
head -c 100000 myFile | grep -E "((src)|(href))\b*=\b*[\"'][\w://\.]+[\"']"
That should be super fast, no?
I cant find anything about this from searching here.
I use mysql on the command line at work and I work with fairly large tables so I set the mysql pager allowing a more readable result if I run a query, that returns 1000's of results. I use the command below to set the pager.
\P less -Sin
This suits my needs but has left me wondering if there are any more pager styles that mysql uses on the command line.
The MySQL client just passes its output to whatever command you specify with \P (for "Pager").
-Sin are commandline switches to the program less. From man less:
-i Causes searches to ignore case
-n Suppresses line numbers
-S Causes lines longer than the screen width to be chopped rather than folded.
For more options of the MySQL client, see reference.
mysql> pager less
PAGER set to 'less'
You might want to try pspg:
Unix pager designed for work with tables. Designed for PostgreSQL, but MySQL is supported too.
Main target
possibility to freeze first few rows, first few columns
possibility to use fancy colors - like mcview or FoxPro
In action:
This post is old, but still very helpful.
You can set the pager to whatever you want, including a script that parses all output before feeding it back to you. The examples there include using an add-on tool that makes EXPLAIN output more readable.
Also note that to turn off this functionality and return to normal stdout the command is nopager.
If you don't like less you can use more :)
\P more
This is weird and I'm not sure who the culprit really is.
I'm doing some scripting, on FreeBSD (6.2)? which makes extensive use of the following ***bash***ism:
do_something <(mysql --skip-column-names -B -e 'select ... from ... where ...;')
... where "do_something is a somewhat crufty utility (in Perl) that won't read from a pipeline. If I use a regular file it works fine. My bash script using things like exec 4< <(...) with these sorts of queries (following by loops of the form while read x y z <&4; do ... never seem to have any issues.
However, Perl (5.8.x) seems to periodically block (apparently forever). I tried changing out the chomp(my $data = <MYDATA>); with a routine that used sysread and I wrote some test cases in Python for comparison. These seem to block far less often than the idiomatic Perl code, but they still do it sometimes. (The Python code using f.read() or os.read(f.fileno()...) seems to behave about equally in this issue).
I've tried reproducing the issue using ... <(cat ...) (where I'm cating the regular file) and that never seems to reproduce that stall.
I've glanced at some ktrace/kdump data ... but I'm far more familiar with Linux strace or even Solaris truss ... so I haven't figured out what's going from there yet, either.
I suppose we can mostly rule out Perl, because I've reproduced the same issue using Python ... I don't see how the bash could be doing anything wrong here (it's just creating a named pipe in /var/tmp/sh-np-xxx and wiring the processes up to that).
What could the mysql shell/utility be doing that might cause this? I don't think I've seen it from anything else (such as cat or dd). I haven't tested this scenario under Linux ... but I've used <(...) (process substitution) for years under Linux and don't recall ever seeing this.
Is it a FreeBSD issue?
Sure I can work around the issue using temporary files ... but I'd sure rather understand why it's doing this (and avoid some of the races and clean-up messiness that temporary files entail).
Any suggestions?
The big difference between operating on the output of mysql and directly on a file is timing. When the perl process is stalled, the big question is: "why is it not making forward progress"? You can use the "l" option to ps to see the wait channel for the perl process; that way you can see if it blocked on a read, or if something else is going on. If it is really blocked on pipe input, I expect the MWCHAN entry for perl to be "piperd".
The same information would be interesting for the mysql process.
What does your Python test code look like?
Another way of writing this while avoiding the bashism is this; that would allow you to rule out bash:
mysql --skip-column-names -B -e 'select ... from ... where ...;' | do_something /dev/stdin
Other interesting questions:
Does the --unbuffered option to mysql change anything?
Does piping the mysql output through dd change anything? (eg. "perlscript <(mysql ... | dd)
Summary: Need more information.