Python - iterate and replace (IPv4) with (IPv4 plus GeoIP) - language-agnostic

I am looking for an elegant way to parse a text file (i.e. a log file containing source and destination IPs and lots of other data) keeping each line intact, and replacing all IPv4 addresses with the same IP followed by a comma and the GeoIP country code of that IP.
I have tried doing this in bash, sed, perl, and python. I tried a hundred perl one-liners and never quite got it because substitution like s/original/replacement/g doesn't want to execute GeoIP lookup in the substitution field. For example:
perl -pe 's/([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})/($1,system(geoiplookup $1))/g' < log.csv
results in:
"srcip=(110.110.110.110,system(geoiplookup 110.110.110.110))"
instead of the executing geoiplookup.
I've tried this with backticks as well as exec, lots of different punctuation, with the same result.
In Python I tried some code that looks like:
rexp_ip = r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"
repl = { rexp_ip: rexp_ip+".test" }
---
while line:
line = i.readline()
print(re.sub(rexp_ip, lambda m: str(repl.get(m.group())), line))
It seems pretty close but I'm not sure whether I'm on the right track here.
I would be open to bash, sed, awk, perl, python, or any other solution.
This seems fairly simple to me and I may be over-thinking it!
I am guessing I'm not the first person who has tried this and maybe I'm 'reinventing the wheel' here.
Any insight would be appreciated.

I may have solved my own problem using perl with /e switch--
$ perl -lpe 's/([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})/(`printf $1;geoiplookup $1`)/eg' < log.csv

Related

Tab Delimites to CSV

For a data mining project I need to convert 80 tab delimited files(100 MB each) to CSV files. Anybody is aware of some tools that can be handy in this case.
Download python: https://www.python.org/downloads/
Install it.
And run a script similar to the following.
Save the following as convert_tsv_to_csv.py Or anything ending in .py:
import csv
with open('C:\\path\to\file','r') as f:
tab_file = csv.reader(f, dialect=csv.excel_tab)
with open('C:\path\to\outfile.csv','w') as g:
comma_file = csv.writer(g, dialect=csv.excel)
for row in tab_file:
comma_file.writerow(row)
Change the paths and run it like: python convert_tsv_to_csv.py
The basic idea:
If the files are big, read them line by line.
Learn your basic tools.
On any UNIX/Linux/OSX system, the following commands each should do the trick:
sed -i -e 's/\t/,/g' *.csv
perl -i -p -e 's/\t/,/g' *.csv
These perform the basic tab to comma substitution. They won't take care of things like quoting and escaping if your data contains columns with a tabular or comma, or chaning the file name for you! Note that the syntax of sed and perl are very similar... -i is inplace editing, -e is execute a command, s/// is the syntax for regular expression substitutions. Etc.
Either way, your basic unix tools for this job are
extremely fast (the "stream editor" sed is well optimized, low-level C code)
handy (just some 10 keypresses!)
easy to use, once you've learned the basics (i.e. read the manual)

CSV transformation

I have been looking for a way to reformat a CSV (Pipe separator) file with some if parameters, I'm pretty sure this can be done in PHP (strpos and if statements) or using XSLT but wanted to know if this is the best/easiest way to do it before I go and learn my way around a new language. here is a small example of the kind of thing I'm trying to achieve (the real file is about 25000 lines is this changes the answer?)
99407350|Math Book #13 (Random Information)|AB Collings|http:www.abc.com/ABC
497790366|English Book|Harold Herbert|http:www.abc.com/HH
Transform to this:
99407350|Math Book|#13|AB Collings|http:www.abc.com/ABC
497790366|English Book||Harold Herbert|http:www.abc.com/HH
Any advice about which direction I need to look in would be great.
PHP provides getcsv() (PHP 5) and fgetcsv() (PHP 4 and 5) for this, so if you are working in a PHP environment, use that. See e.g. http://www.php.net/manual/en/function.fgetcsv.php
If you do something yourself, remember to cope with "...|..." and/or \| to have | inside a field. Or test to make sure it can't happen - e.g. check the code that exports the database to CSV if that's what's happening.
Note also - on Unix / Solaris / Linux / OS X systems,
awk -F '|' '(NF != 9)' yourfile.csv | wc
will count the number of lines with other than 9 fields; if you are certain | never occurs except as a field delimiter, awk is a perfectly fine language for this too, e.g. with
awk -F '|' '{ gsub(/ [(].*[)]/, "", $1); print}' yourfile.csv
Here, [(] matches ( in a way that works across different versions of awk, and same for [)].

Cannot properly decode html entities in perl

I am having an issue which I am unable to solve after spending the last 10 hours searching around the internet for an answer.
I have some data in this format
??E??0??<?20120529184453+0200?20120529184453+0200???G0E?5?=20111213T103134000-136.225.6.103-30365316-1448169323, ver: 12??W??tP?2??
??|?????
??:o?????tP???B#?????B#??????)0????
49471010550??? ???tP???3??<????????????????
I have a PHP code, not written by me, which is just running html_entity_decode on that and it returns the correct results.
When I try running Perl's decode_entities I get a completely different result. After some debugging it seems to me that PHP is "properly" replacing what seems to be invalid entities, such as,  or  into their ascii counterparts, namely NULL and backspace for the 2 cases mentioned.
Perl on the other hand does not seem to decode those "invalid" entities and leaves them alone which later one screws up the result (Which goes through unpack or, in phph's case, bin2hex, which fails because rather than unpacking null to 00 it will unpack each individual character of ).
I have tried everything I can think of include running the following substitution in perl after running decode_entities
$var =~ s/&#(\d+);/chr($1)/g
however that does not work at all.
This is driving me mad and I would like to have this done in perl rather than phpI really hope I don't have to write 1000 pattern matching lines in perl to cover all possible entities and numbers.
Anybody that has an idea how to go about this problem without resorting to having to parse PHPs entire html_entity_decode function into perl or writing endless lines of pattern matching?
You're almost there. Instead of
$var =~ s/&#(\d+);/chr($1)/g
say
$var =~ s/&#(\d+);/chr($1)/ge
The /e modifier instructs Perl to 'e'valuate the replacement pattern.

Extracting URLs from large text/HTML files

I have a lot of text that I need to process for valid URLs.
The input is vaguely HTMLish, in that it's mostly html. However, It's not really valid HTML.
I*ve been trying to do it with regex, and having issues.
Before you say (or possibly scream - I've read the other HTML + regex questions) "use a parser", there is one thing you need to consider:
The files I am working with are about 5 GB in size
I don't know any parsers that can handle that without failing, or taking days. Furthermore, the fact that, while the text content is largely html, but not necessarily valid html means it would require a very tolerant parser. Lastly, not all links are necessarily in <a> tags (some may be just plaintext).
Given that I don't really care about document structure, are there any better alternatives WRT extracting links?
Right now I'm using the regex:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))) (in grep -E)
but even with that, I gave up after letting it run for about 3 hours.
Are there significant differences in Regex engine performance? I'm using MacOS's command-line grep. If there are other compatible implementations with better performance, that might be an option.
I don't care too much about language/platform, though MacOS/command line would be nice.
I wound up string a couple grep commands together:
pv -cN source allContent | grep -oP "(?:\"([^\"' ]*?)\")|(?:'([^\"' ]*?)')|(?:([^\"' ]*?) )" | grep -E "(http)|(www)|(\.com)|(\.net)|(\.to)|(\.cc)|(\.info)|(\.org)" | pv -cN out > extrLinks1
I used pv to give me a progress indicator.
grep -oP "(?:\"([^\"' ]*?)\")|(?:'([^\"' ]*?)')|(?:([^\"' ]*?) )"
Pulls out anything that looks like a word or quoted text, and has no spaces.
grep -E "(http)|(www)|(\.com)|(\.net)|(\.to)|(\.cc)|(\.info)|(\.org)"
Filters the output for anything that looks like it could be a URL.
Finally,
pv -cN out > extrLinks1
Outputs it to a file, and gives a nice activity meter.
I'll probably push the generated file through sort -u to remove duplicate entries, but I didn't want to string that on the end because it would add another layer of complexity, and I'm pretty sure that sort will try to buffer the whole file, which could cause a crash.
Anyways, as it's running right now, it looks like it's going to take about 40 minutes. I didn't know about pv before. It's a really cool utility!
I think you are in the right track, and grep should be able to handle a 5Gb file. Try simplifying your regex avoid the | operator and so many parenthesis. Also, use the head command to grab the first 100Kb before running against the whole file, and chain the greps using pipes to achieve more specificity. For example,
head -c 100000 myFile | grep -E "((src)|(href))\b*=\b*[\"'][\w://\.]+[\"']"
That should be super fast, no?

FreeBSD, MySQL, Perl, bash: intermittent blocking on named pipes?

This is weird and I'm not sure who the culprit really is.
I'm doing some scripting, on FreeBSD (6.2)? which makes extensive use of the following ***bash***ism:
do_something <(mysql --skip-column-names -B -e 'select ... from ... where ...;')
... where "do_something is a somewhat crufty utility (in Perl) that won't read from a pipeline. If I use a regular file it works fine. My bash script using things like exec 4< <(...) with these sorts of queries (following by loops of the form while read x y z <&4; do ... never seem to have any issues.
However, Perl (5.8.x) seems to periodically block (apparently forever). I tried changing out the chomp(my $data = <MYDATA>); with a routine that used sysread and I wrote some test cases in Python for comparison. These seem to block far less often than the idiomatic Perl code, but they still do it sometimes. (The Python code using f.read() or os.read(f.fileno()...) seems to behave about equally in this issue).
I've tried reproducing the issue using ... <(cat ...) (where I'm cating the regular file) and that never seems to reproduce that stall.
I've glanced at some ktrace/kdump data ... but I'm far more familiar with Linux strace or even Solaris truss ... so I haven't figured out what's going from there yet, either.
I suppose we can mostly rule out Perl, because I've reproduced the same issue using Python ... I don't see how the bash could be doing anything wrong here (it's just creating a named pipe in /var/tmp/sh-np-xxx and wiring the processes up to that).
What could the mysql shell/utility be doing that might cause this? I don't think I've seen it from anything else (such as cat or dd). I haven't tested this scenario under Linux ... but I've used <(...) (process substitution) for years under Linux and don't recall ever seeing this.
Is it a FreeBSD issue?
Sure I can work around the issue using temporary files ... but I'd sure rather understand why it's doing this (and avoid some of the races and clean-up messiness that temporary files entail).
Any suggestions?
The big difference between operating on the output of mysql and directly on a file is timing. When the perl process is stalled, the big question is: "why is it not making forward progress"? You can use the "l" option to ps to see the wait channel for the perl process; that way you can see if it blocked on a read, or if something else is going on. If it is really blocked on pipe input, I expect the MWCHAN entry for perl to be "piperd".
The same information would be interesting for the mysql process.
What does your Python test code look like?
Another way of writing this while avoiding the bashism is this; that would allow you to rule out bash:
mysql --skip-column-names -B -e 'select ... from ... where ...;' | do_something /dev/stdin
Other interesting questions:
Does the --unbuffered option to mysql change anything?
Does piping the mysql output through dd change anything? (eg. "perlscript <(mysql ... | dd)
Summary: Need more information.