Strip_Tags yielding strange result and how to display HTML tags as text - html

Good morning!
I'm trying to scrape a CME web page to pull the table at the bottom to a two dimensional array. (http://www.cmegroup.com/trading/equity-index/us-index/e-mini-sandp500_quotes_settlements_futures.html)
Code is below. Problem is, the var_dump says string(507) but there are only about 300 characters displayed! Three questions:
1) How do I display any hidden tags or characters?
2) Why does it say 507 chars but only about 300 chars displayed?
3) How do I remove whatever characters are hidden?
Thank you for your help!
Here is the code I used:
$EMiniURL = "http://www.cmegroup.com/trading/equity-index/us-index/e-mini-sandp500_quotes_settlements_futures.html";
$EMiniRaw = file_get_contents($EMiniURL);
$EMiniRaw = strip_tags($EMiniRaw);
$StartChr = strpos($EMiniRaw, "About This Report") + strlen("About This Report");
$EndChr = strpos($EMiniRaw, "Total", $StartChr) - strlen("Total");
$TotalLen = $EndChr - $StartChr;
$RawStr = substr ($EMiniRaw, $StartChr, $TotalLen);
var_dump ($RawStr);
And here is the var_dump result:
string(532) " DEC 14 1938.50 1964.50 1935.75 1959.75 +21.75 1960.25 1,551,405 2,751,445 MAR 15 1931.00 1956.25B 1928.00A 1952.00A +21.75 1952.50 2,244 5,495 JUN 15 1920.25 1949.00B 1920.25 1945.00A +22.00 1945.50 88 350 SEP 15 1925.00 1937.75B 1925.00 1937.75B +21.75 1938.75 6 204 DEC 15 1935.75 1935.75 1935.75 1935.75 +22.00 1932.75 1 212 "

Turns out there were newlines and tabs. Apparently PHP strip_tags does not remove everything. Used the following to finish the clean up:
$NewStr = preg_replace("/\s+/", " ", $OldStr);
Just saying ;)

Related

Decoding a hex file

I would like to use a webservice who deliver a binary file with some data, I know the result but I don't know how I can decode this binary with a script.
Here is my binary :
https://pastebin.com/3vnM8CVk
0a39 0a06 3939 3831 3438 1206 4467 616d
6178 1a0b 6361 7264 6963 6f6e 5f33 3222
0d54 6865 204f 6c64 2047 7561 7264 2a02
....
Some part are in ASCII so it easy to catch, at the end of the file you got vehicle name in ASCII and some data, it should be kill/victory/battle/XP/Money data but I don't understand how I can decode these hexa value, I tried to compare 2 vehicles who got same kills but I don't see any match.
There is a way to decode these data ?
Thanks :)
Hello guys, after 1 year I started again to find a solution, so here is the structure of the packet I guessed : (the part between [ ] I still don't know what is it for)
[52 37 08 01 10] 4E [18] EA [01 25] AB AA AA 3E [28] D4 [01 30] EC [01 38] 88 01 [40] 91 05 [48] 9F CA 22 [50] F5 C2 9A 02 [5A 12]
| | | | | | | | |
Victories Victory Ratio| | Air target| Xp Money earned
| | | Ground Target
Battles Deaths Respawns
So here is the result :
Victory : 78
Battles : 234
Victory Ratio : ? (should be arround 33%)
Deaths : 212
Respawns : 236
Air Target : 136
Ground Target : 657
Xp : ? (should be arround 566.56k)
Money : ? (should be arround 4.63M)
Is there a special way to calculate the result of a long hex like this ?
F5 C2 9A 02 (should be arround 4.63M)
I tell you a bit more :
I know the result, but I don't know how to calculate it with these hex from the packet.
If I check a packet with a small amout of money or XP to be compatible with one hex :
[52 1E 08 01 10] 01 [18] [01 25] 00 00 80 3F [28] 01 [30] 01 [48] 24 [50] 6E [5A 09]
6E = 110 Money earned
24 = 36 XP earned
Another exemple :
[52 21 08 01 10] 02 [18] 03 [25] AB AA 2A 3F [28] 02 [30] 03 [40] 01 [48] 78 [50] C7 08 [5A 09]
XP earned = hex 78 = 120
Money earned = hex C7 08 = 705
How C7 08 can do 705 decimal ?
Here is the full content in case but I know how to isolate just these part I don't need to decode all these hex data :
https://pastebin.com/vAKPynNb
What you have asked is nothing but how to reverse engineer a binary file. Lot of threads already on SO
Reverse engineer a binary dictionary file to extract strings
Tools to help reverse engineer binary file formats
https://reverseengineering.stackexchange.com/questions/3495/what-tools-exist-for-excavating-data-structures-from-flat-binary-files
http://www.iwriteiam.nl/Ha_HTCABFF.html
The final take out on all is that no single solution for you, you need to spend effort to figure it out. There are tools to help you, but don't expect a magic wand tool to give you the structure/data.
Any kind of file read operation is done in text or binary format with basic file handlers. And some languages offer type reading of int, float etc. or arrays of them.
The complex operations behind these reading are almost always kept hidden from normal users. And then the user has to learn more when it comes to read/write operations of data structures.
In this case, OFFSET and SEEK are the words one must find value and act accordingly. whence data read, it must be converted to suitable data type too.
The following code shows basics for these operations to write data and read blocks to get numbers back. It is written in PHP as the OP has commented in the question he uses PHP.
Offset is calculated with these byte values to be 11: char: 1 byte, short: 2 bytes, int: 4 bytes, float: 4 bytes.
<?php
$filename = "testdata.dat";
$filehandle = fopen($filename, "w+");
$data=["test string","another test string",77,777,77777,7.77];
fwrite($filehandle,$data[0]);
fwrite($filehandle,$data[1]);
$numbers=array_slice($data,2);
fwrite($filehandle,pack("c1s1i1f1",...$numbers));
fwrite($filehandle,"end"); // gives 3 to offset
fclose($filehandle);
$filename = "testdata.dat";
$filehandle = fopen($filename, "rb+");
$offset=filesize($filename)-11-3;
fseek($filehandle,$offset);
$numberblock= fread($filehandle,11);
$numbersback=unpack("c1a/s1b/i1c/f1d",$numberblock);
var_dump($numbersback);
fclose($filehandle);
?>
Once this example understood, the rest is to find the data structure in the requested file. I had written another example but it uses assumptions. I leave the rest to readers to find what assumptions I made here. Be careful though: I know nothing about real structure and values will not be correct.
<?php
$filename = "testfile";
$filehandle = fopen($filename, "rb");
$offset=17827-2*41; //filesize minus 2 user area
fseek($filehandle,$offset);
print $user1 = fread($filehandle, 41);echo "<br>";
$user1pr=unpack("s1kill/s1victory/s1battle/s1XP/s1Money/f1Life",$user1);
var_dump($user1pr); echo "<br>";
fseek($filehandle,$offset+41);
print $user2 = fread($filehandle, 41);echo "<br>";
$user2pr=unpack("s1kill/s1victory/s1battle/i1XP/i1Money/f1Life",$user2);
var_dump($user2pr); echo "<br>";
echo "<br><br>";
$repackeduser2=pack("s3i2f1",$user2pr["kill"],$user2pr["victory"],
$user2pr["battle"],$user2pr["XP"],$user2pr["Money"],
$user2pr["Life"]
);
print $user2 . "<br>" .$repackeduser2;
print "<br>3*s1=6bytes, 2*i=6bytes, 1*f=*bytes (machine dependent)<br>";
print pack("s1",$user2pr["kill"]) ."<br>";
print pack("s1",$user2pr["victory"]) ."<br>";
print pack("s1",$user2pr["battle"]) ."<br>";
print pack("i1",$user2pr["XP"]) ."<br>";
print pack("i1",$user2pr["Money"]) ."<br>";
print pack("f1",$user2pr["Life"]) ."<br>";
fclose($filehandle);
?>
PS: pack and unpack uses machine dependent size for some data types such as int and float, so be careful with working them. Read Official PHP:pack and PHP:unpack manuals.
This looks more like the hexdump of a binary file. Some methods of converting hex to strings resulted in the same scrambled output. Only some lines are readable like this...
Dgamaxcardicon_32" The Old Guard
As #Tarun Lalwani said, you would have to know the structure of this data to get the in plaintext.
If you have access to the raw binary, you could try using strings https://serverfault.com/questions/51477/linux-command-to-find-strings-in-binary-or-non-ascii-file

Highlight a matched pattern in a DNA sequence with HTML markup using Perl

I am working on generating an HTML page using a CGI script in Perl.
I need filter some sequences in order to check whether they contain a specific pattern; if they contain it I need to print those sequences on my page with 50 bases per line, and highlight the pattern in the sequences. My sequences are in an hash called %hash; the keys are the names, the values are the actual sequences.
my %hash2;
foreach my $key (keys %hash) {
if ($hash{$key} =~ s!(aaagg)!<b>$1</b>!) {
$hash2{$key} = $hash{$key}
}
}
foreach my $key (keys %hash2) {
print "<p> <b> $key </b> </p>";
print "<p>$_</p>\n" for unpack '(A50)*', $hash2{$key};
}
This method "does" the job however if I highlight the pattern "aaagg" using this method I am messing up the unpacking of the line (for unpack '(A50)*'); because now the sequences contains the extra characters of the bold tags which are included in the unpacking count. This beside making the lines of different length it is also a big problem if the tag falls between 2 lines due to unpacking 50 characters, it basically remains open and everything after that is bold.
The script below uses a single randomly generated DNA sequence of length 243 (generated using http://www.bioinformatics.org/sms2/random_dna.html) and a variable length pattern.
It works by first recording the positions which need to be highlighted instead of changing the sequence string. The highlighting is inserted after the sequence is split into chunks of 50 bases.
The highlighting is done in reverse order to minimize bookkeeping busy work.
#!/usr/bin/env perl
use utf8;
use strict;
use warnings;
use YAML::XS;
my $PRETTY_WIDTH = 50;
# I am using bold-italic so the highlighting
# is visible on Stackoverflow, but in real
# life, this would be something like:
# my #PRETTY_MARKUP = ('<span class="highlighted-match">', '</span>');
my #PRETTY_MARKUP = ('<b><i>', '</i></b>');
use constant { BAŞ => 0, SON => 1, ROW => 0, COL => 1 };
my $sequence = q{ccggtgagacatccagttagttcactgagccgacttgcatcagtcatgcttttccccgtaatgagggccccatattcaggccgtcgtccggaattgtcttggatccggaatgcagcttttctcaccgcttgatgaacattcactgaatatctgacgccgcgaaaacagggtcactagcctgtttccggtcgcccgagaccggcgagtttgtggtatcgcgagcgcccccgggcggtagggtct};
my $wanted = 'c..?gg';
my #pos;
while ($sequence =~ /($wanted)/g) {
push #pos, [ pos($sequence) - length($1), pos($sequence) ];
}
print Dump \#pos;
my #output = unpack "(A$PRETTY_WIDTH)*", $sequence;
print Dump \#output;
while (my $pos = pop #pos) {
my #rc = map pos_to_rc($_, $PRETTY_WIDTH), #$pos;
substr($output[ $rc[$_][ROW] ], $rc[$_][COL], 0, $PRETTY_MARKUP[$_]) for SON, BAŞ;
}
print Dump \#output;
sub pos_to_rc {
my $r = int( $_[0] / $_[1] );
my $c = $_[0] - $r * $_[1];
[ $r, $c ];
}
Output:
C:\...\Temp> perl s.pl
---
- - 0
- 4
- - 76
- 80
- - 87
- 91
- - 97
- 102
- - 104
- 108
- - 165
- 170
- - 184
- 188
- - 198
- 202
- - 226
- 231
---
- ccggtgagacatccagttagttcactgagccgacttgcatcagtcatgct
- tttccccgtaatgagggccccatattcaggccgtcgtccggaattgtctt
- ggatccggaatgcagcttttctcaccgcttgatgaacattcactgaatat
- ctgacgccgcgaaaacagggtcactagcctgtttccggtcgcccgagacc
- ggcgagtttgtggtatcgcgagcgcccccgggcggtagggtct
---
- ccggtgagacatccagttagttcactgagccgacttgcatcagtcatgct
- tttccccgtaatgagggccccatattcaggccgtcgtccggaattgtctt
- ggatccggaatgcagcttttctcaccgcttgatgaacattcactgaatat
- ctgacgccgcgaaaacagggtcactagcctgtttccggtcgcccgagacc
- ggcgagtttgtggtatcgcgagcgcccccgggcggtagggtct
Especially since this turns out to have been a homework assignment, it is now up to you to take this and apply it to all sequences in your hash table.

JSONDecodeError when iterating twitter data

I'm trying to iterate twitter data which is stored in a json file:
fname = 'test.json'
with open(fname, 'r') as f:
for line in f:
tweet = json.loads(line)['text']
print(tweet)
It prints the first tweet in the file just fine but when it iterates for a second time it gives me a JSONDecodeError:
JSONDecodeError: Expecting value: line 2 column 1 (char 1)
My JSON file is 650Mb is size approximately.
To get the twitter data I used the StreamListener from the Twitter API.
Here is a glimpse into my JSON file:
{"created_at":"Sun Apr 24 05:37:02 +0000 2016","id":724109877732204544,"id_str":"724109877732204544","text":"JONES RETURNS WITH A UNANIMOUS DECISION WIN IVER OVINCE SAINT PREUX! #UFC197 https:\/\/t.co\/KlfaAh9h21","source":"\u003ca href=\"http:\/\/instagram.com\" rel=\"nofollow\"\u003eInstagram\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":714389668633116672,"id_str":"714389668633116672","name":"Leon Doyle","screen_name":"TheLDPodcast","location":"Dublin, Ireland","url":"http:\/\/www.youtube.com","description":"A weekly\/bi-weekly podcast focused mainly around MMA, Boxing, fighting etc. With the occasional random topic.","protected":false,"verified":false,"followers_count":7,"friends_count":59,"listed_count":0,"favourites_count":3,"statuses_count":31,"created_at":"Mon Mar 28 09:52:24 +0000 2016","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"004455","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/714390864030797824\/REXXKCvs_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/714390864030797824\/REXXKCvs_normal.jpg","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"UFC197","indices":[69,76]}],"urls":[{"url":"https:\/\/t.co\/KlfaAh9h21","expanded_url":"https:\/\/www.instagram.com\/p\/BEkk6Gewpqy\/","display_url":"instagram.com\/p\/BEkk6Gewpqy\/","indices":[77,100]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1461476222819"}
{"created_at":"Sun Apr 24 05:37:03 +0000 2016","id":724109879200366592,"id_str":"724109879200366592","text":"regrann from #ufc - #AndStill UFC flyweight champ #MightyMouseUFC! #UFC197\n\nPresented by\u2026 https:\/\/t.co\/zbE5CsFxMJ","source":"\u003ca href=\"http:\/\/instagram.com\" rel=\"nofollow\"\u003eInstagram\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":1070221260,"id_str":"1070221260","name":"Will Manuel","screen_name":"TheWillManuel","location":"Kenai, AK","url":null,"description":"Alaskan. Paramedic. Firefighter. Industrial Security. Libertarian. 2nd Amendment. Liberty. BJJ & Muay Thai novice. #TeamRed #RedemptionMMA #BJJ #MuayThai #MMA","protected":false,"verified":false,"followers_count":437,"friends_count":573,"listed_count":32,"favourites_count":2516,"statuses_count":3184,"created_at":"Tue Jan 08 07:22:47 +0000 2013","utc_offset":-28800,"time_zone":"Alaska","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/pbs.twimg.com\/profile_background_images\/579042288040435713\/VeA-zI45.jpeg","profile_background_image_url_https":"https:\/\/pbs.twimg.com\/profile_background_images\/579042288040435713\/VeA-zI45.jpeg","profile_background_tile":true,"profile_link_color":"4A913C","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/715188796615237632\/JvxeLz8D_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/715188796615237632\/JvxeLz8D_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/1070221260\/1447179132","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"AndStill","indices":[22,31]},{"text":"UFC197","indices":[69,76]}],"urls":[{"url":"https:\/\/t.co\/zbE5CsFxMJ","expanded_url":"https:\/\/www.instagram.com\/p\/BEkk6a0QMeX\/","display_url":"instagram.com\/p\/BEkk6a0QMeX\/","indices":[92,115]}],"user_mentions":[{"screen_name":"ufc","name":"#UFC197","id":6446742,"id_str":"6446742","indices":[13,17]},{"screen_name":"MightyMouseUFC","name":"Demetrious Johnson","id":140845817,"id_str":"140845817","indices":[52,67]}],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1461476223169"}
{"created_at":"Sun Apr 24 05:37:03 +0000 2016","id":724109882341896192,"id_str":"724109882341896192","text":"RT #BESTFlGHTS: Jon Jones flips off Daniel Cormier at #UFC197 https:\/\/t.co\/S0pDvRWhfW","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":1019191860,"id_str":"1019191860","name":"Paul","screen_name":"Paulie_Frat","location":"Mount Pocono, PA","url":null,"description":"...","protected":false,"verified":false,"followers_count":272,"friends_count":259,"listed_count":0,"favourites_count":1580,"statuses_count":1622,"created_at":"Tue Dec 18 07:10:12 +0000 2012","utc_offset":-14400,"time_zone":"Eastern Time (US & Canada)","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"131516","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme14\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme14\/bg.gif","profile_background_tile":true,"profile_link_color":"009999","profile_sidebar_border_color":"EEEEEE","profile_sidebar_fill_color":"EFEFEF","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/512140999444164608\/4H2fiOtg_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/512140999444164608\/4H2fiOtg_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/1019191860\/1461422809","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Sun Apr 24 05:12:13 +0000 2016","id":724103630702432256,"id_str":"724103630702432256","text":"Jon Jones flips off Daniel Cormier at #UFC197 https:\/\/t.co\/S0pDvRWhfW","source":"\u003ca href=\"http:\/\/bufferapp.com\" rel=\"nofollow\"\u003eBuffer\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":1370712786,"id_str":"1370712786","name":"BEST FIGHTS","screen_name":"BESTFlGHTS","location":"MMA, Boxing, Street Fights","url":"http:\/\/snapchat.com\/add\/wshhfans","description":"Parody, we do not own the content posted DM's are open send me your fight","protected":false,"verified":false,"followers_count":156257,"friends_count":17861,"listed_count":83,"favourites_count":1,"statuses_count":6723,"created_at":"Sun Apr 21 22:43:19 +0000 2013","utc_offset":-25200,"time_zone":"Arizona","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"131516","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme14\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme14\/bg.gif","profile_background_tile":true,"profile_link_color":"ABB8C2","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"EFEFEF","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/620356388833734657\/NvmkmGDk_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/620356388833734657\/NvmkmGDk_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/1370712786\/1460756748","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":740,"favorite_count":624,"entities":{"hashtags":[{"text":"UFC197","indices":[38,45]}],"urls":[{"url":"https:\/\/t.co\/S0pDvRWhfW","expanded_url":"http:\/\/vine.co\/v\/iU5T53X6U7J","display_url":"vine.co\/v\/iU5T53X6U7J","indices":[46,69]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en"},"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"UFC197","indices":[54,61]}],"urls":[{"url":"https:\/\/t.co\/S0pDvRWhfW","expanded_url":"http:\/\/vine.co\/v\/iU5T53X6U7J","display_url":"vine.co\/v\/iU5T53X6U7J","indices":[62,85]}],"user_mentions":[{"screen_name":"BESTFlGHTS","name":"BEST FIGHTS","id":1370712786,"id_str":"1370712786","indices":[3,14]}],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1461476223918"}
How can I solve this issue?
If your JSON file has exactly the same structure as the piece you are posting, the empty lines between tweets indeed cause a JSONDecodeError. If that's the problem, just check that the line is not empty before processing:
In [12]:
with open(fname, 'r') as f:
for line in f:
if (not line.strip()):
continue
tweet = json.loads(line)['text']
print(tweet)
Hope it helps.

how to store text containing escape sequences in ms access

When i try to store text containing 'C' code in MS ACCESS table (programatically). It replaces escape sequences ('\n', '\t') with some question-mark symbol.
Example :
code to store :
#include<stdio.h>
int main()
{
printf("\n\n\t Hi there...");
return 0;
}
When i see MS-Access table for above inserted code it shows every newline and '\t' character replaced with a '?' kind of symbol.
My question "is there any other data type for MS-Access filed which stores code as it is without replacing escape sequences with some symbol?"
and
"Is 'raw' data type present in other DBMS like MYSQL will do my job? "
This is how it shows in access-07 :
It looks like the line breaks in your source text are not the Windows-standard CRLF (carriage return, line feed). Find out the character codes of those mystery characters.
Using the procedure below, I can feed it a text string, and it will list the code of each character. Here is an example from the Immediate window.
AsciiValues "a" & vbcrlf & "b"
position Asc AscW
1 97 97
2 13 13
3 10 10
4 98 98
If I want to examine the value stored in a table text field, I can use DLookup to fetch that value and feed it to the function.
AsciiValues DLookup("memo_field", "tblFoo", "id=1")
position Asc AscW
1 108 108
2 105 105
3 110 110
4 101 101
5 32 32
Once you determine the codes of the problem characters, you can execute an UPDATE statement to replace the problem character codes with suitable alternatives.
UPDATE YourTable
SET YourField = Replace(YourField, Chr(x), Chr(y));
And this is the procedure ...
Public Sub AsciiValues(ByVal pInput As String)
Dim i As Long
Dim lngSize As Long
lngSize = Len(pInput)
Debug.Print "position", "Asc", "AscW"
For i = 1 To lngSize
Debug.Print i, Asc(Mid(pInput, i, 1)), AscW(Mid(pInput, i, 1))
Next
End Sub
I'd say it's probably that you're lacking the whole newline. A newline in Access consists of a Carriage Return (ASCII 13) AND a Line Feed (ASCII 10). This is abbreviated as CRLF. You probably only have one or the other, but not both.
Use HansUp's AsciiValues procedure to take a look.

Automatically sum numeric columns and print total

Given the output of git ... --stat:
3 files changed, 72 insertions(+), 21 deletions(-)
3 files changed, 27 insertions(+), 4 deletions(-)
4 files changed, 164 insertions(+), 0 deletions(-)
9 files changed, 395 insertions(+), 0 deletions(-)
1 files changed, 3 insertions(+), 2 deletions(-)
1 files changed, 1 insertions(+), 1 deletions(-)
2 files changed, 57 insertions(+), 0 deletions(-)
10 files changed, 189 insertions(+), 230 deletions(-)
3 files changed, 111 insertions(+), 0 deletions(-)
8 files changed, 61 insertions(+), 80 deletions(-)
I wanted to produce the sum of the numeric columns but preserve the formatting of the line. In the interest of generality, I produced this awk script that automatically sums any numeric columns and produces a summary line:
{
for (i = 1; i <= NF; ++i) {
if ($i + 0 != 0) {
numeric[i] = 1;
total[i] += $i;
}
}
}
END {
# re-use non-numeric columns of last line
for (i = 1; i <= NF; ++i) {
if (numeric[i])
$i = total[i]
}
print
}
Yielding:
44 files changed, 1080 insertions(+), 338 deletions(-)
Awk has several features that simplify the problem, like automatic string->number conversion, all arrays as associative arrays, and the ability to overwrite auto-split positional parameters and then print the equivalent lines.
Is there a better language for this hack?
Perl - 47 char
Inspired by ChristopheD's awk solution. Used with the -an command-line switch. 43 chars + 4 chars for the command-line switch:
$i-=#a=map{($b[$i++]+=$_)||$_}#F}{print"#a"
I can get it to 45 (41 + -ap switch) with a little bit of cheating:
$i=0;$_="Ctrl-M#{[map{($b[$i++]+=$_)||$_}#F]}"
Older, hash-based 66 char solution:
#a=(),s#(\d+)(\D+)#$b{$a[#a]=$2}+=$1#gefor<>;print map$b{$_}.$_,#a
Ruby — 87
puts ' '+[*$<].map(&:split).inject{|i,j|[0,3,5].map{|k|i[k]=i[k].to_i+j[k].to_i};i}*' '
Python - 101 chars
import sys
print" ".join(`sum(map(int,x))`if"A">x[0]else x[0]for x in zip(*map(str.split,sys.stdin)))'
Using reduce is longer at 126 chars
import sys
print" ".join(reduce(lambda X,Y:[str(int(x)+int(y))if"A">x[0]else x for x,y in zip(X,Y)],map(str.split,sys.stdin)))
AWK - 63 characters
(in a bash script, $1 is the filename provided as command line argument):
awk -F' ' '{x+=$1;y+=$4;z+=$6}END{print x,$2,$3,y,$5,z,$7}' $1
One could of course also pipe the input in (would save another 3 characters when allowed).
This problem is not challenging or difficult... it is "cute" though.
Here is solution in Python:
import sys
r = []
for s in sys.stdin:
r = map(lambda x,y:(x or 0)+int(y) if y.isdigit() else y, r, s.split())
print ' '.join(map(str, r))
What does it do... it keeps tally in r while proceeding line by line. Splits the line, then for each element of the list, if it is a number, adds it to the tally or keeps it as string. At the end they all get re-mapped to string and merged with spaces in between to be printed.
Alternative, more "algebraic" implementation, if we did not care about reading all input at once:
import sys
def totalize(l):
try: r = str(sum(map(int,l)))
except: r = l[-1]
return r
print ' '.join(map(totalize, zip(*map(str.split, sys.stdin))))
What does this one do? totalize() takes a list of strings and tries to calculate sum of the numbers; if that fails, it simply returns the last one. zip() is fed with a matrix that is list of rows, each of them being list of column items in the row - zip transposes the matrix so it turns into list of column items and then totalize is invoked on each column and the results are joined as before.
At the expense of making your code slightly longer, I moved the main parsing into the BEGIN clause so the main clause is only processing numeric fields. For a slightly larger input file, I was able to measure a significant improvement in speed.
BEGIN {
getline
for (i = 1; i <= NF; ++i) {
# need to test for 0, too, in this version
if ($i == 0 || $i + 0 != 0) {
numeric[i] = 1;
total[i] = $i;
}
}
}
{
for (i in numeric) total[i] += $i
}
END {
# re-use non-numeric columns of last line
for (i = 1; i <= NF; ++i) {
if (numeric[i])
$i = total[i]
}
print
}
I made a test file using your data and doing paste file file file ... and cat file file file ... so that the result had 147 fields and 1960 records. My version took about 1/4 as long as yours. On the original data, the difference was not measurable.
JavaScript (Rhino) - 183 154 139 bytes
Golfed:
x=[n=0,0,0];s=[];readFile('/dev/stdin').replace(/(\d+)(\D+)/g,function(a,b,c){x[n]+=+b;s[n++]=c;n%=3});print(x[0]+s[0]+x[1]+s[1]+x[2]+s[2])
Readable-ish:
x=[n=0,0,0];
s=[];
readFile('/dev/stdin').replace(/(\d+)(\D+)/g,function(a,b,c){
x[n]+=+b;
s[n++]=c;
n%=3
});
print(x[0]+s[0]+x[1]+s[1]+x[2]+s[2]);
PHP 152 130 Chars
Input:
$i = "
3 files changed, 72 insertions(+), 21 deletions(-)
3 files changed, 27 insertions(+), 4 deletions(-)
4 files changed, 164 insertions(+), 0 deletions(-)
9 files changed, 395 insertions(+), 0 deletions(-)
1 files changed, 3 insertions(+), 2 deletions(-)
1 files changed, 1 insertions(+), 1 deletions(-)
2 files changed, 57 insertions(+), 0 deletions(-)
10 files changed, 189 insertions(+), 230 deletions(-)
3 files changed, 111 insertions(+), 0 deletions(-)
8 files changed, 61 insertions(+), 80 deletions(-)";
Code:
$a = explode(" ", $i);
foreach($a as $k => $v){
if($k % 7 == 0)
$x += $v;
if(3-$k % 7 == 0)
$y += $v;
if(5-$k % 7 == 0)
$z += $v;
}
echo "$x $a[1] $a[2] $y $a[4] $z $a[6]";
Output:
44 files changed, 1080 insertions(+), 338 deletions(-)
Note: explode() will require that there is a space char before the new line.
Haskell - 151 135 bytes
import Char
c a b|all isDigit(a++b)=show$read a+read b|True=a
main=interact$unwords.foldl1(zipWith c).map words.filter(not.null).lines
... but I'm sure it can be done better/smaller.
Lua, 140 bytes
I know Lua isn't the best golfing language, but compared by the size of the runtimes, it does pretty well I think.
f,i,d,s=0,0,0,io.read"*a"for g,a,j,b,e,c in s:gmatch("(%d+)(.-)(%d+)(.-)(%d+)(.-)")do f,i,d=f+g,i+j,d+e end print(table.concat{f,a,i,b,d,c})
PHP, 176 166 164 159 158 153
for($a=-1;$a<count($l=explode("
",$i));$r=explode(" ",$l[++$a]))for($b=-1;$b<count($r);$c[++$b]=is_numeric($r[$b])?$c[$b]+$r[$b]:$r[$b]);echo join(" ",$c);
This would, however, require the whole input in $i... A variant with $i replaced with $_POST["i"] so it would be sent in a textarea... Has 162 chars:
for($a=-1;$a<count($l=explode("
",$_POST["i"]));$r=explode(" ",$l[$a++]))for($b=0;$b<count($r);$c[$b]=is_numeric($r[$b])?$c[$b]+$r[$b]:$r[$b])$b++;echo join(" ",$c);
This is a version with
NO HARDCODED COLUMNS