I have a PGN (Portable Game Notation) of a chess game. What I would like is to get just a list of the moves. For example:
PGN :
1. e4 e5 2. f4 exf4 3. Nf3 d5 4. exd5 Nf6 5. Nc3 Nxd5 6. Nxd5 Qxd5 7. d4 Bg4 8.
Bxf4 Nc6 9. Be2 O-O-O 10. c3 Qe4 11. Qd2 Rxd4 12. Nxd4 Nxd4 13. cxd4 Bb4 14.
Kf2 Bxd2 15. Bxg4+ f5 16. Bxd2 fxg4 17. Rhe1 Qxd4+ 18. Be3 Qxb2+ 19. Kf1 Re8
0-1
output:
['e4','e5','f4','exf4','Nf3','d5', .... , 'Re8']
My idea was the take the string and split it at the spaces and then arrange a new array that way, but I'm wondering if there are any better ways of doing this. There's no specific language I'm just interested in general. Could be python, javascript, doesn't really matter.
Also, sometimes PGN comes with notation in the middle of the string or "variations" which are denoted in brackets, I'd like to ignore these. Any ideas?
Thanks
Strange, I couldn't find good PGN parsers for Ruby or Javascript. Here are two other libraries that I briefly tested:
PHP: https://github.com/DHTMLGoodies/chessParser (seems to be broken; when I tried I always got an empty array of games)
Perl: http://metacpan.org/pod/Chess::PGN::Parse
(seems to work, at least I could see the moves of a PGN game. Not easy to get started, though.)
Maybe it is really the best approach to write the parser yourself. You can eliminate the comments with regular expressions as they are not nested.
(from Wikipedia)
Comments are inserted by either a ";" (a comment that continues to the end of the line) or a "{" (which continues until a matching "}"). Comments do not nest.
After the comments (including the variants) are gone, you can parse the moves as you intended (split for whitespaces and filter the move numbers).
I've just started using the Ruby PGN gem at https://rubygems.org/gems/pgn It has a parser module, you can do PGN-> FEN, play through the game, set up positions with FEN import, etc. i've been using a branch of this as well at https://github.com/tobiasvl/pgn/tree/pgn-annotations this branch is able to parse PGNs containing variations and comments.
Here's a javascript version, https://github.com/jhlywa/chess.js
Related
I would like to see if I can scan a sign-in sheet for a class. The good news is I know 90% of the names that might be written.
My idea was to use tessaract to parse an image of names, and then use the Levenshtein algorithm to compare each line with a list of names in my database and if I get reasonably close matches, then that name is right.
Does this approach sound like a good one? If not, other ideas?
I tried using tesseract on a sample sheet (see below)
I used:
tesseract simple.png -psm 4 outtxt
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
I am assuming it didn't like line 2 because I went below the line.
The results I got were:
1.. AM: (harm;
l. ’E (J 22 a 00k
2‘ wau \\) [HQ
4. KIM TAYLOE
5. LN] Davis
6‘ Mzflé! Ha K
Obviously not the greatest, my guess is the distance matches for 4 & 5 would work, but the rest are not even close.
I have control of my sign-in sheet, but not the handwriting of folks coming in, so if any changes to that I can do to help, please let me know.
Since your goal is to get names only - I would suggest you to reduce tessedit_char_whitelist to english alphabetical ones("ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.") so that you will not get characters that you don't expect as output like \\) [ .
Your initial approach to calculate L distance is fine if you success to extract text from handwritten image (which is a hard task for tesseract).
I would also suggest to run some preprocessing on your image. For example you can remove horizontal lines and extract text ROIs around them. In the best case you will be able to extract separated characters, but even if you don't do that - you will get better results & will be able to distinguish result names "line by line".
You should also try other recommended output quality improvement stages which you can find in Tesseract OCR wiki (link)
I have written a programme which merges two 1D arrays containing names. I print the list of arr1, arr2 and arr3.
I am using Lazarus Free Pascal v. 1.0.14 . I was wondering if anyone knows how to break the results in the dos-like window because the list is so long that I can only see the last few names in the returned results. The rest go by too fast to read.
I know I can save the resuls to file and I also use the delay command, but would like to know if there is a way to somehow break the results or slow them down or even edit the output console?
I appreciate your help.
This isn't really a programming question, because your console application should output the values without pause. Otherwise your program would become useless if you ever wanted it to run as part of another pipeline in an automated fashion.
Instead you need a tool that you wrap around your program to paginate the output if, and when, you so desire. Such tools are known as terminal pagers and the basic one that ships with Windows is called more. You execute your program and pipe the output to the more program. Like this:
C:\SomeDir>MyProject.exe <input_args> | more
You can change the code of your loop in the following way:
say you print the results by the followng loop:
for i:=0 to 250 do
WriteLn(ArrUnited[i]);
you can replace it with:
for i:=0 to 250 do
begin
WriteLn(ArrUnited[i]);
if (i mod 25) = 24 then //the code will wait for the user pressing Enter every 25 rows
ReadLn;
end;
For the future please! post MCVE in your questions otherwise everyone has to guess what your code is.
I want to replace the characters '<' and '>' by < and > with COBOL. I was wondering about INSPECT statement, but it looks like this statement just can be used to translate one char by another. My intention is to replace all html characters by their html entities.
Can anyone figure out some way to do it? Maybe looping over the string and testing each char is the only way?
GnuCOBOL or IBM COBOL examples are welcome.
My best code is something like it: (http://ideone.com/MKiAc6)
IDENTIFICATION DIVISION.
PROGRAM-ID. HTMLSECURE.
ENVIRONMENT DIVISION.
DATA DIVISION.
WORKING-STORAGE SECTION.
77 INPTXT PIC X(50).
77 OUTTXT PIC X(500).
77 I PIC 9(4) COMP VALUE 1.
77 P PIC 9(4) COMP VALUE 1.
PROCEDURE DIVISION.
MOVE 1 TO P
MOVE '<SCRIPT> TEST TEST </SCRIPT>' TO INPTXT
PERFORM VARYING I FROM 1 BY 1
UNTIL I EQUAL LENGTH OF INPTXT
EVALUATE INPTXT(I:1)
WHEN '<'
MOVE "<" TO OUTTXT(P:4)
ADD 4 TO P
WHEN '>'
MOVE ">" TO OUTTXT(P:4)
ADD 4 TO P
WHEN OTHER
MOVE INPTXT(I:1) TO OUTTXT(P:1)
ADD 1 TO P
END-EVALUATE
END-PERFORM
DISPLAY OUTTXT
STOP RUN
.
GnuCOBOL (yes, another name branding change) has an intrinsic function extension, FUNCTION SUBSTITUTE.
move function substitute(inptxt, ">", ">", "<", "<") to where-ever-including-inptxt
Takes a subject string, and pairs of patterns and replacements. (This is not regex patterns, straight up text matching). See http://opencobol.add1tocobol.com/gnucobol/#function-substitute for some more details. The patterns and replacements can all be different lengths.
As intrinsic functions return anonymous COBOL fields, the result of the function can be used to overwrite the subject field, without worry of sliding overlap or other "change while reading" problems.
COBOL is a language of fixed-length fields. So no, INSPECT is not going to be able to do what you want.
If you need this for an IBM Mainframe, your SORT product (assuming sufficiently up-to-date) can do this using FINDREP.
If you look at the XML processing possibilities in Enterprise COBOL, you will see that they do exactly what you want (I'd guess). GnuCOBOL can also readily interface with lots of other things. If you are writing GnuCOBOL for running on a non-Mainframe, I'd suggest you ask on the GnuCOBOL part of SourceForge.
Otherwise, yes, it would come down to looping through the data. Once you clarify what you want a bit more, you may get examples of that if you still need them.
According to Adobe's Manual on PDF Open Parameters PDF files can be opened with certain parameters from command line or from a link in HTML.
These open Parameters include page=pagenum, zoom=scale, comment=commentID and others (the first parameter should be preceded with a # and the next should be preceded with a &
The official PDF Open Parameters from adobe gives this example:
#page=1&comment=452fde0e-fd22-457c-84aa-2cf5bed5a349
but the comment part doesn't work for me!
page=pagenum and zoom=scale work for me well. But comment=commentID does not work. I tried on Adobe reader 6.0.0 and Adobe Pro Extended 9.0.0: I can't get to the specified comment.
Also, I get the comment ID by exporting the comments in XFDF format and in the resulting file, there is a name attribute for every comment that I hope corresponds to the ID (well, the appearance looks like the example in the manual).
I thought maybe there is a setting that I should first enable (or maybe disable in adobe) or maybe I am getting the comment IDs wrong, or maybe something else?!
Any help would be extremely appreciated
According to the docs, you must include a page=X along with your comment=foo. Your copied sample has it, but it's copied from the docs, not something you did yourself.
Are you missing a page= when setting comment?
BASTARDS!
From the last page of the manual you linked:
URL Limitations
●Only one digit following a decimal point is retained for float values.
●Individual parameters, together with their values (separated by & or #), can be no greater then 32 characters in length.
Emphasis added.
The comment ID is a 16-byte value expressed as hex, with four hyphens thrown in to break up the monotony. That's 36 characters right there... starting with "comment=" adds another 8 characters. 44 characters total.
According to that, a comment ID can NEVER WORK, including the samples they show in their docs.
Are you just trying it on the command line, or have you tried via a web browser too? I wonder if that makes a difference. If not, we're looking at a feature that CANNOT WORK. EVER... and probably never has.
I have an address string in MySQL that has been mashed together from the source. I think it is possible to use a regular expression or some other method to seperate the string into usable parts in MySQL, but I am not aware of how this could be acheived.
Basically each string looks something like these examples (I have added a marker to the top to show what each bit is):
<-------------><-------><-><-->
123 Fake StreetRESERVOIRVIC3001
<-----------------><--------------------><------><-><-->
Brooks Nursing Home123 Little Fake StreetSMITHTONNSW2001
<-------------------><-------------------><--- ><><-->
Grange Police StationShop 1 Fairytale LaneGRANGEWA8001
The address supposed to be broken up into optionally two lines of address information, suburb, state and post code. I'm in Australia so the state will be either NSW,VIC,QLD,WA,SA,NT or ACT and the postcode will always be a 4 digit number at the very end.
The possible ways to break it up are that the suburb will always be capitalised, the state and postcode will be predicatable within the last 6 or 7 characters (depending on state) and the first two lines of address information will be broken up by a change in case with no space character in between.
I have some 100,000 records like this, so to go through and do it by hand would be very time consuming. Any help on a way of doing this programatically would be much appreciated.
With no spaces? Most gross...
MySQL doesn't have the tools to deal with that, so you'll have to access the database with an external program. I tend to use Perl for manipulations like this.
Start from the end and work backwards... we know the last four should be digits, and the letters preceding that one of 7 options. Use that knowledge and you'll be down 2 fields and 6-7 characters.
It looks like your example now has a town in all capital letters at the end... Parse out that, and it should match to the state and area code. I'm certain you can find a database of zip codes within some minutes online.
With the name and street address remaining, that will have some variability to it, and I wish you a bit of luck there. You may have a head-start with being able to concentrate on the lack of a space between a lowercase and capital, or a letter and number as a breaking point.
Challenge accepted. I'll even throw in some basic punctuation to allow for "101 St. Mark's St." and the like.
/^(([\w\'\.](?=[a-z \'\.])| )+[a-z\'\.])?(([\w\'\.](?=[a-z \d\'\.])| )+[a-z\.\'])([A-Z]+)(NSW|VIC|QLD|WA|SA|NT|ACT)(\d{4})/
Could probably use a little more clean-up, but it should work in any language which supports basic regex with lookahead (some implementations, like JavaScript's and (I think) Ruby's, support lookahead, but not lookbehind). (That, and this puzzle kept me up well past my bed time.) At the very least, it worked on the three examples you provided.
By the way, 2problems.com is a great site for quickly testing regular expressions. It's what I used to work this puzzle out. The guy who built it must have been a real genius. (koff koff)
Rubular is another good option, though since it works by making Ajax calls to a Ruby script behind-the-scenes, it's a bit slower. It does have the nice feature of being able to link to entered patterns and haystacks, though; here's this pattern on Rubular. The 2problems guy really should get around to implementing something like that some day.