NLTK reconstruct sentence from tokens - nltk

I have used NLTK to tokenise a sentance, I would however now like to reconstruct the sentance into a string.
I've looked over the docs but can't see an obvious wat to do this. Is this possible at all?
tokens = [token.lower() for token in tokensCorrect]

The nltk provides no such function. Whitespace is thrown away during tokenization, so there is no way to get back exactly what you started with; the whitespace might have included newlines and multiple spaces, and there's no way to get these back. The best you can do is to join the sentence into a string that looks like a normal sentence. A simple " ".join(tokens) will put a space before and after all punctuation, which looks odd:
>>> print(" ".join(tokens))
'This is a sentence .'
So you need to get rid of spaces before most punctuation, except for a select few like ( and `` that should have the space after them removed. Even then it's sometimes guesswork, since the apostrophe ' is sometimes used between words, sometimes before, and sometimes after. ("Nuthin' doin', y'all!") Good luck with that.
My recommendation is to hold on to the original strings from which you tokenized the sentence, and go back to those. You don't show where your sentences come from so there's nothing more to say really.

Related

Godot/gdscript strings with escape characters from database

I'm making a dialogue system in gdscript and am struggling with escape characters, specifically '\n'.
I'm using CastleDB as, although not perfect, it has allowed me to have almost everything stored in data and will allow the person doing the writing for the game to do everything outside the engine, without me having to copy and paste stuff in.
I've hit a stumbling block with escape characters. A single text entry in CastleDB doesn't support spaces, and '\n' within the string prints to '\n', not a space, in the dialog box.
I've tried using the format string function with 'some text here {space} some more text', with the space referencing a string consisting of just \n. This still prints \n. If I feed some constant string with \n in the middle directly into the function which displays the dialog text, it adds a space so I'm not really sure what is going on here.
I don't have a computer science background (I've done some C up until pointers, at which point I decided to return later).
Is there something going on in the background with my string in gdscript? It prints out just like you would expect a string to, apart from ignoring my escape characters.
Could it be something to do with the fact that it comes in as a JSON? As far as I'm aware, even if a string is chopped up and reassembled, it should still just behave like a string...?!
Anyway, I haven't included any code because I don't know what code you'd need to see. I'm hoping it's something simple that because I'm teaching myself as I go I just wasn't aware of, but can post code if it helps.
Thanks,
James
Escape sequences are a way of getting around issues with syntax. When you type a string in most programming languages, it starts with " and ends with another ". And it needs to stay on one line. Simple, right?
What if you want to put an actual " in your string? Or a new line? We need some way of telling the compiler, "hey, we want to insert a newline here, even though we can't use an actual newline character". So we use a \ to let the compiler know that the next character is part of an escape sequence.
This causes another problem: What if we literally want to put a backslash in a string? That's where the double backslash comes from: \\ is the escape sequence for \, since \ by itself has a special meaning.
But CastleDB (apparently, I'm not familiar with it) doesn't recognize escape sequences, so when you type \n it thinks you literally want \ followed by n. When it converts this to JSON, it inserts the \\ because JSON does recognize escape sequences.
GDScript also recognizes escape sequences, so print("Hello\nworld!") prints
Hello
world!
You could try input_string.replace("\\n", "\n") to replace the \n escape sequences.
I've solved this by looking at the way CastleDB data is stored on the project's github page.
For some reason "\n" was stored as "\\n" behind the scenes. Now that I know why it was printing weirdly I can change it, even though it feels like a messy solution!
To add even more weirdness to this whole backslash business, stack overflow displays a double backslash as a single backslash so I have to write \ \ \n minus the spaces to get \\n...
I'm sure there must be a reason, but it eludes me.

Remove Speech Marks from file where Text Qualifier is "

Introduction
We have a pretty standard way of importing .txt and .csv into our data warehouse using SSIS.
Our txt/csvs are produced with speech marks as text qualifiers. So a typical file may look like the below:
"0001","025",1,"01/01/19","28/12/18",4,"ST","SMITH,JOHN","15/01/19"
"0002","807",1,"01/01/19","29/12/18",3,"ST","JONES,JOY","06/02/19"
"0003","160",1,"01/01/19","29/12/18",3,"ST","LEWIS,HANNAH","18/01/19"
We have set all our SSIS packages to strip out the speech marks by setting Text Qualifier = "
Problem
However, as some of our data entry is done manually, speech marks are sometimes used - particularly in free text fields such as NAME where people have nicknames/alias. This causes errors in our SSIS loading.
An example of a problematic row would be:
"0004","645",1,"01/01/19","29/12/18",3,"ST","MOORE,STANLEY "STAN"","12/04/19"
My question
Is there a way to somehow strip out these problematic speech marks? i.e. the speech marks surrounding "STAN", so that column would be treated as MOORE, STANLEY STAN.
If there was a way within SSIS to do this, great. If not, we are open to other ideas outside of SSIS.
Solution needs to be scalable as we have hundreds of SSIS packages where this problem can occur.
I have a few suggestions:
I know Excel has a setting that says something like "Treat Consecutive Delimiters as one."
Change your delimiter to something else, like a pipe (the thing above the backslash, not sure what it is called elsewhere, looks like a vertical line). You can distinguish delimiters from quote marks that are meant to be included in the resulting value because any string delimiter either immediately precedes or immediately follows a comma. A quote character anywhere else is not a delimiter.
If you do not need to pass the data through any T-SQL you might want to replace non-delimiter quotes with single quotes or, depending on the final output, maybe the html entity (") instead.
Hope this helps,
Joey

How to get rid of in AA

I am reading data (combination of letters and numbers) from an excel sheet and put it into a text field in target application, where the input should yield a unique item from a database.
However there (sometimes) is a whitespace behind the data in the excel cell, which results in a "no data found" when this whitespace is entered into the search field in target application. The whitespace does not seem to be a space though, since i am unable to trim that whitespace AA-internally. I guess it is a (or some similar html special character).
edit: confirmed to be a by now.
Q: How can i get rid of such characters AA internally?
Tried: Neither (a) Trim, (b) Replace " " ->"", nor (c) Replace " "->"" work.
Workaround: I am currently checking for the length of the data provided: if its longer than 10 chars i only take the leftmost 10 chars. This works here, since its a business rule for the data i am working with, but i am still interested in an original solution, since there may be upcoming cases, where no business rule will help me out.
AA Version: 11.3.1
Thankful for input...
Okay, since it's non-breaking spaces character, you can replace it using Regex in replace command.
Find: \u00a0
Options: Regular Expression.
Got rid of it using Replace Command with RegEx ticked:
[^a-z A-Z 0-9]

I need to remove a piece of every line in my json file

I have a json output on my notepad and i know it is not in the correct format. At the end of each line there is a time stamp which is causing the bad format. I want to get rid of it using find and replace since the file is pretty big. The format is as follows :
"eventtimestamp": "05 23 2017 04:01:02"}
The above piece comes in at the end of every line. How can i get rid of it using find a replace or any other way.
All help is appreciated.
Thank you
If you need to alter every line in a consistent way then regex find/replace is a good option. Free tools like atom.io, Notepad++, and plenty of others offer this feature.
Assuming "eventtimestamp" is constant, then a simple regex that says "find everything starting with "eventtimestamp" and up to a '}'" will work.
"eventtimestamp".*(?=})
And "replace" that with an empty string.
ps) here's a demo of the regex in regexr.com--hovering over the parts of the pattern will explain what they do.
If you are not sure that the eventtimestamp field always comes in at the end of a line and/or as the last element of the object, prefer that kind of pattern: "eventtimestamp":\s*"[^"]+",?.
Note the useful surrounded excepted character class pattern "[^"]+" that can be declined with any other delimiter.

MySQL matching this regex while it shouldn't

I'm trying to recognize quoting (citing) somebody's else sentence in a markdown text, which I have in my local copy of MySQL GHTorrent dataset. So I wrote this query:
select * from github_discussions where body rlike '(.)*(\s){1,}(>)(\s){1,}(.)+';
it matches some unwanted data, which according to https://regex101.com/, it should not with this particular regular expression.
Test string:
`Params` is plural -> contain<s>s</s>
Matched on MySQL database, not matched at regex101 dot com.
Obvious example of quoting, but not matched at db:
Yes, I believe so.\r\n\r\n\r\n\r\nK\r\n\r\n> On 19-Jul-2014, at 17:33, Stefan Karpinski <notifications#github.com> wrote:\r\n> \r\n> This is the standard 3-clause BSD license, right?\r\n> \r\n> —\r\n> Reply to this email directly or view it on GitHub.
Moreover, MySQL workbench didn't show those return carriage and new line symbols unless copy-pasted here.
Can I normalize (remove \r and \n) with some update query ?
Is MySQL regex implementation different from POSIX standard regex ?
Do you have by any chances maximally clean solution for recognizing quoting in a markdown text ?
Thanks!
You've got an awful lot of parens in there. Try this as functionally what you have above:
select * from github_discussions where body rlike '.*[:blank:]+>[:blank:]+.+'
However, I'm not sure that's really what you want. This would happily match this line:
this is before > and after
which by my understanding is not a quoted string in markdown. Instead I would anchor it at the beginning like this:
select * from github_discussions where body rlike '^[:blank:]*>[:blank:]+'
That will match a greater-than sign at the beginning of the line, optionally preceded by whitespace. Is that what you are looking for?
I'm not sure if your data has newlines embedded. If so, you may need to look into ways of having your regex identify newlines using the ^ anchoring symbol. As is the well accepted conclusion in regex literature, that is left as an exercise for the student. :-)