I've got some JSON coming in that is occasionally dirty with random quotes inserted. As an example :
"contact_interests": "Interests:|Poet - Several poems have been published. One poem was set to music, recorded and released in 1972. A recent poem |"Little Brother|" has been set to music and will be recorded and released by 2014.Read |"mystery books, love long walks/hikes, prayer, family.|",
We need to find and replace all occurrences of |" except for the case where it's those characters terminate a line (|",)
What would be the Regex to accomplish this? Thanks.
You can try this:
\|"(?=[^,])
It will match |" that isn't followed by a comma
Related
I am reading data (combination of letters and numbers) from an excel sheet and put it into a text field in target application, where the input should yield a unique item from a database.
However there (sometimes) is a whitespace behind the data in the excel cell, which results in a "no data found" when this whitespace is entered into the search field in target application. The whitespace does not seem to be a space though, since i am unable to trim that whitespace AA-internally. I guess it is a (or some similar html special character).
edit: confirmed to be a by now.
Q: How can i get rid of such characters AA internally?
Tried: Neither (a) Trim, (b) Replace " " ->"", nor (c) Replace " "->"" work.
Workaround: I am currently checking for the length of the data provided: if its longer than 10 chars i only take the leftmost 10 chars. This works here, since its a business rule for the data i am working with, but i am still interested in an original solution, since there may be upcoming cases, where no business rule will help me out.
AA Version: 11.3.1
Thankful for input...
Okay, since it's non-breaking spaces character, you can replace it using Regex in replace command.
Find: \u00a0
Options: Regular Expression.
Got rid of it using Replace Command with RegEx ticked:
[^a-z A-Z 0-9]
I have a json output on my notepad and i know it is not in the correct format. At the end of each line there is a time stamp which is causing the bad format. I want to get rid of it using find and replace since the file is pretty big. The format is as follows :
"eventtimestamp": "05 23 2017 04:01:02"}
The above piece comes in at the end of every line. How can i get rid of it using find a replace or any other way.
All help is appreciated.
Thank you
If you need to alter every line in a consistent way then regex find/replace is a good option. Free tools like atom.io, Notepad++, and plenty of others offer this feature.
Assuming "eventtimestamp" is constant, then a simple regex that says "find everything starting with "eventtimestamp" and up to a '}'" will work.
"eventtimestamp".*(?=})
And "replace" that with an empty string.
ps) here's a demo of the regex in regexr.com--hovering over the parts of the pattern will explain what they do.
If you are not sure that the eventtimestamp field always comes in at the end of a line and/or as the last element of the object, prefer that kind of pattern: "eventtimestamp":\s*"[^"]+",?.
Note the useful surrounded excepted character class pattern "[^"]+" that can be declined with any other delimiter.
I have used NLTK to tokenise a sentance, I would however now like to reconstruct the sentance into a string.
I've looked over the docs but can't see an obvious wat to do this. Is this possible at all?
tokens = [token.lower() for token in tokensCorrect]
The nltk provides no such function. Whitespace is thrown away during tokenization, so there is no way to get back exactly what you started with; the whitespace might have included newlines and multiple spaces, and there's no way to get these back. The best you can do is to join the sentence into a string that looks like a normal sentence. A simple " ".join(tokens) will put a space before and after all punctuation, which looks odd:
>>> print(" ".join(tokens))
'This is a sentence .'
So you need to get rid of spaces before most punctuation, except for a select few like ( and `` that should have the space after them removed. Even then it's sometimes guesswork, since the apostrophe ' is sometimes used between words, sometimes before, and sometimes after. ("Nuthin' doin', y'all!") Good luck with that.
My recommendation is to hold on to the original strings from which you tokenized the sentence, and go back to those. You don't show where your sentences come from so there's nothing more to say really.
I'm developing a website which lets people create their own translator. They can choose the name of the URL, and it is sent to a database and I use .htaccess to redirect website.com/nameoftheirtranslator
to:
website.com/translator.php?name=nameoftheirtranslator
Here's my problem:
Recently, I've noticed that someone has created a translator with special characters in the name -> "LAEFÊVËŠI".
But when it is processed (posted to a php file, then mysqli_real_escape_string) and added to the database it appears as simply "LAEFVI" - so you can see the special characters have been lost somewhere.
I'm not quite sure what to do here, but I think there are two paths:
Try to keep the characters and do some encoding (no idea where to start)
Ditch them and tell users to only use 'normal' characters in the names of their translators (not ideal)
I'm wondering whether it's even possible to have a url like website.com/LAEFÊVËŠI - can that be interpreted by the server?
EDIT1: I notice that stack overflow, on this very question, translates the special characters in my title to .../using-special-characters-in-urls! This seems like a great solution, I guess I could make a function that translates special characters like â to their normal equivalent (like â)? And I suppose I would just ignore other characters like /##"',&? Now that I think of it, there must be some fairly standard/good-practice strategies for getting around problems like this.
EDIT2: Actually, now that I think about it (more) - I really want this thing to be usable by people of any language (not just English), so I would really love to be able to have special characters in the urls. Having said this, I've just found that Google doesn't interpret â as a, so people may have a hard time finding the LAEFÊVËŠI translator if I don't translate the letters to normal characters. Ahh!
Okay, after that crazy episode, here's what happened:
Found out that I was removing all the non alpha-numeric characters with PHP preg_replace().
Altered preg_replace so it only removes spaces and used rawurlencode():
$name = mysqli_real_escape_string($con, rawurlencode( preg_replace("/\s/", '', $name) ));
Now everything is in the database encoded, safe and sound.
Used this rewrite rule RewriteRule ^([^/.]+)$ process.php?name=$1 [B]
Run around in circles for 2 hours thingking my rewrite was wrong because I was getting "page not found"
Realise that process.php didn't have a rawurlencode() to read in the name
$name = rawurlencode($_GET['name']);
Now it works.
WOO!
Sleep time.
I have a document that was converted from PDF to HTML for use on a company website to be referenced and indexed for search. I'm attempting to format the converted document to meet my needs and in doing so I am attempting to clean up some of the junk that was pulled over from when it was a PDF such as page numbers, headers, and footers. luckily all of these lines that need to be removed are in blocks of 4 lines unfortunately they are not exactly the same therefore cannot be removed with a simple literal replace. The lines contain numbers which are incremental as they correlate with the pages. How can I remove the following example from my html file.
Title<br>
10<br>
<hr>
<A name=11></a>Footer<br>
I've tried many different regular expression attempts but as my skill in that area is limited I can't find the proper syntax. I'm sure i'm missing something fairly easy as it would seem all I need is a wildcard replace for the two numbers in the code and the rest is literal.
any help is apprciated
The search & replace of npp is quite odd. I can't find newline charactes with regular expression, although the documentation says:
As of v4.9 the Simple find/replace (control+h) has changed, allowing the use of \r \n and \t in regex mode and the extended mode.
I updated to the last version, but it just doesn't work. Using the extended mode allows me to find newlines, but I can't specify wildcards.
However, you can use the macros to overcome this problems.
prepare a search that will find a unique passage (like Title<br>\r\n, here you can use the extended mode)
start recording a macro
press F3 to use your search
mark the four lines and delete them
stop recording the macro ... done!
Just replay it and it deletes what you wanted to delete.
If I have understood your request correctly this pattern matches your string:
Title<br>( ?)\n([0-9]+)<br>( ?)\n<hr>( ?)\n<A name=([0-9]+)></a>Footer<br>
I use the Regex Coach to try out complicated regex patterns. Other utilities are available.
edit
As I do not use Notepad++ I cannot be sure that this pattern will work for you. Apologies if that transpires to be the case. (I'm a TextPad man myself, and it does work with that tool).