Using Personal Dictionary files - hunspell

The man page mentions the "Personal dictionary file" ...
https://manpages.ubuntu.com/manpages/trusty/en/man4/hunspell.4.html
Personal dictionaries are simple word lists. Asterisk at the
first character position signs prohibition. A second word separated
by a slash sets the affixation.
If I can include those words in the main Dictionary file, why would anyone use personal file? Is any language using this?

Related

Load a CSV onto Apache Beam where there is a comma in some of the fields

I am loading a CSV into Apache Beam, but the CSV I am loading has commas in the fields. It looks like this:
ID, Name
1, Barack Obama
2, Barry, Bonds
How can I go about fixing this issue?
This is not specific to Beam, but a general problem with CSV. It's unclear if the second line should have ID="2, Barry" Name="Bonds" or the other way around.
If you can use some context (e.g. ID is always an integer, only one field could possibly contain commas) you could solve this by reading it as a text file line by line and parsing it into separate fields with a custom DoFn (assuming rows also contain newlines).
Generally, non-separating commas should be inside quotes in well-formed CSV, which makes this much more tractable (e.g. it would just work with the Beam Dataframes read_csv.)

Remove Speech Marks from file where Text Qualifier is "

Introduction
We have a pretty standard way of importing .txt and .csv into our data warehouse using SSIS.
Our txt/csvs are produced with speech marks as text qualifiers. So a typical file may look like the below:
"0001","025",1,"01/01/19","28/12/18",4,"ST","SMITH,JOHN","15/01/19"
"0002","807",1,"01/01/19","29/12/18",3,"ST","JONES,JOY","06/02/19"
"0003","160",1,"01/01/19","29/12/18",3,"ST","LEWIS,HANNAH","18/01/19"
We have set all our SSIS packages to strip out the speech marks by setting Text Qualifier = "
Problem
However, as some of our data entry is done manually, speech marks are sometimes used - particularly in free text fields such as NAME where people have nicknames/alias. This causes errors in our SSIS loading.
An example of a problematic row would be:
"0004","645",1,"01/01/19","29/12/18",3,"ST","MOORE,STANLEY "STAN"","12/04/19"
My question
Is there a way to somehow strip out these problematic speech marks? i.e. the speech marks surrounding "STAN", so that column would be treated as MOORE, STANLEY STAN.
If there was a way within SSIS to do this, great. If not, we are open to other ideas outside of SSIS.
Solution needs to be scalable as we have hundreds of SSIS packages where this problem can occur.
I have a few suggestions:
I know Excel has a setting that says something like "Treat Consecutive Delimiters as one."
Change your delimiter to something else, like a pipe (the thing above the backslash, not sure what it is called elsewhere, looks like a vertical line). You can distinguish delimiters from quote marks that are meant to be included in the resulting value because any string delimiter either immediately precedes or immediately follows a comma. A quote character anywhere else is not a delimiter.
If you do not need to pass the data through any T-SQL you might want to replace non-delimiter quotes with single quotes or, depending on the final output, maybe the html entity (") instead.
Hope this helps,
Joey

Is there any technical difference between CSV, a TSV or a TXT file?

I use these files constantly in my application, but aren't CSV, TSV or TXT files all flat files?
The content is:
"sample","sample"
They are all text files, following the same "guidelines". The difference between the files are - as long as the creator followed some "rules", that:
A csv file will have comma separated values and a tsv file will have tab seperated values.
For .txt files, there is no formatting specified.
.csv stands for comma separated values, .tsv stands for tab separated values.
As the names suggest, different elements in the file are separated by ',' and '\t' respectively.
The type is chosen depending on the data. If we have say numbers larger than 3 digits, we might need commas as part of the content ans it would be better to use a csv in that case.
Both are types of text files and are increasingly used for classification and data mining purposes.
They do not have any other technical distinguishing factor.
A text file (which might have a txt file extension) will have lines separated by a platform specific line separator (CRLF on Windows, LF on Linux, and so on), and it will tend to contain characters human readable as text in some encoding. Apart from that human readability expectation this allows pretty much any file content on some platforms, so this is more of a content classification than a specific file format.
The other two formats are usually considered special cases of a text file intended to allow easy automated processing; tsv, a "tab separated values" file is simpler than csv, a "comma separated values" file.
csv will have commas as field separators, and it may use quoting and escaping especially to handle commas and quotes occurring in those fields. It may also include a header line as the first line in the file. The last line in the file may or may not end with a line separator.
(Details.)
tsv simply disallows tabs in the values, the header line is mandatory, the final line separator is mandatory.
(Details.)
A "flat file", in connection with databases, is a text file as opposed to a machine optimized storage method (such as a fixed size record file or a compressed backup file or a file using more elaborate markup language supporting data validation); a flat file tends to be csv or tsv or similar.
This answer benefited from a comment by Alex Shpilkin.

Taking count in Rapidminer

How to take a row count of a list which is in word document?? If the same list is in excel I am able to take the count using aggregate operator but in word document it is not happening.
I recommend the answer from #awchisholm as it's the easiest solution. However, if you have several word documents this might become impractical.
In this case you can use the operator Loop Zip files to unzip the word document and look inside the for the file /word/document.xml and using RapidMiner's text functions (or Read XML) look for each instance of <w:p ...>...</w:p>, this represents a new line so you can count them from there.
There is also an xml doc in the unzipped directory called /docProps/app.xml you can read this in to find some meta information about the document such as number of words, characters & pages. Unfortunately I've found that unreliable for number of lines which is why I recommend using the <w:p> tag to search.
RapidMiner cannot easily read Word documents. You have to save the document as a text file and use the Read CSV operator to read the file.

Find all JPG pathnames in HTML files and convert them into all lowercase

I have a very basic understanding of regexp. I have searched and searched the internet for this.....
I have a linux server which only likes lowercase file names and I stupidly have image filenames in title case!
I want to batch find all jpg pathnames in my HTML files and convert them into all lowercase with Regex.
My-File-Name1.jpg needs to be my-file-name1.jpg
I think I need a regex expression to find them all, and another that replaces them converted into lowercase.
Any help?
EDIT
#Sniffer gave me the regex that gets the filename path.
In notepad ++ find and replace using regex. You can use
([\w/-]+)\.jpe?g to find image pathnames and
: \L\1 to change to lowercase and using replace
\U\2 to change to higher case using replace
I found the lower/uppercase regex here http://sourceforge.net/p/notepad-plus/discussion/331754/thread/ecb11904/
Usually I would say use an HTML parser which is the best tool for the job here but since you only want jpg files then you might be able to find them all by using the following:
([\w/-]+)\.jpe?g
^
|
|
As you can see I have added the forward slash / and the dash - to the
character class, WARNING: the dash - should always be the last character in the
class, keep that in mind if you have more special characters.
You will have to match this globally in your file.
As for the conversion, it can't be done using a regex. You will have to call an API that converts a string to lower case, and use it on the captured groub $1.