Docx don't read properly accented word in python - nltk

I've got a problem when trying to tokenize text using Moses tokenizer. The tokenizer is considering the accented word as 'é' or 'è' as spaces and special characters when tokenizing.
Steps :
-- > I read from .docx file
-- > Tokenize text with Moses tokenizer
from docx import Document
tokenizer = MosesTokenizer(lang='FR')
for i in file_docx.paragraphs:
text = i.text
tok = tokenizer.tokenize(text)
print(text)
print(tok)
results :
J'atteste que j'étais présent pour toute la procédure.
['J', '\\'', 'atteste', 'que', 'j', '\\'', 'e', '́', 'tais', 'pre', '́', 'sent', 'pour', 'toute', 'la', 'proce', '́', 'dure', '.']

Related

What is the encoding scheme for this Arabic web page?

I am trying to find the encoding scheme for this page (and others) which are surely Arabic, using lower ASCII range Latin characters to encode the contents.
http://www.saintcyrille.com/2011a.htm
http://www.saintcyrille.com/2011b.htm (English version/translation of that same page)
I have seen several sites and even PDF documents with this encoding, but I can't find the name or method of it.
This specific page is from 2011 and I think this is a pre-Unicode method of encoding Arabic that has fallen out of fashion.
Some sample text:
'D1J'6) 'D1H-J) 'DA5-J)
*#ED'* AJ 3A1 'D*CHJF
JDBJG'
'D#( / 3'EJ -D'B 'DJ3H9J
'D0J J#*J .5J5'K EF -D( # 3H1J'
An extraordinary mojibake case. It looks like there is missing high byte in Unicode code points in Arabic text. For instance: ا (U+0627, Arabic Letter Alef) appears as ' (U+0027, Apostrophe).
Let's suppose that missing high byte is always 0x06 in the following PowerShell script (added some more strings from the very end of the page http://www.saintcyrille.com/2011a.htm to your sample text):
$mojibakes = #'
E3'!K
'D1J'6) 'D1H-J) 'DA5-J)
*#ED'* AJ 3A1 'D*CHJF
JDBJG'
'D#( / 3'EJ -D'B 'DJ3H9J
'D0J J#*J .5J5'K EF -D( # 3H1J'
ED'-8'* :
'D#CD 'D5J'EJ 7H'D 'D#3(H9 'D98JE E-(0 ,/'K H'D5HE JF*GJ (9/ B/'3 'D9J/
J-(0 'D*B/E DD'9*1'A (9J/'K 9F JHE 'D9J/ (B/1 'D%EC'F -*I *3*7J9H' 'DE4'1C) AJ 'D5DH'* HB/'3 'D9J/ HFF5- D0DC 'D'3*A'/) EF -AD) 'D*H() 'D,E'9J) JHE 'D,E9) 15 '(1JD 2011 -J+ JGJ# 'D,EJ9 E9'K DFH'D 31 'DE5'D-) ( 9// EF 'D#('! 'DCGF) 3JCHF -'61'K )
(5F/HB 'D5HE) 9F/ E/.D 'DCFJ3) AAJ A*1) 'D#9J'/ *8G1 AJF' #9E'D 'D1-E) H'D5/B'* HE' JB'(DG' H0DC 9ED EB(HD HEE/H-
HDF' H7J/ 'D#ED #F *4'1CH' 'D'-*A'D'* AJ 19J*CE HCD 9'E H#F*E (.J1
'DE3J- B#'E ... -#B'K B#'E
'# -split [System.Environment]::NewLine
Function highByte ([byte]$lowByte, [switch]$moreInfo) {
if ( $moreInfo.IsPresent -and (
$lowByte -lt 0x20 -or $lowByte -gt 0x7f )) {
Write-Host $lowByte -ForegroundColor Cyan
}
if ( $lowByte -eq 0x20 ) { 0,$lowByte } else { 6,$lowByte }
}
foreach ( $mojibake in $mojibakes ) {
$aux = [System.Text.Encoding]::
GetEncoding( 1252).GetBytes( [char[]]$mojibake )
[System.Text.Encoding]::BigEndianUnicode.GetString(
$aux.ForEach({(highByte -lowByte $_)})
)
'' # new line separator for better readability
}
Output (using Google Translate) seems to give a sense roughly similar to English version of the page, after a fashion…
Output: .\SO\70062779.ps1
مساءً
الرياضة الروحية الفصحية
تأملات في سفر التكوين
يلقيها
الأب د سامي حلاق اليسوعي
الذي يأتي خصيصاً من حلب ـ سوريا
ملاحظات غ
الأكل الصيامي طوال الأسبوع العظيم محبذ جداً والصوم ينتهي بعد قداس
العيد
يحبذ التقدم للاعتراف بعيداً عن يوم العيد بقدر الإمكان حتى تستطيعوا
المشاركة في الصلوات وقداس العيد ، وننصح لذلك ، الاستفادة من حفلة
التوبة الجماعية يوم الجمعة رص ابريل زذرر حيث يهيأ الجميع معاً لنوال سر
المصالحة ب عدد من الأباء الكهنة سيكون حاضراً ة
بصندوق الصومة عند مدخل الكنيسة ففي فترة الأعياد تظهر فينا أعمال الرحمة
والصدقات وما يقابلها وذلك عمل مقبول وممدوح
ولنا وطيد الأمل أن تشاركوا الاحتفالات في رعيتكم وكل عام وأنتم بخير
المسيح قـام خخخ حـقاً قـام
Please keep in mind that I do not understand Arabic.
The script does not handle numbers: year 2011 in note #2 is incorrectly transformed to زذرر, for instance;
Handling spaces is unclear: is 0x20 always a space, or should be transformed to ؠ (U+0620, Arabic Letter Kashmiri Yeh)?
moreover, there is that problematic presumption about Unicode range U+0600-U+067F (where are U+0680-U+06FF and others?).

Read a file in R with mixed character encodings

I'm trying to read tables into R from HTML pages that are mostly encoded in UTF-8 (and declare <meta charset="utf-8">) but have some strings in some other encodings (I think Windows-1252 or ISO 8859-1). Here's an example. I want everything decoded properly into an R data frame. XML::readHTMLTable takes an encoding argument but doesn't seem to allow one to try multiple encodings.
So, in R, how can I try several encodings for each line of the input file? In Python 3, I'd do something like:
with open('file', 'rb') as o:
for line in o:
try:
line = line.decode('UTF-8')
except UnicodeDecodeError:
line = line.decode('Windows-1252')
There do seem to be R library functions for guessing character encodings, like stringi::stri_enc_detect, but when possible, it's probably better to use the simpler determinstic method of trying a fixed set of encodings in order. It looks like the best way to do this is to take advantage of the fact that when iconv fails to convert a string, it returns NA.
linewise.decode = function(path)
sapply(readLines(path), USE.NAMES = F, function(line) {
if (validUTF8(line))
return(line)
l2 = iconv(line, "Windows-1252", "UTF-8")
if (!is.na(l2))
return(l2)
l2 = iconv(line, "Shift-JIS", "UTF-8")
if (!is.na(l2))
return(l2)
stop("Encoding not detected")
})
If you create a test file with
$ python3 -c 'with open("inptest", "wb") as o: o.write(b"This line is ASCII\n" + "This line is UTF-8: I like π\n".encode("UTF-8") + "This line is Windows-1252: Müller\n".encode("Windows-1252") + "This line is Shift-JIS: ハローワールド\n".encode("Shift-JIS"))'
then linewise.decode("inptest") indeed returns
[1] "This line is ASCII"
[2] "This line is UTF-8: I like π"
[3] "This line is Windows-1252: Müller"
[4] "This line is Shift-JIS: ハローワールド"
To use linewise.decode with XML::readHTMLTable, just say something like XML::readHTMLTable(linewise.decode("http://example.com")).

extracting &lt and &gt from html using python

I have a HTML in UTF-8 encoding like below. I want to extract OWNER, NVCODE, CKHEWAT tags from this using python and bs4. But <> is converted to &lt and &gt I am not able to extract text from OWNER, NVCODE, CKHEWAT tags.
kindly guide me to extract text from these tags.
<?xml version="1.0" encoding="utf-8"?><html><body><string xmlns="http://tempuri.org/"><root><OWNER>अराजी मतरुका वासीदेह </OWNER><NVCODE>00108</NVCODE><CKHEWAT>811</CKHEWAT></root></string></body></html>
My code
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
soup.find('string').text
Check this
By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into “&”, “<”, and “>”, so that Beautiful Soup doesn’t inadvertently generate invalid HTML or XML:
soup = BeautifulSoup("<p>The law firm of Dewey, Cheatem, & Howe</p>")
soup.p
# <p>The law firm of Dewey, Cheatem, & Howe</p>
soup = BeautifulSoup('A link')
soup.a
# A link
You can change this behavior by providing a value for the formatter argument to prettify(), encode(), or decode(). Beautiful Soup recognizes six possible values for formatter.
The default is formatter="minimal". Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML:
french = "<p>Il a dit <<Sacré bleu!>></p>"
soup = BeautifulSoup(french)
print(soup.prettify(formatter="minimal"))
# <html>
# <body>
# <p>
# Il a dit <<Sacré bleu!>>
# </p>
# </body>
# </html>

NLTK tokenize but don't split named entities

I am working on a simple grammar based parser. For this I need to first tokenize the input. In my texts lots of cities appear (e.g., New York, San Francisco, etc.). When I just use the standard nltk word_tokenizer, all these cities are split.
from nltk import word_tokenize
word_tokenize('What are we going to do in San Francisco?')
Current output:
['What', 'are', 'we', 'going', 'to', 'do', 'in', 'San', 'Francisco', '?']
Desired output:
['What', 'are', 'we', 'going', 'to', 'do', 'in', 'San Francisco', '?']
How can I tokenize such sentences without splitting named entities?
Identify the named entities, then walk the result and join the chunked tokens together:
>>> from nltk import ne_chunk, pos_tag, word_tokenize
>>> toks = word_tokenize('What are we going to do in San Francisco?')
>>> chunks = ne_chunk(pos_tag(toks))
>>> [ w[0] if isinstance(w, tuple) else " ".join(t[0] for t in w) for w in chunks ]
['What', 'are', 'we', 'going', 'to', 'do', 'in', 'San Francisco', '?']
Each element of chunks is either a (word, pos) tuple or a Tree() containing the parts of the chunk.

Extracting greek characters from technical PDF documents when using Python 3

I'm currently trying to construct a database of chemicals used in a university department, and their hazard classes. I then wish to output to a csv file. One step is to pull all the synonyms for the various chemicals from standard PDFs, such as this for gamma hexalactone:
sample PDF
At the moment, the code I'm using to extract the text just loses the greek characters which I need to transfer. It looks like this:
pdfReader = PyPDF2.PdfFileReader(inpathf) txtObj = '' for pageNum in range (0, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
txtObj += str(pageObj.extractText())
inpathf.close()
outputf.write(txtObj)
outputf.close()
return txtObj
Parameters are extracted from ~2000 PDFs and stored in a dictionary before being transferred to a csv file:
def Outfile_csv(outfile, dict1, length):
outputfile = open((outfile) + '.csv', 'w', newline ='')
output_list = []
outputWriter = csv.writer(outputfile)
outputWriter.writerow(['PDF file', 'Name', 'Synonyms', 'CAS No.', 'H statements',
'TWA limits /ppm', 'STEL limits /ppm'])
for r in range (0, length):
output_list =[]
for s in range (0,7):
if s == 0 or s == 3:
output_list.append(str((dict1[s][r])).encode('utf-8'))
else:
output_list.append(str(dict1[s][r]))
outputWriter.writerow(output_list)
outputfile.close()
I also can't read out to the CSV in cases where there are greek characters - those data are simply not placed in the csv file. Many thanks for any help - a day playing with codecs and the contents of stackexchange has not helped yet. I'm using Python 3.4 and Windows 8.