NLTK Python Extract .txt files from a local folder, file recall issue - nltk

I'm writing a simple program on Python/NLTK (Windows PC) for a university exam, I'm frankly new in the world of coding.
I have a folder on my computer called "Reviews" where there are 50 .txt files.
My objective is to extract these files from the folder and recall it; thereafter to create some lists with the files and compare it with some techniques like (as example) FreqDist.
Firstly I did the "import" of nltk, os, PlaintextCorpusReader.
import nltk
from nltk import os
from nltk.corpus import PlaintextCorpusReader
All works. Then I tried to see the content of the folder.
foldercontent = PlaintextCorpusReader("C:\\Users\\Mgmura\\Desktop\\Reviews", '.*', encoding='latin1')
print(foldercontent.fileids())
Also here, all works. The output shows all the 50 .txt files in the folder. So I tried to do something (show sents) with the content of a single .txt file.
foldercontent.sents('it_quattroruote_giulia.txt')
The output shows some sents so it works fine.
Now there's the real issue. If I tried to recall a single file there's a "name error" like below.
> NameError Traceback (most recent call
> last) <ipython-input-1-3dd9ed6446c9> in <module>()
> ----> 1 it_quattroruote_giulia
>
> NameError: name 'it_quattroruote_giulia' is not defined
So the real question is: how I can assign a name to every .txt files and recall it?
Thanks in advance
Marco

Related

'b' Added to file name when trying to load data file in Jupyter

When trying to load a data file into a Jupyter notebook I get the following error message
File b'data_file.csv' does not exist: b'data_file.csv'
Following suggestions I can find online on this problem, I tried the following variations, including specifying the full path and utf encoding
pd.read_csv("data_file.csv")
pd.read_csv("C:\\FULL_PATH\\EBI\\data_file.csv")
pd.read_csv(r"data_file.csv")
pd.read_csv(r"C:\\FULL_PATH\\EBI\\data_file.csv")
pd.read_csv("data_file.csv",encoding='utf-8')
pd.read_csv("C:\\FULL_PATH\\EBI\\data_file.csv",encoding='utf-8')
pd.read_csv(r"data_file.csv",encoding='utf-8')
pd.read_csv(r"C:\\FULL_PATH\\EBI\\data_file.csv",encoding='utf-8')
as well as
pd.read_csv('C:\\FULL_PATH\\EBI\\"data_file.csv"')
However, all of these yield the same error message
File b'data_file.csv' does not exist: b'data_file.csv'
Not sure if it is helpful to add that the Jupyter notebook is being run on a Windows Server 2012 platform. Please note that I checked using os.getcwd() that the full path is indeed as quoted above.
Any suggestions would be much appreciated!
Assuming the file is in your working directory, could you try:
import os
file = os.path.join(os.getcwd(),"data_file.csv")
df = pd.read_csv(file)

Python Os.walk misses few files to process in the directory

Out of 10 files in the directory, only 8 files are processed and 2 files are not processed. But if I delete all the 8 files and try running with the missed 2 files it is working. Why Os.walk is missing files? Also is there a way to process all the files in the directory one after another without missing any.
Note: The solution will be used for the folder that contains 100K JSON files.
for root, dirs, files in os.walk('D:/M'):
for file in files:
if file.endswith(".json"):
Strfil=os.path.join(root,file)
with open(Strfil, 'r') as json_file:
For file system related things it is better to use the pathlib module
With pathlib you can do something like this.
from pathlib import Path
json_files = list(Path("D:/M").glob("**/*.json"))
for f in json_files:
with open(f, 'r') as json_file:
I think any file with more than 250 characters will be skipped by Windows as 'too long'. What I suggest is to map the network drive to make the path much shorter.
e.g. z:\myfile.xlsx instead of c:\a\b\c\d\e\f\g\myfile.xlsx

Access a hidden library function in Python?

So when I was doing coding I came across this:
from hidden_lib import train_classifier
Out of curiosity, is there a way to access the function using the terminal and see what's inside there?
You can use "inspect" library to do that, but it will work only if you have the source code of the "hidden_lib" somewhere on your machine:
>>> import hidden_lib
>>> import inspect
>>> print inspect.getsource(hidden_lib.train_classifier)
Otherwise library will throw the exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\inspect.py", line 701, in getsource
lines, lnum = getsourcelines(object)
File "C:\Python27\lib\inspect.py", line 690, in getsourcelines
lines, lnum = findsource(object)
File "C:\Python27\lib\inspect.py", line 529, in findsource
raise IOError('source code not available')
IOError: source code not available
In such a case you need to decompile .pyc file first. To do that you need to go to the:
https://github.com/wibiti/uncompyle2
then download the package, go to the package folder and install it:
C:\package_location> C:\Python27\python.exe setup.py install
Now you can easily find location of the library by typing [1]:
>>> hidden_lib.__file__
Then go to the pointed directory and unpyc the file:
>C:\Python27\python.exe C:\Python27\Scripts\uncompyle2 -o C:\path_pointed_by_[1]\hidden_lib.py C:\path_pointed_by_[1]\hidden_lib.pyc
Sources should be decompiled seccessfully:
# 2016.05.07 17:47:36 Central European Daylight Time
+++ okay decompyling hidden_lib.pyc
# decompiled 1 files: 1 okay, 0 failed, 0 verify faile
# 2016.05.07 17:47:36 Central European Daylight Time
And now you can display sources of functions exposed by hidden_lib in a way I described at the beginning of the post. If you are using iPython you can use also embedded function help(hidden_lib.train_classifier) to do exactly the same.
IMPORTANT NOTE: uncompyle2 library (that I used) works only with Python 2.7, if you want to do the same for Python 3.x you need to find other similar library.

I get a mysterious "Neo.ClientError.Statement.InvalidSyntax" error when loading a CSV in Neo4j

For a course on Excel I was trying to load a CSV in Neo4j (first time using this application) when I was blocked at the first step of replicating an example shown in said course: loading.
The command which was used in the example was this;
LOAD CSV WITH HEADERS FROM "file:/path/to/file/file.csv"
as row
CREATE (m:movie {name:row.movie})
But it gave syntax errors. I found out I could correct it by using double \ and add "file:";
LOAD CSV WITH HEADERS FROM "file://C:\\path\\to\\file\\file.csv"
as row
CREATE (m:movie {name:row.movie})
Neo4j accepts this syntax, processes for a few moments, and returns YET ANOTHER error;
Neo.TransientError.Statement.ExternalResourceFailure
I tried the same commands (original and my own) in the online Neo4j console but no luck. I can reach the file using that path without problem; it really is there. The CSV file consist out of just 5 strings of regular letters, that's all. No fancy formatting or characters.
What's going on?
Not that mysterious, Neo4j's IMPORT CSV function looks for the specified CSV file in the import directory within your server configuration for that database, as specified at the top of its server configuration file. (IE: dbms.directories.import=import in your neo4j.conf file.)
You should create the import directory in...
"C:\Users\[User Name]\Documents\Neo4j\default.graphdb\"
If you place your CSV file in there, you can specify any sub-directory or just the "file.csv" you want to import with the IMPORT CSV function as below.
LOAD CSV WITH HEADERS FROM "file:///file.csv"
AS row
RETURN row
LIMIT 5
Try using:
"file:///C:/path/to/file/file.csv"
Since your file is on your local computer, the third / following the file scheme is not preceded by a host name or address -- but it still needs to be there. Also, file URI path separators should be forward slashes (even on Windows machines).
See the File URI scheme Wikipedia page if you need more information.

get data from .csv file, analyze, produce output - python3

I am trying to complete an assignment in Python3. It is very similar to the pdf found here
I have a few questions on both the execution of how to get the information I need, and if possible, some code that could move me along. I am new to python. As right now from the code I have, I keep getting the error "directory not found" after running a function to try and read the data. I know the .csv file should be in the directory where I save it to in WingIDE, but I can't get it to work correctly.
My first question is after getting each line of the .csv file to read from my get_file_list, what is the best way to take each category and throw it into an efficiency equation?
Here is my get_data_list function:
def get_data_list(filename):
data_file = open(filename, "r")
data_list = [ ]
for line_str in data_file:
data_list.append(line_str.strip().split(','))
return data_list
when I run get_data_list("player_regular_season.csv") I get the following error:
builtins.IOError: [Errno 2] No such file or directory:'player_regular_season.csv'
For the first try, you can put the data file to the same directory with the Python program and launch it from the directory.
Try also a single purpose script to learn how to work with directories. Learn the functions from the standard doc 15.1.5. Files and Directories, namely os.getcwd(), os.chdir(path), and then 10.1. os.path — Common pathname manipulations, namely os.path.isfile(path).
But read also the doc of other functions in the documents to learn what is available.
When knowing how to work with filenames and paths, have a look at the 13.1. csv — CSV File Reading and Writing. Not to be scared of all the stuff, start from the end -- 13.1.5. Examples of using the csv module.