Python Os.walk misses few files to process in the directory - json

Out of 10 files in the directory, only 8 files are processed and 2 files are not processed. But if I delete all the 8 files and try running with the missed 2 files it is working. Why Os.walk is missing files? Also is there a way to process all the files in the directory one after another without missing any.
Note: The solution will be used for the folder that contains 100K JSON files.
for root, dirs, files in os.walk('D:/M'):
for file in files:
if file.endswith(".json"):
Strfil=os.path.join(root,file)
with open(Strfil, 'r') as json_file:

For file system related things it is better to use the pathlib module
With pathlib you can do something like this.
from pathlib import Path
json_files = list(Path("D:/M").glob("**/*.json"))
for f in json_files:
with open(f, 'r') as json_file:

I think any file with more than 250 characters will be skipped by Windows as 'too long'. What I suggest is to map the network drive to make the path much shorter.
e.g. z:\myfile.xlsx instead of c:\a\b\c\d\e\f\g\myfile.xlsx

Related

Colab how to get file id for existing file

I am starting with colab for ml and I have problem importing files from my google drive into the notebook. Say I got a file pretrained_vgg19.mat in my drive like drive/jupyter/pretrained_vgg19.mat. The code snippet for importing files from drive says that I need to use the file_ID that looks like laggVyWshwcyP6kEI-y_W3P8D26sz. How do I get this file_ID?
See PyDrive documentation for the ListFile command:
from pydrive.drive import GoogleDrive
drive = GoogleDrive(gauth) # Create GoogleDrive instance with authenticated GoogleAuth instance
# Auto-iterate through all files in the root folder.
file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()
for file1 in file_list:
print('title: %s, id: %s' % (file1['title'], file1['id']))
Now all you need to do is tweak the search parameters, since you know the title of the file already. See docs.
file_list = drive.ListFile({'q': "name='pretrained_vgg19.mat' and trashed=false"}).GetList()
for file in file_list:
print('%s' % (file['id']))
Note that it is possible to have files with the same folder name and file name, because you can create multiple folders with identical paths in Google Drive. If there is even a chance of this, you will get multiple files returned in your list operation and will need to use some other criteria in order to select the correct one.
user244343's answer didn't work for me since the gauth object doesn't exist. I did this instead (test.zip needs to point to the right folder and file in your Drive!):
!apt-get install -qq xattr
filename = "/content/drive/My\ Drive/test.zip"
# Retrieving the file ID for a file in `"/content/drive/My Drive/"`:
id = !xattr -p 'user.drive.id' {filename}
print(id)

Include JSON file to build/output directory without import

Maybe the title is a bit strange, but I can't seem to find anything about on google.
Question: I have a folder that only contains .ts files and .json files.. Typescript compiles the .ts files and puts it into a separate directory (not as a bundle, just the directory structure 'as-is').
Src /
Workers /
[ModuleA.ts, ModuleA.json],
[ModuleB.ts, ModuleB.json],
[MobuleC.ts, ModuleC.json]
Most of the time I can just require('*.json') and the JSON file will be also placed in to build directory.
But now I have a situation, where importing the JSON will make no sense, because the JSON file gets updated every few seconds and I read the file with fs.readFile('*.json'), so I also don't want it floating around in the v8 cache (through require)
So how do I 'include' a JSON/None-Typescript file into the build, that is not explicitly being importing by either require or import?
For now I just used gulp to copy every .json file in the src folder over to the the respective dist/** folder.
But still find it strange typescript doesn't have something included for it..
Maybe you should checkout --resolveJsonModule, it's a newer feature of typescript.

Pandas module read_csv reads file within Eclipse+pydev while fail if I run standalone

I'm currently developing a GUI using Python and Tkinter.
On of the task is to open and read some *.csv files.
I order to perform this task I have written the following code:
ReadData=pd.read_csv(ResultFile,skipinitialspace=True).values
While I'm running the code within the IDE Eclipse+Pydev everything work fine. But as soon as I run my code form a Dos window, i.e. python MainGrap.py, the code bugs stating that the file doesn't exists???????
I first load the path to a file via self.Inp_Filename=askopenfilename() then I create a list of the folders by means of the following function:
def PathDisintegrator(Inp_File):
Folders = os.path.split(Inp_File)
LastFolder = Folders[1]
RootPath = Folders[0]
Dirs=[]
while not(LastFolder==''):
Dirs.insert(0,LastFolder)
Folders = os.path.split(RootPath)
LastFolder = Folders[1]
RootPath = Folders[0]
Dirs.insert(0,RootPath[:-1])
Dirs=Dirs[:-1]
return(Dirs)
Then I can recreate the full path to file via the following function:
def PathAndFile(Folders,File):
FileOut=''
for item in Folders:
FileOut=FileOut+os.sep+item
#FileOut=FileOut+r"\\"+item
FileOut=FileOut[1:]+os.sep+File
return(FileOut)
I have printed out the file path even within the parser of Pandas and it looks fine to me: D:\Abaqus_Runs\DOWLEX_PET_LAMINATE_PROTO_REFERENCE_SI_Version_2_Revision_2_MDangle0_Rate0_01_MOVING_NODE_out.csv
The problem here is that your python environment in eclipse can see the folder where your csv resides but the terminal one does not.
You can observe what the system paths are by doing:
In [331]:
import sys
sys.path
Out[331]:
['',
'C:\\WinPython-64bit-3.4.2.4\\python-3.4.2.amd64\\python34.zip',
'C:\\WinPython-64bit-3.4.2.4\\python-3.4.2.amd64\\DLLs',
'C:\\WinPython-64bit-3.4.2.4\\python-3.4.2.amd64\\lib',
'C:\\WinPython-64bit-3.4.2.4\\python-3.4.2.amd64',
'C:\\WinPython-64bit-3.4.2.4\\python-3.4.2.amd64\\lib\\site-packages',
'C:\\WinPython-64bit-3.4.2.4\\python-3.4.2.amd64\\lib\\site-packages\\win32',
'C:\\WinPython-64bit-3.4.2.4\\python-3.4.2.amd64\\lib\\site-packages\\win32\\lib',
'C:\\WinPython-64bit-3.4.2.4\\python-3.4.2.amd64\\lib\\site-packages\\Pythonwin',
'C:\\WinPython-64bit-3.4.2.4\\python-3.4.2.amd64\\lib\\site-packages\\IPython\\extensions']
So you need to provide a complete path or append the path to where the csv resides to your sys path. Note that backslashes must be escaped e.g. 'c:\\data\\my.csv' but if you use forward slashes then it works fine: e.g. 'c:/data/my.csv'

2 Hdfs files comparison

I have 6000+ .csv files in /hadoop/hdfs/location1 and 6100+ .csv files in /hadoop/hdfs/location2.
I want to compare these two hdfs directories and find the diff of files. The diff .csv files(non-similar) should be reflected in a 3rd hdfs directory(/hadoop/hdfs/location3). I am not sure we can use diff command as in unix to hdfs file system.
Any idea on how to resolve this would be appreciable.
Anshul
You could use some python (perl/etc.) script to check it. Depending on your special needs and speed, you could check for file-size first. Are the filenames identical? Are the creation-dates identical etc.?
If you want to use python, check out the filecmp module.
>>> import filecmp
>>> filecmp.cmp('undoc.rst', 'undoc.rst')
True
>>> filecmp.cmp('undoc.rst', 'index.rst')
False
Look at the below post which provides an answer on how to compare 2 HDFS files. You will need to extend this for 2 folders.
HDFS File Comparison
You could easily do this with the Java API and create a small app:
FileSystem fs = FileSystem.get(conf);
chksum1 = fs.getFileChecksum(new Path("/path/to/file"));
chksum2 = fs.getFileChecksum(new Path("/path/to/file2"));
return chksum1 == chksum2;
We don't have hdfs commands to compare the files.
Check below post we can achieve by writing the PIG Program or We need to Write Map Reduce Program.
Equivalent of linux 'diff' in Apache Pig
I think below steps will solve your problem:
Get the list of file names which are in first location into one file
Get the second location files into another file
Find the diff between two files using unix commands
Whatever the diff files you found, copy those files in the other location.
I hope this helps you. otherwise let me know.

get data from .csv file, analyze, produce output - python3

I am trying to complete an assignment in Python3. It is very similar to the pdf found here
I have a few questions on both the execution of how to get the information I need, and if possible, some code that could move me along. I am new to python. As right now from the code I have, I keep getting the error "directory not found" after running a function to try and read the data. I know the .csv file should be in the directory where I save it to in WingIDE, but I can't get it to work correctly.
My first question is after getting each line of the .csv file to read from my get_file_list, what is the best way to take each category and throw it into an efficiency equation?
Here is my get_data_list function:
def get_data_list(filename):
data_file = open(filename, "r")
data_list = [ ]
for line_str in data_file:
data_list.append(line_str.strip().split(','))
return data_list
when I run get_data_list("player_regular_season.csv") I get the following error:
builtins.IOError: [Errno 2] No such file or directory:'player_regular_season.csv'
For the first try, you can put the data file to the same directory with the Python program and launch it from the directory.
Try also a single purpose script to learn how to work with directories. Learn the functions from the standard doc 15.1.5. Files and Directories, namely os.getcwd(), os.chdir(path), and then 10.1. os.path — Common pathname manipulations, namely os.path.isfile(path).
But read also the doc of other functions in the documents to learn what is available.
When knowing how to work with filenames and paths, have a look at the 13.1. csv — CSV File Reading and Writing. Not to be scared of all the stuff, start from the end -- 13.1.5. Examples of using the csv module.