pythonic way to access data in file structures - html

I want to access every value (~10000) in .txt files (~1000) stored in directories (~20) in the most efficient manner possible. When the data is grabbed I would like to place them in a HTML string. I do this in order to display a HTML page with tables for each file. Pseudo:
fh=open('MyHtmlFile.html','w')
fh.write('''<head>Lots of tables</head><body>''')
for eachDirectory in rootFolder:
for eachFile in eachDirectory:
concat=''
for eachData in eachFile:
concat=concat+<tr><td>eachData</tr></td>
table='''
<table>%s</table>
'''%(concat)
fh.write(table)
fh.write('''</body>''')
fh.close()
There must be a better way (I imagine it would take forever)! I've checked out set() and read a bit about hashtables but rather ask the experts before the hole is dug.
Thank you for your time!
/Karl

import os, os.path
# If you're on Python 2.5 or newer, use 'with'
# needs 'from __future__ import with_statement' on 2.5
fh=open('MyHtmlFile.html','w')
fh.write('<html>\r\n<head><title>Lots of tables</title></head>\r\n<body>\r\n')
# this will recursively descend the tree
for dirpath, dirname, filenames in os.walk(rootFolder):
for filename in filenames:
# again, use 'with' on Python 2.5 or newer
infile = open(os.path.join(dirpath, filename))
# this will format the lines and join them, then format them into the table
# If you're on Python 2.6 or newer you could use 'str.format' instead
fh.write('<table>\r\n%s\r\n</table>' %
'\r\n'.join('<tr><td>%s</tr></td>' % line for line in infile))
infile.close()
fh.write('\r\n</body></html>')
fh.close()

Why do you "imagine it would take forever"? You are reading the file and then printing it out - that's pretty much the only thing you have as a requirement - and that's all you're doing.
You could tweak the script in a couple of ways (read blocks not lines, adjust buffers, print out instead of concatenating, etc.), but if you don't know how much time do you take now, how do you know what is better/worse?
Profile first, then find if the script is too slow, then find a place where it's slow, and only then optimise (or ask about optimisation).

Related

How to split a large CSV file with no code?

I have a CSV file which has nearly 22M records. I want to split this into multiple CSV files so that I can use it further.
I tried to open it using Excel(tried Transform Data Option as well)/Notepad++/Notepad, but all give me an error.
When I explore the options, I found that we can split the file using some coding methodologies like Java, Python, etc.. I am not much familiar with coding and want to know if there is any option to split the file without using any coding process. Also, since the file has client sensitive data I don't want to download/use any external tools.
Any help would be much appreciated.
I know you're concerned about security of sensitive data, and that makes you want to avoid external tools (even a nominally trusted tool like Google Big Query... unless your data is medical in nature).
I know you don't want a custom solution w/Python, but I don't understand why that is—this is a big problem, and CSVs can be tricky to handle.
Maybe your CSV is a "simple one" where there are no embedded line breaks, and the quoting is minimal. But if it isn't, you're going to want to a tool that's meant for CSV.
And because the file is so big, I don't see how you can do it without code. Even if you could load it into a trusted tool, how would you process the 22M records?
I look forward to seeing what else the community has to offer you.
The best I can think of based on my experience is exactly what you said you don't want.
It's a small-ish Python script that uses its CSV library to correctly read in your large file and write out several smaller files. If you don't trust this, or me, maybe find someone you do trust who can read this and assure you it won't compromise your sensitive data.
#!/usr/bin/env python3
import csv
MAX_ROWS = 22_000
# The name of your input
INPUT_CSV = 'big.csv'
# The "base name" of all new sub-CSVs, a counter will be added after the '-':
# e.g., new-1.csv, new-2.csv, etc...
NEW_BASE = 'new-'
# This function will be called over-and-over to make a new CSV file
def make_new_csv(i, header=None):
# New name
new_name = f'{NEW_BASE}{i}.csv'
# Create a new file from that name
new_f = open(new_name, 'w', newline='')
# Creates a "writer", a dedicated object for writing "rows" of CSV data
writer = csv.writer(new_f)
if header:
writer.writerow(header)
return new_f, writer
# Open your input CSV
with open(INPUT_CSV, newline='') as in_f:
# Like the "writer", dedicated to reading CSV data
reader = csv.reader(in_f)
your_header = next(reader) # see note below about "header"
# Give your new files unique, and sequential names: e.g., new-1.csv, new-2.csv, etc...
new_i = 1
# Make first new file and writer
new_f, writer = make_new_csv(new_i, your_header)
# Loop over all input rows, and count how many
# records have been written for each "new file"
new_rows = 0
for row in reader:
if new_rows == MAX_ROWS:
new_f.close() # This file is full, close it and...
break
new_i += 1
new_f, writer = make_new_csv(new_i, your_header) # get a new file and writer
new_rows = 0 # Reset row counter
writer.writerow(row)
new_rows +=1
# All done reading input rows, close last file
new_f.close()
There's also a fantastic tool I use daily for processing large CSVs, also with sensitive client contact and personally identifying information, GoCSV.
Its split command is exactly what you need:
Split a CSV into multiple files.
Usage:
gocsv split --max-rows N [--filename-base FILENAME] FILE
I'd recommend downloading it for your platform, unzipping it, putting a sample file with non-sensitive information in that folder and trying it out:
gocsv split --max-rows 1000 --filename-base New sample.csv
would end up creating a number of smaller CSVs, New-1.csv, New-2.csv, etc..., each with a header and no more than 1000 rows.

Create libsvm from multiple csv files for xgboost external memory training

I am trying to train an xgboost model using its external memory version, which takes a libsvm file as training set. Right now, all the data is stored in a bunch of csv files which combine together are way larger than the memory I have, say 70G.(you can easily read any one of them). I just wonder how to create one large libsvm file for xgboost. Or if there is any other work round for this. Thank you.
If you csv files do not have headers you can combine them with the Unix cat command.
Example:
> ls
file1.csv file2.csv
> cat *.csv > combined.csv
Now combined.csv is the concatenation of all the other files.
If all your csv files have headers you''ll want to do something trickier, like take the n-1 lines with tail.
XGBoost supports csv as an input.
If you want to convert that to libsvm regardless, you can use phraug's scripts.

How to load only first n files in pyspark spark.read.csv from a single directory

I have a scenario where I am loading and processing 4TB of data,
which is about 15000 .csv files in a folder.
since I have limited resources, I am planning to process them in two
batches and them union them.
I am trying to understand if I can load only 50% (or first n
number of files in batch1 and the rest in batch 2) using
spark.read.csv.
I can not use a regular expression as these files are generated
from multiple sources and they are of uneven number(from some
sources they are few and from other sources there are many ). If I
consider processing files in uneven batches using wild cards or regex
i may not get optimized performance.
Is there a way where i can tell the spark.read.csv reader to pick first n files and next I would just mention to load last n-1 files
I know this can be doneby writing another program. but I would not prefer as I have more than 20000 files and I dont want to iterate over them.
It's easy if you use hadoop API to list files first and then create dataframes based on this list chunks. For example:
path = '/path/to/files/'
from py4j.java_gateway import java_import
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
list_status = fs.listStatus(spark._jvm.org.apache.hadoop.fs.Path(path))
paths = [file.getPath().toString() for file in list_status]
df1 = spark.read.csv(paths[:7500])
df2 = spark.read.csv(paths[7500:])

Pari/Gp directing output

Is there an easy convenient way to direct output in Pari/GP to file? My aim is to get the full decimal expansion of 2^400000-1 either on screen or in a text file?
(23:37) gp > 2^400000-1
%947 = 996014342993......(4438 digits)......609762267975[+++]
GP terminal output gives this, which is not the goal. Basic output re-direction does not work either. Any ideas? Thanks.
(23:38) gp > 2^400000-1 > output.txt
There is a manual online, it does not say much about the output, except for the variable TeXstyle. I am unsure how to work with this though.
Quick and easy is to just do print(2^400000-1) and then you can cut+paste. Otherwise write(filename, 2^400000-1) if you want in a file.
Some other possibilities:
writebin(filename,2^400000-1) writes the object binary structure in a file: this is faster than traditional output (which implies a binary to decimal conversion), and loading it into another session will be faster as well. This is useful for a huge atomic write.
C-style output: fileopen, then successive filewrite allows many writes to a file referenced by a descriptor (which avoids re-opening / flushing / closing the file after each write). This is useful for a large write operation done through many tiny writes to a given file, e.g, character by character.

how to use import bat

I'm new in neo4j.
I'm trying to load csv files using the import.bat,
with shell.
(in windows)
I have 500,000 nodes
and 37 million relationships.
The import.bat is not working.
The code in shell cmd:
../neo4j-community-3.0.4/bin/neo4j-import \
--into ../neo4j-community-3.0.4/data/databases/graph.db \
--nodes:Chain import\entity.csv
--relationships import\roles.csv
but I did not know where to keep the csv files
and how to use the import.bat with shell.
I'm not sure I'm in the right place:
neo4j-sh(?)$
(I looked at a lot of examples, for me it just does not work)
I try to start the server with the cmd line and it's not working. That's what I did:
neo4j-community-3.0.4/bin/neo4j.bat start
I want to work with indexes I set the index, but when I try to use it,
it's not working:
start n= node:Chain(entity_id='1') return n;
I set the properties:
node_keys_indexable=entity_id
and also:
node_auto_indexing=true
Without indexes this query:
match p = (a:Chain)-[:tsuma*1..3]->(b:Chain)
where a.entity_id= 1
return p;
try to get one node with 3 levels
it's returned 49 relationships in 5 minutes.
It's a lot of time!!!!!
Your import command looks correct. You point to the csv files where they are, just like with how you point to --into directory. If you're unsure then use fully qualified names like /home/me/some-directory/entities.csv. What does it say (really hard to help you without knowing the error).
What's the error?
Legacy indexes doesn't go well with the importer and so enabling the legacy indexes afterwards doesn't index your data, could you instead use a index (CREATE INDEX ...)?